MESSAGE
DATE | 2007-11-18 |
FROM | Ruben Safir
|
SUBJECT | Re: [NYLXS - HANGOUT] Website Updates
|
On Sun, Nov 18, 2007 at 12:11:33AM -0500, Ron Guerin wrote: > Ruben Safir wrote: > > > Well two things. First, you can only identify the headers IF you identify them > > as headers. Thats what the regex does, so i don't see how you can cut it out > > of a search on the body. Even the binaries need to be searched until you reach the > > end of the content type marker. > > > > thats the whole problem, right. Grep through the file and find From lines. > > If you're doing that, it's no wonder it's taking so long, and you're > very likely to pick up headers that aren't headers. You should only be > parsing headers from the first line of the message until the header > delimiter (the first line containing only CRLF, iirc), and then not > again until the next message. > > In this semi-made up example: > > In-Reply-To: <473F9BD1.7080201-at-vnetworx.net> > User-Agent: Mutt/1.5.6i > Sender: owner-hangout-at-mrbrklyn.com > Precedence: bulk > Lines: 115 > > On Sat, Nov 17, 2007 at 08:56:33PM -0500, Ron Guerin wrote: > > Ruben Safir wrote: > > > > > Anyway, even that code I wrote is now running 18 hours plus and > From: IAmNotAHeader-at-example.com > > You should treat everything up to and including the "Lines: 115" as a > header, and everything below, including "From: > IAmNotAHeader-at-example.com", you should not be parsing for headers at > all, because anything after the header delimiter is not a header, > guaranteed. > > Also make sure you account for folded headers. (ie: headers containing a > CRLF because they're split across multiple lines) > > - Ron
Folded headers, you mean headers within the body, I assume. The mail is supposed to move those off the first line of the mail. Mutt itself does this for any
From
That is finds that is not
From: emailaddress valid date.
Otherwise this is a decent fault in the entire email protocal.
But most of all, how do you know when you've reached the end of the body? a blank line isn't enough or every paragraph would look like a new message.
What I can do it glob up the body until a line feed, and then regex just the new paragraphs seperated by a line feed.
Ruben -- http://www.mrbrklyn.com - Interesting Stuff http://www.nylxs.com - Leadership Development in Free Software
So many immigrant groups have swept through our town that Brooklyn, like Atlantis, reaches mythological proportions in the mind of the world - RI Safir 1998
http://fairuse.nylxs.com DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002
"Yeah - I write Free Software...so SUE ME"
"The tremendous problem we face is that we are becoming sharecroppers to our own cultural heritage -- we need the ability to participate in our own society."
"> I'm an engineer. I choose the best tool for the job, politics be damned.< You must be a stupid engineer then, because politcs and technology have been attached at the hip since the 1st dynasty in Ancient Egypt. I guess you missed that one."
© Copyright for the Digital Millennium
|
|