MESSAGE
DATE | 2007-11-18 |
FROM | Ron Guerin
|
SUBJECT | Re: [NYLXS - HANGOUT] Website Updates
|
Ruben Safir wrote:
> Well two things. First, you can only identify the headers IF you identify them > as headers. Thats what the regex does, so i don't see how you can cut it out > of a search on the body. Even the binaries need to be searched until you reach the > end of the content type marker. > > thats the whole problem, right. Grep through the file and find From lines.
If you're doing that, it's no wonder it's taking so long, and you're very likely to pick up headers that aren't headers. You should only be parsing headers from the first line of the message until the header delimiter (the first line containing only CRLF, iirc), and then not again until the next message.
In this semi-made up example:
In-Reply-To: <473F9BD1.7080201-at-vnetworx.net> User-Agent: Mutt/1.5.6i Sender: owner-hangout-at-mrbrklyn.com Precedence: bulk Lines: 115
On Sat, Nov 17, 2007 at 08:56:33PM -0500, Ron Guerin wrote: > Ruben Safir wrote: > > > Anyway, even that code I wrote is now running 18 hours plus and From: IAmNotAHeader-at-example.com
You should treat everything up to and including the "Lines: 115" as a header, and everything below, including "From: IAmNotAHeader-at-example.com", you should not be parsing for headers at all, because anything after the header delimiter is not a header, guaranteed.
Also make sure you account for folded headers. (ie: headers containing a CRLF because they're split across multiple lines)
- Ron
|
|