MESSAGE
DATE | 2007-11-18 |
FROM | Ron Guerin
|
SUBJECT | Re: [NYLXS - HANGOUT] Website Updates
|
Ruben Safir wrote: > On Sun, Nov 18, 2007 at 12:17:55AM -0500, Ron Guerin wrote: >> Ruben Safir wrote: >> >> Sorry, I missed this before... >> >>> Maybe there is a magic way around this if the header tells me how many lines of >>> content there is. Then I can gobble up the content without viewing >>> the individual lines. >> Yeah, chop off the headers by assuming everything from the start of the >> file to the first blank line is a header. Parse those to your heart's >> content for headers. >> >>> I'm open to suggestions. Meanwhile I just noticed that the message body is being doubled so I >>> need to look at the code again in the morning when I get home from work. >> Then when you get to the body, don't try to parse the entire body for >> everything either. Keep cutting it down, parse out the pieces you don't >> need to do anything else with, like the binaries. You don't need to be >> running text searches on those, it's just going to burn up cycles and >> heaven forbid, actually match something. >> >> Then when you've got only the parts of the body you want to search, run >> regexes on that last bit of remaining content. >> > > How do you know when the body ends? The body ends with a line feed and a fromline > > I can get the ehaders like you suject, but the delimitator is the next header. > > From ruben-at-mrbrklyn.com Sun 18 Nov 00:38:27 2007 > Smilies :) ;)
Oh, I see. You know,... Kevin was right. You should be using the existing message parsing libraries for Perl, because you don't know what you're looking for here. And I'm not being snippy, I don't know what you're looking for here either, *except* I do know there's a message delimiter there that you're not picking up. There's no hard and fast rule about that delimiter, IIRC, which is one really good reason to use an existing, well-vetted library to read these mbox files.
Just breaking these down into individual messages would make your life a whole lot easier.
- Ron
|
|