MESSAGE
DATE | 2007-11-17 |
FROM | Ruben Safir
|
SUBJECT | Re: [NYLXS - HANGOUT] Website Updates
|
On Sat, Nov 17, 2007 at 08:56:33PM -0500, Ron Guerin wrote: > Ruben Safir wrote: > > > Anyway, even that code I wrote is now running 18 hours plus and > > still parsing mail. It makes me apreciaite what the boys working > > for Wall Street go through. My regex must be chewing up too much > > CPU power. It's using 99% of the CPU to do this and still running > > and it has parsed a little over 30,000 messages. > > > > m/^From\s+[-.=\w]+\-at-[-.\w]+\.\w{2,3}\s+\w{3}\s+\w{3}\s+\d{1,2}\s+\d\d:\d\d:\d\d\s+\d\d\d\d/ > > > > is the From line regex. And it is still missing some From Headers. > > > > Perhaps I should reduce this to a more generalized format such as > > > > m/^From\s+w.*\-at-w.*\s+\w{3}\s+\w{3}\s+\d{1,2}\s+\d\d:\d\d:\d\d\s+\d\d\d\d/ > > > > or even > > > > m/^From\s+w.*\-at-w.*\s+\w{3}\s+\w.*\s+\d+\s+\d.*\s+\d.*/ > > > > > > Would that make it run faster? > > If you can optimize those regexes, it can only help. But also make sure > you're not looking at anything but the *headers*, because you neither > want to parse the message body looking for headers, nor do you want to > treat something that looks like a From: header like a header if it's in > the message body. >
Well two things. First, you can only identify the headers IF you identify them as headers. Thats what the regex does, so i don't see how you can cut it out of a search on the body. Even the binaries need to be searched until you reach the end of the content type marker.
thats the whole problem, right. Grep through the file and find From lines.
then the body itself needs to be captured and entered into the database.
Maybe there is a magic way around this if the header tells me how many lines of content there is. Then I can gobble up the content without viewing the individual lines.
> So if you cut back on the data you regex against, that's probably going > to help as much, if not more than anything else, because the headers are > probably a relatively small percentage of your data. >
I'm open to suggestions. Meanwhile I just noticed that the message body is being doubled so I need to look at the code again in the morning when I get home from work.
Ruben -- http://www.mrbrklyn.com - Interesting Stuff http://www.nylxs.com - Leadership Development in Free Software
So many immigrant groups have swept through our town that Brooklyn, like Atlantis, reaches mythological proportions in the mind of the world - RI Safir 1998
http://fairuse.nylxs.com DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002
"Yeah - I write Free Software...so SUE ME"
"The tremendous problem we face is that we are becoming sharecroppers to our own cultural heritage -- we need the ability to participate in our own society."
"> I'm an engineer. I choose the best tool for the job, politics be damned.< You must be a stupid engineer then, because politcs and technology have been attached at the hip since the 1st dynasty in Ancient Egypt. I guess you missed that one."
© Copyright for the Digital Millennium
|
|