MESSAGE
DATE | 2007-11-17 |
FROM | Ron Guerin
|
SUBJECT | Re: [NYLXS - HANGOUT] Website Updates
|
Ruben Safir wrote:
> Anyway, even that code I wrote is now running 18 hours plus and > still parsing mail. It makes me apreciaite what the boys working > for Wall Street go through. My regex must be chewing up too much > CPU power. It's using 99% of the CPU to do this and still running > and it has parsed a little over 30,000 messages. > > m/^From\s+[-.=\w]+\-at-[-.\w]+\.\w{2,3}\s+\w{3}\s+\w{3}\s+\d{1,2}\s+\d\d:\d\d:\d\d\s+\d\d\d\d/ > > is the From line regex. And it is still missing some From Headers. > > Perhaps I should reduce this to a more generalized format such as > > m/^From\s+w.*\-at-w.*\s+\w{3}\s+\w{3}\s+\d{1,2}\s+\d\d:\d\d:\d\d\s+\d\d\d\d/ > > or even > > m/^From\s+w.*\-at-w.*\s+\w{3}\s+\w.*\s+\d+\s+\d.*\s+\d.*/ > > > Would that make it run faster?
If you can optimize those regexes, it can only help. But also make sure you're not looking at anything but the *headers*, because you neither want to parse the message body looking for headers, nor do you want to treat something that looks like a From: header like a header if it's in the message body.
So if you cut back on the data you regex against, that's probably going to help as much, if not more than anything else, because the headers are probably a relatively small percentage of your data.
- Ron
|
|