MESSAGE
DATE | 2007-11-16 |
FROM | Ruben Safir
|
SUBJECT | Re: [NYLXS - HANGOUT] Website Updates
|
On Fri, Nov 16, 2007 at 03:24:40PM -0500, Ron Guerin wrote: > Kevin Mark wrote: > > On Thu, Nov 15, 2007 at 06:32:01PM -0500, Ruben Safir wrote: > >> I'm busy rewriting a lot of the functionality of the NYLXS and the Freeom-IT > >> websites. It's been a bit longer to do that I thought, much because of the > >> first step is parsing all the old mail boxes I have and pulling out all the > >> HANGOUT mailings to store in a database. This is well over 700,000 lines of > >> data (thanks for the binaries in the mail guys!). > >> > >> This makes testing go slower than I'd like and the large sample is important > >> for the purpose of shaking out my regex routines on the From lines in the > >> mboxes. > > Whats the problem with parsing email from mail boxes into a database? > > I can think of one. Speaking from pain of experience, you need to > maintain the same level of privacy that your posters expected when they > posted to the list. That's possibly going to be a problem when the > expectations on this list in regard to an archive were pretty restrictive. > > Oh! Make that two. You need to honor the "X-No-Archive: yes" headers > unless you had a stated policy about not honoring it. We no longer > honor it on nylug-talk, and I know we lost one fairly active list member > who left after we announced that. > > - Ron
I'm trying to open the mboxes (there are about 4 or them) and put them into an array. They are nearly 350 megs each and with gig of ram (Memory: 1033332k/1047744k available (2077k kernel code, 13764k reserved, 782k data, 212k init, 130240k highmem)
I thought I'd have no trouble just absorbing them into an array and then working through it. Unfortunately this is proving to be too hard and the script is running slow and bringing the box to a halt. The gimp handles 3M images with no problem and runs fiilters on this, so I'm actually perplexed by this.
The code looks like this
!/usr/bin/perl use strict; use SQLHANDLE; use Date::Manip; use Fcntl ':flock';
my $base = '/usr/'; opendir(LOGS, $base) or die "$!"; my -at-archives = grep{ m/mbox/} readdir(LOGS); my ($logfile, $tmp);
for $tmp (-at-archives){ my $logfile = $tmp; my -at-newentries; print STDERR "Processing $logfile *****************\n"; #my $line; open LOG, "<$base$logfile" or die $!; flock(LOG, LOCK_EX);
print STDERR "Opening $logfile\n"; -at-newentries = ;#All the new lines since run last time
ize = -at-newentries; print STDERR "OK - have $logfile which is $size lines large\n";
flock(LOG, LOCK_UN);#unlock the file
#######OK BUILD Database################# my $i; my $message = $newentries[0]; my $From = ''; my $Date = ''; my $Subject = ''; my $From_line = $newentries[0]; #The First Line is a From: line for ($i=1; $i < -at-newentries; $i++){ #starting from the second line
if ($From eq ''){ $From = $newentries[$i] if $newentries[$i] =~ /^From:/; } if ($Date eq ''){ $Date = $newentries[$i] if $newentries[$i] =~/^Date:/; } if ($Subject eq ''){ $Subject = $newentries[$i] if $newentries[$i] =~/^Subject:/; }
#these we need - these below we do not need
my $Reply = $newentries[$i] if $newentries[$i] =~ /^Reply-to:/; my $To = $newentries[$i] if $newentries[$i] =~ /^To:/; my $Cc = $newentries[$i] if $newentries[$i] =~ /^Cc/; my $Content = $newentries[$i] if $newentries[$i] =~ /^Content-type:/;
-- http://www.mrbrklyn.com - Interesting Stufontent = $newentries[$i] if $newentries[$i] =~ /^Content-type:/; #This delimits the messages in the archive if ($newentries[$i] =~ m/^From\s+[-.=\w]+\-at-[-.\w]+\.\w{2,3}\s+\w{3}\s+\w{3}\s+\d{1,2}\s+\d\d:\d\d:\d\d\s+\d\d\d\d/ and $newentries[$i - 1] eq "\n" ) { #LOOKING FOR THE NEXT FROM LINE &enterdata($From, $Date, $Subject, $message ) if $From_line =~ m/hangout/i; # print STDERR "New Message\n"; #reset message data after storing it $From = ''; $Date = ''; $Subject = ''; $message = $newentries[$i]; $From_line = $newentries[$i]; next; } #if none of the above #conditions are met, then it should be the body of the message. #print STDERR "."; #print "line $i $newentries[$i]" if $newentries[$i] =~ m/^From /; $message = join '', $message, $newentries[$i];
} #The last message does not have a have a From Line at the end &enterdata($From, $Date, $Subject, $message ) if $From_line =~ /hangout/; }
I'm going to have to change it and instead walk through the files line by line which all but means recoding the core of the program.
Ruben
http://www.nylxs.com - Leadership Development in Free Software
So many immigrant groups have swept through our town that Brooklyn, like Atlantis, reaches mythological proportions in the mind of the world - RI Safir 1998
http://fairuse.nylxs.com DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002
"Yeah - I write Free Software...so SUE ME"
"The tremendous problem we face is that we are becoming sharecroppers to our own cultural heritage -- we need the ability to participate in our own society."
"> I'm an engineer. I choose the best tool for the job, politics be damned.< You must be a stupid engineer then, because politcs and technology have been attached at the hip since the 1st dynasty in Ancient Egypt. I guess you missed that one."
© Copyright for the Digital Millennium
|
|