MESSAGE
DATE | 2004-09-22 |
FROM | Ruben Safir
|
SUBJECT | Subject: [hangout] Spamassassin Weak Points
|
The problems with SpamAssassin are rather large and become pretty clear once your using it for substantial mail. Many of the problems have already been mentioned here but I'll review them on by one.
The first problem is documentation and setup. SpamAssasins documentation is not clear, and a detail write up of it's architecture is not clearly written up. In fact, it seems to have been built with a clear road map to it's design. SpamAssassin.pm itself seems to be mostly, as Ron put it, a REGEX monster with a lot of rules. Regexes in perl are fairly efficient, but SpamAssasin itself is not. It is very inefficient. In addition to Spamassin, you have spamc, salearb and spamd
spamc doesn't (surprise) even have a perldoc page. spamassassin itself does, as does sa-learn. Installation documentation is very complex do to the need for multiple outside programs which would best be built as dependencies, or better, built into spamassasin itself. Among these include razor razor2, dccproc, pyzor etc. If installing by hand, these collaboration modules can be a nightmare. Pyzor, for example required a complete re-installation of the Python environment. This kind of minefield reminds be significantly of the problems trying to install gnome by hand. The codependency problem is hell.
Now lets look at some of the internals of the documentation. For example, you have this little beauty from perldoc spamassassin:
-k, --revoke Revoke this message. This will revoke the mail mesĀ sage read from STDIN from various spam-blocker databases. Currently, these are Vipul's Razor ( http://razor.sourceforge.net/ ).
What? Huh? Revoke a message from stdin of WHAT? What does this mean. uh oh, I better run off now to a mail list or IRC channel and beg for an explaination.
Or how about this one?:
Quite often, if you've been on the internet for a while, you'll have accumulated a few old email accounts that nowadays get nothing but spam.
SpamAssassin lets you set them up as aliases, as follows:
spamtrap1: ""| /path/to/spamassassin -r -w spamtrap1""
What the heck is spamtrap1 in this example? What does this do exactly? It's bewildering.
Now, if you spend a lot of time struggling with all the documentation and if you have now made spamassassin a full time hobby, and you've integrated yourself into the spamassassin cadre and union of spamassassin A++ professionals, all this bad design begins to become clear to you, maybe.
Now, in its design, its working on the wrong end and it is too passive. Mail comes into your mail system and is then processed by your MTA and sent to, not spamassassin, but to procmail. Procmail itself is one of those mysterious poorly documents, and often doesn't do what it was supposed to, applications (which unfortunately we all love). It then filters all the mail, headers and bodies, through multiple tests. Instead of cutting the mailing IP address off at the HELO interface, when it sees multiple spams coming through from the same IP address, it just keeps processing them. No if you have something like a spendmail milter on the RBL, your can possibly short circuit the spam source. But spamassassin only filters and tags, and depends on user intervention to learn (with sa-learn). This is all a pitfall and the spammers, who are not as stupid and Sunny would like to think they are, have worked on defeating this system at its weak points.
For example, they itinerate messages with binaries just above the common maximum sizes folks commonly use for spamc, and thereby just blow past the whole thing. If you raise the max size, it caused spamd to choke on the mail, and your log will fill with notices and spamd is not responding to spamc. Spamassassin doesn't seem have enough common sense not to attempt to mulch on binary data (which is a problem suffered by embperl as well for what it is worth) despite really having all the information it needs to do this in the headers. Spamassassin doesn't have a graceful way of timing out. In fact, since the mail is ALREADY accepted by the MTA, it has no way of say, yo!! This is too hard right now, send again later!, because the SMTP is already closed. It also doesn't gracefully just say, oh - this is too big right now, right it out to a temp file and process later when I'm not business. No, it's just a dumb filter with some backend communication to Rhyzor et al.
When users create bayesian, databases, the process can really bogg down. My own bayesian database just dragged the CPU to a halt. This is not a good thing (TM by Vaughn Scott). Multiply this by 40 ppower users, like we have at NYLXS, and you can have a huge problem. The documentation recommends not turning on the local user .spamassasin.conf files. Great, and this is evidently set up as default behavior.
So lets say we want to fix some of these problems. Lets integrate spamassassin into the MTA! Let's look at hwo we can do that:
http://wiki.apache.org/spamassassin/IntegratedInMta
tells me for sendmail to use procmail - UGG
OK but there is a link for a high speed vilter! Wonderful. Click on that link:
http://www.msys.ch/products/software/unix/sendmail/smtp-vilter/
Bonk!!!! Anyone know how to read German?
OK how about roaring Penguin. Well, truthfully, this is where I stopped until this week. This week we were mail bombed by spam coming out of nearly every corner of the planet. A little over 400,000 entries in the from and mail log in about 6 hours. I dropped the procmail spamc entry to about 200,000
:0fw * < 20000 |spamc -f
put a maximum of 5 children. A lot of spam gets through, but the system doesn't swap out and freeze anymore and about 80% of the spam is caught and the larger spams designed to kill SA are failing to do that now. God Bless America (and China too).
I upgraded SA and I've got to roll up my sleeves and look further at solutions which integrate into the sendmail milters. But overall, there are serious design flaws in SA which could be addressed.
It should be a wrapper for the MTA, add access denial entries and HUP the MTA when under stress so that it takes the new entries. It needs to gracefully time out. It needs to renice itself down (or up depending on how you look at it). It needs to generally play nice.
Meanwhile, I'm going to run out and buy more ram this week for the mail server, just for SA.
Ruben
-- __________________________ Brooklyn Linux Solutions
So many immigrant groups have swept through our town that Brooklyn, like Atlantis, reaches mythological proportions in the mind of the world - RI Safir 1998
DRM is THEFT - We are the STAKEHOLDERS - RI Safir 2002 http://fairuse.nylxs.com
http://www.mrbrklyn.com - Consulting http://www.inns.net <-- Happy Clients http://www.nylxs.com - Leadership Development in Free Software http://www2.mrbrklyn.com/resources - Unpublished Archive or stories and articles from around the net http://www2.mrbrklyn.com/downtown.html - See the New Downtown Brooklyn....
____________________________ NYLXS: New Yorker Free Software Users Scene Fair Use - because it's either fair use or useless.... NYLXS is a trademark of NYLXS, Inc
|
|