Wednesday, April 6

SpamBayes

I was fed up with the low, but increasingly annoying flow of spam into my mailboxes, so I have finally decided to set up a spam filter. I chose SpamBayes, as I had heard some good things about it (besides, it's written in Python).

As I use Debian (unstable), installation was just an apt-get install spambayes away. Setup and integration with Evolution, my mail client, was a bit more tedious. (By the way, SpamAssassin might have been a sensible choice, as it integrates well with Evolution.) First, I investigated the approach of using SpamBayes by piping messages to one of the SpamBayes scripts, then I even found a script (sb_evoscore.py) that is for use with Evolution specifically, but these solutions had a few drawbacks, so in the end I settled with the standard proxy server approach.

The user interface of the SpamBayes server impressed me. The server sports a simple web server which you can use to configure SpamBayes, review messages, train the filter or view statistics. Configuration of the server was straightforward.

The server is started by running the script sb_server.py, residing in /usr/bin, so it should be in your path. I was slightly annoyed by the fact that the script would immediately litter the working directory with files, and that it had no way of daemonizing, i.e., detaching from the terminal. I created the directory .spambayes in my home directory for storing the SpamBayes database. To run the server automatically, I whipped up a simple init script. It runs sb_server.py in the background as the specified user (just one user though - this will not work for a multiuser system with several people running the SpamBayes server). You will need to create /etc/default/spambayes where the variables DBDIR (the directory for the databases) and RUNAS (the name of the user) would be specified, e.g.:

DBDIR=/home/gintas/.spambayes
RUNAS=gintas

I have not yet figured out why, but after changing the network the SpamBayes server would sometimes wedge up and refuse to connect to a POP3 server because it could not resolve the domain name. For now, I added /etc/init.d/spambayes restart to my suspend script as a workaround.

As I had anticipated setting up a Bayesian spam filter, I have been marking my mail as spam in Evolution rather than simply deleting it for a while. However, when I wanted to train the filter, I could not find the spam folder on my filesystem (Evolution stores mail in the mbox format, in ~/.evolution/mail/local). My first try at training the filter was to simply copy the contents of the Spam folder to a temporary mail folder, which would show up as a file, and feed that as "spam" to SpamBayes, and the other mailboxes as "ham". However, I noticed that the filter didn't work well. Then I found out why the Spam mail folder was not showing up as a file - Spam is actually a virtual folder, and when a message is marked as spam, it is simply hidden from the view rather than moved to a different mailbox. It makes some sense - in case you change your mind about the message, you don't have to know where it came from, it will appear where it was. Therefore the spam training went fine, but supplying the "good" mailboxes was a mistake, because they included the spam too. In the end I had to create another temporary mail folder and copy some good messages to it, and use that one to train the filter.

Wiring up Evolution to use the proxy was easy. I had to change the POP3 server settings in my Evolution accounts to point to localhost:proxied_port as the server, so that Evolution would get messages with the spam indication headers. To use the filtering, I added two rules, one for messages tagged as spam by SpamBayes, and another one for "unsure" (the tag can be found in the header "X-SpamBayes-Classification"). I set the former one to give the message Spam status and mark it read, so I wouldn't even notice it, and the latter to mark the message as Spam but leave it unread, so that I would have a look at it before discarding it. These rules suit me well, as I have never had a false positive, and most of the "unsure" messages (21 out of 24) are spam.

After you train SpamBayes, remember to run a sanity check by querying some common "ham" / "spam" words - that's how I discovered my blunder. Such words as "money", "rich" should show up as spam content. As for ham content, you know best what words are most frequent in your emails (in my case "python" was a clear shot at 87 ham messages vs. 0 spam messages).

Further training of SpamBayes is performed either by reviewing the messages through the web, or running a proxy for the outgoing mail server so that you can send emails to fictitious addresses used for notifying SpamBayes about mistakes. I went for the former. Now once in a while I visit the message review page to classify the unsure messages, though even that is probably unnecessary, as SpamBayes should be chugging along well enough with the existing database.

In conclusion, with zero positives, zero negatives and just a handful of unsure messages to date, I'm quite satisfied with SpamBayes. I had tried to look around on the web for information on using Evolution with SpamBayes and found very little, so I hope that this article will turn out to be useful to someone.