The biggest spam corpus the world has ever seen


It's been a while. My old laptop had a fan fail, and I didn't really have an easy way to recover data off of my old drive, so today I just went through the pain of doing a recovery. Anyway, I now have the data needed to generate my blog, and can thus start writing entries again.

Today, I was thinking about the idea of personalized spam filters. It occurs to me that the advantage of the personalized spam filter is not that you get less spam, but rather that you get more spam but are less likely for your edge case mail to be marked as spam because of its quasi-spamlike contents. I think what's clear here is that spam is always spam. Doctors don't want viagra spam, even if they talk about viagra on email. Financial advisors don't want hot stock tip spam, even if they talk about hot stock tips via email. So, one prevailing attitude is that making it less likely for a user to get his hot stock tip ham marked as hot stick tip spam is a good thing, even if it means he gets more hot stock tip spam. This make sense, except that the best option is to simply not get your ham marked as spam, and the best way to do that is to have a larger corpus.

Generally, spam filters work by using a Bayesian filter which has a corpus of spam and not spam stored in it. In theory the bigger your (valid) corpus the better it is able to match emails as spam or ham. So when everyone has their own training set, your corpus should be less accurate since it doesn't have the volume of the entire userbase. With all that in mind we can present a solution that can be taken in a number of directions.

The basic idea is that everyone makes their corpus available distributedly. This is acheived by running a small daemon which validates a message as spam upon request. A level above this is a central daemon that tracks corpuses. When a request is made to the central daemon a subset of the cluster is sent the message, and their results are aggregated/average/somethinged to come up with a score for that subset of the cluster. Subsets can either be picked randomly from the nodes connected, or a heuristic can be implemented that tends to push certain messages to certain areas of the cluster (so we end up with a financial advisor subnode which deals with messages related to stocks, or somesuch). This basically means that the message is tested across a massive corpus of different types of people which can classify those messages that are spam to all people better than one which might tend to favor an edge case.

To speed up checks (since running a map-reduce operation over a part of the cluster won't be that fast), each message will be stripped of user/host specific information, and a checksum will be calculated, so that if the same spam comes in it need not be tested, but simply marked as spam.


published at Wed Nov 21 20:52:14 2007 (-0800) by alexbl
Tags: distributed, spam, filtering
| |