Spam
Tuesday, October 17th, 2006Spammers seem to be suddenly getting much more successful in breaking through common filtering systems with increasing regularity of late. Either that or MIT’s spam filtering services suck compared to Stanford’s. I recall the first exciting distributed spam filtering system (which I encountered) was a simple hash database, containing checksums of known spams. Clearly the spammers have long-since conquered this one, but I wonder whether another axis of filtering might extend this idea using sliding window hashes to recognize content similarity. In particular, while spams often try to be somewhat unique, they generally have a very cut-and-paste feel, which suggests that greater success might be found through simply matching small regions of commonality through local fingerprints (a la LBFS).
Similarly, today’s most successful spams seem to rely almost exclusively on image text to sneak past filters. First Gmail and others seem to have added simple OCR to their filters, leading to italic, anti-aliased, non-fixed-width image text to break that next barrier. As these images get more expensive to generate (and they certainly seem to be identical across large numbers of spams in a given day, with only the markov-generated “poetry” varying from message to message), can’t we similarly use range hashes or other efficient fingerprints to recognize these oft-reused images from a distributed known-spam database?



