I’m sick and tired of spam and even more ticked off about false positives. Machine learning (ML) researchers keep on telling me that ML is going to make the spam problem go away. Yet, for this user, it’s worse than ever.
Anti-spam technology.
Don’t get me wrong — huge strides in anti-spam technology have been made. Paul Graham’s essay A Plan for Spam discusses a now popular technology – Bayesian filtering. According to Sophos 92.8% of all mail sent in the first quarter of 2008 was spam. Wow. So spam technology is taking care of most of the problem, but that’s just not good enough.
Amazon’s Mechanical Turk.
The idea of artificial artificial intelligence (AAI) isn’t new, but the term AAI comes from Amazon’s Jeff Bezos. From Wikipedia:
Artificial artificial intelligence (AAI) is a term coined by Jeff Bezos. Certain computational tasks, such as identifying whether a person in a photograph is male or female, are tasks that humans still do better and faster than computers. If perfect artificial intelligence systems existed, computer programs could complete those tasks. The idea of artificial artificial intelligence is to outsource those parts of a computer program to humans. AAI is the underlying principle behind Amazon Mechanical Turk.
Amazon’s Mechanical Turk outsources human labor at the task level. You get paid in HITs (human intelligence tasks) for doing tasks that companies, individuals, and organizations pay for on a per HIT basis.
AAI + Spam = Nirvana
We should turn down the intensity of the anti-spam filters to avoid false-positives (I have several important messages — on the sending and receiving side — go into spam EVERY WEEK). We should then supplement anti-spam models by having real humans evaluate messages which fall in the gray area.
One option is to literally use Mechanical Turk as part of an anti-spam solution. Take the messages that are clearly spam according to your model and toss it. Let the kosher mail pass without harassment. But then send questionable mail to the Mechanical Turk to have humans help you figure out what to do. Another solution is to build a fully-integrated Anti-spam Turk.
There are clearly privacy issues which need to be managed carefully. But I’m confident that those issues can be solved with technology better than relying on AI to understand the difference between spam and important mail. Until SMTP is updated to include a much higher level of authentication/security, spam will be an issue. We either need to leverage some cost-based model like bonds, or we need to supplement AI with a little AAI.
7 responses so far ↓
Bob Ngu // June 28, 2008 at 5:56 am |
Ah spam, a subject near and dear to my heart, I actually managed the anti-spam engineering team at McAfee years ago, the founder and main developer of SpamAssassin was on my team. Bayesian was a key component of SpamAssassin and I will never look at “ham” and “spam” the same way again
Privacy concern aside, the thought of leveraging a service like mechanical turk for sorting out ham/spam has crossed my mind before but even as an expert, I have at times found it hard to tell whether some emails are spam or not. I am afraid some spammers have gotten that good. So if I can’t tell for certain, would the mechanical turk do any better?
Also I have learned that not everyone has the same definition of spam. What some people consider as spam, others might not. This is one of the challenges of using Bayesian since it is not person specific, so there are bound to be some confusion.
In the long term, I feel that solutions like SPF will probably work best but the challenge there is that every mail gateway/server have to use it for the solution to be 100% effective.
Scott Lawton // June 28, 2008 at 7:13 pm |
Granting that it’s orthogonal to your main point: have you tried SpamStopsHere.com?
Advantage: good control of factors to yield vanishingly small false positives as long as you’re willing to do some post-processing on the client. For me: a huge improvement over Spam Assassin plus my usual client-side tools.
Disadvantage: not cheap (though IMHO well worth it based on time saved); can have problems if your regular ISP changes email domain info without adequate notice.
(I’m just a customer … though should figure out their affiliate program at some point. Might as well try to get paid for the time it takes to type this….)
Bob Ngu // June 28, 2008 at 9:03 pm |
Scott,
I perused the technical details for SpamStopsHere.com and I don’t see anything that makes it a superior solution to SpamAssassin or other antispam software. Bayesian is only one component of SpamAssassin, if memory serves me correctly (back in 2004), SpamAssassin has all the major technical features listed by SpamStopsHere.com.
BTW when I said SpamAssassin, I meant SpamAssassin the open source server solution at http://spamassassin.apache.org/, not the McAfee Personal SpamKiller solution.
Scott Lawton // June 28, 2008 at 10:52 pm |
I’m also talking about the open-source SpamAssassin as implemented by 2 different ISPs (pair and Superb). I may well be wrong, but it seems to me that lots of people who rely on SpamAssassin still complain about spam — especially regarding false positives. In any case, I don’t want to hijack this post so I’ll just say: the everyday difference on my email was remarkable. (And to reiterate: my only economic stake here is that if the company gets more users they’ll have an even better product.) Feel free to contact me off-line; my email is easy to find.
Bob Ngu // June 29, 2008 at 12:11 am |
Well, not all SpamAssassin implementations are created equal, I will leave it at that
One thing we can agree on, stop hijacking the post, thanks for letting us take it for a spin Mike
Michael F. Martin // June 29, 2008 at 1:49 am |
Thanks for the link to the cost-based bonds post. That is a very cool idea.
It would work best if the person who flagged it as spam got part of the fee charged the spammer, and the rest went to cover the overhead of chasing down the spammers to collect the fees.
I think the problem, ultimately, is jurisdictional. Many of the spammers are setup in this unusual spots like Vanuatu where the local government is actually relieved to have some foreign investment coming in — even if it’s from spam farms!
Ryo // June 29, 2008 at 2:58 am |
I think CloudMark had (has?) a model similar to what you seem to be describing. They have a plug-in for mail clients, and when you mark a message as spam, it sends a “finger print” of the message to a centralized database, and that information is used to filter similar messages for other people. So when new spam appears, a few people will see it and categorize it as spam, but nobody else sees it. I don’t know how well it worked though.