As always, the gnarly questions that come up when I give a speech are the best part of the experience. On Thursday I spoke at The Media Center's Emerging Technology conference at Stanford University, where I was asked a toughie: could bad actors drown the Tail in a tide of spam and other robotic graft?
The fear, to put it simply, is that the Tail is fragile and that niche quality can easily be overwhelmed by industrial-scale automated marketing, fraud and vandalism. After all, the signal-to-noise ratio of the Tail is already low; could digital grafitti push it to insignificance? The fall of the once-great Usenet, which collapsed under the weight of autospam and trolls, is the cautionary example. Will the same happen to the blogosphere and all the trust networks and other recommendation mechanisms that we use to pull diamonds from the rough?
I think the answer is no. At the conference I said, controversially, that I felt a combination of technologies had finally come together to turn the corner on the email spam problem and I didn't see why the same approach wouldn't work for comment spam and other automated web scams. In short, I think the good guys, backed by smart technology, will win in the end. (Ross Mayfield did a fine job covering the talk here.)
My confidence comes primarily from seeing the spam problem, which was until recently deemed an out-of-control epidemic, get reduced to a very manageable two-or-three-a-day annoyance in my own case. I use four spam filters, two on the server and two on the client, and the result is the graph above, taken from just one day. Although more spam than real email is sent to me, almost all of it is filtered out with virtually no false positives.
(Note: I know I may not be typical, but I do think I'm not a total outlier either. I have a public email address, so I get a good bit of spam. And of those four filters, three are automatically provided by my company, ISP or email software. I only pay for the fourth. In other words, most people's problem are not as bad as mine and their solutions are equally within reach, so their experience shouldn't be much worse than my own.)
I call this multiple-filter approach a "cocktail therapy", intentionally invoking the drug combinations that have shown such success in fighting AIDS. The best strategy is to attack simultaneously on many front.
My spam-blockers are a combination of Bayesian filters and networked blacklists. Although they all play a role, I think the most powerful among them is Cloudmark's SafetyBar, which uses collaborative filtering. Its client software fingerprints email that you identify as spam and sends that fingerprint to all the other Cloudmark clients out there. If enough people with enough good track records call something spam, it probably is. The client software automatically moves a blacklisted email into the spam folder for everyone else, saving them the trouble of clicking on it themselves. (If someone mistakenly calls something spam and others reverse it, the mistaken user's rating will go down, reducing their influence until they earn enough confirmatory credit again). This is The Wisdom of Crowds at work, market forces applied to fight automated crime.
I think we can combat the problem of comment and trackback spam using the same forces--democracy and shared information. The computers that are depositing links to porn sites in my comments and trackbacks (now all deleted, albeit by hand) are doing the same at other blogs. Yet the implicit data contained in my deletions is lost if it isn't shared with those other sites. A little bit of software can make all the difference, recording the thumbs-down "votes" of vigilant bloggers pruning their comments of spam and automatically passing that information on to other blogs, where the deletions can happen with no human intervention.
There are, to be sure, differences between email and comment spam. For starters, most comment spam occurs in the parts of the web only visited by machines: dusty archives patroled just by spiders and, unlike your own inbox, rarely seen by humans. Furthermore, Google has added financial fuel to the fire by making links, even if nobody follows them, essentially free money, which is more than can be said for most herbal Viagra email spams.
But the similarities are even more striking. Already, an open source project called MT Blacklist (MT=Moveable Type, the blogging software from SixApart that underlies the TypePad hosting services that I use, too) is showing how collaborative comment-spam filtering might work. It's still pretty crude and not yet automatic, but Jay Allen, its creator, describes the promising future of the project here. Another advance is Google's "no-follow" tag, which allows bloggers to deprive comment spammers of Google juice
The low-hanging fruit for this is hosted blogging services such as TypePad, where everything is done on a central server and the information about who-is-deleting-what can be easily shared. I was pleased to hear that the Cloudmark and SixApart folks have met to talk about this. "We're going to be adapting the lessons of the email spam world to comment spam," Michael Sippey, SixApart's VP for Products told me. "We're looking at ways to include community feedback mechanisms into that filtering process. " No announcements yet, but I'd be surprised if within the year my trackback and comment spam wasn't automagically cleaned up as rapidly as my inbox now is.