« Google's Long Tail | Main | In Defense of Endism »

February 14, 2005

Can blog spam be solved like email spam?

Spam_1As always, the gnarly questions that come up when I give a speech are the best part of the experience. On Thursday I spoke at The Media Center's Emerging Technology conference at Stanford University, where I was asked a toughie: could bad actors drown the Tail in a tide of  spam and other robotic graft?

The fear, to put it simply, is that the Tail is fragile and that niche quality can easily be overwhelmed by industrial-scale automated marketing, fraud and vandalism. After all, the signal-to-noise ratio of the Tail is already low; could digital grafitti push it to insignificance? The fall of the once-great Usenet, which collapsed under the weight of autospam and trolls, is the cautionary example.  Will the same happen to the blogosphere and all the trust networks and other recommendation mechanisms that we use to pull diamonds from the rough?

I think the answer is no. At the conference I said, controversially, that I felt a combination of technologies had finally come together to turn the corner on the email spam problem and I didn't see why the same approach wouldn't work for comment spam and other automated web scams.  In short, I think the good guys, backed by smart technology, will win in the end. (Ross Mayfield did a fine job covering the talk here.)

My confidence comes primarily from seeing the spam problem, which was until recently deemed an out-of-control epidemic, get reduced to a very manageable two-or-three-a-day annoyance in my own case. I use four spam filters, two on the server and two on the client, and the result is the graph above, taken from just one day. Although more spam than real email is sent to me, almost all of it is filtered out with virtually no false positives.

(Note: I know I may not be typical, but I do think I'm not a total outlier either. I have a public email address, so I get a good bit of spam. And of those four filters, three are automatically provided by my company, ISP or email software. I only pay for the fourth. In other words, most people's problem are not as bad as mine and their solutions are equally within reach, so their experience shouldn't be much worse than my own.)

I call this multiple-filter approach a "cocktail therapy", intentionally invoking the drug combinations that have shown such success in fighting AIDS. The best strategy is to attack simultaneously on many front.

My spam-blockers are a combination of Bayesian filters and networked blacklists. Although they all play a role, I think the most powerful among them is Cloudmark's SafetyBar, which uses collaborative filtering. Its client software fingerprints email that you identify as spam and sends that fingerprint to all the other Cloudmark clients out there. If enough people with enough good track records call something spam, it probably is. The client software automatically moves a blacklisted email into the spam folder for everyone else, saving them the trouble of clicking on it themselves. (If someone mistakenly calls something spam and others reverse it, the mistaken user's rating will go down, reducing their influence until they earn enough confirmatory credit again). This is The Wisdom of Crowds at work, market forces applied to fight automated crime.

I think we can combat the problem of comment and trackback spam using the same forces--democracy and shared information. The computers that are depositing links to porn sites in my comments and trackbacks (now all deleted, albeit by hand) are doing the same at other blogs. Yet the implicit data contained in my deletions is lost if it isn't shared with those other sites. A little bit of software can make all the difference, recording the thumbs-down "votes" of vigilant bloggers pruning their comments of spam and automatically passing that information on to other blogs, where the deletions can happen with no human intervention.

There are, to be sure, differences between email and comment spam. For starters, most comment spam occurs in the parts of the web only visited by machines: dusty archives patroled just by spiders and, unlike your own inbox, rarely seen by humans. Furthermore, Google has added financial fuel to the fire by making links, even if nobody follows them, essentially free money, which is more than can be said for most herbal Viagra email spams.

But the similarities are even more striking. Already, an open source project called MT Blacklist (MT=Moveable Type, the blogging software from SixApart that underlies the TypePad hosting services that I use, too) is showing how collaborative comment-spam filtering might work. It's still pretty crude and not yet automatic, but Jay Allen, its creator, describes the promising future of the project here.  Another advance is Google's "no-follow" tag, which allows bloggers to deprive comment spammers of Google juice

The low-hanging fruit for this is hosted blogging services such as TypePad, where everything is done on a central server and the information about who-is-deleting-what can be easily shared.  I was pleased to hear that the Cloudmark and SixApart folks have met to talk about this. "We're going to be adapting the lessons of the email spam world to comment spam," Michael Sippey, SixApart's VP for Products told me.  "We're looking at ways to include community feedback mechanisms into that filtering process. " No announcements yet, but I'd be surprised if within the year my trackback and comment spam wasn't automagically cleaned up as rapidly as my inbox now is.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341bfb6353ef00d8342323ac53ef

Listed below are links to weblogs that reference Can blog spam be solved like email spam?:

» Can blog spam be solved like email spam? from del.icio.us WebCites
Suggests something like collaborative filtering to delete SPAM. This is a little what is going on with MT-Blacklist.... [Read More]

» Critical Content from Stevens Weblog
[Read More]

» Critical Content from Stevens Weblog
[Read More]

» Cloudmark from Johnnie Moore's Weblog
I'm currently trialling the Cloudmark's SafetyBar anti-spam software, after learning about it from Chris Anderson. Here's what interests me: at the moment, I am not plagued with Spam and the built-in filters on Outlook 2003 work pretty well. So far,... [Read More]

» Spamming the Bloggers from Jim Glass and the VSXUE Team
[Read More]

» Spamming the Bloggers from Jim Glass and the VSXUE Team
[Read More]

» links for 2005-03-11 from Journal
Can blog spam be solved like email spam? If it can, does that mean it can’t? After all email spam is far from gone (categories: spam blogging) MT-Blacklist/Comment Spam Clearinghouse Jay Allen talks about the future of MT Blacklist... [Read More]

» Site Difficulties Update from The Left Coaster
As Steve noted in his post, comments had been disabled on the site in response to a very bad spam attack today. Things should be back to normal now, but I thought I'd explain what happened and what we are... [Read More]

Comments

I think Thunderbird warrants a mention here. It has free spam blocking which is good but has to be trained. If the Mozilla foundation could set up a system where marking something as spam would send information to their server to be downloaded by others it may make a real difference. Microsoft is doing something similar with their Anti-Spyware beta.

I'm not sure I know what the question means? Does mean anything to in the context eBay, Netflix, Amazon, abeBooks? Does it mean anything in the context of the wealth distribution? Or is this just about some important subset of the systems that leverage long tail dynamics?

One of the key features of the long tail systems is that the tail is full of junk - stuff that by most measures doesn't rise above some quality bar or another. So asking if bad actors can pollute it seems a bit strange.

I'm not convinced that the email spam issue has been 'solved' at all; maybe it's just paranoia, but I'm beginning to see this as the calm before the storm. Hotmail & Gmail's spam filters have been growing progessively worse over time. A lot of spam I've seen lately appears to be chunked with random text (physics texts are popular for some reason) which can only be an attempt to degrade the effect of the filters.

I always wonder what the point of that is. Whenever I get an e-mail with a bunch of nonsense words in the subject, even if it makes it through my filters, I just delete it without opening it. Are spammers sending these things out specifically to mess with the filtering, or do they actually expect people to open their e-mails?

I dunno. I think you would actually want a lot more spam filters with more false positives, so that it'll throw out real mail and make that graph look more like the Long Tail. :)

I asked that question, and while your post here is more well-thought-out than the on-your-feet response you gave at the conference, I still don't buy that e-mail spam is conquered. And I don't think recommendation spam will be easily defeated.

If/when spammers realize recommendations are the key to navigating the long tail, they're going to be pulling out all of the stops to influence that navigation. MT Blacklist is a cool application and it's serving me pretty well, but I still have to wade through and delete/report a lot of crap each day. There also are setup problems that I'm still fighting through. This isn't turnkey.

And remember: It's the people who aren't using Blacklist and who own blogs packed with comment spam that encourage/enable this spamming.

While I think your argument has merit and I hope it goes a long way toward solving the problem, I wouldn't underestimate how cunning and innovative the spammers can be ...

Only to say that, educated and literate as I am, I can't make heads or tails--hah!--of what you're talking about on your blog here. "Gnarly" I know but "robotic graft" and "signal-to-noise-ratio!" Totally foreign. And I got here from some nice parenting blog. How could it be!? Good luck.

I generally agree with your answer, but I'm still struggling with the premise of the question. Doesn't it turn on the definition of "spam"? By definition, a lot of content in the tail is irrelevant to the majority of consumers. We might call that spam, or we might just call it content that appeals to a minority interest. I've never been clear on the difference. Eric.

Eric, it seems to me that the difference between spam and other irrelevant content is that the latter sits out there on the web, and the spam ends up in your mailbox. Likewise w/ programs (rather than people) that deposit comments on blogs; it's not just irrelevant content, it's delivered to your door in bulk.

And Eve, I think any good parenting blog should address the issue of signal-to-noise ratio.

David, not sure the push/pull distinction works--especially in the day of RSS readers, email alerts like Google News alerts, etc. In the end, the problem is wanted v. unwanted; the medium we use to get there seems irrelevant to me. Eric.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Tidbits

Search this site

The Long Tail by Chris Anderson

Notes and sources for the book

FREE will be available in all digital forms--ebook, web book, and audiobook--for free shortly after the hardcover is published on July 7th (exact dates will be announced here as each form is released). The ebook and web book will be free for a limited time, the unabridged audiobook will be available free forever.[Update: the first free versions have now been released.]

Order the hardcover now!