Bring tha noize!
Following my original post on the subject, I've been having some interesting discussions with smart folk about signal-to-noise ratios in the Long Tail. Those ratios are important because they dictate consumer behavior. Too much noise, and people don't buy. And without good filters (from search to recommendations), the Long Tail is just noise.
This may seem a bit arcane, but as I'm writing a whole chapter on filters, recommendations and other tools and techniques to drive demand down the Tail, it's worth drilling down a bit more here.
John Hagel thought I'd gone off the rails a bit with my analysis. He's as bright as they come, so I've gone to more than the usual lengths to reconsider this issue. Here's our debate, which I'll share with all of you because I think it eventually led to an interesting insight.
I originally posted this conceptual graph:
In a post, John took issue with it:
[Chris] asserts that signal to noise ratios decrease as one move down the tail. Really? Isn't that subjective? I may be a real outlier (aren't we all?) but, at least for me in the realm of music, the signal to noise ratio decreases as I move up the tail. The real point, I think, is that the sheer quantity (rather than the quality) of items increases as we move down the tail and the ready availability of information about these items diminishes - that's what increases the difficulty of connecting with relevant resources as we move down the tail.
We batted this back and forth in several emails, but I didn't make much headway. John's taste in music is quite niche, he says. He doesn't like anything in the top 100, or even much in the top 1,000. And he suspects that there are lots of people like him (not necessarily in his particular niche, but in ones like it). As a result, he argues that his s/n ratio actually goes the other way, which is to say that it's zero at the head (no signal), peaks somewhere in the middle or even further down, and only then eventually falls under the weight of a zillion garage bands at the end of the Tail.
I tried to persuade John with yet more conceptual charts to explain that shape I drew was the aggregation of everyone's s/n ratios, which are indeed all different but together end up looking like the one I originally posted. Like this:
But he was still unconvinced. This was surprising, since I think it's totally straightforward: there's more noise in the tail because there's more everything there. Most stuff doesn't sell very well, so the volume of the material available--and by extension the volume of stuff you don't want--rises as the Long Tail falls. Like this:
Whatever you're looking for, there's more stuff you aren't looking for the further you go down the tail. Which is why the signal-to-noise ratio gets worse, even if you're more likely (with good search and filters) to find what you want as you go down the tail.
It
sounds like a paradox, but it isn't. Much of what you want is in the
tail. Most of what you don't want is also in the tail. That's why you
need increasingly powerful filters to extract the good from the greater
bad.
But conceptual graphs weren't doing it; John was still skeptical. So I turned to actual data. I analyzed the music collection of five people: Me, Anne (my wife), my assistant Peter Arcuni, John himself (turns out that he's into rockabilly, surf music and Algerian "rai music", which seems to be some sort of ethno-techno thing), and Koranteng Ofusu-Amaah, for no other reason than that he showed up amongst my trackbacks and appears to have quite fringy music taste (African pop, mostly) that he was kind enough to provide Amazon links to. I've got Koranteng twice: once, for a random list of things he listening to, and the second for his best of the year list.
Overall, the number of albums that I included from each subjects' collections ranged from a few dozen to more than 200 titles. (This is too small to be anything more than suggestive. But I'm working with some companies to extend this analysis to proper large-n data sets, which I hope to be able to include in the book.)
This is what I found (using logarithmic curve-fitting to smooth the underlying numbers):
What's important to note here is that everyone, no matter how niche, shows a falling s/n ratio as you go farther down the tail. Why is this, when John, Koraneng and I (indeed, all of us except for Anne) have no albums in the top 100, and only a few in the top 1,000? Why did it not show the rising-then-falling shape I predicted in my conceptual graph above?
The answer is that there is so much music out there that even what we consider niche is usually still top-decile.
In the above chart I binned the Amazon ranks of record collections
by 5,000s, which is the smallest unit that gives any decent overview. I
cut off the chart at 100,000 for visual impact, although the full
analysis and the collections included several albums in the 600,000
range.
All that rising-then-falling shape that I illustrated with the
conceptual s/n graph actually takes place entirely in the top
5,000-10,000 for most people. By the time you're past that the density
of almost everyone's music collection (which is to say, their s/n
ratio) falls as you go down the tail. When you're binning by 5,000s, as
I have, all that the fine structure in the head is obscured by the
larger perspective.
Top 100 is irrelevant in an abundant market. Even the library of my most mainstream subject (Anne) had an average rank of 3,000. In the big picture of the Long Tail, there are so many items that even today's niche looks relatively popular. For instance, the average sales rank in my own collection was 25,000. That may sound super-fringe, but it still puts my average in the top 5% of Amazon's offerings. You've got to pull back and see the whole market. And at that resolution, the falling s/n ratio curve I originally described emerges for almost all of us.
Long Tails are long, and it's illuminating to stand back and see the whole curve. The microstructure of the current hits business, the blockbuster charts our culture has so long fixated on, is quickly lost in the macrostructure of the entire music universe. It's a big world out there, and the top 40 is just the beginning of it, not the end.







Yes, I think this is something that most people just don't understand about the market for music: a title that sells 10,000 units in a given year is probably in the top 10% of sales for that year and certainly in the top 20%(i.e. 90% of all discs released in that year will have sold less than that).
Indeed, I saw figure somewhere probably based on Soundscan data stating that most CDs sell less than 1000 units. Indeed, the leap from selling 1000 units to 10,000 units is much much bigger than most people realize (including many musicians who do not realize just how big an accomplishment it is to sell 10,000 records).
That's because the power laws are in full effect in the music biz, where most of the real spoils go only to the very top part of the curve. Indeed, the L.A. Times reported a while back that "[s]tatistics tabulated by SoundScan... [indicate that]...[o]f the 6,188 albums released [in 2000], only 50 sold more than a million copies. Sixty-five sold 500,000 units and 356 sold 100,000 or more."
So agree with you. In aggregate, there's much more noise at the long tail end of the curve.
Jake
Posted by: Jake | June 13, 2005 at 01:35 AM
It would be interesting to plot this out with real data for an N > 5. For instance, Audioscrobbler's aggregate listening data for ~150,000 people is sitting for the taking at: http://www.audioscrobbler.com/data/
Looking at actual listening behavior also removes some of the potential self-report bias of just asking people what their top records of the year are. Of course, it also reveals taste preferences instead of just purchase preferences, which may in fact show different distribution patterns. If anything I would suspect that repeated listening patterns are even more top-heavy than people realize (and self-report).
Posted by: an_anon_poster | June 13, 2005 at 08:34 AM
One weakness in your data analysis: you are confounding current sales with cumulative sales or past sales.
The probability of being represented in a music collection will be some funciton of the lifetime popularity of the CD (in other words, its cumulative sales) or possibly its peak popularity at some point in the past.
A CD's rank on Amazon, on the other hand, is almost purely dictated by its current sales. (I believe that Amazon does give some weight to historical sales, but sales that happened more than a month ago have very little weight.)
In my own case, my CD collection has twenty years' worth of music. I wasn't even an early adopter, so other people will have been collecting for more years.
So, for example, I own a Paul McCartney CD that was a best-seller in 1985 when I got it as a birthday present. This CD currently ranks #27,163 on Amazon, which is remarkable for an old CD.
Anyway, your curve, if plotted for my collection, would be bumped up for the 25-30,000 bin, even though in reality that CD represents a purchase from the <5,000 bin.
Posted by: Jakob Nielsen | June 13, 2005 at 09:44 AM
Anon: Absolutely, I need to do this with larger N datasets (I've updated the post to make that clearer.) BigChampagne is one company that has that data and we're talking. For the record, the Ns for the invidivuals I looked at ranged from a low of 20 in one case to a high of more than 300. John's did a rough analysis of his own 3,000-CD collection, and I took a few liberties in charting it. Overall, as I mentioned, it's just suggestive at this size but does point the way to a proper analysis as a next step.
Posted by: chris anderson | June 13, 2005 at 10:42 AM
Jakob: Thanks for yet another insightful comment. I did realize that the problem with Amazon rankings is they're mostly based on current sales, which depresses the rankings of older titles (although back-catalog sales represents a bigger part of total sales than any other industry I've looked at).
Given that I won't be able to tell when people bought their music, or what the sales ranks was at that time, I need to look for useful proxies. One, using P2P data such as that from BigChampagne, would be to rank music by the number of personal collections it appears in. That way I could compare individual data with collective data, which seems like a reasonable approach to plotting personal collections on an overall popularity chart. Do you agree?
Posted by: chris anderson | June 13, 2005 at 10:50 AM
It feels to me like this data is skewed toward the top of the tail right now because no good filter exists for music. That is, how am I supposed to find stuff that's further down the curve? For instance, when I'm working, I like to listen to "chill" music such as Morcheeba and Air. Someone recently recommended Zero 7 to me, and I really like it, but the word of mouth filter is not so good when you have hundreds of thousands of albums to deal with. I know there are guys out there trying to burte force it by hiring grad students to evaluate music and categorize it, and there are companies trying to do some sort of waveform analysis, but these both seem like they're doomed. Netflix does a good job of creating these recommendations, but I haven't seen anything very good for music. So, two things:
1. What sorts of filters work best for recommending music?
2. Once the filters are working well, won't more of the music in peoples' collections start shifting down the tail? I think this has already happened a little with TV since there used to be just a few channels and now with more channels the viewing audience is spread out over more of them.
Posted by: Chris Neumann | June 13, 2005 at 05:10 PM
that first graph intrigued me since it reminds me of a Temperature vs Pressure phase diagram. i've been turning over the problem(?) of "cult" content (movies like "Rocky Horror", products like Pez, aso) in my head for a while and haven't seen that resolved graphically (to be fair, i'm not following all this as closely as i could; apologies). but it's like cult items just never quite reach a kind of critical mass; the kind that might carry them out of their niche. so i started equating that with the energy required for state changes.
you can see an example phase diagram here.
plus i like how this graph might offer - via sublimation - a kind of explanation for inexplicable success stories. "The Blair Witch" movie comes to mind. it very much reminds me of something starting out as a solid and going straight to a gas.
anyway, thought i'd throw that out there. and if someone could point me to where "cult" stuff is folded into all this - i'd appreciate it.
Posted by: csven | June 13, 2005 at 06:21 PM
Estimates based on SoundScan source data (web addresses listed at the end of this note) for albums released in 2004:
~40,000 new album titles were released in 2004.
~269 million units of new album titles were sold in 2004.
~55% of new album titles sold fewer than 100 units.
~25% of new album titles sold at least 100 units but fewer than 1000 units.
46 new album titles sold at least one million units. yes, just slightly more than 0.1% of new album titles reached platinum status.
source data:
www.narm.com/2004Convention/Nielsen.ppt
http://forums.moopy.co.uk/showthread.php?t=15480
Posted by: M.V. | June 14, 2005 at 03:25 AM
Hi there Chris... I give you:
On The Long Tail of Music, Metrics and Recommendations
I have some better data for you (read my entire music collection), some commentary (it's a blog after all, I need to add some bits of value) and pointers to some artists who inhabit this fringe
Posted by: Koranteng Ofosu-Amaah | June 14, 2005 at 11:43 AM
Spot on, the last point..."It's a big world out there, and the top 40 is just the beginning of it, not the end."
Top 40 works as the end as well. Its the last stop. Majors. Big push, maketing etc. Distilled
by refined ears and bigger budgets. Most of top40 surfs off the long tail. Time defines the tail in a way. The Killers where part of the noise, off the radar 3/4 years ago, known to less than more. Now they are Top40 Top10. Part of the short curve. Or are they just kicking out to another or the next wave/ larger universe.
That said, the top40 enables both financially and formatically. It thankfully gets lost in what is now an ocean of an industry and music.("Top 100 is irrelevant in an abundant market") The relevancy exists as the enabler. It's just that the record industry has become the music industry and the econmomies or the new digital world have punctuated that clearly.
Top40's expensive sounds, production values, and mass-media machine marketing serves as a center benchmark, one to destroy musically, but a center thats needed to go long on. It (top40) oddly enuff is the noise for the non music consumer but the closest they come to any tail at all. The start of their "consumer behavior"
and a bigger piece of the entertainment segment.
It was not that long ago that the music sector
did not compete with games, dvds, the net etc for
play.
Posted by: mediaeater | June 14, 2005 at 05:27 PM
You might get some traction by backing off the Shannonesque view of information implicit in S/N ratios, and instead look at Bateson's definition of information: "A difference that makes a difference". Implying the question: A difference to whom? That puts POV at the core of the analysis, and (I believe) suggests some new directions for evolution of 'search' and other elements of citizens' media platforms.
Like Jakob's point (hi!) as well. The surrounding market changes over time. For that matter, we change over time. Today's POV is not yesterday's.
Posted by: Tim Oren | June 15, 2005 at 03:24 PM
"there's more noise in the tail because there's more everything there."
I don't think that holds at all. S/N ratio is based on the proportion, not the absolute size of either the signal or noise.
Posted by: Hamish MacEwan | June 18, 2005 at 05:14 AM
One thing to be careful of when talking about using filters to "move people down the tail". It makes the tail itself sound static, which it isn't.
If there was so little music that we all could listen to everything first before buying, the tail would really represent the taste of everyone, but obviously we need the filtering because it isn't like that.
But when the filtering helps you move down the tail, you'll also buy what you like, which move it UP the tail. So, the very act of filtering changes the tail itself. Since Amazon is itself an example of the long tail, that is why the results seem a bit counter-intuitive.
My guess is that all of this is just getting started, and as time goes on the curve will become a bit more linear, with a dropoff that isn't quite so severe at the top. Or at least that the gap between the top 40 and the niches won't be quite so big, as the popular niches get pushed up the tail by the newer forms of exposure and filters.
Posted by: Shawn Fumo | June 22, 2005 at 09:11 AM