This weekend I ran a session at the "SciFoo" camp (an interdisciplinary meeting of scientists and technologists) at Google that was focused on an interesting statistical problem in the Long Tail. I've long argued that the "natural" shape of most markets is a powerlaw, and that any deviation from that shape is due to some bottleneck in distribution. Get rid of the bottleneck and you can tap the latent demand in the market, unlocking the potential of the Long Tail.
The usual example I give is this one, which shows US box office revenues over a three-year period (2003-2005). Remember that a powerlaw looks like a straight line in a log-log plot, so the key part of the below is where the real world data drops off the line (around rank 350).
In this case, the explanation for the fall-off is simple. They just ran out of screens. The carrying capacity of the US megaplex theater network is about 100 films per year, or 300 over three years. Over the same period about 13,000 films are shown in film festivals, although only a tiny fraction of them get mainstream commercial distribution. But if you can distribute niche films as easily as the blockbusters, the curve would look like the straight line predicted by the theory, which is, as it happens, exactly what we see with the Netflix data.
However, there is another common distribution that looks a lot like a powerlaw at first, but then deviates from the straight line on its own, even without scarcity effects and other distortions. It's called a lognormal distribution and it looks a little like this:
How can you tell one from the other? This is obviously an important question, since if the theory has any real predicative power, you've got to be able to say ahead of time whether the "natural" shape of the market is a powerlaw or a lognormal. Otherwise you can't tell if the fall-off is due to a removable bottleneck, such as inefficient distribution, or not.
The difference between those two curves is the subject of a lot of research at the cutting edge of complexity theory, and the simple answer seems to be that it comes down to the nature of the network effects that create unequal ("rich get richer") distributions such as the powerlaw and lognormal in the first place. I lay out the basics of that research in this presentation, which I gave on Saturday at Google. (Make sure you're displaying the notes field in that Powerpoint so you can read my narration)
One final note: This doesn't affect any of the conclusions of the book, which are based on real-world data rather than predictions (although I discuss the problem briefly in the notes on page 229). Instead, it's an issue when people attempt to make predictions based on the theory, such as estimating the latent value of a television or film archive based on the assumption that the natural shape of demand would be a powerlaw. But billions of dollars in valuation depend on getting that calculation right. If you're into this stuff, check out the presentation and see if you can see a path to a solution.
UPDATE: See related discussion: Jakob Neilsen analyzes the observed fall-off in web traffic on one site here. Chris Edwards gives another perspective on that here. And the always astute Nick Carr is following the conversations and adds commentary here.
In the academic world, there are good posts here (""how not to fit a straight line") and here (powerlaws versus lognormals in web tagging). Meanwhile, the two must-read papers in this domain are:
Mitzenmacher, M. (2003). "A brief history of generative models for power law and lognormal distributions". Internet Mathematics 1: 226–251.
Log-normal Distributions across the Sciences: Keys and Clues, E. Limpert, W. Stahel and M. Abbt,. BioScience, 51 (5), p. 341–352 (2001).












