The Japanese edition of The Long Tail was released today, with the canonical powerlaw front and center. It's been interesting to see how various publishers have dealt with the whole book-named-after-a-curve thing. In the US, Hyperion decided that emphasizing the statistical roots of the book would be offputting to the mainstream audience they were targeting, so they went the Tipping Point route with a simple, somewhat abstract image meant to largely give a subject-matter context for the title, subhead and other text that's actually doing all the work of selling the book. The curve, meanwhile, was relegated to the inside back flap. Whatever you might think about that decision, it worked.
In the UK publishers tend to be a bit more playful with their covers. Here's what Random House UK did with the book (is that a donkey tail? I'll try not to read too much into it...).
[UPDATE] Here's the Brazilian version. Not sure what those arrow things are, but I think it's a graceful and modern design, albeit a bit generic. (Thanks to Luigui Moterani for the heads up that it was out.)
[UPDATE 2] Here's the Taiwanese version. Typography aside, I quite like the integration of the curve into the cover, which evokes the powerlaw without feeling too math-y. Perhaps a path worth considering for the paperback edition in the US?
The book is now being translated into nearly two dozen other languages, so we'll get to see how more countries and cultures deal with the question of how to express a complicated idea simply, and how much math they think their readers can handle.
As a side note, I'm sorry about having so many self-promotional posts in a row. I've been traveling 22 of the last 30 days (I've just arrived in NYC from Paris, where I keynoted the IDC IT Forum) and next month doesn't look much better. But I'll try to at least update my sadly neglected sidebar over the next week and turn to finishing some proper research-driven posts over the weekend.
Reader Marco Ganz writes to tell me that in the 1970s, Ferrari had a beautiful racing car called the Coda Lunga (Long Tail).
From the QV500 guide to classic racing cars:
[In the fall of 1969, Enzo Ferrari was working on improving the 512 model for the racing team led by Mario Andretti.] Modifications were made throughout the year, particularly to the aerodynamics. Other alterations included a switch from a louvred rear window to a clear plastic item, helmet bubbles of varying shapes and sizes and a periscope rear view mirror. However, unarguably the most striking development came at Le Mans where a Coda Lunga or Long Tail version was used. It featured special low drag bodywork to take advantage of the massive speeds attainable down the chaicane-less Mulsanne Straight, the extended rear bodywork being adorned with vertical fins, a wraparound lip spoiler and an enveloping tail facia. No less than 11 512's were entered for the 24 Hour race in 1970, four of which were Coda Lunga's, both Ferrari and Porsche producing one-race specials for this event. Scuderia Ferrari, NART, Scuderia Filipinetti and Ecurie Francorchamps each had a Long Tail in their armoury, all but two of the 11 512's retiring, four being knocked out simultaneously in a freak multiple pile up. The two finishers were both Coda Lungo's and they came home in fourth (NART) and fifth (Ecurie Francorchamps) behind a trio of Porsche's. After Le Mans, development started on the lighter, more powerful 512 M for 1971.
How cool is it that the library would be the hot event space in Manhattan this season? Credit NYPL Live impresario Paul Holdengräber, who in his first season has made the majestic midtown landmark with the twin lions the place to be for speakers and audiences alike. This spring alone speakers have included John Updike, Arianna Huffington, Malcolm Gladwell, Bernard Henri Lévy, Tina Brown and Salman Rushdie. And now me and Lawrence Lessig.
Lessig and I are going to discuss the Long Tail's effect on blockbuster culture, from media fragmentation to Google's book projects. Even if you've heard me speak before, it's worth coming for Lessig, who is perhaps the smartest and most articulate thinker on digital economics on the planet. The event is on Thursday, September 28th at 7:00. Sparks are sure to fly.
Tickets are $15. Buy them here.
Great news. The Long Tail has been shortlisted for the Financial Times/Goldman Sachs Business Book of the Year award. Last year Thomas Friedman's The World is Flat won.
The mission is: "To identify the book that provides the most compelling and enjoyable insight into modern business issues, including management, finance and economics."
My competition is:
The winner will be announced at a gala dinner in NYC on Oct 26th.
Hollywood has been the last refuge of the hit, the one industry that seemed to be able to resist the force of gravity dragging down blockbusters everywhere else. This year's Pirates of the Caribbean: Dead Man's Chest, which set a box office record in its opening weekend, seemed to prove that the movie-making business still had the gift of mobilizing mass audiences. But no other film this year has done nearly as well, and when you look at the industry at large, the year so far isn't looking good at all. Total ticket sales (in tickets, not dollars) are down 9% from 2004, and it if weren't for Pirates, they'd be down 15%. The last refuge isn't looking so safe anymore.
Source: BoxOfficeMojo (YTD calculations with numbers of tickets require a subscription)
Over the last two years of doing Long Tail research, one of the biggest learning curves for me has been just getting re-used to proper analytical tools. It's been close to two decades since I last did any real science, and back then our computational tools pretty much consisted of pocket calculators and Fortran. I remember being seen as a bit of a physics lightweight for stooping to spreadsheets on occasion (this was the late 80s), and the graphing programs that we used to plot data took as long to understand as the data itself. Since then I've pretty much lived in Excel (which I actually use more than Word), and that's always been enough. Until now.
In early 2004, when I stated this project, I was pretty much counting on the kindness of strangers to do most of the data-massaging work for me. I'd craft a data request and then negotiate with the appropriate company to have it generated by the clever elves in their technical division. But as time went on my requests got more ambitious and I started getting more and more raw data and massive data sets, often with millions of entries. That's when I realized that Excel wasn't going to do the trick anymore. For starters, it can only handle 65,000 rows (that's been raised to 1 million rows in Excel 2007, if you can bear the bugs that still exist in the beta version now available).
For bigger files than a spreadsheet can handle, you'll want a database. Because I already had Access, I used that. It's a graceless, over-complicated program, but it can at least handle huge files. I mostly used it to parse data sets and do very simple queries (sorts, sums, filters and the like) that would output files that were small enough to analyze in Excel. But the data sets kept getting more complicated, and the questions I was trying to answer were getting more sophisticated and thus outside of the usual tutorials. I was either going to have to learn how to program databases properly or find some other tool.
There are plenty of dedicated statistics and scientific graphing programs out there, but all the ones I've found are limited in what they can do and relatively inflexible in their accepted data formats. In my case, the data sets ranged from search terms to subscriber-level ringtone records. Each was different and all needed a lot of scrubbing to put in a form that would allow the sort of calculations I was after, from powerlaw exponents to head/tail ratios.
And so I ended up where I'd started twenty years ago: writing code. As far as I can tell, it's the only way to really have the ultimate flexibility to ask any question of any kind of data set. I know of course that real geeks use Perl for this (as in "so I just tossed off a few lines of Perl", as if it were as simple as an email), but to be honest I just don't have the time to learn a new programming language. So I did what no self-respecting geek would admit to: I used Basic.
In fairness, a lot has changed since the "GOTO 50" days. For one thing, Microsoft's Visual Basic is now nearly as powerful (and as difficult to learn) as C++. It's fully structured, object-oriented and net-aware. It's also a pig to use, with even the most simple programs requiring all sorts of structure overhead. I gave up on that pretty quickly and went looking for something simpler.
I found it in Liberty Basic. It's a proper structured programming language without the unnecessary complexity of languages trying to prove that they're "enterprise class". I know its string handling isn't nearly as good as Perl, but everything else about it is so easy to use, from its editor to its debugger, that I can deal with writing my own text parsers and the like. And like most Basics these days, you can use it in interpreted or compiled mode, so you can create stand-alone programs if you want. It's not fast, but to be honest that's probably more due to my grossly unoptomized algorithms than it is the language itself.
If you've been dabbling in large-scale data analysis and running up against the limits of Excel, you might want to give Liberty a shot, at least for rough, proof-of-concept stuff. Here, for instance, is a program I wrote to rank the individual words used in the search terms of the notorious AOL dataset. (Others have written programs to rank the whole terms, but I wanted to look at it at word level, to compare with Zipf analyses of natural language). I had to use a tiny (50,000 terms, or about 250,000 words) subset of the whole set, and even that took nearly 40 minutes to run because I didn't bother to optimize my lookup code, but it was enough to get a usable answer.
Now that I've relearned some programming, the next step will be to move up to a proper tool that really can handle data sets as big as AOL's. I've just bought the legendary Mathematica, which is a favorite on Wall Street and in universities alike, so that may well be the answer, although it's far from intuitive. I'll probably start by replicating my search word ranking program to see how much faster and more powerful Mathematica is. I can already tell this is going to take a while. Data analysis can get easier, but that still doesn't make it actually, you know, easy.
One of the great things about setting out to do proper research on a subject is that it tends to inspire others to do the same. So along with all the impressive business analysis of the Long Tail that we've been chronicling here, there's quite of bit of academic research that's starting to come out, too. Here's a round-up of some of the more interesting recent papers that have come to my attention. There are no doubt others (indeed, I know of a few coming out in journals that don't allow pre-publication posting), so I'll update this post or add another one as they cross my desk:
Anita Elberse and Felix Oberholzer-Gee (Harvard Business School)
Sample quote: "We study the distribution of revenues across products in the context of the U.S. home video industry for the 2000 to 2005 period. We find superstar and long-tail effects in home video sales, but each effect comes with a twist. There is a long-tail effect in that the number of titles that sell only a few copies every week increases almost twofold during our study period. But at the same time, the number of non-selling titles rises rapidly; it is now four times as high as in 2000."
Erik Brynjolfsson (MIT), Yu “Jeffrey” Hu (Perdue University) and Duncan Simester (MIT)
Sample quote: "The 80/20 rule has proved to describe the product sales distribution very well in a traditional business environment. However, the Internet seems to have changed this balance. By greatly lowering search costs, by creating virtually unlimited “shelf space” and by facilitating powerful recommender system, information technology in general and online markets in particular have the potential to radically increase the collective share of niche products, and flatten the sales distribution."
Erik Brynjolfsson (MIT), Yu “Jeffrey” Hu (Purdue University) and Michael D. Smith (Carnegie Mellon University)
Sample quote: "The study used a data set collected from a medium-sized retailing company that sells the same assortment of clothing through a catalog and an Internet website. We found that, Internet customers were much more likely to by niche products. Interesting, even after controlling for customer selection bias between the two channels by focusing only on those customers who used both channels, product sales were still significantly more evenly distributed on the Internet than through the catalog channel. The more even product distribution online is consistent with the theory that lower search cost through the Internet channel, caused by Internet search, browsing, and recommendation tools, can increase the collective share of niche and obscure products, leading to a more even product sales distribution online."
Jure Leskovek (Carnegie Mellon University), Lada A. Adamic (University of Michigan) and Bernardo A. Huberman (HP Labs)
Sample quote: "We analyze how user behavior varies within user communities defined by a recommendation network. Product purchases follow a ’long tail’ where a significant share of purchases belongs to rarely sold items. We establish how the recommendation network grows over time and how effective it is from the viewpoint of the sender and receiver of the recommendations. While on average recommendations are not very effective at inducing purchases and do not spread very far, we present a model that successfully identifies communities, product and pricing categories for which viral marketing seems to be very effective"
Paul Caron (University of Cincinnati)
Sample quote: "[T]he many hundreds or thousands of law review articles with only a
few readers each may cumulatively have many readers—the proverbial “long tail.” And this “long tail” is important—because it signals the importance of microaudiences and microcommunities of scholars."
For students of powerlaws, George Zipf's 1949 observation that the frequency of words used in the English language followed a powerlaw distribution (James Joyce's Ullysees was one of the first test cases) is a profound thing. It not only defined Zipf's Law, which gives a simple rule to explain why some words are far more commonly used than others, but also suggests that there's something elemental in powerlaws.
Several generations of linguists have been trained that network effects--specifically "preferential attachment" between words that explain why some are more commonly paired and thus appear more often than others--explain that powerlaw shape and its intrinsic inequality. This is a big deal in everything from complexity theory to economics, and is key to understanding when and where powerlaws emerge, which is a big part of the ongoing Long Tail research.
But while exploring this, I came across the work of Wentian Li, a remarkable polymath who studies genetics, statistics and powerlaws. Li noticed that Zipf's Law doesn't just apply to English. It applies to any language. Not only that, but it actually applies to any random collection of characters, such as those that might be typed by monkeys banging on keyboards. Forget "preferential attachment" or any other theory of network effects within human languages. Li showed that Zipf's Law is just a statistical artifact of ranking and randomness:
To summarize in one sentence, the Zipf's law in "monkey languages" is caused by an exponential transformation of variables, and if the first variable follows an exponential distribution, the second would follow a power-law distribution. To be more specific, the first variable is the word's length, and the second variable is the word's rank.
In other words, what Li found is that if you just bang on a keyboard generating random letters, with a space (which one assumes is no more likely than any other character) defining the end of each "word", you'll find that the length of the words falls off in an exponential distribution. And if you rank those "words" in terms of their frequency in random use, you'll get a powerlaw. Not because of any linguistic quality of language, but simply because that's what the gods of math dictate.
I couldn't quite get my head around this, so I wrote a few lines of code to replicate the experiment. And by golly Li's right. Monkey Text looks like this:
And when you plot the "word" length (for 1000 words in this case) you get a probability distribution that looks like this:
In my experiment, the average length of word was
5.7 26 characters (error corrected; see comments), but the single most common word length was one character. Because there are only 26 one-character "words", those are the most common when you rank word frequency, and that's why when you treat monkey language like a real language you get the same powerlaw Zipf observed in Joyce. Just as the short words "A", "I", "The" and "An" are common in English, so are "C", "Z" , "Y" and the other single letters common in Monkey. Nothing to do with meaning; everything to do with math. Freaky.
FREE was available in all digital forms--ebook, web book, and audiobook--for free shortly after the hardcover was published on July 7th. The ebook and web book were free for a limited time and limited to certain geographic regions as determined by each national publisher; the unabridged MP3 audiobook (get zip file here) will remain free forever, available in all regions.
Order the hardcover now!