[Update: I've posted additional analysis from the MIT team here]

One of the most quoted statistics in my original article was the data point that 57% of Amazon’s book sales are in the Long Tail, defined as beyond the 100,000 books available in the typical Barnes and Noble superstore (we sometimes used 130,000, which is the inventory of its larger stores). We arrived at that number by working with the team at the MIT Sloan School that had produced one of the best economics papers on Amazon to date. Because Amazon, as a public company closely scrutinized by analysts looking for any hint of a forward-looking statement, does not reveal its sales figures, the MIT team had to attempt to reverse-engineer them from the information Amazon does provide, chiefly sales rank of individual titles.

The forensic economics process they used, which had been tried by others at smaller scale, was to buy a set numbers of books at various ranks and see how those purchases changed the sales rank. Based on a sufficient number of these absolute-sales-to-relative-rank conversions, they were able to calculate a sales curve. In that curve, based on 2001-2003 data, 39.2% of sales fell above the 100,000 rank position. We updated that using 2004 sales data from Amazon, and came up with the 57% figure used in the story.

Although all the other comparable data sets we had, such as Netflix and Rhapsody, showed Long Tail sales (sales of titles not carried by comparable bricks-and-mortar stores) in the 20%-25% range, the MIT methodology seemed solid and Amazon wasn't commenting one way or another. So we went with it. It was the best data available.

After publication the number rightly raised eyebrows. First, it was well above what other retailers in the book market were experiencing. Second, we were told that Amazon’s sale rank algorithms were, well, notoriously funky; indeed, they apparently showed hysteresis at certain points, jumping backward thousands of rank steps when certain sales thresholds are crossed. Finally, Amazon sources suggested that the MIT methodology may have undercounted Head (top 100) sales, leading to an overcount of the Tail.

Subsequently, I had the chance to speak with Jeff Bezos about the figure, and although he didn’t know the exact number, he also thought ours was very high. He guessed it was closer to the 20% range, in line with the other examples. In October, I posted on the debate here.

For the book, it was clearly important to get this figure right, or at least righter. Amazon still isn’t releasing the hard numbers, but we do at least have a bit more experience at reverse-engineering them. A number of other academics have taken a stab at it since then, as well as some independent experts.

I asked one of them, Morris Rosenthal of Foner Books, to extend the analysis he’s done over the past two years on correlating Amazon sales rank to absolute sales and apply that approach to the Long Tail. Fortunately, he took to the project with a vengeance and has built the beginning of what may be the best analytical framework for estimating Amazon’s sales yet.

I’ll post his working methodology and conclusions here in hopes that others will help us improve it. Rosenthal and I are doing this because we recognize that this method, despite the work and experience that has gone into it, is still imperfect and its conclusions may be off base. So we’re looking for feedback and advice.

Like the MIT team, Rosenthal works by buying books and watching how that changes rank. The difference is the Amazon has revamped its ranking methodology since last year, and we now have a few other datapoints to cross-check the conclusions with. I’ve shortened his explanation in the following, but the entire thing is in this file for those interested.

The bottom line: his research with aggressive assumptions puts the
Long Tail (titles beyond the top 100,000) at 36% of Amazon’s book
sales. Conservative assumptions, meanwhile, put it under 20%.
Cross-checking it against Amazon’re book revenues seems to suggest
something in the mid-to-high twenties. In either case, it’s certainly
less than 57% and even 39%. **But the Long Tail still appears to be
somewhere between a quarter and a third of Amazon’s book business,
which is a significant fraction by any measure.**

*Over to Rosenthal:*

The basic sales rate assumptions come from over 1,000 data points for a collection of books that I hand gathered last November to update my rank equivalency graph at www.fonerbooks.com/surfing.htm, some including artificial buys.

Under the new system, two sales of any title, independent of whether it's ever sold before, will propel it into the top 50,000 books for a few hours. The exact rank and the length of time it stays there depends on the day of the week, the season, etc. The decay rate is fastest in the first 24 hours after the buys cease, dropping anywhere from 100,000 to 175,000 in the first 24 hours, again depending on day and season. This is a little tougher to determine than you might expect due to frequent and frustrating freezes in the overall ranking system. After the initial jolt, a bit of historical weight is introduced. A title that sells very rarely (never) will drop 100,000 the next day, 400,000 over the course of the week, another 200,000 the next week, 150,00 a week for a couple weeks after that. With no more known sales in the interim, it will stand around 2,000,000 today, eight months later.

By the same token, for another infrequent seller, but one that had sold at least 20 copies through real and artificial buys in a couple years of Amazon life, the initial decay rate is about 75,000 in the first 24 hours, then 30,000 a day for a couple days, then 20,000 a day for a few weeks. When it gets to the range between 800,000 and 1,000,000, where it would have lived under the old system, the stability gets a little erratic and it may actually improve on a given day. However, as near as I can tell, it will continue slowly dropping every time a new title from further down the tail sells after it does, but the probability of that happening drops rapidly.

A few quick conclusions can be draw from this, though they haven't been fully tested:-)

1)Amazon has sold approximately 2,000,000 unique titles in the last eight months.

2)Amazon sells somewhere between 150,000 and 200,000 unique titles on any give day. The top 30,000 titles average over 1 copy a day.

3)Of the top 100,000, we can estimate that 70,000 also only sell one copy that day. Based on a straight line log-log graph, I'd estimate that the 20,000 positions between 10,000 and 30,000 account for 28,000 sales. So we’re up to 98,000 sales on the body,vs. 100,000 on the long tail, with the top 10,000 to go.

4)The ranks between 1,000 and 10,000 are selling a couple copies a day, my latest graph estimated around 11 copies a day at the 1,000 rank. I regraphed it all the way from 10,000 to 1 on log-log with another straight line approximation. I arrive at 36,000 copies for the next 9,000 titles. That brings the body up to 134,000 vs. 100,000 for the long tail.

5)Finally, we have the top 1,000 books to deal with. These are books selling at least 11 copies a day. This time I extended the straight line out rather than setting the top title to 1,000 copies a day, and got the top at 2,100 copies a day, still an obvious underestimation. We get a little over 8,000 sales for the top 10 books, reading the trailing graph line. Between 10 and 100, we're talking about 90 titles ranging from 220 copies a day down to 50, or another 10,000 sales. The final bracket, from 100 to 1,000, sees sales ranging from over 50 a day down to 11 a day, or another 24,000, That gives us about 42,000 for the top 1000 books.

6)So, for a given day, the "body" sells 176,000 books, and the long tail 100,000, or about 36% for the long tail, using the 100,000 break point.

7)Now comes a checksum. 276,000 books a day equals 101 million books a year. Amazon's North American media sales on the year will be a little under $3.0 billion, and we can attribute about $2.0 billion of that to books based on the books-to-other-media ratio in this press release.

Despite the huge importance of used sales to Amazon's bottom line, if I understood their annual reports, they only include the net from these sales in their North American sales number. If they do 25 million used book transactions (guesstimate, might be a little higher since books are more likely used items) and net a couple dollars per transaction (may be high given the number of Z-shops and auction sellers), it doesn't make a dent worth mentioning in the 2 billion of gross sales for books. If we declared the average selling price of a book on Amazon as $20, we could call it a perfect match and go home. My latest research shows top 100 titles average $15, but further out the curve they average $25 (but with a higher availability of cheap, used titles), so the $20 average selling price may not be a bad approximation.

That said, it's a bit of a scary good match, so I'll have to go back and look at my methodology, make sure I'm not abusing the log-log technique or the like. Also keep in mind that the 200,000 unique titles a day is probably high, which increases the contribution of the long tail. Without inside information from Amazon, it's impossible to say for sure if the orphan book decay rate is really fixed by new titles selling past it. In the short term vs. mid term discussion, the 200,000 uniques a day is an important factor to look at, and I'll look at it some more. Even if Amazon does 200,000 uniques a day, but only 400,000 uniques a week and 500,000 a month, etc, the break point would keep books on the long tail that intuitively belong in the head for selling multiple copies a week.