« Lebanon or Terminator? | Main | Academic research on the Long Tail »

September 06, 2006

Comments

Brandon

So the Long Tail distribution is merely a result of the difficulty of finding a person who will reccommend a given product?

Patrick Hall

The particular letters which came out as the most common in your experiment ("C", "Z" and "Y") should not be the most frequent on a repeat trial generation of Monkey; it's not quite right to say that Monkey "has" most frequent letters.

What it has (like any random string) is a consistent distribution of particular word lengths, not of particular words (or letters) themselves.

Chris Anderson

Patrick,

Sorry, I wasn't clear. All the single-letter "words" are equally common in monkey language; I just chose to give a few examples that aren't also words in English. I've now updated the post to make this clearer.

CHris

Chris Anderson

Brandon,

No. that's not a conclusion one can draw from Li's work. Most powerlaws *are* probably due to network effects, so recommendations or other word-of-mouth would indeed explain those curve. But language, despite being often given as an example of this, may not actually fit the model. Because of the unique definition of "word", meaning a string of characters ended with a space (which comes 1/27th of the time in Monkey Text and more often in human langagues due to our aversion to long words), variety is constrained by an exponentially falling-off function. And that's what leads to the powerlaw observered in word use. No network effects required.

A fair question is in what other categories is variety similarily constrained, since that could explain powerlaws there, too. I suspect that may be a future research project.

Craig Walker

Here's something to consider: the power law shown in the Monkey Text is a relationship between word length and word frequency. But that's not the case with natural language; Zipf's law just talks about the relationship between the frequencies of the words themselves.

For example, in Monkey Text, "O" would occur at the same frequency as "A" and "I", because they're both the same length. However, in English, "O" is pretty infrequent (it's just barely a word at all) and will probably be well into the long tail of the language. Similarly, (I'm guessing that) "please" (a 5-letter word) would be more common than "hew" (a 3-letter word).

Thus, for English at least, there meaning is important after all.

Craig Walker

Er, "please" is a 6-letter word. Insert appropriate 4-letter word here. :-P

Kevin Hillstrom

I have found that variations of Zipf's law apply very well to my field (Database Marketing) --- when I worked at Lands' End and Eddie Bauer, I could explain the dropoff in performance of segments of customers we mailed catalogs to by using variations of Zipf's law.

Neat!

Keshava

Seems like there should be a second, reinforcing effect at play here. Not only would shorter words be more common in a "real" language, but wouldn't most societies choose to use shorter words for concepts that arose more often? I have a hard time believing that any society would ever develop the word "Scormioligiocalsit" to represent the concept of "me" There are obvious exceptions, but it seems generally, we use shorter words for more common (or more general) terms, with longer words for more precise concepts, that as such are less frequent anyway.

Mark Harrison

> Just as the short words "A", "I", "The" and "An"
> are common in English, so are "C", "Z" , "Y" and
> the other single letters common in Monkey.

Similar effect, different causes, surely.

The set of 1-letter words in Monkey would be expected to comprise about 3.7% of the word sample size (because there is a 1/27 chance that the "second" letter will be SPACE, thus rendering the word size equal to one.)

In most languages, the bulk of linguistic development has been verbal, with written text (and in particular, written text with commonly accepted spelling) a relatively recent phemonenon [1]... so it's the single-syllable nature that should be studied, surely?

It strikes me as plausible that the "most commonly-needed to be expressed" concepts would have been the first to evolve in language, and at a time when proto-humans were capable of a smaller range of sounds. The ability to distinguish quickly between "food" and "run" is probably a better survival trait for a hunter-gatherer than the ability to distinguish between "vermillion" and "scarlet"!

[1] My wife's grandfather was an officer in the British army both as a young man in the 1st World War, and called back from retirement in the 2nd World War. His comment that "one big difference was that in the second world war, the enlisted men could read and write" has long stuck with me. I had not realised that mass literacy was an event that happened in the lifetime of people I knew!

David

I had not realised that mass literacy was an event that happened in the lifetime of people I knew!

It may be disappearing again within our lifetime too :)

Chris Anderson

Mark,

I hadn't actually done the arithmatic, but I'm pleased to note that in my sample of 1,000 words it just so happened that 37 (3.7%) were single characters--exactly as predicted. Not all my models work so neatly!

Chris

Kevin Kelly

Not exactly related to Zipf's law is the curious distribution of numbers, or rather single digit intergers: 1,2,3....9, 0. Since they are one digit you would expect them all to appear with equal frequency, but for some reason the number one is more common, apparently (if my memory is right) even in randomly generated numbers. It's an example of a pattern buried deep in "just the math."

Bertil

Chris,

When typing "monkey", a word of length 'n' letter has a probability to be

f(n) = A.(1/27)^n

Where A is just a fitting constant of the density function 'f'.

A Zipf's distribution is more like:

f(n) = B.(1/n)^gamma


> Most powerlaws *are* probably due to network effects,

I should have my second paper on this anytime over the week-end.

Bertil

Li's comment insists on the importance of transformation in Zipfian apparition; rules like "this word is long and commonly used, let's shorten it" like:

weblog -> blog

Or this word is short and commonly used (less frequent -- couldn't find an English example, this is French for "today", literally "the day of today"):

hui -> aujourd'hui

Rob

Um. I repeated the experiment, making words out of 26 characters (+1 space), so 27 items to choose from. I pick an item until I reach a space, store the word length and start over, until I repeated this for 100.000 item-picks.

The distribution of the word-length is the same, but I get an average word-length of 26 (as might be expected from math). Where did you get the 5.7?

John

In probability theory and statistics, the geometric distribution is either of two discrete probability distributions - the one that concerns us here is: the probability distribution of the number X of Bernoulli trials needed to get one success, with a probability of success on a single trial being = p.

If we take a 27 letter alphabet, then the probability that a character selected at random is the space bar = 1/27.

To determine expected word length, let us define a 'trial' as the act of hitting a random key on our 27-letter keyboard, and a 'success' as hitting the space key.

For a geometric distribution, the probability that there are n fails before the first success is

P(n) = (1-p)^n*p.

So the probability that we get a word of zero length (ie, just the space character) is 1/27.

The probability that we get a word length of X is 26/27 times the probability that we get a word length of X-1.

So, one would expect that your most common length would be 1 character. How did you get an average word length of 5.7?

And, why is this all so mysterious? And, what's it all got to do with the Long Tail?

I am missing something, perhaps...

Chris Anderson

Rob, John,

Thanks for catching my stupid error. I accidentally took the average of word length *frequencies* (so all those 130-letter and other big words lenghts with low frequency were driving down my average. Of course when I did it properly I got the same result you did.

As for what this has to do with the Long Tail, it has to do with the forces that cause powerlaws to emerge in the first place. The usual explanation is that it's network effects ("preferential attachment" between within the system that lead to some entities being more "popular" than other, similar to the effect of word of mouth in culture). Zipf's Law has usually been explained this way, too. What interesting about this is that you don't require network effects to explain this in languages, contrary to what the convential wisdom in linguistics says. So this is at least one powerlaw that can be explained by statistics alone, rather than network effects.

It's worth noting that language is a special case in which the length of words is constained by the scarcity function of there being only 26 characters and one space, with a 1/27 chance that the next character will end the word. Obviously many of the markets we're considering have no such scarcity function, and thus variety would not necessarily decline in an exponential function as is the case with language. In those cases, perhaps network effects are indeed the explanation for the observed powerlaws.

Chris

Don Draper

"What interesting about this is that you don't require network effects to explain this in languages, contrary to what the convential wisdom in linguistics says."

I don't see this at all, but maybe I'm still missing something. The "monkey language" thing is a simple formula relating the lengths of "words" (random string of characters, not the language concept) to their distribution. The experiment has nothing to do with language: the results are purely driven by the frequency of the end of word/test success marker. The fact that letters and spaces are used for this helps the imagination, but would be the same experiment if it involved shapes or colors or anything else that could have 27 items and one of them marks a new starting point.

If you wanted to make this claim about language, you would have to show this distribution with real words. Of course, Zipf did that for English and showed different results (words not distributed by length, but frequency, requiring explanation from network effects to explain the frequency). Even for that, I'd rather see it done with real words (morpheme != grapheme) and multiple languages (especially, for example, fusional vs. agglutinative) before declaring any universals.

I thought Li's remarks were meant to show that this kind of distribution could be predictable based on the rule that determines frequency: the "if first variable follows an exponential distribution" bit. Probably the term "Monkey Languages" for the experiment is too tempting, since it sounds related to actual languages.

Chris Anderson

Don Draper,

You're right that I should have given more historical and academic context for this. If you Google around you can find lots of discussion on how word pairings and other network effects in language could be responsible for Zipf's law. Here's one example.

As I've said, I agree with you that such explanations are not necessary because statistics alone explain the observation. But that hasn't stop lots of linguists from trying to apply complexity and network theory to this all the same. Hence my post.

Chris

john

Hmmmm.... methinks you doth protest too much! My understanding of the 'Long Tail' - as you describe it - is that the emergence of the internet creates potentially huge opportunities for businesses that develop the savvy to provide products & services to global, but inceasingly specialised market spaces. In many ways, these businesses need to ignore traditional (geographic/cultural) marketspaces & focus more on product marketspaces. Long tails abound when businesses change their spectacles!

In all of this I agree with you - and I admire the way you have developed these ideas in your blog.

But!!! Lets not get carried away here! In the discrete world, the geometric, binomial, or poisson distributions reflect real-world behaviour with pinpoint accuracy. In the continuous world, there exist a multitude of phenomena which can be approximated to by the above distributions or simply by power series expansions. ( Think of the curve for the charging/discharging of an inductor or capacitor for example. The model here is to use e^x appropriately.e^x has a very simple power series expansion - with the added bonus that you get a very good approximation- because of the factorial in the denominator- very quickly)

My point? The world is full of natural phenomena which have geometric series or geometric series-like properties. Nothing at all to do with networking effects - its just the way these things work.

I'm not a linguist but I have a hunch that at least some linguistic phenomena (word frequency, word length, etc) have geometric or exponential behaviours.

I think you run the risk of diluting your original (and valuable) thesis if you try to include the behaviour of the above phenomena types as examples of long-tailism.

OTOH, I would be fascinated (genuinely, I dont mean to be patronising here) if you could demonstrate that some of these phenomena did indeed have long-tail behaviours , which were affected by human cultures, interactions, or the emergence of global communications & distribution capacity.

This has turned out to be much longer & more pedantic than I intended - please forgive me, I am too tired now to put it through th eemail diplomacy filter...

Best Regards,

John

Bertil

The fact that something can be independent of a network doesn't imply it has to be;
consider the sigmoidal adoption curve, or technological lock-in through increasing return: both can be explained without any networks---but both are much more interesting when considered in a reticular structure with more properties than the fictional random ties or square lattice.

Chris,
You didn't reply to my remark on your Monkey language not being a power-law: I went though your theory and, in both cases, the tail is rather long, but only the Zipfian suits your findings; the dynamic processes to reach each are quite different too, however independent on a network both are. Most graphs that I could find where not clear on whether the ordinate axis was logarithmic or not. As Monkey language fits a line with a log-abscissa and linear ordinate, Zipfian fits a line with both logarithmic, that could explain the confusion.

printer cartridges

The Gaussian distribution is supported by the central limit theorem, and is to do with adding random variables rather than multiplying. There are also the “stable distributions” which can turn up if some of the assumptions of the CLT fail to hold.

The comments to this entry are closed.

Tidbits

The Long Tail by Chris Anderson

Notes and sources for the book

FREE was available in all digital forms--ebook, web book, and audiobook--for free shortly after the hardcover was published on July 7th. The ebook and web book were free for a limited time and limited to certain geographic regions as determined by each national publisher; the unabridged MP3 audiobook (get zip file here) will remain free forever, available in all regions.

Order the hardcover now!