« Academic research on the Long Tail | Main | Hollywood: The Year of Not Enough Blockbusters »

September 16, 2006

When spreadsheets aren't enough

Over the last two years of doing Long Tail research, one of the biggest learning curves for me has been just getting re-used to proper analytical tools. It's been close to two decades since I last did any real science, and back then our computational tools pretty much consisted of pocket calculators and Fortran. I remember being seen as a bit of a physics lightweight for stooping to spreadsheets on occasion (this was the late 80s), and the graphing programs that we used to plot data took as long to understand as the data itself. Since then I've pretty much lived in Excel (which I actually use more than Word), and that's always been enough. Until now.

In early 2004, when I stated this project, I was pretty much counting on the kindness of strangers to do most of the data-massaging work for me. I'd craft a data request and then negotiate with the appropriate company to have it generated by the clever elves in their technical division. But as time went on my requests got more ambitious and I started getting more and more raw data and massive data sets, often with millions of entries. That's when I realized that Excel wasn't going to do the trick anymore. For starters, it can only handle 65,000 rows (that's been raised to 1 million rows in Excel 2007, if you can bear the bugs that still exist in the beta version now available).

For bigger files than a spreadsheet can handle, you'll want a database. Because I already had Access, I used that. It's a graceless, over-complicated program, but it can at least handle huge files. I mostly used it to parse data sets and do very simple queries (sorts, sums, filters and the like) that would output files that were small enough to analyze in Excel. But the data sets kept getting more complicated, and the questions I was trying to answer were getting more sophisticated and thus outside of the usual tutorials. I was either going to have to learn how to program databases properly or find some other tool.

There are plenty of dedicated statistics and scientific graphing programs out there, but all the ones I've found are limited in what they can do and relatively inflexible in their accepted data formats. In my case, the data sets ranged from search terms to subscriber-level ringtone records. Each was different and all needed a lot of scrubbing to put in a form that would allow the sort of calculations I was after, from powerlaw exponents to head/tail ratios.

And so I ended up where I'd started twenty years ago: writing code. As far as I can tell, it's the only way to really have the ultimate flexibility to ask any question of any kind of data set. I know of course that real geeks use Perl for this (as in "so I just tossed off a few lines of Perl", as if it were as simple as an email), but to be honest I just don't have the time to learn a new programming language. So I did what no self-respecting geek would admit to: I used Basic.

In fairness, a lot has changed since the "GOTO 50" days. For one thing, Microsoft's Visual Basic is now nearly as powerful (and as difficult to learn) as C++. It's fully structured, object-oriented and net-aware. It's also a pig to use, with even the most simple programs requiring all sorts of structure overhead. I gave up on that pretty quickly and went looking for something simpler.

I found it in Liberty Basic. It's a proper structured programming language without the unnecessary complexity of languages trying to prove that they're "enterprise class". I know its string handling isn't nearly as good as Perl, but everything else about it is so easy to use, from its editor to its debugger, that I can deal with writing my own text parsers and the like. And like most Basics these days, you can use it in interpreted or compiled mode, so you can create stand-alone programs if you want. It's not fast, but to be honest that's probably more due to my grossly unoptomized algorithms than it is the language itself.

If you've been dabbling in large-scale data analysis and running up against the limits of Excel, you might want to give Liberty a shot, at least for rough, proof-of-concept stuff. Here, for instance, is a program I wrote to rank the individual words used in the search terms of the notorious AOL dataset. (Others have written programs to rank the whole terms, but I wanted to look at it at word level, to compare with Zipf analyses of natural language). I had to use a tiny (50,000 terms, or about 250,000 words) subset of the whole set, and even that took nearly 40 minutes to run because I didn't bother to optimize my lookup code, but it was enough to get a usable answer.

Now that I've relearned some programming, the next step will be to move up to a proper tool that really can handle data sets as big as AOL's. I've just bought the legendary Mathematica, which is a favorite on Wall Street and in universities alike, so that may well be the answer, although it's far from intuitive. I'll probably start by replicating my search word ranking program to see how much faster and more powerful Mathematica is. I can already tell this is going to take a while. Data analysis can get easier, but that still doesn't make it actually, you know, easy.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341bfb6353ef00d834b1a5df53ef

Listed below are links to weblogs that reference When spreadsheets aren't enough:

Comments

Good luck with that. Just reading it gave me a brain-ache.

You really should consider using R or S-Plus. R and S-Plus are related statistical tools. R is open source and has a plethora of add-ins available through CRAN. S-Plus is commercial, prettier, and has some modules built in that R does not. The third alternative is SAS, which is cumbersome but blindingly fast for datasets that can't be coerced into memory.

Mathematica is great for symbolic computing, draws the prettiest graphs, and is lots of fun. MatLab is the heavy lifting tool of choice for numerical computing. I see MatLab on the desks of more quants than Mathematica and it has good statistical tools. However, I suspect you'd be better off with a stats package.

R -- http://www.r-project.org/
S-Plus -- http://www.insightful.com/products/splus/default.asp
MatLab -- http://www.mathworks.com/products/matlab/
SAS -- http://www.sas.com/technologies/analytics/statistics/insight/index.html

-r hayes

Many social science researchers use SPSS for data analysis.

I actually was fairly suprised none of your technical/statistical collaborators had already suggested R or S-Plus, as R Hayes did above. The tools you surveyed don't seem particularly suited for what it seems you need to do. Makes me wonder if you've been talking to the wrong crowd all along...

I recommend you check out http://www.tableausoftware.com/

They have an extremely flexible graph/visualization tool that queries the database directly rather than a spreadsheet. You can make nearly any type of data visualization you can think of fairly easily.

Is the raw data available? It was not clear from this posting. You can send URLs to my email. To be frank, it looks solvable with a few lines of APL, but I'd like to be able to demo that.

Mathematica is feels much more about symbolic math. And Matlab doesn't do well with missing/unknown values, which makes it more painful than necessary for statistical analyses.

I miss SAS terribly, but R is probably what you want to learn here.

Also hellos to Reilly

My statistician girlfriend swears by R. And I remember Mathematica from my undergraduate days being a pain in the ass. So I'm going to echo some of the above commenters.

Check out Edward Tufte's book, Visual Explanations. He has devoted his whole academic career at Yale to helping people present information. www.edwardtufte.com

I'll also go for R. Use the RODBC library to connect with databases, the 'merge' command is great, and it has more statistical tests than any person can handle. Besides the standard 'Introduction to R', I can also recommend http://www.math.csi.cuny.edu/Statistics/R/printable/simple-12-twoside-letter.pdf
And if you don't know how to do something, you can always look at the mailinglist http://tolstoy.newcastle.edu.au/R/

I think I mentioned this to you before, but JMP is the GUI version of SAS, and I think you'd like it as a middle step between spreadsheets and writing real code. It has pretty much every statistical test you can think of, and a very nice interactive mode where clicking on charts highlights the relevant rows and vice versa.

Not to make too strong a point, but that code is, well, ghastly, and Liberty looks to be somewhat, er, retro - like the 90's never happened :)

Without wanting to plug any specific language, I wrote a script in about 20 minutes that analysed 3.6m lines of search terms containing 8.9m words of which 589K were unique. 253K (I'm rounding as I go) occurred more than once.

Total execution time? 3 minutes, 40 seconds. About as long as it took to write this comment. And my scripting language of choice is by no means the fastest (hint: red gemstone).

For the record, my top 10:
of 113444
- 103649 (OK, I know it's not a real word!)
in 95224
the 85864
and 73232
for 72755
to 48050
free 41745
a 37089
google 34886

If any of this would be useful, just scream.

Since I place some value on an improved UI and since I used to use S back at Bellcore, I took a look at S-Plus at the site mentioned by R Hayes.

This website immediately scared me off the product: no price, which violates guideline #1 for B2B sites. Further violating usability guidelines, product info is given in PDF files that are slow and annoying to view.

I did give them the two minutes that users visit websites on average, but then I left. If the commercial product has so poor service on its website, then why pay when there's a free version.

I have tried about ten stats packages over the years, and they have all been bad in terms of usability. There were a few decent ones for the Mac back in the days when there was 3rd party software developed for that platform.

Please the above site out for an excellent data analysis tool


Have you tried Vilno? It's a new data crunching programming language at

www.my.opera.com/datahelper

I second the recommendation of Edward Tufte's site and books. There's a long thread on his message board about graphing software:
http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=00000p&topic_id=1&topic=Ask+E%2eT%2e

I've used Aabel on the Mac to supplant Excel.

I also agree with Nielsen's comments on the S-Plus site!

Here's a ruby script that does the same as your basic script:

#!/usr/local/bin/ruby
wordcnt = Hash.new(0)
while line = gets
  id,words,rest = line.split("\t")
  words.split(/\s+/).each do |word|
    wordcnt[word] += 1
  end
end
x = 0
wordcnt.sort{|a,b| b[1]<=>a[1]}.each { |e|
  break if x == 10
  x += 1
  puts "#{e[0]}\t#{e[1]}"
}
took less than 2 minutes to run across 3,558,412 lines

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Tidbits

Search this site

The Long Tail by Chris Anderson

Notes and sources for the book

FREE will be available in all digital forms--ebook, web book, and audiobook--for free shortly after the hardcover is published on July 7th (exact dates will be announced here as each form is released). The ebook and web book will be free for a limited time, the unabridged audiobook will be available free forever.[Update: the first free versions have now been released.]

Order the hardcover now!