When spreadsheets aren't enough
Over the last two years of doing Long Tail research, one of the biggest learning curves for me has been just getting re-used to proper analytical tools. It's been close to two decades since I last did any real science, and back then our computational tools pretty much consisted of pocket calculators and Fortran. I remember being seen as a bit of a physics lightweight for stooping to spreadsheets on occasion (this was the late 80s), and the graphing programs that we used to plot data took as long to understand as the data itself. Since then I've pretty much lived in Excel (which I actually use more than Word), and that's always been enough. Until now.
In early 2004, when I stated this project, I was pretty much counting on the kindness of strangers to do most of the data-massaging work for me. I'd craft a data request and then negotiate with the appropriate company to have it generated by the clever elves in their technical division. But as time went on my requests got more ambitious and I started getting more and more raw data and massive data sets, often with millions of entries. That's when I realized that Excel wasn't going to do the trick anymore. For starters, it can only handle 65,000 rows (that's been raised to 1 million rows in Excel 2007, if you can bear the bugs that still exist in the beta version now available).
For bigger files than a spreadsheet can handle, you'll want a database. Because I already had Access, I used that. It's a graceless, over-complicated program, but it can at least handle huge files. I mostly used it to parse data sets and do very simple queries (sorts, sums, filters and the like) that would output files that were small enough to analyze in Excel. But the data sets kept getting more complicated, and the questions I was trying to answer were getting more sophisticated and thus outside of the usual tutorials. I was either going to have to learn how to program databases properly or find some other tool.
There are plenty of dedicated statistics and scientific graphing programs out there, but all the ones I've found are limited in what they can do and relatively inflexible in their accepted data formats. In my case, the data sets ranged from search terms to subscriber-level ringtone records. Each was different and all needed a lot of scrubbing to put in a form that would allow the sort of calculations I was after, from powerlaw exponents to head/tail ratios.
And so I ended up where I'd started twenty years ago: writing code. As far as I can tell, it's the only way to really have the ultimate flexibility to ask any question of any kind of data set. I know of course that real geeks use Perl for this (as in "so I just tossed off a few lines of Perl", as if it were as simple as an email), but to be honest I just don't have the time to learn a new programming language. So I did what no self-respecting geek would admit to: I used Basic.
In fairness, a lot has changed since the "GOTO 50" days. For one thing, Microsoft's Visual Basic is now nearly as powerful (and as difficult to learn) as C++. It's fully structured, object-oriented and net-aware. It's also a pig to use, with even the most simple programs requiring all sorts of structure overhead. I gave up on that pretty quickly and went looking for something simpler.
I found it in Liberty Basic. It's a proper structured programming language without the unnecessary complexity of languages trying to prove that they're "enterprise class". I know its string handling isn't nearly as good as Perl, but everything else about it is so easy to use, from its editor to its debugger, that I can deal with writing my own text parsers and the like. And like most Basics these days, you can use it in interpreted or compiled mode, so you can create stand-alone programs if you want. It's not fast, but to be honest that's probably more due to my grossly unoptomized algorithms than it is the language itself.
If you've been dabbling in large-scale data analysis and running up against the limits of Excel, you might want to give Liberty a shot, at least for rough, proof-of-concept stuff. Here, for instance, is a program I wrote to rank the individual words used in the search terms of the notorious AOL dataset. (Others have written programs to rank the whole terms, but I wanted to look at it at word level, to compare with Zipf analyses of natural language). I had to use a tiny (50,000 terms, or about 250,000 words) subset of the whole set, and even that took nearly 40 minutes to run because I didn't bother to optimize my lookup code, but it was enough to get a usable answer.
Now that I've relearned some programming, the next step will be to move up to a proper tool that really can handle data sets as big as AOL's. I've just bought the legendary Mathematica, which is a favorite on Wall Street and in universities alike, so that may well be the answer, although it's far from intuitive. I'll probably start by replicating my search word ranking program to see how much faster and more powerful Mathematica is. I can already tell this is going to take a while. Data analysis can get easier, but that still doesn't make it actually, you know, easy.



Good luck with that. Just reading it gave me a brain-ache.
Posted by: Jeffrey Parsons | September 16, 2006 at 10:44 AM
You really should consider using R or S-Plus. R and S-Plus are related statistical tools. R is open source and has a plethora of add-ins available through CRAN. S-Plus is commercial, prettier, and has some modules built in that R does not. The third alternative is SAS, which is cumbersome but blindingly fast for datasets that can't be coerced into memory.
Mathematica is great for symbolic computing, draws the prettiest graphs, and is lots of fun. MatLab is the heavy lifting tool of choice for numerical computing. I see MatLab on the desks of more quants than Mathematica and it has good statistical tools. However, I suspect you'd be better off with a stats package.
R -- http://www.r-project.org/
S-Plus -- http://www.insightful.com/products/splus/default.asp
MatLab -- http://www.mathworks.com/products/matlab/
SAS -- http://www.sas.com/technologies/analytics/statistics/insight/index.html
-r hayes
Posted by: R Hayes | September 16, 2006 at 12:35 PM
Many social science researchers use SPSS for data analysis.
Posted by: Mindy McAdams | September 16, 2006 at 01:06 PM
I actually was fairly suprised none of your technical/statistical collaborators had already suggested R or S-Plus, as R Hayes did above. The tools you surveyed don't seem particularly suited for what it seems you need to do. Makes me wonder if you've been talking to the wrong crowd all along...
Posted by: Todd W. | September 16, 2006 at 05:24 PM
I recommend you check out http://www.tableausoftware.com/
They have an extremely flexible graph/visualization tool that queries the database directly rather than a spreadsheet. You can make nearly any type of data visualization you can think of fairly easily.
Posted by: Nisarg Kothari | September 16, 2006 at 07:18 PM
Is the raw data available? It was not clear from this posting. You can send URLs to my email. To be frank, it looks solvable with a few lines of APL, but I'd like to be able to demo that.
Posted by: Randy MacDonald | September 16, 2006 at 08:02 PM
Mathematica is feels much more about symbolic math. And Matlab doesn't do well with missing/unknown values, which makes it more painful than necessary for statistical analyses.
I miss SAS terribly, but R is probably what you want to learn here.
Also hellos to Reilly
Posted by: joshua schachter | September 17, 2006 at 09:08 AM
My statistician girlfriend swears by R. And I remember Mathematica from my undergraduate days being a pain in the ass. So I'm going to echo some of the above commenters.
Posted by: YLlama | September 17, 2006 at 06:43 PM
Check out Edward Tufte's book, Visual Explanations. He has devoted his whole academic career at Yale to helping people present information. www.edwardtufte.com
Posted by: Dan Koifman | September 17, 2006 at 08:05 PM
I'll also go for R. Use the RODBC library to connect with databases, the 'merge' command is great, and it has more statistical tests than any person can handle. Besides the standard 'Introduction to R', I can also recommend http://www.math.csi.cuny.edu/Statistics/R/printable/simple-12-twoside-letter.pdf
And if you don't know how to do something, you can always look at the mailinglist http://tolstoy.newcastle.edu.au/R/
Posted by: notme | September 18, 2006 at 01:13 AM
I think I mentioned this to you before, but JMP is the GUI version of SAS, and I think you'd like it as a middle step between spreadsheets and writing real code. It has pretty much every statistical test you can think of, and a very nice interactive mode where clicking on charts highlights the relevant rows and vice versa.
Posted by: Kevin Marks | September 18, 2006 at 01:46 AM
Not to make too strong a point, but that code is, well, ghastly, and Liberty looks to be somewhat, er, retro - like the 90's never happened :)
Without wanting to plug any specific language, I wrote a script in about 20 minutes that analysed 3.6m lines of search terms containing 8.9m words of which 589K were unique. 253K (I'm rounding as I go) occurred more than once.
Total execution time? 3 minutes, 40 seconds. About as long as it took to write this comment. And my scripting language of choice is by no means the fastest (hint: red gemstone).
For the record, my top 10:
of 113444
- 103649 (OK, I know it's not a real word!)
in 95224
the 85864
and 73232
for 72755
to 48050
free 41745
a 37089
google 34886
If any of this would be useful, just scream.
Posted by: Mike Woodhouse | September 18, 2006 at 07:14 AM
Since I place some value on an improved UI and since I used to use S back at Bellcore, I took a look at S-Plus at the site mentioned by R Hayes.
This website immediately scared me off the product: no price, which violates guideline #1 for B2B sites. Further violating usability guidelines, product info is given in PDF files that are slow and annoying to view.
I did give them the two minutes that users visit websites on average, but then I left. If the commercial product has so poor service on its website, then why pay when there's a free version.
I have tried about ten stats packages over the years, and they have all been bad in terms of usability. There were a few decent ones for the Mac back in the days when there was 3rd party software developed for that platform.
Posted by: Jakob Nielsen | September 18, 2006 at 05:04 PM
Please the above site out for an excellent data analysis tool
Posted by: random | September 19, 2006 at 12:37 AM
Have you tried Vilno? It's a new data crunching programming language at
www.my.opera.com/datahelper
Posted by: datahelper | September 24, 2006 at 03:46 PM
I second the recommendation of Edward Tufte's site and books. There's a long thread on his message board about graphing software:
http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=00000p&topic_id=1&topic=Ask+E%2eT%2e
I've used Aabel on the Mac to supplant Excel.
I also agree with Nielsen's comments on the S-Plus site!
Posted by: Rob Simmon | September 26, 2006 at 06:23 PM
Here's a ruby script that does the same as your basic script:
#!/usr/local/bin/ruby wordcnt = Hash.new(0) while line = gets id,words,rest = line.split("\t") words.split(/\s+/).each do |word| wordcnt[word] += 1 end end x = 0 wordcnt.sort{|a,b| b[1]<=>a[1]}.each { |e| break if x == 10 x += 1 puts "#{e[0]}\t#{e[1]}" }took less than 2 minutes to run across 3,558,412 linesPosted by: Joe | November 19, 2006 at 10:47 AM