Over the last two years of doing Long Tail research, one of the biggest learning curves for me has been just getting re-used to proper analytical tools. It's been close to two decades since I last did any real science, and back then our computational tools pretty much consisted of pocket calculators and Fortran. I remember being seen as a bit of a physics lightweight for stooping to spreadsheets on occasion (this was the late 80s), and the graphing programs that we used to plot data took as long to understand as the data itself. Since then I've pretty much lived in Excel (which I actually use more than Word), and that's always been enough. Until now.
In early 2004, when I stated this project, I was pretty much counting on the kindness of strangers to do most of the data-massaging work for me. I'd craft a data request and then negotiate with the appropriate company to have it generated by the clever elves in their technical division. But as time went on my requests got more ambitious and I started getting more and more raw data and massive data sets, often with millions of entries. That's when I realized that Excel wasn't going to do the trick anymore. For starters, it can only handle 65,000 rows (that's been raised to 1 million rows in Excel 2007, if you can bear the bugs that still exist in the beta version now available).
For bigger files than a spreadsheet can handle, you'll want a database. Because I already had Access, I used that. It's a graceless, over-complicated program, but it can at least handle huge files. I mostly used it to parse data sets and do very simple queries (sorts, sums, filters and the like) that would output files that were small enough to analyze in Excel. But the data sets kept getting more complicated, and the questions I was trying to answer were getting more sophisticated and thus outside of the usual tutorials. I was either going to have to learn how to program databases properly or find some other tool.
There are plenty of dedicated statistics and scientific graphing programs out there, but all the ones I've found are limited in what they can do and relatively inflexible in their accepted data formats. In my case, the data sets ranged from search terms to subscriber-level ringtone records. Each was different and all needed a lot of scrubbing to put in a form that would allow the sort of calculations I was after, from powerlaw exponents to head/tail ratios.
And so I ended up where I'd started twenty years ago: writing code. As far as I can tell, it's the only way to really have the ultimate flexibility to ask any question of any kind of data set. I know of course that real geeks use Perl for this (as in "so I just tossed off a few lines of Perl", as if it were as simple as an email), but to be honest I just don't have the time to learn a new programming language. So I did what no self-respecting geek would admit to: I used Basic.
In fairness, a lot has changed since the "GOTO 50" days. For one thing, Microsoft's Visual Basic is now nearly as powerful (and as difficult to learn) as C++. It's fully structured, object-oriented and net-aware. It's also a pig to use, with even the most simple programs requiring all sorts of structure overhead. I gave up on that pretty quickly and went looking for something simpler.
I found it in Liberty Basic. It's a proper structured programming language without the unnecessary complexity of languages trying to prove that they're "enterprise class". I know its string handling isn't nearly as good as Perl, but everything else about it is so easy to use, from its editor to its debugger, that I can deal with writing my own text parsers and the like. And like most Basics these days, you can use it in interpreted or compiled mode, so you can create stand-alone programs if you want. It's not fast, but to be honest that's probably more due to my grossly unoptomized algorithms than it is the language itself.
If you've been dabbling in large-scale data analysis and running up against the limits of Excel, you might want to give Liberty a shot, at least for rough, proof-of-concept stuff. Here, for instance, is a program I wrote to rank the individual words used in the search terms of the notorious AOL dataset. (Others have written programs to rank the whole terms, but I wanted to look at it at word level, to compare with Zipf analyses of natural language). I had to use a tiny (50,000 terms, or about 250,000 words) subset of the whole set, and even that took nearly 40 minutes to run because I didn't bother to optimize my lookup code, but it was enough to get a usable answer.
Now that I've relearned some programming, the next step will be to move up to a proper tool that really can handle data sets as big as AOL's. I've just bought the legendary Mathematica, which is a favorite on Wall Street and in universities alike, so that may well be the answer, although it's far from intuitive. I'll probably start by replicating my search word ranking program to see how much faster and more powerful Mathematica is. I can already tell this is going to take a while. Data analysis can get easier, but that still doesn't make it actually, you know, easy.