« Academic research on the Long Tail | Main | Hollywood: The Year of Not Enough Blockbusters »

September 16, 2006

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341bfb6353ef00d834b1a5df53ef

Listed below are links to weblogs that reference When spreadsheets aren't enough:

Comments

Jeffrey Parsons

Good luck with that. Just reading it gave me a brain-ache.

R Hayes

You really should consider using R or S-Plus. R and S-Plus are related statistical tools. R is open source and has a plethora of add-ins available through CRAN. S-Plus is commercial, prettier, and has some modules built in that R does not. The third alternative is SAS, which is cumbersome but blindingly fast for datasets that can't be coerced into memory.

Mathematica is great for symbolic computing, draws the prettiest graphs, and is lots of fun. MatLab is the heavy lifting tool of choice for numerical computing. I see MatLab on the desks of more quants than Mathematica and it has good statistical tools. However, I suspect you'd be better off with a stats package.

R -- http://www.r-project.org/
S-Plus -- http://www.insightful.com/products/splus/default.asp
MatLab -- http://www.mathworks.com/products/matlab/
SAS -- http://www.sas.com/technologies/analytics/statistics/insight/index.html

-r hayes

Mindy McAdams

Many social science researchers use SPSS for data analysis.

Todd W.

I actually was fairly suprised none of your technical/statistical collaborators had already suggested R or S-Plus, as R Hayes did above. The tools you surveyed don't seem particularly suited for what it seems you need to do. Makes me wonder if you've been talking to the wrong crowd all along...

Nisarg Kothari

I recommend you check out http://www.tableausoftware.com/

They have an extremely flexible graph/visualization tool that queries the database directly rather than a spreadsheet. You can make nearly any type of data visualization you can think of fairly easily.

Randy MacDonald

Is the raw data available? It was not clear from this posting. You can send URLs to my email. To be frank, it looks solvable with a few lines of APL, but I'd like to be able to demo that.

joshua schachter

Mathematica is feels much more about symbolic math. And Matlab doesn't do well with missing/unknown values, which makes it more painful than necessary for statistical analyses.

I miss SAS terribly, but R is probably what you want to learn here.

Also hellos to Reilly

YLlama

My statistician girlfriend swears by R. And I remember Mathematica from my undergraduate days being a pain in the ass. So I'm going to echo some of the above commenters.

Dan Koifman

Check out Edward Tufte's book, Visual Explanations. He has devoted his whole academic career at Yale to helping people present information. www.edwardtufte.com

notme

I'll also go for R. Use the RODBC library to connect with databases, the 'merge' command is great, and it has more statistical tests than any person can handle. Besides the standard 'Introduction to R', I can also recommend http://www.math.csi.cuny.edu/Statistics/R/printable/simple-12-twoside-letter.pdf
And if you don't know how to do something, you can always look at the mailinglist http://tolstoy.newcastle.edu.au/R/

Kevin Marks

I think I mentioned this to you before, but JMP is the GUI version of SAS, and I think you'd like it as a middle step between spreadsheets and writing real code. It has pretty much every statistical test you can think of, and a very nice interactive mode where clicking on charts highlights the relevant rows and vice versa.

Mike Woodhouse

Not to make too strong a point, but that code is, well, ghastly, and Liberty looks to be somewhat, er, retro - like the 90's never happened :)

Without wanting to plug any specific language, I wrote a script in about 20 minutes that analysed 3.6m lines of search terms containing 8.9m words of which 589K were unique. 253K (I'm rounding as I go) occurred more than once.

Total execution time? 3 minutes, 40 seconds. About as long as it took to write this comment. And my scripting language of choice is by no means the fastest (hint: red gemstone).

For the record, my top 10:
of 113444
- 103649 (OK, I know it's not a real word!)
in 95224
the 85864
and 73232
for 72755
to 48050
free 41745
a 37089
google 34886

If any of this would be useful, just scream.

Jakob Nielsen

Since I place some value on an improved UI and since I used to use S back at Bellcore, I took a look at S-Plus at the site mentioned by R Hayes.

This website immediately scared me off the product: no price, which violates guideline #1 for B2B sites. Further violating usability guidelines, product info is given in PDF files that are slow and annoying to view.

I did give them the two minutes that users visit websites on average, but then I left. If the commercial product has so poor service on its website, then why pay when there's a free version.

I have tried about ten stats packages over the years, and they have all been bad in terms of usability. There were a few decent ones for the Mac back in the days when there was 3rd party software developed for that platform.

random

Please the above site out for an excellent data analysis tool

datahelper


Have you tried Vilno? It's a new data crunching programming language at

www.my.opera.com/datahelper

Rob Simmon

I second the recommendation of Edward Tufte's site and books. There's a long thread on his message board about graphing software:
http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=00000p&topic_id=1&topic=Ask+E%2eT%2e

I've used Aabel on the Mac to supplant Excel.

I also agree with Nielsen's comments on the S-Plus site!

Joe

Here's a ruby script that does the same as your basic script:

#!/usr/local/bin/ruby
wordcnt = Hash.new(0)
while line = gets
  id,words,rest = line.split("\t")
  words.split(/\s+/).each do |word|
    wordcnt[word] += 1
  end
end
x = 0
wordcnt.sort{|a,b| b[1]<=>a[1]}.each { |e|
  break if x == 10
  x += 1
  puts "#{e[0]}\t#{e[1]}"
}
took less than 2 minutes to run across 3,558,412 lines
printer cartridges

Macros in Google Spreadsheets would make it much more productive application. Looking forward seeing that functionality.

Solver, as mentioned in the post, would also be very much appreciated.

The comments to this entry are closed.

Tidbits

The Long Tail by Chris Anderson

Notes and sources for the book

FREE was available in all digital forms--ebook, web book, and audiobook--for free shortly after the hardcover was published on July 7th. The ebook and web book were free for a limited time and limited to certain geographic regions as determined by each national publisher; the unabridged MP3 audiobook (get zip file here) will remain free forever, available in all regions.

Order the hardcover now!