Sunday, February 08, 2009

Working with large data sets is hard

It seems that the most difficult problems in programming today are due to massive data sets and/or a relative lack of processing speed.

Memory is free so we've decided to store everything that we possibly can. And then we have problems analysing these massive data sets.

Google have attempted to solve the problem with their brilliant Map Reduce technique. This relies on having a number of cheap PCs on your network, and because of this, it scales well.

Anyone other than Google are a slave to the Big O.

If your algorithms aren't linear or nearly linear you're going to struggle when you want to run them over large data sets, and there's not a lot you can do about it other than throw processors at it.

No comments: