Sunday, June 12, 2011

External Sort

If you're ever tasked with sorting a data set larger than can fit into the memory of a single machine, you shouldn't need to panic. Put on your outside-the-box hat (pun intended) and get to work.

First of all, divide the data into blocks that are small enough to be sorted in memory, then sort them and write the results to disk (or network, or anywhere external to your process). If memory size is M, and total data to be sorted is MxN then you should now have N blocks of locally sorted data.

Next, do an N-way merge. I did it by getting N buffered readers over the N blocks. By continually getting the next lowest value from the pool of N readers (it's easy if you continually sort the readers by last obtained value in ascending order) and writing the obtained values into an output file, you will end up with N globally sorted blocks of data.

For most people attempting a very large sort, this is usually the end result (I am assuming not a lot of people have this requirement very often, and even less frequently is it their first time at seeing it.) If you're left wanting, however, you must continue...

For large values of N it first becomes prohibitive to buffer the input, then to even access one value for every N at the same time. In this case, you will need to perform a second (or higher) pass, working on less than N blocks at a time. A 32-bit Windows machine should only begin to approach this next hurdle somewhere after the 10's of terabytes mark (depending of course, in the size of the objects being sorted)...

No comments: