Sunday, February 28, 2010

MapReduce in a Nutshell: Reduce

What we've established so far (see my earlier post) is that we can take a large set of data, partition it into smaller sets, and distribute those sets for processing to take advantage of parallelism. Each of those smaller sets ("library slices" in the example) generates its own set of intermediate results. In order to get our results, we must process all of the intermediate results, but we've already established that the data set is too large to perform the operation in one go. Here is where the Reduce step of MapReduce comes into play.

Remember that we called Map() once* for every book. We now need to call Reduce once* for every distinct word.

public void Reduce(string key, IEnumerable<byte> values)
long total = 0;
foreach (byte value in values)
total += value;
Emit(key, total);

Take a look at the intermediate results:
WORDS:[("", ),("", )]
WORDS:[("", ),("", )]

We're only going to call Reduce() once* per distinct word, but the key "the" exists in two of our slices. Both of those slices need to be logically joined before Reduce() can be called, with the combined intermediate values from both slices. Seeing as we're not using a framework, we need to manually set up some sort of routing that will ensure matching keys are joined. This is why Reduce() takes a key and a list of values as its input.

We used 5 Mapper objects, but there's no requirement for the same number of Reducers. Because we don't care about optimum performance yet, I've chosen to use 3. In real life, we would tune the number based on the hardware that's available, and the amount of data that needs to be pushed around to complete the process.

Here's a simple routing function that partitions words based on their initial letter, always routing matching words to the same reducer:
if (word[0] <= 'e')
// route to reducer 0
else if (word[0] <= 'r')
// route to reducer 1
// route to reducer 2

As it turns out, the naive routing function will split our data like this:

Reducer 0Reducer 1Reducer 2

The program continues to execute, this time calling Emit() once for every word, and its final (total) count. The behaviour of Emit() is not concrete, but remember that it's (probably) being called on different physical machines, and the final results - once emitted - need to be routed back to the start.

No comments: