Sunday, February 28, 2010

MapReduce in a Nutshell: Reduce

What we've established so far (see my earlier post) is that we can take a large set of data, partition it into smaller sets, and distribute those sets for processing to take advantage of parallelism. Each of those smaller sets ("library slices" in the example) generates its own set of intermediate results. In order to get our results, we must process all of the intermediate results, but we've already established that the data set is too large to perform the operation in one go. Here is where the Reduce step of MapReduce comes into play.

Remember that we called Map() once* for every book. We now need to call Reduce once* for every distinct word.

public void Reduce(string key, IEnumerable<byte> values)
{
long total = 0;
foreach (byte value in values)
total += value;
Emit(key, total);
}


Take a look at the intermediate results:
SLICE:0
WORDS:[("", ),("", )]
SLICE:1
WORDS:[("", ),("", )]


We're only going to call Reduce() once* per distinct word, but the key "the" exists in two of our slices. Both of those slices need to be logically joined before Reduce() can be called, with the combined intermediate values from both slices. Seeing as we're not using a framework, we need to manually set up some sort of routing that will ensure matching keys are joined. This is why Reduce() takes a key and a list of values as its input.

We used 5 Mapper objects, but there's no requirement for the same number of Reducers. Because we don't care about optimum performance yet, I've chosen to use 3. In real life, we would tune the number based on the hardware that's available, and the amount of data that needs to be pushed around to complete the process.

Here's a simple routing function that partitions words based on their initial letter, always routing matching words to the same reducer:
if (word[0] <= 'e')
// route to reducer 0
else if (word[0] <= 'r')
// route to reducer 1
else
// route to reducer 2


As it turns out, the naive routing function will split our data like this:














Reducer 0Reducer 1Reducer 2
afeatherssticking
andfirstthe
beginninggodtill
buttheavenup
caredinwhich
chickenmakewho
creatednotwould
crossedoryears
doesreportedyou
downroadyour
earth  
egg  


The program continues to execute, this time calling Emit() once for every word, and its final (total) count. The behaviour of Emit() is not concrete, but remember that it's (probably) being called on different physical machines, and the final results - once emitted - need to be routed back to the start.

Saturday, February 27, 2010

MapReduce in a Nutshell: Map

You have a problem to solve, but the set of input data is far too large to solve it by simply looping over all the data.

What if you broke the input data into chunks and analysed only a chunk at a time, somehow aggregating the intermediate results into something meaningful at a later stage?

How do you perform this partial analysis in such a way that an aggregation is possible?

The standard Wikipedia example counts all the words in a collection of books. Only problem is: the collection of files is too large, or the intermediate word count is too large to run on one conventional computer.

So. First of all, we have the superset of books: the library. We need to break it up. Maybe if it was 1/5th the size, we could analyze it in a single process? We will call each 1/5th a slice of the library.

We still need to define our function that will do the first step of analysis on each book, remembering it will be called once* for each and every book in the library.
void Map(Book book)
{
foreach (string word in book.Words)
EmitIntermediate(word, 1);
}


Now we have the function, where do we put it? On a Mapper object, of course. We will create one* mapper for each slice of the library, then we will invoke the Map() function once per book in that library slice. Then for every word in the book, we will call the mysterious EmitIntermediate() function.

For simplicity, let's just assume that EmitIntermediate delegates its call to an Emitter object. We will probably have one Emitter per library slice; each Mapper will have its own Emitter. This is good for isolationg Mappers: it will mean we're not sharing state between mappers so there won't be any concurrency issues even though we're dealing with multiple Mappers working in parallel.

Here is a possible implementation of an Emitter for our example - it differs from the standard example in that it performs some arithmetic continuously updating the slice's word count:
class Emitter
{
private Dictionary intermediate; // constructed elsewhere
public void Emit(string key, long value)
{
if (!intermediate.ContainsKey(key))
intermediate.Add(key, 0);
intermediate[key].Value += value;
}
}


Here is another possible implementation of an Emitter - it's more like the standard example as it stores a list of values, to be summed up at a later stage. Because we know that "1" is the only value we ever emit, we can reduce the amount of storage space required by defining the values as a list of bytes.
class AnotherEmitter
{
private Dictionary<string, List<byte>> intermediate; // constructed elsewhere
public void Emit(string key, byte value)
{
if (!intermediate.ContainsKey(key))
intermediate.Add(key, new List<byte>());
intermediate[key].Add(value);
}
}


There's no requirement for the Emitter's keys and values to be backed by an object in memory, this is just a simplification to explain the MapReduce concept.

Let's start this program up and run it. We take our library and divide it into slices. For each slice we create a Mapper and an Emitter. For each book in the slice we call the Map() function, which calls EmitIntermediate() for every word it finds. Once all the slices have been mapped, we inspect the Emitters. There's one Emitter per slice, and it contains all of the words for all the books in that slice.

Here's a concrete example of our entire library:
SLICE:0
BOOK:"Fight Club"
WORDS:"Sticking feathers up your butt does not make you a chicken"
BOOK:"The Bible"
WORDS:"In the beginning God created the heaven and the earth"
SLICE:1
BOOK:"Still Life With Woodpecker"
WORDS:"who cared which crossed the road first, the chicken or the egg"
BOOK:"FIASCO"
WORDS:"would not be reported till years down the road"
SLICE:N
BOOK:*
WORDS:*


When we example our intermediate values for the Emitter on slice 0, we find "the" 3 times (all from the Bible) and "chicken" just 1 time. Unsurprisingly, not a lot of cross over between Fight Club and the Bible. The Emitter on slice 1 has slightly different results: "the" comes in 4 times (3 times from Still Life, and 1 time from FIASCO), but "chicken" is there just 1 time (thanks again to Still Life).

What we need now is something that will aggregate our intermediate results so that we get 6x "the"s and 2x "chicken"s. In the next post, it's time for the Reduce step.

* Lies to children: if we expand our example to become more robust, we could end up retrying to Map() a book that previously failed, even retrying an entire slice with a brand new Mapper if the compute node failed.

Friday, February 26, 2010

Edgar Frank

Why are relational databases so ubiquitous? We are taught about them in Computer Science lessons at university; we see job advertisements requiring SQL skills all the time; almost every project I've touched in my professional career has used a DBMS of some sort, to various degrees of efficacy.

I will admit that to some extent my exposure is biased and self-fulfilling: I took the course, someone heard I was interested in developing the skills, I started looking to reuse and improve my skills, now every project wants to leverage existing skills and a subset of those would hire people that knew no other language but SQL.

But why? Who is designing all these applications? Are they really considering the actual requirements, or are they following the approach prescribed by DBMS vendors: use a relational database. Heck, use our DBMS.

Most database systems worth their salt offer two primary features: a reliable transactional system and a way to protect data integrity; ACID and EI/RI/DI.

Even if we don't use explicit transactions, or even consider them in our designs, the ACID properties (atomic, consistent, isolated, durable) stop simultaneous updates from rendering our entities in an inconsistent state.

Entity, referential and data integrity are enforced by the DBMS, but we need to design the structure of the database to take advantage of them. We declare entity integrity by defining primary keys (no two rows are duplicates), referential integrity by defining foreign keys (no unresolved dependencies) and data integrity by defining column types such as int and datetime.

SQL Server also offers us the ability to store massive amounts of data, indexes to seek right into the bits of data that you need, and a powerful query optimiser that will join related bits of data in the most efficient way possible. It also allows the data to be restored to arbitrary points in time in the past. With failover clustering it even attempts resilience against hardware failures. But despite its prevalence, I know there necessarily exists something (in concept at least) more efficient for every given task.

My perceived problem in DBMS systems lies in their applicability to general problems - in short their ubiquity. People don't question whether or not they need a DBMS; instead they question which vendor to purchase them from. Understanding everything a database does isn't something everybody wants to do, but more and more people are available in the market to build your company's new idea into a tangible product. A recruitment agent has a much easier job selling a labelled skill than selling some undefined/unquantifiable genius. A man (or woman) can make a living by writing SQL code for another man (or woman). It is everywhere. But why re-invent the wheel?

Well, when you don't know how large your application will have to scale up to at design time, you can take the bet that it won't scale, you can save some time and money by hiring a SQL developer off the street, and you will probably deliver something that's functionally correct if you do it right. But down the line, usually when you become a victim of your own success, it will become the bottleneck, and seemingly impossible to extract from the code that's been written around it...

Monday, February 22, 2010

iSCSI Target

I have wanted to setup my own SQL Server failover cluster for quite a while, but never really understood how: a long time ago you had to pick special hardware off a list provided by Microsoft and you needed to run a special edition of Windows Server and you needed Fibre Channel. In short, a recent university graduate wasn't going to get anywhere.

It all seems to have changed though. iSCSI has emerged as an alternative to Fibre Channel, which means that you use your gigabit network to communicate with the SAN instead of some cheeky fibre optic cables (all those years ago it still wouldn't have helped to know that you could use copper cables too). Windows Server 2008 ships with the iSCSI initiator built-in, although you can download it for free from Microsoft if you're running Windows Server 2003 or earlier.

My first mission was simple. Get it working. I chose to run iSCSI Enterprise Target under Ubuntu, with a 16GB USB flash drive formatted in NTFS. This would be my SAN. Note: you need to choose the correct block device name so that you don't end up destroying hardware or losing data. For my flash disk I used /dev/sdc1 (maybe an Ubuntu guru could tell me why /dev/sdc without the number didn't work 100%) and yours will almost certainly be different. [edit:on a fresh install of karmic, i had to run sudo apt-get update first, otherwise it complained that the package couldn't be found]
sudo apt-get install iscsitarget
Once the installation was complete, I edited the service configuration file
sudo gedit /etc/ietd.conf
Adding in the following two lines:
Target iqn.2010-02.com.extellect:ta.tb.tc.td
Lun 0 Path=/dev/sdc1,Type=fileio

After saving the file and closing the editor, I restarted the Ubuntu machine, and powered up the Windows Server.

I clicked on Start > Administrative Tools > iSCSI Initiator then I clicked yes to the questions about firewall, and thought about setting the service as a dependency of the SQL Server service so that the drive would be available whenever SQL Server was running.

Looking at the iSCSI Target properties window:
Discovery tab: Add portal: aa.bb.cc.dd port 3260
Targets tab: automatically populated thanks to discovery: iqn.2010-02.com.extellect:ta.tb.tc.td and the status is "Inactive"
Click the Log on button, then OK.
The status is now "Connected"

At this point take a look at the Ubuntu machine to see the connection
sudo cat /proc/net/iet/session

Back to the Windows machine, this time opening Computer Management > Storage > Disk Management
I saw my flash drive but the disk said it was offline - right click and choose online from the context popup menu. Now I saw a familiar (and happy) sight:
Disk 2 Basic 14.95 GB Online - CRUZER (F:) 14.95 GB NTFS Healthy (Primary Partition)

Opened up SQL Server Management Studio and created a new database, choosing F: as the default data and log location. Created a table, inserted and then selected "Hello, World!"

Checked on the Ubuntu machine and there was the mdf and ldf!

Sunday, February 14, 2010

Sequential vs. Random I/O

I mentioned here that sequential I/O was good and random I/O was bad. Bad is the wrong word to use. Random I/O is comparatively less efficient than sequential I/O when accessing the same amount of data. The point to take home is that when a hard disk head is seeking to a different location, it isn't reading and returning data. In the sequential I/O scenarios there is very little seeking and the drive can quickly return all the data in a single stream. In contrast, random I/O causes the disk to spend half its time seeking and it can only return half the amount of data in the same amount of time as the sequential scenario.

Here are two performance considerations for designing and administering a SQL Server database and server:

Index seeks and scans use random and sequential I/O respectively. The performance gained by index seeks is in the amount of data that needs to be accessed. Finding 1 row in just 4 random I/Os is much quicker than scanning through 10000 I/Os worth of data sequentially. There is a point at which scanning is quicker, and that's when the tables are small.

Log files are accessed sequentially, so it makes sense to put each database's log file onto its own disk for the best performance. Sharing one disk between many different log files will lose you the performance benefit.

Tuesday, February 09, 2010

Caution: Assembly Under Test

What do we test? What can we test? What should we test? Three subtly different questions with radically different answers in the general case.

We start by answering the questions that weren't asked: why do we test and why should we test?

Testing increases developers' confidence to introduce new changes without breaking existing functionality. Anecdotally, it also reduces overall project completion time because less time is spent in the testing phase because bugs are found and dealt with sooner. It would be a tall order to define all the reasons we should test in one paragraph of one tiny blog.

There are usually ancillary reasons why we do test. Someone else forces us to, or we choose to, or we need some experience for the CV. People's motivations are endlessly creative. Heck, I can imagine many a political reason for coding unit tests.

We can test pretty much anything, but it doesn't mean we should. As we become more familiar with various testing tools and APIs we increase our range of what can be tested, but we should focus on the application of our knowledge to what we should be testing.

What should we test? Best question yet. From the point of view of the person who's paying you to develop software: we should test things that make our project more attractive than a (perhaps imaginary) competitor's project that doesn't employ unit testing. Do we deliver it faster? Do we deliver it with less bugs? Is everyone on the team happy? Did we get to do a little bit of CV++ along the way?

What do we test? In an ideal world, we test the objects we write. Seeing as I come from an object-oriented background, I'd say we test the state and behaviour of objects. We test state by inspecting the values of properties. We test behaviour by inspecting the return values of methods, and sometimes by ensuring that we're interacting correctly with other objects.

  • State: make assertions against expected and actual property values
  • Behaviour: make assertions against expected and actual method return values
  • Interactive behaviour: make assertions against the objects with which our objects interact - what we're testing is that we're actually interacting with them in the expected manner

Mock objects are the flavour of the week when it comes to testing interactive behaviour - we set up expectations on a mock object, interact with it, and then validate our expectations. They are there so that we can test our object under test is correctly interacting with it's surroundings.

Stub objects don't fit anywhere in this list of tests because we don't assert against them... we only use stub objects in order to allow our tests to compile and execute. We choose to use stub objects because we can control them; control their dependencies. They are an important part of any isolation framework. We don't want our logging method to keep writing disk files or sending e-mails while our tests run, but we don't want it to throw NullReferenceExceptions either. We definitely don't want to code our application to continuously check if the logger is null (this would be tedious and error prone), nor would we want to introduce special branching logic to not perform logging under test conditions. We could code up our own NullLogger : ILogger that does nothing but implements all ILogger's events/properties/methods to do nothing, but this would be fragile and often a waste of time when functionality is added to or removed from the ILogger interface a month down the line. Good frameworks like Rhino.Mocks allow dynamic stubs to be built on the fly without you ever having to care about creating and maintaining your own NullLogger implementations. Saving us time, saving us money.

Just because we can test state and behaviour (even interactive behaviour) doesn't mean we should. There needs to be justification. Does it make the application development experience quicker and better? Does it reduce overall costs? Does it allow you to sleep better at night (no, not because you are so exhausted from fighting fires and fixing production bugs!)? Are the tests valid? Will they help someone newly joining the team to understand what's going on? You can be as creative as you want in your reasoning, but without proper justification, you may just be wasting time writing and running tests.

Sunday, February 07, 2010

Unblock

Windows 7 is pretty security-conscious. I downloaded the latest Rhino.Mocks zip, extracted the .dll and .xml, referenced them into a new Visual Studio test project, and CTRL+R,CTRL+T'd to run the test. Fail! Test run deployment issue: 'Rhino.Mocks.dll' is not trusted.

Unbeknownst to me, there's a new option in Windows 7 (visible from the file's properties window) "this file came from another computer and might be blocked to help protect this computer." Smart. And there's an unblock button nearby. Click it.

It doesn't end there. I extracted the .xml file too, so that I'd have access to the Intellisense code documentation comments. This file needs to be unblocked too. Even if you've unblocked the .dll, the error message will persist: 'Rhino.Mocks.dll' is not trusted. Lies lies lies. Repeat step 1 to unblock the .xml.

Finally, Visual Studio will likely be working with cached assemblies, so the original .dll and .xml that were copied into the bin directory before the intial test run deployment failure are probably still there (they're still blocked). Delete them and let Visual Studio pick up the fresh unblocked files, and you'll be on your way again in no time.

Confidence & Unit Tests

Why do we waste our time with these stupid unit tests, some might ask. Well, if your system's requirements can change on a dime then I'd say (besides the initial measure-twice-cut-once rule) that the next most important thing you can do is have an automated suite of tests that can ensure your "tiny" "quick" code change doesn't have massive negative impact on the rest of the system. An automated build that runs unit tests covering most of the code base gives developers confidence to change the relevant pieces of source code and get quick turnaround times for new features and bug fixes. Yes! Quick turnaround times because the code was all already under test. I hate to say it, but "agile".

Measure Twice, Cut Once

Unit testing offers software developers a parallel insight into the world of physical construction: measure twice, cut once. Whether you're in charge of putting together a wooden tie-rack or a sky scraper, it's slightly more than just inconvenient to find that someone in the process delivered the wrong sized artifact because they were rushing and only looked over their work once. A second pair of eyes, a buddy check, even double checking on the developer's own part can reduce the occurrence of these kinds of errors. Unit tests are a great tool for software engineers in this regard. We code up the business logic once, and then we get to test it from an alternate perspective (please, for heaven's sake, don't mimic the exact flow of logic in your unit tests: even if there's a bug your tests will erroneously pass!)

Friday, February 05, 2010

ConditionalAttribute("DEBUG")

This is quite a neat concept that I only partially grasped until now. When you apply this attribute to a target (either a method or class) it is compiled just as any other custom attribute is treated. Look at the assembly in a tool like .NET Reflector and you'll see it there, exactly as you defined it. Bog standard.

The clever part is what happens next, whether you're referencing the target yourself (perhaps even from the same assembly in which it is defined), or someone else is referencing it by linking to your assembly as a 3rd party library.

The compiler recognizes the target has a conditional attribute, and it checks its list of defined symbols to see if there's a match. If there's a match, it proceeds as normal, but if there's no match, the compiler modifies the caller to leave the method call (or the custom attribute declaration) out of the newly compiled assembly entirely. Simple as that. In this way, the caller (and this is the important part) will never know that the target method or attribute class exists. This can have great performance implications: code that's compiled with the DEBUG symbol defined (or whatever symbol you choose for your project) will make all the calls, which could include very verbose tracing, but code without the symbol will have the method calls (and custom attributes) elided. In the case of method calls, not only the method call is skipped, but any logic for building up arguments is also skipped. For example:

ConditionalTargetClass conditionalTargetClass = new ConditionalTargetClass();
conditionalTargetClass.conditionalTargetMethod(Factorial(2000));


becomes

ConditionalTargetClass conditionalTargetClass = new ConditionalTargetClass();

There are a few limitations:
When applying the attribute to a method, that method return type must be void.
When applying it to a class, the class must derive from System.Attribute.

Another example:
[Conditional("DEBUG")]
public class DebugAttribute : Attribute { /* EMPTY */ }

[Debug]
public class Program { /* EMPTY */ }


becomes

[Conditional("DEBUG")]
public class DebugAttribute : Attribute { /* EMPTY */ }

public class Program { /* EMPTY */ }

Wednesday, February 03, 2010

Method Scenario ExpectedBehaviour

What do you call your unit test methods? How do you logically structure them? I've been over this before, so it's time for a little refinement. One unit test assembly for every assembly under test. One test fixture for every class under test. Many test methods for every method under test. And to make everything easy to read, every test method should follow this pattern (not my idea, but I wish it was):

Method_Scenario_ExpectedBehaviour

For example if we're testing a new Security class with an Authenticate method, we might create the following test methods inside the SecurityTestFixture class:

Authenticate_UserNotFound_ThrowsAuthenticationException
Authenticate_IncorrectPassword_ThrowsAuthenticationException


Of course no application development should be driven by negative tests - at first you'll want to test that your authentication process works, then later down the line you'll want to make sure that it's robust. So there would likely be more test methods like:

Authenticate_Success_ReturnsPrincipal

Note how easy it is for someone to read the test method name and understand what it was doing. That's on purpose. Also note how the scenarios aren't long and complex, and only one logical unit of work is tested per test method. That's on purpose too. It means it's easy to pinpoint the root cause of failure.

Fake: Stub vs. Mock

When you're writing unit tests, you'll occasionally find yourself in the situation where you need to create some kind of fake object and pass it into your object/method under test. What you do with that object next is what determines whether it's a mock object or just a stub. If your test makes an assertion on the fake (i.e. what methods were called on it, and with what arguments) then that fake is a mock object. If the test does not make any assertions on the fake, then it's just a stub.

In short:
all mocks are fakes
all stubs are fakes
no mock is a stub
no stub is a mock

Many so called "mocking frameworks" actually allow you to dynamically create both mocks and stubs, so it would be better to call them isolation frameworks as their primary function is to let you test your own objects in isolation by mocking/stubbing out external dependencies.