Sunday, February 28, 2010

MapReduce in a Nutshell: Reduce

What we've established so far (see my earlier post) is that we can take a large set of data, partition it into smaller sets, and distribute those sets for processing to take advantage of parallelism. Each of those smaller sets ("library slices" in the example) generates its own set of intermediate results. In order to get our results, we must process all of the intermediate results, but we've already established that the data set is too large to perform the operation in one go. Here is where the Reduce step of MapReduce comes into play.

Remember that we called Map() once* for every book. We now need to call Reduce once* for every distinct word.

public void Reduce(string key, IEnumerable<byte> values)
{
long total = 0;
foreach (byte value in values)
total += value;
Emit(key, total);
}


Take a look at the intermediate results:
SLICE:0
WORDS:[("", ),("", )]
SLICE:1
WORDS:[("", ),("", )]


We're only going to call Reduce() once* per distinct word, but the key "the" exists in two of our slices. Both of those slices need to be logically joined before Reduce() can be called, with the combined intermediate values from both slices. Seeing as we're not using a framework, we need to manually set up some sort of routing that will ensure matching keys are joined. This is why Reduce() takes a key and a list of values as its input.

We used 5 Mapper objects, but there's no requirement for the same number of Reducers. Because we don't care about optimum performance yet, I've chosen to use 3. In real life, we would tune the number based on the hardware that's available, and the amount of data that needs to be pushed around to complete the process.

Here's a simple routing function that partitions words based on their initial letter, always routing matching words to the same reducer:
if (word[0] <= 'e')
// route to reducer 0
else if (word[0] <= 'r')
// route to reducer 1
else
// route to reducer 2


As it turns out, the naive routing function will split our data like this:














Reducer 0Reducer 1Reducer 2
afeatherssticking
andfirstthe
beginninggodtill
buttheavenup
caredinwhich
chickenmakewho
creatednotwould
crossedoryears
doesreportedyou
downroadyour
earth  
egg  


The program continues to execute, this time calling Emit() once for every word, and its final (total) count. The behaviour of Emit() is not concrete, but remember that it's (probably) being called on different physical machines, and the final results - once emitted - need to be routed back to the start.

Saturday, February 27, 2010

MapReduce in a Nutshell: Map

You have a problem to solve, but the set of input data is far too large to solve it by simply looping over all the data.

What if you broke the input data into chunks and analysed only a chunk at a time, somehow aggregating the intermediate results into something meaningful at a later stage?

How do you perform this partial analysis in such a way that an aggregation is possible?

The standard Wikipedia example counts all the words in a collection of books. Only problem is: the collection of files is too large, or the intermediate word count is too large to run on one conventional computer.

So. First of all, we have the superset of books: the library. We need to break it up. Maybe if it was 1/5th the size, we could analyze it in a single process? We will call each 1/5th a slice of the library.

We still need to define our function that will do the first step of analysis on each book, remembering it will be called once* for each and every book in the library.
void Map(Book book)
{
foreach (string word in book.Words)
EmitIntermediate(word, 1);
}


Now we have the function, where do we put it? On a Mapper object, of course. We will create one* mapper for each slice of the library, then we will invoke the Map() function once per book in that library slice. Then for every word in the book, we will call the mysterious EmitIntermediate() function.

For simplicity, let's just assume that EmitIntermediate delegates its call to an Emitter object. We will probably have one Emitter per library slice; each Mapper will have its own Emitter. This is good for isolationg Mappers: it will mean we're not sharing state between mappers so there won't be any concurrency issues even though we're dealing with multiple Mappers working in parallel.

Here is a possible implementation of an Emitter for our example - it differs from the standard example in that it performs some arithmetic continuously updating the slice's word count:
class Emitter
{
private Dictionary intermediate; // constructed elsewhere
public void Emit(string key, long value)
{
if (!intermediate.ContainsKey(key))
intermediate.Add(key, 0);
intermediate[key].Value += value;
}
}


Here is another possible implementation of an Emitter - it's more like the standard example as it stores a list of values, to be summed up at a later stage. Because we know that "1" is the only value we ever emit, we can reduce the amount of storage space required by defining the values as a list of bytes.
class AnotherEmitter
{
private Dictionary<string, List<byte>> intermediate; // constructed elsewhere
public void Emit(string key, byte value)
{
if (!intermediate.ContainsKey(key))
intermediate.Add(key, new List<byte>());
intermediate[key].Add(value);
}
}


There's no requirement for the Emitter's keys and values to be backed by an object in memory, this is just a simplification to explain the MapReduce concept.

Let's start this program up and run it. We take our library and divide it into slices. For each slice we create a Mapper and an Emitter. For each book in the slice we call the Map() function, which calls EmitIntermediate() for every word it finds. Once all the slices have been mapped, we inspect the Emitters. There's one Emitter per slice, and it contains all of the words for all the books in that slice.

Here's a concrete example of our entire library:
SLICE:0
BOOK:"Fight Club"
WORDS:"Sticking feathers up your butt does not make you a chicken"
BOOK:"The Bible"
WORDS:"In the beginning God created the heaven and the earth"
SLICE:1
BOOK:"Still Life With Woodpecker"
WORDS:"who cared which crossed the road first, the chicken or the egg"
BOOK:"FIASCO"
WORDS:"would not be reported till years down the road"
SLICE:N
BOOK:*
WORDS:*


When we example our intermediate values for the Emitter on slice 0, we find "the" 3 times (all from the Bible) and "chicken" just 1 time. Unsurprisingly, not a lot of cross over between Fight Club and the Bible. The Emitter on slice 1 has slightly different results: "the" comes in 4 times (3 times from Still Life, and 1 time from FIASCO), but "chicken" is there just 1 time (thanks again to Still Life).

What we need now is something that will aggregate our intermediate results so that we get 6x "the"s and 2x "chicken"s. In the next post, it's time for the Reduce step.

* Lies to children: if we expand our example to become more robust, we could end up retrying to Map() a book that previously failed, even retrying an entire slice with a brand new Mapper if the compute node failed.

Friday, February 26, 2010

Edgar Frank

Why are relational databases so ubiquitous? We are taught about them in Computer Science lessons at university; we see job advertisements requiring SQL skills all the time; almost every project I've touched in my professional career has used a DBMS of some sort, to various degrees of efficacy.

I will admit that to some extent my exposure is biased and self-fulfilling: I took the course, someone heard I was interested in developing the skills, I started looking to reuse and improve my skills, now every project wants to leverage existing skills and a subset of those would hire people that knew no other language but SQL.

But why? Who is designing all these applications? Are they really considering the actual requirements, or are they following the approach prescribed by DBMS vendors: use a relational database. Heck, use our DBMS.

Most database systems worth their salt offer two primary features: a reliable transactional system and a way to protect data integrity; ACID and EI/RI/DI.

Even if we don't use explicit transactions, or even consider them in our designs, the ACID properties (atomic, consistent, isolated, durable) stop simultaneous updates from rendering our entities in an inconsistent state.

Entity, referential and data integrity are enforced by the DBMS, but we need to design the structure of the database to take advantage of them. We declare entity integrity by defining primary keys (no two rows are duplicates), referential integrity by defining foreign keys (no unresolved dependencies) and data integrity by defining column types such as int and datetime.

SQL Server also offers us the ability to store massive amounts of data, indexes to seek right into the bits of data that you need, and a powerful query optimiser that will join related bits of data in the most efficient way possible. It also allows the data to be restored to arbitrary points in time in the past. With failover clustering it even attempts resilience against hardware failures. But despite its prevalence, I know there necessarily exists something (in concept at least) more efficient for every given task.

My perceived problem in DBMS systems lies in their applicability to general problems - in short their ubiquity. People don't question whether or not they need a DBMS; instead they question which vendor to purchase them from. Understanding everything a database does isn't something everybody wants to do, but more and more people are available in the market to build your company's new idea into a tangible product. A recruitment agent has a much easier job selling a labelled skill than selling some undefined/unquantifiable genius. A man (or woman) can make a living by writing SQL code for another man (or woman). It is everywhere. But why re-invent the wheel?

Well, when you don't know how large your application will have to scale up to at design time, you can take the bet that it won't scale, you can save some time and money by hiring a SQL developer off the street, and you will probably deliver something that's functionally correct if you do it right. But down the line, usually when you become a victim of your own success, it will become the bottleneck, and seemingly impossible to extract from the code that's been written around it...

Monday, February 22, 2010

iSCSI Target

I have wanted to setup my own SQL Server failover cluster for quite a while, but never really understood how: a long time ago you had to pick special hardware off a list provided by Microsoft and you needed to run a special edition of Windows Server and you needed Fibre Channel. In short, a recent university graduate wasn't going to get anywhere.

It all seems to have changed though. iSCSI has emerged as an alternative to Fibre Channel, which means that you use your gigabit network to communicate with the SAN instead of some cheeky fibre optic cables (all those years ago it still wouldn't have helped to know that you could use copper cables too). Windows Server 2008 ships with the iSCSI initiator built-in, although you can download it for free from Microsoft if you're running Windows Server 2003 or earlier.

My first mission was simple. Get it working. I chose to run iSCSI Enterprise Target under Ubuntu, with a 16GB USB flash drive formatted in NTFS. This would be my SAN. Note: you need to choose the correct block device name so that you don't end up destroying hardware or losing data. For my flash disk I used /dev/sdc1 (maybe an Ubuntu guru could tell me why /dev/sdc without the number didn't work 100%) and yours will almost certainly be different. [edit:on a fresh install of karmic, i had to run sudo apt-get update first, otherwise it complained that the package couldn't be found]
sudo apt-get install iscsitarget
Once the installation was complete, I edited the service configuration file
sudo gedit /etc/ietd.conf
Adding in the following two lines:
Target iqn.2010-02.com.extellect:ta.tb.tc.td
Lun 0 Path=/dev/sdc1,Type=fileio

After saving the file and closing the editor, I restarted the Ubuntu machine, and powered up the Windows Server.

I clicked on Start > Administrative Tools > iSCSI Initiator then I clicked yes to the questions about firewall, and thought about setting the service as a dependency of the SQL Server service so that the drive would be available whenever SQL Server was running.

Looking at the iSCSI Target properties window:
Discovery tab: Add portal: aa.bb.cc.dd port 3260
Targets tab: automatically populated thanks to discovery: iqn.2010-02.com.extellect:ta.tb.tc.td and the status is "Inactive"
Click the Log on button, then OK.
The status is now "Connected"

At this point take a look at the Ubuntu machine to see the connection
sudo cat /proc/net/iet/session

Back to the Windows machine, this time opening Computer Management > Storage > Disk Management
I saw my flash drive but the disk said it was offline - right click and choose online from the context popup menu. Now I saw a familiar (and happy) sight:
Disk 2 Basic 14.95 GB Online - CRUZER (F:) 14.95 GB NTFS Healthy (Primary Partition)

Opened up SQL Server Management Studio and created a new database, choosing F: as the default data and log location. Created a table, inserted and then selected "Hello, World!"

Checked on the Ubuntu machine and there was the mdf and ldf!

Sunday, February 14, 2010

Sequential vs. Random I/O

I mentioned here that sequential I/O was good and random I/O was bad. Bad is the wrong word to use. Random I/O is comparatively less efficient than sequential I/O when accessing the same amount of data. The point to take home is that when a hard disk head is seeking to a different location, it isn't reading and returning data. In the sequential I/O scenarios there is very little seeking and the drive can quickly return all the data in a single stream. In contrast, random I/O causes the disk to spend half its time seeking and it can only return half the amount of data in the same amount of time as the sequential scenario.

Here are two performance considerations for designing and administering a SQL Server database and server:

Index seeks and scans use random and sequential I/O respectively. The performance gained by index seeks is in the amount of data that needs to be accessed. Finding 1 row in just 4 random I/Os is much quicker than scanning through 10000 I/Os worth of data sequentially. There is a point at which scanning is quicker, and that's when the tables are small.

Log files are accessed sequentially, so it makes sense to put each database's log file onto its own disk for the best performance. Sharing one disk between many different log files will lose you the performance benefit.

Tuesday, February 09, 2010

Caution: Assembly Under Test

What do we test? What can we test? What should we test? Three subtly different questions with radically different answers in the general case.

We start by answering the questions that weren't asked: why do we test and why should we test?

Testing increases developers' confidence to introduce new changes without breaking existing functionality. Anecdotally, it also reduces overall project completion time because less time is spent in the testing phase because bugs are found and dealt with sooner. It would be a tall order to define all the reasons we should test in one paragraph of one tiny blog.

There are usually ancillary reasons why we do test. Someone else forces us to, or we choose to, or we need some experience for the CV. People's motivations are endlessly creative. Heck, I can imagine many a political reason for coding unit tests.

We can test pretty much anything, but it doesn't mean we should. As we become more familiar with various testing tools and APIs we increase our range of what can be tested, but we should focus on the application of our knowledge to what we should be testing.

What should we test? Best question yet. From the point of view of the person who's paying you to develop software: we should test things that make our project more attractive than a (perhaps imaginary) competitor's project that doesn't employ unit testing. Do we deliver it faster? Do we deliver it with less bugs? Is everyone on the team happy? Did we get to do a little bit of CV++ along the way?

What do we test? In an ideal world, we test the objects we write. Seeing as I come from an object-oriented background, I'd say we test the state and behaviour of objects. We test state by inspecting the values of properties. We test behaviour by inspecting the return values of methods, and sometimes by ensuring that we're interacting correctly with other objects.

  • State: make assertions against expected and actual property values
  • Behaviour: make assertions against expected and actual method return values
  • Interactive behaviour: make assertions against the objects with which our objects interact - what we're testing is that we're actually interacting with them in the expected manner

Mock objects are the flavour of the week when it comes to testing interactive behaviour - we set up expectations on a mock object, interact with it, and then validate our expectations. They are there so that we can test our object under test is correctly interacting with it's surroundings.

Stub objects don't fit anywhere in this list of tests because we don't assert against them... we only use stub objects in order to allow our tests to compile and execute. We choose to use stub objects because we can control them; control their dependencies. They are an important part of any isolation framework. We don't want our logging method to keep writing disk files or sending e-mails while our tests run, but we don't want it to throw NullReferenceExceptions either. We definitely don't want to code our application to continuously check if the logger is null (this would be tedious and error prone), nor would we want to introduce special branching logic to not perform logging under test conditions. We could code up our own NullLogger : ILogger that does nothing but implements all ILogger's events/properties/methods to do nothing, but this would be fragile and often a waste of time when functionality is added to or removed from the ILogger interface a month down the line. Good frameworks like Rhino.Mocks allow dynamic stubs to be built on the fly without you ever having to care about creating and maintaining your own NullLogger implementations. Saving us time, saving us money.

Just because we can test state and behaviour (even interactive behaviour) doesn't mean we should. There needs to be justification. Does it make the application development experience quicker and better? Does it reduce overall costs? Does it allow you to sleep better at night (no, not because you are so exhausted from fighting fires and fixing production bugs!)? Are the tests valid? Will they help someone newly joining the team to understand what's going on? You can be as creative as you want in your reasoning, but without proper justification, you may just be wasting time writing and running tests.

Sunday, February 07, 2010

Unblock

Windows 7 is pretty security-conscious. I downloaded the latest Rhino.Mocks zip, extracted the .dll and .xml, referenced them into a new Visual Studio test project, and CTRL+R,CTRL+T'd to run the test. Fail! Test run deployment issue: 'Rhino.Mocks.dll' is not trusted.

Unbeknownst to me, there's a new option in Windows 7 (visible from the file's properties window) "this file came from another computer and might be blocked to help protect this computer." Smart. And there's an unblock button nearby. Click it.

It doesn't end there. I extracted the .xml file too, so that I'd have access to the Intellisense code documentation comments. This file needs to be unblocked too. Even if you've unblocked the .dll, the error message will persist: 'Rhino.Mocks.dll' is not trusted. Lies lies lies. Repeat step 1 to unblock the .xml.

Finally, Visual Studio will likely be working with cached assemblies, so the original .dll and .xml that were copied into the bin directory before the intial test run deployment failure are probably still there (they're still blocked). Delete them and let Visual Studio pick up the fresh unblocked files, and you'll be on your way again in no time.

Confidence & Unit Tests

Why do we waste our time with these stupid unit tests, some might ask. Well, if your system's requirements can change on a dime then I'd say (besides the initial measure-twice-cut-once rule) that the next most important thing you can do is have an automated suite of tests that can ensure your "tiny" "quick" code change doesn't have massive negative impact on the rest of the system. An automated build that runs unit tests covering most of the code base gives developers confidence to change the relevant pieces of source code and get quick turnaround times for new features and bug fixes. Yes! Quick turnaround times because the code was all already under test. I hate to say it, but "agile".

Measure Twice, Cut Once

Unit testing offers software developers a parallel insight into the world of physical construction: measure twice, cut once. Whether you're in charge of putting together a wooden tie-rack or a sky scraper, it's slightly more than just inconvenient to find that someone in the process delivered the wrong sized artifact because they were rushing and only looked over their work once. A second pair of eyes, a buddy check, even double checking on the developer's own part can reduce the occurrence of these kinds of errors. Unit tests are a great tool for software engineers in this regard. We code up the business logic once, and then we get to test it from an alternate perspective (please, for heaven's sake, don't mimic the exact flow of logic in your unit tests: even if there's a bug your tests will erroneously pass!)

Friday, February 05, 2010

ConditionalAttribute("DEBUG")

This is quite a neat concept that I only partially grasped until now. When you apply this attribute to a target (either a method or class) it is compiled just as any other custom attribute is treated. Look at the assembly in a tool like .NET Reflector and you'll see it there, exactly as you defined it. Bog standard.

The clever part is what happens next, whether you're referencing the target yourself (perhaps even from the same assembly in which it is defined), or someone else is referencing it by linking to your assembly as a 3rd party library.

The compiler recognizes the target has a conditional attribute, and it checks its list of defined symbols to see if there's a match. If there's a match, it proceeds as normal, but if there's no match, the compiler modifies the caller to leave the method call (or the custom attribute declaration) out of the newly compiled assembly entirely. Simple as that. In this way, the caller (and this is the important part) will never know that the target method or attribute class exists. This can have great performance implications: code that's compiled with the DEBUG symbol defined (or whatever symbol you choose for your project) will make all the calls, which could include very verbose tracing, but code without the symbol will have the method calls (and custom attributes) elided. In the case of method calls, not only the method call is skipped, but any logic for building up arguments is also skipped. For example:

ConditionalTargetClass conditionalTargetClass = new ConditionalTargetClass();
conditionalTargetClass.conditionalTargetMethod(Factorial(2000));


becomes

ConditionalTargetClass conditionalTargetClass = new ConditionalTargetClass();

There are a few limitations:
When applying the attribute to a method, that method return type must be void.
When applying it to a class, the class must derive from System.Attribute.

Another example:
[Conditional("DEBUG")]
public class DebugAttribute : Attribute { /* EMPTY */ }

[Debug]
public class Program { /* EMPTY */ }


becomes

[Conditional("DEBUG")]
public class DebugAttribute : Attribute { /* EMPTY */ }

public class Program { /* EMPTY */ }

Wednesday, February 03, 2010

Method Scenario ExpectedBehaviour

What do you call your unit test methods? How do you logically structure them? I've been over this before, so it's time for a little refinement. One unit test assembly for every assembly under test. One test fixture for every class under test. Many test methods for every method under test. And to make everything easy to read, every test method should follow this pattern (not my idea, but I wish it was):

Method_Scenario_ExpectedBehaviour

For example if we're testing a new Security class with an Authenticate method, we might create the following test methods inside the SecurityTestFixture class:

Authenticate_UserNotFound_ThrowsAuthenticationException
Authenticate_IncorrectPassword_ThrowsAuthenticationException


Of course no application development should be driven by negative tests - at first you'll want to test that your authentication process works, then later down the line you'll want to make sure that it's robust. So there would likely be more test methods like:

Authenticate_Success_ReturnsPrincipal

Note how easy it is for someone to read the test method name and understand what it was doing. That's on purpose. Also note how the scenarios aren't long and complex, and only one logical unit of work is tested per test method. That's on purpose too. It means it's easy to pinpoint the root cause of failure.

Fake: Stub vs. Mock

When you're writing unit tests, you'll occasionally find yourself in the situation where you need to create some kind of fake object and pass it into your object/method under test. What you do with that object next is what determines whether it's a mock object or just a stub. If your test makes an assertion on the fake (i.e. what methods were called on it, and with what arguments) then that fake is a mock object. If the test does not make any assertions on the fake, then it's just a stub.

In short:
all mocks are fakes
all stubs are fakes
no mock is a stub
no stub is a mock

Many so called "mocking frameworks" actually allow you to dynamically create both mocks and stubs, so it would be better to call them isolation frameworks as their primary function is to let you test your own objects in isolation by mocking/stubbing out external dependencies.

Tuesday, January 26, 2010

Visual Studio Shortcuts

We all make typogarphical mistakes. Oops. Put the caret between the 'a' and 'r' and hit Ctrl+T.

This shortcut allows you to move a single letter one position to the right every time you use it. More precisely, it swaps the characters on the left and right of the caret, then advances the caret one character to the right. The advancement of the caret means you can keep hitting Ctrl+T to effectively move a single letter right through a word.

Sometimes I'll type in a class name without declaring the corresponding import statement at he top of my C# source file. Visual Studio adds a visual clue (small red box under the final character of the type name) and if you move your hands off the keyboard and wiggle the mouse around a bit, Visual Studio will display a context menu to choose the namespace. If you want to keep your hands on the keyboard, hit Ctrl+Shift+F10 and then hit the up and down arrows to select the appropriate namespace from the list.

Some other shortcuts can be found here: http://www.microsoft.com/downloads/details.aspx?FamilyID=e5f902a8-5bb5-4cc6-907e-472809749973&displaylang=en

Tuesday, January 12, 2010

Split Range - Partitions to Filegroups

So I asked the questions: How do I know which filegroup any given partition is mapped onto? How do I know which is the new partition when a range is split on a new boundary. And which partition keeps the data and which is removed when two partitions are merged? The first question is answered by looking at the sys.destination_data_spaces table. The second is answered in this post, and the third will be the subject of a later post.

We can set up a partition for this exercise with the following SQL:
CREATE PARTITION FUNCTION pfLeft(INT) AS RANGE LEFT FOR VALUES()
CREATE PARTITION SCHEME psLeft AS PARTITION pfLeft TO (FG1)


Our diagnostic SQL, which we run repeatedly, is:
SELECT destination_data_spaces.*
FROM sys.destination_data_spaces
INNER JOIN sys.partition_schemes ON sys.destination_data_spaces.partition_scheme_id = sys.partition_schemes.data_space_id
WHERE sys.partition_schemes.name = 'psLeft'
ORDER BY destination_id ASC


For a partition function with no boundaries (i.e. all rows fit in one partition) we see that the first (one and only) destination_id maps to the data_space_id of FG1.
partition_scheme_id destination_id data_space_id
------------------- -------------- -------------
65601 1 2

If we set the next used filegroup to FG2 and then split the partition on the boundary of 0 we see:
partition_scheme_id destination_id data_space_id
------------------- -------------- -------------
65601 1 3
65601 2 2

The interesting bit is that FG2 is now used for the first destination_id. When the existing partition was divided, the LEFT sub-partition (including value 0) was assigned to the new filegroup. Let's try it again by setting the next used filegroup to FG3 and splitting on 100:
partition_scheme_id destination_id data_space_id
------------------- -------------- -------------
65601 1 3
65601 2 4
65601 3 2

Again, it's the LEFT sub-partition (including value 100) of the partition that was being divided that is assigned to the new filegroup. If we try this one more time with a split on -100, I'd hope to see FG4 take up the LEFT sub-partition of our left-most partition.
partition_scheme_id destination_id data_space_id
------------------- -------------- -------------
65601 1 5
65601 2 3
65601 3 4
65601 4 2


Lets do everything from the start, but this time we'll use:
CREATE PARTITION FUNCTION pfRight(INT) AS RANGE RIGHT FOR VALUES()
CREATE PARTITION SCHEME psRight AS PARTITION pfRight TO (FG1)


We'll also need to change the diagnostic code to show the psRight partition scheme:
SELECT destination_data_spaces.*
FROM sys.destination_data_spaces
INNER JOIN sys.partition_schemes ON sys.destination_data_spaces.partition_scheme_id = sys.partition_schemes.data_space_id
WHERE sys.partition_schemes.name = 'psRight'
ORDER BY destination_id ASC


Then we run all the splits in order: 0 -> FG2, 100 -> FG3, -100 -> FG4
destination_id data_space_id
-------------- -------------
1 2
2 5
3 3
4 4

The latest results are consistent with what we've seen previously, but now it's the RIGHT segment (including the boundary value) of the newly divided partition that gets moved onto the new filegroup.

In summary, if your partition function is defined as RANGE LEFT, then a SPLIT RANGE (x) alteration will split an existing partition and the LEFT segment will become the new partition on the NEXT USED filegroup; the other segment stays on its original filegroup. Conversely, if your partition function is defined as RANGE RIGHT, then the RIGHT segment becomes the new partition on the NEXT USED filegroup.

Wednesday, January 06, 2010

Splitting a PDF

I discovered a novel way of splitting a large PDF file into more manageable chunks: print out the range you want to the "print to file" printer in Ubuntu, making sure to choose the PDF output!

Tuesday, January 05, 2010

Horizontal Partitioning 1

When you want to speed up the physical reads on a SQL Server, you have two options: faster disks, or more disks. But only once you've determined that physical reads are a (potential) bottleneck do you get to make the choice.

If using faster disks is not enough to get your system back to an acceptable level of performance, then there's still the option of adding more disks. Merely adding disks won't - in itself - speed anything up. To realize performance gains you need to restructure your data. SQL Server allows you to partition tables, and also indexes. But as usual, the devil resides in the details.

Partition functions can only take one parameter. This means that the partition in which each row resides is determined by the value of just one column value in that row.

If your original table had a clustered index, you'll probably want to keep it. However, this has a big consequence: you will need to make the partition function congruent with the clustered index. SQL Server will complain if you leave the partition column out of the clustered index "Partition columns for a unique index must be a subset of the index key". It gets worse if you want a composite clustered index - you should be aware that in some cases SQL Server appears to store data internally sorted first by the partition column, and then by any other columns in the composite clustered index. If your original table was sorted by Col_1,Col_2 and you choose Col_2 as your partition column, then your table may be sorted internally by Col_2,Col_1. Actually, it's not this straightforward: I need some time to figure this out; it will be the subject of a later post.

Tuesday, December 08, 2009

Think Big

I was playing around with my new installation of Windows 7 today, frustrated that I couldn't install SQL Server 2008 Management Studio Express (not yet supported), but determined to have fun. I was also playing around with 64-bit Windows Server 2008 running SQL Server 2008 on a machine with 8GB of RAM. I thought it would be fun to create a table with 2^32 rows, and I was right. If you run SELECT COUNT(*) FROM #Big on a table of this size, you're shown an error message "Arithmetic overflow error converting expression to data type int." No fear though, just use the COUNT_BIG() function instead of COUNT(). Bingo: 4294967296!

Tuesday, November 17, 2009

Realistic Scalability

Everybody seems to focus on adding processors and memory (or complete nodes) when they talk about scalability, but not a lot of mention is made about adding new people to manage the systems. True scalability should, in my opinion, include factors like the cost of human labour. You've designed a new system: great! It runs with only 20 errors a day on 1000 engines: super! It's linearly scalable, so your boss buys another 1000 engines: unfortunately, this means an extra 20 errors per day; this could mean another person needs to be added to the support team!

Tuesday, October 06, 2009

Escribir en español

I'm using Ubuntu 8.10 and learning to speak Spanish, so I wanted to know how to type the accented characters and inverted exclamation and question marks. As it turns out: it's pretty easy, even with a generic UK keyboard. Follow these easy steps to do it yourself:

System > Preferences > Keyboard

On the Layouts tab, click the "Other Options..." button, then expand the "Compose key position" element. Select "Right Alt is Compose." Close the Keyboard Layout Options dialog and focus the cursor on the "Type to test settings" box.

For á, é, í, ó and ú:
Press and release Right-Alt, then press and release @', then press and release the vowel key.

For the upper case Á, É, Í, Ó and Ú:
Press and release Right-Alt, then press and release @', then hold shift while pressing and releasing the vowel key.

For ñ:
Press and release Right-Alt, then press and release ~#, then press and release n.

For the upper case Ñ:
Press and release Right-Alt, then press and release ~#, then hold shift while pressing and releasing N.

For ¿:
Press and release Right-Alt, then hold shift while pressing and releasing the ?/ key twice.

For ¡:
Press and release Right-Alt, then hold shift while pressing and releasing the !1 key twice.

Tuesday, June 30, 2009

IIS and Process Orphaning

IIS has some very neat features. Consider process orphaning: your application has become unresponsive... IIS can take it out of the application pool so that it receives no further requests, and can automatically attach a debugger or write a full memory dump for you to examine later. And while all this is happening, it will have brought up a new worker to seamlessly service all other inbound requests. All in the name of reliability.

Thursday, May 28, 2009

Isolation / Immutability / Synchronization

If there's any one recipe for software runtime scalability, it's this: (IIS)^2. It's no secret I'm a fan of Internet Information Services, but in today's post it plays the role of host to our application, and my focus instead is on the underlying application's design.

Isolation:
Prevent state from being shared, allowing us to read and write to our hearts' content. Shared Nothing (SN) architectures are the ultimate realization of isolation. The definition of the isolation boundaries within an application might become contentious, and someone will need to be responsible for owning each isolated domain.

Immutability:
Prevent state from being modified - we can have as many readers as we want. In reality though, immutability by code contract is not as straightforward as simply not providing setter methods. In distributed environments especially, the .NET runtime must (de)serialize objects, which requires default constructors and public get/set accessors. We could choose to pass restrictive interface references instead, but this would force the (un)boxing of structs. Possibly, the application of custom attributes to state modifying methods/properties could be inspected statically by a tool like FxCop to prevent accidental object mutation. Objects do require mutation at times to be useful.

Synchronization:
Mark critical sections where locks must be acquired before reading or writing. This is the typical style of programming that I've seen in Java and C# and doesn't easily lend itself to distributed programming. Also, while waiting, valuable resources are often wasted. It tends not to scale as well.

The best solution to a large distributed computing problem will undoubtably be a combination of the three above problems. Isolated tasks are likely candidates for physical distribution as they will scale horizontally. Immutability is hard to enforce completely and synchronization implies necessary waiting for other tasks. For these reasons, I give top priority to isolation, and let the other two duke it out as I see fit on the day.

Saturday, May 23, 2009

Without Intellisense, I'm Nothing

Imagine you're given a task: write the code for a program in a hitherto unknown DSL, embedding that code within a hitherto unknown markup language. What would be your tool of choice? Without any easily understood reference documentation, it's not going to be easy. Without the visual clues that you're missing a quote (or your string contains an extra quote) or that your function call doesn't exist or has the wrong arguments, you're going to have to run the program just to debug it. If that running process involved multiple clicks and typing followed by a short wait, you're going to get frustrated and probably won't deliver the program on time. Intellisense is great. You can write Javascript inside an HTML page, and Visual Studio will squiggle under everything you've done wrong!

Sunday, May 10, 2009

Hello Ruby

def reverse(value)
if value.length > 1 then
value[value.length-1] + reverse(value[0,value.length-1])
else
value[0]
end
end
The string formatting functions of Ruby caught my interest, but I haven't checked if any equivalents exist in Python.
irb> "#{reverse 'Hello, World!'}"
=> "!dlroW ,olleH"

Saturday, April 18, 2009

Synchronisation... is... slooooow...

a.k.a an argument for Shared Nothing. Continuing my !random post, I decided to see how long it would take to build up a cumulative normal distribution using the values generated by System.Random.Next(). I used and timed the following configurations:
1) Four threads on four cores sharing a Random (with thread synchronisation): 180 seconds.
2) One thread on one core with its own Random: 63 seconds
3) Four threads on four cores, each with their own Random (sharing nothing): 17 seconds.
The lesson here is that synchronisation is slow. So slow in this case, that it was actually faster to single-thread the application. However, when each thread was given its own object that it didn't have to share with the others, the speed increase was dramatic. Not only was it massively faster, but I'd put my money on it scaling pretty well with an 8, 16, 32 or even 64 way server.

Out of curiosity, I also tried this configuration:
4) Four threads on four cores, each with their own Random, locking it unnecessarily: 39 seconds.
I found it interesting that even uncontended locks could halve the speed of my algorithm.

The moral of the story is to think about your locking strategy (hint: avoid putting yourself in the position where you need one) when looking to parallelize tasks.

!random

System.Random's Next() method isn't guaranteed to be thread-safe, so if you start calling it from multiple threads at the same time, you're likely to end up with very little entropy indeed! I cooked up a static class, with a static System.Random field, and set 4 threads in motion calling Next() continuously. Each thread got its own core to run on, and very soon the only "random" number returned from Next() was 0.0 - the object had been well and truly corrupted. At this point I needed to choose: a single System.Random protected by lock() statements, or multiple System.Random objects. If I chose the single route (why?) all the synchronization would slow me down, and I'd end up not using each core to its fullest potential. If I chose the multiple route (only as many System.Random objects as there were threads), I would need to seed each one with a different value, otherwise they could - if created at the same time - return the same series across more than one System.Random object.

Interestingly, if I called the overloaded version of Next(int min, int max), soon the only return values would be min.

Thursday, March 26, 2009

Buffer vs. Array

To clear up any misconceptions, System.Array.Copy is not the same as System.Buffer.BlockCopy unless you're operating on arrays of System.Byte. The Buffer class copies bytes from one array to another. If you have an array of System.Int32 (4 bytes each) and you copy 3 bytes (not items) from your source to your destination array, you will just get the first 3 bytes of your 4 byte System.Int32. Depending on the endian-ness of your system, this could give you different results. Also, the System.Buffer class only works on primitives. Not C# primitives (which include strings) and not even derivatives of System.ValueType (e.g. enums and your own structs). Clearly it can't work on reference types safely (imagine just copying 3 of the 4 bytes from one object reference to another) but I would have expected it to work with enumerations (essentially ints).

Monday, March 23, 2009

Hello Python

>>> def reverse(value):
if len(value) > 1:
return value[len(value) - 1] + reverse(value[:len(value) - 1])
else:
return value[0]


>>> print reverse('!dlroW ,olleH')
Hello, World!

System.OutOfMemoryException Part 4

And then there were 2. I'm talking about code paths that lead to an OutOfMemoryException being thrown by the CLR.

#1 is the standard "You've run out of contiguous virtual memory, sonny boy!", which is pretty easy to do in a 32-bit process. With the advent of 64-bit operating systems, you get a little more headroom. Actually, a lot more headroom: precicely 2^32 times as much as you had to start with! But, due to limitations built into the CLR, you're no further from an OutOfMemoryException...

#2 is when you try and allocate more than 2GiB in a single object. In the following example, it's the +1 that pushes you over the edge:
new System.Byte[System.Int32.MaxValue + 1];
If you're using a different ValueType your mileage may vary:
new System.Int32[(Ssytem.Int32.MaxValue / sizeof(System.Int32.MaxValue)) + 1];
Or if you're on a 64-bit system:
new System.IntPtr[(System.Int32.MaxValue / sizeof(System.Int64)) + 1];

Many thanks to this blog for helping me in the right direction here.

Hopscotch in the 64-bit Minefield

So it's no secret I've been playing with virtualization, Windows, Linux, IA32 and amd64. Virtualization looks the part, but so does a 24-bit color depth 1280 x 1024 bitmap of a fake desktop. You can't do much with either.

Microsoft has given us WOW64, allowing us to run old x86 (IA32) software on new 64-bit operating systems. It's so seamless you forget you've got it: until you try and install ISA Server 2006.

Getting Windows Hyper-V Server up and running on a headless box... well, let's just say I'm not that patient. I eventually settled for Hyper-V role on a full-fat Windows Server 2008 Enterprise Edition install, with crash-carted DKM. It doesn't appear to run on anything except the 64-bit version, either. Even then, loads of caveats await:
gigabit network card? you'll have to use a ridiculously slow emulated "switch" instead.
multiple cpus? not!
scsi drives? think again.
usb? umm... for some reason nobody in the world has ever thought of plugging a usb device into a server. i was the first. i'm so proud!
Granted, all these restrictions are imposed on the non-MS (see ubuntu) and older guest operating systems like Windows XP, but isn't half the pitch behind virtualization about getting rid of old physical kit and consolidating servers?

Flex Builder 3? Oh yes. But not for Linux. No wait... there's a free alpha version but it's not compatible with 64-bit.

VMWare ESXi? Check your hardware list first.

Saturday, March 21, 2009

Debug Enable!

All wireless functions on my O2 wireless box II appear to have died overnight. This is a problem for me because I use wireless all the time... to stream music to my Apple TV, to surf the Internet from my laptop, and play games on my iPod Touch. The wired ethernet is still happily routing packets between my LAN and the Internet. So I climbed up in my cupboard and pulled down the old Netgear DG834G - what a beautiful beast. It runs BusyBox, an embedded Linux operating system and you can connect to it using any telnet client. Just navigate to http://<netgear-router>/setup.cgi?todo=debug from your browser first, and you're A for Away - you can then telnet into it. Reboot the rooter when you're finished to disable the telnet server until next time.

Things that frustrated me:
There was no way to change the default gateway address handed out by either router. I suspect I could have done so with the Netgear box in time, but there is no text editor in the distribution.

To get my network back up and running would be difficult without the usual trance music coming through the Apple TV; keeping me focused. Still, I needed to:

  1. put both routers onto the same subnet
  2. keep DHCP running on the O2 box (so that the default gateway would remain the O2 box address)
  3. turn off DHCP on the Netgear box (otherwise it would give out its own IP address as default gateway)
  4. turn off the wireless interface on the O2 box for good


In a nutshell:
o2wirelessbox is now 192.168.1.254/24 with DHCP handing out addresses in the range 64-253
netgear is now 192.168.1.1/24 with disabled DHCP (I would have liked to use the range 2-63)

Wednesday, March 11, 2009

Gigabit Ethernet

... is actually quite fast. So fast - in fact - that the bottleneck on both PCs under my desk is currently the PCI bus into which the network cards are plugged. I had no idea that the bus ran at 33MHz and was only be able to transfer 32 bits per cycle (math says: 33M * 32b = 1056Mb). Todo: is this duplex?

There's a very useful FAQ regarding Gigabit Ethernet available here.

In a test with 4 network cards (2 in each machine) I saw the following:

w) 1A -> 2C (93Mb/s)
x) 1A -> 2D (902Mb/s)
y) 1B -> 2C (93Mb/s)
z) 1B -> 2D (294Mb/s)

1A: Gigabit embedded on motherboard (1000 link at switch)
1B: Gigabit on PCI bus (1000 link at switch)
2C: Fast embedded on motherboard ( 100 link at switch)
2D: Gigabit on PCI bus (1000 link at switch)

Machine 1 was sending large UDP datagrams. Machine 2 was not listening, it was just up so that ARP could get the MAC addresses of its adapters (without which, we could not send a datagram).

Interestingly:
tests w + y appeared to be throttled by the router as it managed a mixed 100/1000 route
test x was great and showed that an onboard gigabit controller can actually do its job
test z showed the PCI bus being the bottleneck, allowing 3x faster than Fast, but 3x slower than Gigabit.

Saturday, March 07, 2009

Extellect Utilities

I've finally put a set of C# productivity classes on Google Code under the Apache 2.0 License. So, check 'em out, see if they make your work any easier, and let me know what you think.

Remoting, bindTo and IPAddress.Any

Just solved a fun problem. While running a .NET remoting server on my ubuntu box with multiple NICs I saw some strange behavior where the server would return an alternate IP address, and the client would attempt to re-connect (using a new Socket) to the new IP address. Problem being: the server was responding with its localhost address (127.0.1.1), which the client was resolving to its own loopback adapter. See the problem yet?

It turns out that in the absence of a specific binding, the server binds to IPAddress.Any. When a client attempts to connect, it's redirected to the server's bound address. Unless client and server are hosted on the same physical machine, there's really no point in ever using the loopback adaptor... which makes it a strange choice for default.

The solution:
Before you create and register your TcpServerChannel, you need to set some options.
IDictionary properties = new Hashtable();
properties["bindTo"] = "dotted.quad.ip.address";
properties["port"] = port;
IChannel serverChannel = new TcpServerChannel(properties, new BinaryServerFormatterSinkProvider());
RemotingConfiguration.RegisterWellKnownServiceType(typeof(Explorer), "Explorer.rem", WellKnownObjectMode.SingleCall);


Voila! 'appiness should ensue...

PS. If for some reason the dotted quad doesn't appeal to your particular situation (e.g. load balancing), you can set two other properties instead:
properties["machineName"] = "load.balanced.server.name";
properties["useIpAddress"] = false;


PPS. I think the client will make always two socket connections, A + B. A is used at the start to do some initialization and get the address for connection B. B is used for meat and bones of the operation, and finally A is used just before they're both closed.

Friday, March 06, 2009

Goodput - Season Finale

I thought I'd take a look at a couple of different (and by no means an exhaustive list of) options for transferring a reasonably large file across a network. Over the past couple of days I tried sending a 700MB DivX using the .NET Remoting API (over both TCP and HTTP), the .NET Sockets API (over TCP), and finally using a mounted network share, reading the file as if it was local.

The table that follows shows the results of these tests:

I'd advocate taking pinch of salt when interpreting the numbers.

In general, remoting doesn't support the C# compiler generated closures it emits when it compiles an iterator block (e.g. the yield return keyword): quickly remedied by exposing the IEnumerator<T> as a remote MarshalByRefObject itself, wrapping the call to the iterator block. This gave us a nice looking (easy to read) interface, but will have increased the chattiness of the application, as every call to MoveNext() and Current would have required a network call. Further to this, the default SOAP serialization used with HTTP remoting doesn't support generic classes, so I had to write a non-generic version of my Streamable<T> class.

The performance of the HTTP/SOAP remoting was abysmal and there was very little gain by switching to a faster network. Even with what I suspect to be a massively chatty protocol (mine, not theirs), the bottleneck was probably somewhere else.

TCP remoting was next up. Under the covers it will have done all the marshalling/unmarshalling on a single TCP socket, but the chatty protocol (e.g. Current, MoveNext(), Current, MoveNext() etc.) probably let it down. TCP/Binary remoting's performance jumped 2.5x when given a 10x faster network, indicating some other bottleneck as it still used just 16% of the advertised available bandwidth.

CIFS was pretty quick, but not as quick as the System.Net.Sockets approach. Both used around 30% of the bandwidth on the Gigabit tests, indicating that some kind of parallelism might increase the utilization of the network link. An inverse-multiplexer could distribute the chunks evenly (round-robin) over 3 sockets sharing the same ethernet link, and a de-inverse-multiplexer (try saying that 10 times faster, after a couple of beers) could put them together.

Back on track...

Seeing as TCP/Binary remoting was the problem area that drove me to research this problem, I thought I'd spend a little more time trying to optimise it - without changing the algorithm/protocol/interface - by parameterizing the block size. The bigger the block size, the fewer times the network calls MoveNext() and get_Current have to be made, but the trade-off is that we have to deal with successively larger blocks of data.

What the numbers say: transmission rate is a bit of an upside down smile; at very low block sizes the algorithm is too chatty, at 4M it's at its peak, and beyond that something else becomes the bottleneck. At the 4M peak, the remote iteration invocations would only have been called 175 times, and the data transfer rate was 263Mb/s (roughly 89% of the observed CIFS' 296Mb/s).

Thursday, March 05, 2009

Full Duplex

Simple english: if two computers are connected by a full duplex ethernet link, then they should be able to carry out two conversations with each other simultaneously. For example, imagine two computers named A and B with a 100Mb/s full-duplex connection linking them both. A starts "talking" at 100Mb/s and B "listens". B also starts "talking" at 100Mb/s and A "listens". The total data moving up and down the link 200Mb/s. That's full duplex, baby!

Only, in real life you don't get the full 100Mb/s in either direction. On my PC, I managed to get 91Mb/s in one direction and 61Mb/s in the other direction. If I stopped the 91Mb/s conversation (call it X), the 61Mb/s conversation (call it Y) would quickly use up the extra bandwidth, becoming a 91Mb/s conversation itself. As soon as I restarted X, it reclaimed its original 91Mb/s, and Y returned to its original 61Mb/s. Freaky.

Goodput - Part 2

So then I thought to myself, "Hey, you have two NICs in each machine. Why don't you try and get double the throughput?" Even though all my NICs are gigabit ethernet, my modem/router/switch is only capable of 10/100 (a gigabit switch is in the post, as I type). Yesterday's tests indicated that I was getting roughly 89Mb/s, so I'd be aiming for 178Mb/s with my current hardware setup. And a glorious (hypothetical) 1.78Gb/s when the parcel arrives from amazon.co.uk.

What would have to change? For starters, the server was binding one socket to System.Net.IPAddress.Any; we'd have to create two sockets and bind each one to its own IP address. Easy enough. The client would also have to connect to one of the the two new server IP addresses.

Wait a minute... there isn't any System.Net.Sockets option on the client side to specify which ethernet adapter to use. You only specify the remote IP address. Oh no! This means we could end up sending/receiving all the data through just one of the client's NICs. Luckily, you can modify the routing table so that all traffic to a particular subnet can be routed via a specific interface. I'm using ubuntu as the client, and my routing table looks like this, which indicates that eth0 would get all the traffic to my LAN:

Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.1.0 * 255.255.255.0 U 1 0 0 eth0
192.168.1.0 * 255.255.255.0 U 1 0 0 eth1


I want to add a static route, with higher precedence than the LAN subnet, using eth1 for all communication with the remote IP 192.168.0.73, leaving eth0 with the traffic for the rest of the 192.168.1.0/24 subnet. I type this command at the console:

sudo route add -net 192.168.1.73 netmask 255.255.255.255 dev eth1

Disaster averted. The routing table now looks like this, and I'm happy to say my diagnostics report that I'm getting around 170Mb/s with my new trick in place. It's not the 178Mb/s I was hoping for (I've lost about 4.5% on each connection), but it's still 190% of the original throughput.

Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.1.73 * 255.255.255.255 UH 0 0 0 eth1
192.168.1.0 * 255.255.255.0 U 1 0 0 eth0
192.168.1.0 * 255.255.255.0 U 1 0 0 eth1


Throughput comparison reading a 700MB DivX file:
73Mb/s - Mounted network share using CIFS (although it also appeared to be caching the file on disk, incurring disk write penalties)
89Mb/s - 1x NIC using System.Net.Sockets
170Mb/s - 2x NIC using System.Net.Sockets

Goodput - Part 1

I've been interested recently in maximizing the application-level throughput of data across networks. The word I didn't know I was looking for - but found anyway - was goodput.

At first I tried streaming across a large DivX file. When I realised that I might be measuring disk seek time (doubtful, at the pitiful data rates I was achieving) I transitioned my tests to stream meaningless large byte arrays directly from memory (being careful not to invoke the wrath of the garbage collector, or any of the slow operating system-level memory functions).

What I noticed was that the application-level control of the stream was a big factor in slowing down the effective data transfer rate. In short, if my design of the logical/physical stream protocol was "bad", so would be the goodput.

Throughout the test, data was transmitted in chunks of varying sizes (from 4k to 128k). Firstly, just to see what it was like, I tried establishing a new System.Net.Socket connections for each chunk. Not good. This is why database connection pooling has really gained ground. It's expensive to create new connections. Next I tried a single connection where the client explicitly requested the next chunk. Also really bad. It was chatty like an office secretary, and got less done. So I tried a design I thought would be pretty good, where 1 request resulted in many chunks being returned. For some reason, I thought that prepending each chunk with its size would be a good idea. It was 360% better than the previous incarnations, but the extra size information was just repeating data that wasn't at all useful to the protocol I had devised: it was wasting bits and adding extra CPU load, and giving nothing in return; it had to go. Stripping these items from the stream resulted in an extra 3.6% of throughput.

Interestingly, I noticed that the choice of buffer size could drastically affect the goodput, especially when it was ((1024*128)+4) bytes. I expect this was something to do with alignment. It would be cool to do some more tests, looking for optimal buffer sizes.

Saturday, February 28, 2009

LiveCycle Data Services / amd64 / ubuntu

I haven't tried out the alternative - BlazeDS - yet, but I've had the time of my life trying to get ubuntu to use all my RAM and both my monitors, and even more fun trying to get LCDS installed. Nothing has been exceptionally difficult, but I've noticed a lot of things just are not compatible. Knowing not only where you're going, but also the route to get there, can make a major difference in total time taken. That's why I document my "mistakes" - for next time. Next time...

  1. Well, maybe next time I'll get Hyper-V Server 2008 to work. Even VMWare ESXi wasn't happy with my SCSI drives. Xen looked like more than I was prepared to bite off, nevermind chew.

  2. Accessing any more than 3GB of physical RAM on an ubuntu install requires the amd64 package. Similar limits are imposed on other 32 bit operating systems. In this case I chose to upgrade to the 64 bit version than enable PAE.

  3. to do anything useful in an unfamiliar operating system you'll want to use the desktop version.

  4. the graphics driver and operating system seemed to fight like small children if the driver's not open source ... I'm just glad that nVidia have made the effort for a 64 bit driver so I can see my desktop in TwinView, rather than being forced to run the server version (see 3).

  5. the Flash plugin that you can download from Adobe's website is - at time of writing - only compatible with i386 and not amd64. Problem's easily solved by using sudo apt-get install flashplugin-nonfree which was already set up inside Aptitude.

  6. when you download a .bin file, you're expected to know you're going to chmod +x and execute it. However, in the case of lcds261-lin.bin you can't just do this inside a terminal from the CTRL-ALT-F7 session; you must run it from one of the CTRL-ALT-F{2..6} sessions. Else you'll get a Java error java.lang.UnsatisfiedLinkError: Can't load library: /usr/lib/jvm/java-6-openjdk/jre/lib/amd64/xawt/libmawt.so. Amusingly, one of the lines in the stack trace read: at ZeroGay.a(DashoA10*..)

  7. even if you asked the installer for a Tomcat install, don't expect your container to be started when installation is complete. Perhaps it's too much to expect a reasonable error message too. Anyway, I created a new service account user "tomcat" with sudo adduser tomcat and then gave it ownership of the (otherwise locked down) Tomcat install directory and started the server.

  8. Actually check that the samples are working - front-to-back - at this point. First time around, I celebrated an early victory when I first saw the sample pages running inside Tomcat. There was a leeeettle bit more setup to do re: the sample database. The LCDS install ships with an HSQLDB database, which needs write permission on a bunch of directories. In the spirit of the previous step (and, perhaps, a pre-existing condition in which I derive pleasure from creating new service accounts) I created a new user called hsqldb with permission to create the .lck (lock) files. sudo adduser hsqldb and then /opt/lcds/sampledb$ sudo chown -R hsqldb .

  9. LCDS is free under Adobe's developer license on more than 1 CPU (or even in production environments, but only on a single CPU). This is great for developers like me who can tinker at home without parting with hard cash. We can even demo our applications to clients for free. The people with the cheque books can keep them in their pockets until bigger test and production environments are commissioned.

Friday, February 27, 2009

System.OutOfMemoryException Part 3

It's not every day you come home and plug 8GB of physical memory into your home PC. Before now, I wasn't completely aware that there was even a limit to the amount of memory you could usefully install on a Windows XP machine. There are all sorts of hurdles it seems. Luckily, Windows Server 2008 x64 has no problem with addressing it all. Bonus. But can you imagine the size of the swap files you would start to rack up if you ran a bunch of memory hungry processes simultaneously. Processes only see virtual memory... Windows is smart about whether this memory is backed by physical memory pages, or pages allocated on a system file called the swap file. As we move to the wider 64-bit addresses, we're not going to get another System.OutOfMemoryException; instead we'll run out of disk space for the swap file or spend so much time thrashing that we'll want the small address space back!

Saturday, February 21, 2009

Gen / Spec

I think it's common for vendors and consultancies to push their own technologies as solutions to the problems of clients. Even individual consultants are often bullish about the skills they've acquired. Together, these behaviours make it difficult (if not impossible) for the optimum solution to be found and implemented. Of course, an optimum solution isn't always what a client wants: consider delivering a sub-optimal solution that leaves your client better off than the nearest competitor. Still, I feel that recognising that an optimal solution does exist is an important step towards better architectures.

Adam Smith - the economist - wrote about the division of labour in his magnum opus, The Wealth of Nations. To paraphrase: he talks of a pin factory and the difference in productivity between two scenarios - sharing responsibilities across a group of employees, and assigning specific tasks to employees. There are many conclusions one can draw, but the one that shines out particularly to me is that performance gains can be made by choosing the right tool for the job (RTFJ).

Databases. Great for relational data storage and retrieval. They're designed to meet ACID requirements, and even with all the log shipping and replication in the world, don't scale quite as nicely as other technologies can/do. In my book, that would have been reason enough to keep business logic out of the database. However, certain cases might call for a gargantuan set of data to be worked on at once, and it might be prudent to "bring the computation to the data".

Grids and high performance computing. Great for compute intensive operations. However they're distributed by nature, and that generally makes things more difficult. They usually offer only a subset of the common operating system constructs we're used to - well conceptually, anyway. Spinning up a new thread locally is the "equivalent" of starting up a new process on a remote machine. Also there's the problem of moving data. (De)serialization is computationally intensive - optimisations can take the form of using shared metadata of common versions (e.g. .NET assemblies, and binary serialization) which bring new problems of managing versioning across the environment.

Whatever you're doing, always make "efficient" use of your CPU. Use asynchronous patterns for non-CPU tasks (e.g. waiting on I/O) using callbacks. Thread.Sleep() and spinning in a tight loop are generally evil (but I'm sure there exists a case where both are specifically wonderful).

Distribute only what you have to. If your constraint is virtual ("addressable") memory then it might be OK just to have multiple processes on the same machine with lots of physical memory, talking to each other via some non-network IPC mechanism.

Cache hits should be quick. Cache misses (generally) shouldn't result in multiple simultaneous requests to insert fresh data in the cache. Tricky bit is not making any "threads" wait while the cache data is inserted. That ties my previous point in with the next:

DRY. Don't repeat yourself. This goes for operations as well as for boilerplate copy and pasted code. If a cache can give you the result of an expensive operation you've already computed, for less cost, then consider caching. In-memory, disk, distributed and bespoke caches exist. Each will have a

Thursday, February 12, 2009

Licensing Components in .NET - Part 2

I reckon the only reason LicFileLicenseProvider is part of the framework is to get the point across the licensing greenhorns. All of a sudden, brains start ticking: you can load the license from anywhere. You begin crafting nefarious schemes using public key cryptography. It's brilliantly academic. But something's wrong. It would be easier just to hack the code. Steal the intellectual property. Hey, maybe that's why Microsoft stopped at the LicFileLicenseProvider? Maybe, and here's a thought: maybe they should have.

Another crazy piece of the (even crazier) licensing puzzle: licenses.licx files and lc.exe.

lc.exe is a tool written in .NET by Microsoft, which is used transparently by msbuild when your Visual Studio projects are compiling. Looking inside the assembly's resource strings, we discover it:

Generates a .NET Licenses file and adds it to the manifest of the given assembly
Usage:
lc /target:TargetAssembly /complist:filename [/outdir:path] [/i:modules] [/v] [/nologo]

Options:
/target: Target assembly for the generated licenses file
/complist: Licensed component list file
/outdir: Output directory for the generated licenses file
/i: Specify modules to load
/v Verbose output
/nologo Suppress the display of the startup banner

The entry point into this assembly is the Main method of the System.Tools.LicenseCompiler class. Of (arguably) most importance is the /target: switch. This is the name of the assembly into which the compiled resource will be added. In Elsie.Target.exe this would be a resource named "Elsie.Target.exe.licenses", containing a binary stream representation of a serialized .NET object. More to come...

If you add a an empty text file named "licenses.licx" to your project, Visual Studio automatically sets its BuildAction:EmbeddedResource and CopyToOutput:DoNotCopy. It also calls lc.exe before calling csc.exe (yes, I'm a C#-a-holic). It makes the decision based on the .licx extension and you can have as many .licx files as you want in a single project (ok, that may not be true, but why would you want that many? Anyway, it will generate one /complist:[filename.licx] for each licx file in your project)

So what do you type in this/these text file(s)? If you really care, we'll have to make a 3rd installment.

Licensing Components in .NET - Part 1

I'm going to wear three hats here: the fantastic component developer who's written a third party library, and the poor sod who has to make the aforementioned fantastic component (not developer) work with the existing build environment. Finally, I'll be that guy who has to run this application and explain to his boss why it's suddenly stopped working.

Unsurprisingly, Microsoft do have a how-to for developing licensed components. There's also a nice diagram here. Some of the information presented in these tutorials is a little misleading, so I figured I'd get back to trying on my hats.

Hat 1

I've developed the one component that could change the world. That's a little c, not big C for component. There's no requirement for my class to fit into Microsoft's ComponentModel namespace (which is, incidentally, where all the bog standard licensing classes reside). So, I apply the [LicenseProvider(typeof(LicFileLicenseProvider))] to my fantastic class, and somewhere in the class I make a call to the static method LicenseManager.Validate(passing in the type of my class, as this is how the manager figures out that I wanted it to use the LicFileLicenseProvider). There are two overloads for Validate:
a) public static License Validate(Type type, object instance)
b) public static void Validate(Type type)
Option #1 offers the fullest functionality and it makes sense to make the call in my class's constructor - after all, the error message (of the LicenseException that's thrown if the call to Validate fails for some reason) WANTS me to do this: "An instance of type 'Elsie.Proprietary.Fantastic' was being created, and a valid license could not be granted for the type 'Elsie.Proprietary.Fantastic'. Please, contact the manufacturer of the component for more information." Because the call to Validate returns a new disposable License object, I'm responsible at least for ensuring it gets cleaned up properly. I'll assign it to an instance field, and make my class implement IDisposable.
Option #2 is a little less messy - I don't have to pass in an instance of my class, I don't have to worry about managing a Licence object's lifetime. "A valid license cannot be granted for the type Elsie.Proprietary.Fantastic. Contact the manufacturer of the component for more information."
That's it. I don't even have to create a license file.

Hat 2

I'm going to use the Fantastic class, so I mock up a new project of my own (which I call Elsie.Target.exe) and I add an assembly reference to it. Then I create (probably in notepad2.exe) a one line txt file: inside it I type "Elsie.Propietary.Fantastic is a licensed component". I make sure the file is called "Elsie.Propietary.Fantastic.lic" and I make sure it's copied to my working directory (probably by setting BuildAction:Content, and CopyToOutput:CopyAlways). Inside my application, I call the Fantastic constructor (within a using statement, because the class implements IDisposable, because the component deveoper was a responsible guy after all). Hidden inside the constructor, Fantastic checks if I'm allowed to use it by loading the .lic file. If the checks are successful, I go on my way to being a superstar developer. Otherwise, an exception will be thrown and it's back to the streets for me!

Hat 3

I'm in the London office at 7am. I deployed Elsie.Target.exe, along with Elsie.Propietary.dll and Elsie.Propietary.Fantastic.lic last night. While I've been sleeping, everyone in APAC has been delighted with just how fantastic the application is. In my excitement, I forget about being a cheeky monkey and changing the .lic file contents to read "... is *not* a licensed component". This is lucky for me, because it would BREAK the application!

Other examples:
Good: "Elsie.Proprietary.Fantastic is a licensed component."
Good: "Elsie.Proprietary.Fantastic is a licensed component. Yes it is!"
Good: "Elsie.Proprietary.Fantastic is a licensed component. NOT!"
Bad: "Elsie.Proprietary.Fantastic is a licensed component"
Bad: "Elsie.Proprietary.Fantastic is not a licensed component"
It turns out that the Key is valid if the text in the file starts with the fully qualified type name followed by " is a licensed component."

This is crazy! So crazy, in fact, that it might just work...

Wednesday, December 03, 2008

C# Verbatim Identifier

I've known for a long time that C# identifiers could have an optional @ sign in front of them, but until recently I thought that the character became part of the identifier.

int foo = 3;
int @foo = 4; //<-- error here: identifier 'foo' already in scope

So, it's really a way to call into code that may have been written in another CLR language that has identifiers that clash with C#'s reserved and keywords.

By way of example, to call the following VB.NET function

Public Class VerbatimIdentifierTest
Public Shared Function int(ByVal value As Double) As Integer
Return Convert.ToInt32(value)
End Function
End Class

from C# you'd invoke:

VerbatimIdentifierTest.@int(1.0);

Saturday, October 25, 2008

Amazon EC2

Amazon's Elastic Cloud Compute (EC2) is a nice idea, but it's important not to overlook the "elastic" in the name. If there's an obvious temporal pattern to your service's usage (examples: 1. it could easily consume 100% CPU of 5 machines between 7 am and 11 am, but slow to a dawdle for the rest of the day, or 2. a large system could lie unused on weekends and holidays), and you design your application with parallelism in mind, this could be a cost effective strategy for hosting the service. As they bill you per "instance-hour" (loosely defined as an "instance" available for all - or part of - an hour), an available _idle_ instance will cost the same in a month as an available _fully-utilised_ instance. Costs start to fall as soon as you turn off your under-utilised instances dynamicall; which is something you can't yet do when renting physical space in a data center (AFAIK).

For small business without the need for elasticity (note: does that spell "unsuccessful"?), I'm not convinced it would be as cost effective as renting some tin in a data center and running multiple virtualized instances on some technology like Hyper-V, ESX or Xen.

Wednesday, October 22, 2008

Assert.That(actual, Is.EqualTo(expected).Within(tolerance))

New to NUnit 2.4 back in March 2007:

using NUnit.Framework.SyntaxHelpers;

I really like this idea as it can make some unit test code much more legible!

Saturday, October 18, 2008

Language Optimization Observations

Progamming languages are becoming prolific. Most recently the trend of domain specific languages has emerged. On any given project, multiple languages will be employed (to varying degrees of success). Their design can be influenced by many factors. Some - if not most - languages enable one to succinctly and elegantly define a solution to a problem, speeding up initial development and subsequent modifications. However, as the languages get more abstract they appear to get slower and slower to run. Maybe it's that they're solving bigger problems? Maybe it's that their runtime environments have been designed for the general case, and literally "karpullions" of CPU cycles are being wasted maintaining state in which we [for a given application] are not interested. The smart money is on over-generalized and sometimes poorly implemented runtime environments.

Thursday, October 16, 2008

Duct Typing

defn: A portmanteau of duck-typing and duct-taping. Used to describe the effects of defining .NET extension methods in your own code to give existing objects "new" functionality.

Closures in C#

If I'm using the wrong terminology: sorry; you lose. The C# compiler is a wonderful thing. Among it's many gifts to us are iterator blocks (think: yield return) and anonymous methods (think: delegate literals). Neither of these constructs has a direct parallel in MSIL, the common intermediate language to which all C# code is compiled. Instead the C# compiler defines new types to encapsulate their behaviours.

Iterator Blocks

An iterator block becomes a state machine; a disposable enumerator. Code defined in C# as this:

public static IEnumerable PublicEnumerable()
{
yield return 1;
yield return 2;
yield return 3;
}

is compiled down to the MSIL equivalent of this:

public static IEnumerable PublicEnumerable()
{
return new d__0(-2);
}

The class d__0 is a closure, and if we take a peek in Reflector we see the following definition:

[CompilerGenerated]
private sealed class d__0 : IEnumerable, IEnumerable, IEnumerator, IEnumerator, IDisposable
{
// Fields
private int <>1__state;
private int <>2__current;

// Methods
[DebuggerHidden]
public d__0(int <>1__state);
private bool MoveNext();
[DebuggerHidden]
IEnumerator IEnumerable.GetEnumerator();
[DebuggerHidden]
IEnumerator IEnumerable.GetEnumerator();
[DebuggerHidden]
void IEnumerator.Reset();
void IDisposable.Dispose();

// Properties
int IEnumerator.Current { [DebuggerHidden] get; }
object IEnumerator.Current { [DebuggerHidden] get; }
}

The object instance can be returned to the caller of PublicEnumerable where it will behave like the iterator block we know and love.

Anonymous Methods

Anonymous methods can access local variables defined in the same scope. When a delegate literal contains a reference to a method local variable (see: i in the following example):

public static void DelegateLiteralTest()
{
Int32 i = 5;
Action add = delegate(Int32 value) { i += value; };
Console.WriteLine(i);
add(1);
Console.WriteLine(i);
add(2);
Console.WriteLine(i);
}

the C# compiler generates a new type (closure) that resembles the following:

[CompilerGenerated]
private sealed class <>c__DisplayClass1
{
// Fields
public int i;

// Methods
public <>c__DisplayClass1();
public void b__0(int value);
}

Notice how the local variable is now scoped to the instance of the closure, and how it's an instance of this closure that's used inside the method:

public static void DelegateLiteralTest()
{
<>c__DisplayClass1 __closure = new <>c__DisplayClass1();
__closure.i = 5;
Console.WriteLine(__closure.i);
__cloure.b__0(1);
Console.WriteLine(__closure.i);
__closure.b__0(2);
Console.WriteLine(__closure.i);
}

I think this is pretty darn cool.

Sunday, October 12, 2008

Singleton Pattern vs. C# Static Class

At the most basic level, both constructs are useful in limiting the number of instances of an object that can be created. There are numerous subtle differences at the implementation level, but the single biggest difference is at an object-oriented level: static classes do not support polymorphism. This stems from a combination of factors at the language level; in short, a static class cannot inherit from a base class (abstract or concrete) nor can it implement any interface methods. Conversely, you can neither define static virtual methods on a class, nor can you define them on an interface. So, if you're ever using dependency injection [insert link here] - a technique that relies heavily on polymorphism - you will likely find that static classes will be inadequate, limiting, and plain frustrating.

There are a number of other runtime level differences, but these are all secondary to the polymorphism issue. The simplest is that a static class can never be instantiated, and all methods are executed against the type instance. Contrast this with the singleton, of which exactly one instance is created, and where methods are executed against that instance.

Next time you're involved in object design and you come across a potential application for the singleton pattern, don't forget this posting!

Monday, October 06, 2008

Delegate vs. Event

What's the difference between a delegate and an event in C#, you ask? An event provides better encapsulation (read: a more object orientated approach) while a delegate is "merely" a immutable multicast type-safe function pointer. In other words, an event is an encapsulated delegate, where only the += and -= functions are non-private. That means that although we can add new subscribers to the event from another class, we cannot invoke it. The add and remove accessors are the only members to retain the event's declared access (e.g. public).

It's easier to explain delegates first.

Delegates


  • Function pointer: a delegate acts as a layer of indirection allowing references to methods to be passed between objects. The target methods are invoked when the delegate is invoked.
  • Type-safe: unlike function pointers in C++, a delegate knows the type of the target object, and the signature of the target method.
  • Multicast: a delegate maintains a list of methods; it isn't limited to invoking just one
  • Immutable: delegate instances - like strings - cannot be changed. When you add/remove another target method to/from a delegate instance, you're really creating a completely new delegate instance with a superset/subset of target methods.

    A delegate example


    Define the delegate type, or use one of the many predefined types shipped with the FCL.
    public delegate void Function(string value);
    Define the target methods that we can call. These will have to conform to the signature of the chosen delegate type.

    static void WriteOut(string value)
    {
    Console.Out.WriteLine(value);
    }

    static void WriteError(object value)
    {
    Console.Error.WriteLine(value);
    }

    Define a class with a public delegate field (bad object-oriented encapsulation, but it's just for the purpose of example).

    class DF
    {
    public Function function;
    }

    Construct a new instance of this class, add a couple of target methods and invoke them through the delegate:

    DF df = new DF();
    df.function += new Function(WriteOut);
    df.function += new Function(WriteError);
    df.function("Delegate");


    Events


    An event is a construct that is functionally similar to a delegate, but provides more encapsulation. Specifically, adding/removing target methods is given the accessibility of the event (e.g. public/protected/internal). Everything else is given the accessibility modifier of "private". This allows subscribers to be notified of an event, but not to trigger it themselves.

    An event example


    Again, pick a delegate type from the FCL or define your own. For this example, I will reuse the Function delegate type, and the two Function methods defined above.
    Define a class with an Event:

    class EF
    {
    public event Function function;
    }

    Construct a new instance of this class, add a couple of target methods:

    EF ef = new EF();
    ef.function += new Function(WriteOut);
    ef.function += new Function(WriteError);

    Because of the encapsulation, we cannot call the following method from outside the declaring type:

    //ef.function("Event"); // ... can only appear on the left hand side of += or -= (unless used from within the event's declaring type)


    A good technical resource:
    Difference between a delegate and an event.
  • Wednesday, October 01, 2008

    Indispensable Software Engineering Tools for .NET

    Requirements
    =
    JIRA: rudimentary but ok, not so great for managing project plans
    -
    Source Control
    =
    Subversion: all check-ins must have a reference back to the original requirement.
    Everything included in the release must be held in the versioned repository.
    Everything means everything... environment specific config files.
    (Don't store usernames and passwords in source control, but that's ok because you didn't hard code them into your configuration anyway, did you!?)
    -
    Build
    =
    CruiseControl.NET:
    > get latest from source control, tag so that the build can be repeated.
    > ensure compiled assemblies are versioned so they can be traced back to the build.
    > build, run unit tests and other code metric tools.
    Code Metric Tools:
    >FxCop
    >NUnit
    >NCover
    >anything for cyclomatic complexity?
    Post Build
    =
    FishEye: great for seeing all metrics, committed changes etc. Useful when you need to see what changes went into a build.
    -

    Friday, September 26, 2008

    Pulsed Threads

    It happened quite a while back, but I thought I'd put a note up here as a reminder: a pulsed thread is still idle and another thread could acquire the lock. This blog entry is not about how best to re-write the producer/consumer problem as an event-driven model, it's to show how naive multi-threaded operations are - among other things - low hanging fruit when it comes to finding bugs...

    Imagine a group of producer threads [P0..Pn] and a group of consumer threads [C0..Cn] writing to and reading from a queue of arbitrary size. Access to the queue needs to be thread safe, but the consumer threads could conceivably take a long time to process the items they consume from the queue. It is an error to attempt to read from an empty queue. For this example, we do not consider throttling the producers.

    For a first (naive) attempt in C#:

    01: // consumer
    02: while (running)
    03: {
    04: Int64 value;
    05: lock (queue)
    06: {
    07: if (queue.Count == 0)
    08: {
    09: Monitor.Wait(queue);
    10: }
    11: value = queue.Dequeue();
    12: }
    13: Console.WriteLine("{0}:{1}", threadId, value);
    14: }

    01: // producer
    02: while (running)
    03: {
    04: lock (queue)
    05: {
    06: Int64 item = Interlocked.Increment(ref seed);
    07: queue.Enqueue(item);
    08: Console.WriteLine("{0}:{1}", threadId, item);
    09: Monitor.Pulse(queue);
    10: }
    11: }

    The result? After several thousand iterations we get a "Queue empty" exception on line 10 of the consumer. Puzzled we look a little closer: it's not immediately obvious. Of the 5 consumer threads (call it C1), one has acquired the lock, tested the size of the queue and put itself in a wait state until a producer thread (call it P1) has added an item. P1 pulses the queue to signal an item has been added. An extract from MSDN says:
    When the thread that invoked Pulse releases the lock, the next thread in the ready queue (which is not necessarily the thread that was pulsed) acquires the lock.
    So, while we were expecting C1 (waiting on the monitor) to acquire the lock, we were wrong. C0 was next in the queue for the lock.

    To get the application running correctly, we'd need to change line 6 of the consumer to:

    06: while (queue.Count == 0)

    Kids, always read the label carefully!

    Sunday, September 14, 2008

    Coupling: Interfaces and Abstract Classes

    When "loosely coupled" is la mode du jour, nobody really likes an abstract class. Well nobody on my team. I suspect there's more going on here than the simple increase in coupling (e.g. the number of assumptions a caller must make) over calling an interface method.

    Deriving from an abstract class:
    1) forces you to inherit from an object hierarchy, and
    2) forces you to provide an implementation for a set of abstract methods, and
    3) usually means that some behaviour is "thrown in for free" in the base class(es).

    Implementing an interface
    1) forces you to provide an implementation for a set of method signatures.

    So... use more interfaces; achieve a lower level of coupling, and you'll also implicitly be choosing composition over inheritance.

    Friday, August 08, 2008

    Exceptions

    If only the System.Exception class were abstract or only contained protected constructors... This would force people to define their own custom exceptions (a good thing), but still wouldn't solve humankind's insatiable appetite for the catch-all exception handler.

    Wednesday, August 06, 2008

    System.OutOfMemoryException Part 2

    We've been seeing loads of these recently, not as a result of memory leaks (the usual suspects) but the result of trying to do too much at one time. Now that 64-bit operating systems are here, couldn't we take advantage of the extra virtual address space? Microsoft have a comparison of 32-bit and 64-bit memory architecture for 64-bit editions of Windows XP and Windows Server 2003
    outlining the changes between the two memory architectures. Next step is to see whether the 64-bit operating systems are approved.

    UnhandledException

    From .NET 2.0 onwards, unhandled exceptions on any thread cause the process to terminate. This is - according to a number of experts - the "correct" behaviour, and is different to how they were handled back in .NET 1.1.

    The AppDomain.UnhandledException event allows you to perform some kind of diagnostics (such as logging the exception and stack trace, or performing a mini-dump) after such an exception is thrown, but the CLR is going to exit - whether you like it or not - just as soon as all event handlers have run.

    So... if you are a responsible application developer who is spawning new threads (or queueing user work items to the thread pool) please please please ensure that _if_ you can handle the exception, that it is caught and not rethrown. Even if it means storing the exception and letting another thread handle it, as in the case of the asynchronous programming model (APM).

    Wednesday, June 25, 2008

    Goodbye to Awkward Dictionaries in .NET

    I take enjoyment in analytic work, and often find myself writing code to empirically test a bunch of XML files to see if I've got the "implied schema" right. For example, I might have 500 files of similar structure, and I'd want to see the possible values (and count) of the "/Reports/Globals/@Client" node's value (using XPath). Usually this would entail using a Dictionary and every access would have to test if the dictionary already contains the key ... until now. After seeing that Ruby's designers allow you to specify a default value for a Hash, I took another step forward, allowing a user of the class to provide a default function, allowing more complex defaults such as empty lists.

    using System;
    using System.Collections.Generic;

    namespace ConsoleApplication6
    {
    class Program
    {
    public delegate T DefaultFunction<T>();

    public class RubyDictionary<TKey, TValue>
    {
    private IDictionary<TKey, TValue> inner = new Dictionary<TKey, TValue>();
    private DefaultFunction<TValue> defaultFunction;

    private static TValue Default()
    {
    return default(TValue);
    }

    public RubyDictionary()
    {
    defaultFunction = Default;
    }

    public RubyDictionary(TValue defaultValue)
    {
    defaultFunction = delegate()
    {
    return defaultValue;
    };
    }

    public RubyDictionary(DefaultFunction<TValue> defaultFunction)
    {
    this.defaultFunction = defaultFunction;
    }

    public TValue this[TKey key]
    {
    get
    {
    if (!inner.ContainsKey(key))
    {
    inner.Add(key, defaultFunction());
    }
    return inner[key];
    }
    set
    {
    inner[key] = value;
    }
    }
    }

    static void Main(string[] args)
    {
    try
    {
    RubyDictionary<string, int> dict = new RubyDictionary<string, int>();
    Console.WriteLine(dict["9"]);
    dict["8"] += 1;
    Console.WriteLine(dict["8"]);
    dict["7"] = 5;
    Console.WriteLine(dict["7"]);
    dict["6"]++;
    Console.WriteLine(dict["6"]);

    dict["5"]--; dict["5"]--;
    Console.WriteLine(dict["5"]);


    RubyDictionary<string, IList<string>> lists = new RubyDictionary<string, IList<string>>(delegate() { return new List<string>(); });
    lists["shopping"].Add("bread");
    lists["shopping"].Add("milk");
    lists["todo"].Remove("write this dictionary class");

    foreach (string item in lists["shopping"])
    {
    Console.WriteLine(item);
    }

    RubyDictionary<int, int> foo = new RubyDictionary<int, int>(777);
    Console.WriteLine(foo[8] - 111);
    }
    catch (Exception exception)
    {
    Console.Error.WriteLine(exception);
    }
    finally
    {
    if (System.Diagnostics.Debugger.IsAttached)
    {
    Console.WriteLine("Press enter to continue...");
    Console.ReadLine();
    }
    }
    }
    }
    }