The Chronicles of Nojo: June 2011

Wednesday, June 29, 2011

Version Tolerant Serialization

Somewhere in the vast gap between version 1.1 and version 4 of the .NET Framework, Microsoft came up with a solution to the version intolerance problem of serialization. I may have been living under a rock for several years, because I hear it was actually new in version 2.0.

In the object-oriented .NET Framework, memory to represent the state of an object instance is only ever allocated for fields*. Not properties. Properties are syntactic sugar applied to methods invoked to access the state of the object, which is always stored in fields. If you want to serialize an instance's state - it's the fields that must be written to the wire. To deserialize something off the wire - you guessed it - the fields are the destinations of the wire values.

Consider: an assembly A that exposes one type T. Initially (going against my natural desire to start counting at 0) we label them A1 and T1. And T1 looks like this:

namespace A {
  [Serializable]
  public class T {
    private int f;
    public int F {
      get { return f; }
      set { f = value; }
    }
  }
}

Another developer, D1, takes a copy of the A1 assembly and writes a fantastic application with it, connecting via (*cough*) .NET Remoting to a server that also has a copy of A1. The developer's job is done, and he retires comfortably in the Bahamas, but not before losing all the source code (and forgetting where it was even deployed).

Meanwhile, somebody working on the server team realizes that two ints are better than one, and that he can make the server even better if only he could add another int field G to type T.

Here's where the fun starts.

Prior to .NET 2.0, changing the fields of T would introduce a breaking change. Clients who only had access to A1's T1 would be unable to deserialize an instance of A2's T2, nor would they be able to serialize A1's T1 into the format required by the server (A2's T2). What they wished for (and Microsoft gave them) was:

namespace A {
  [Serializable]
  public class T {
    private int f;
    [OptionalField(VersionAdded = 2)]
    private int g;
    public int F {
      get { return f; }
      set { f = value; }
    }
    public int G {
      get { return g; }
      set { g = value; }
    }
  }
}

This allows the server to load A2 and serialize T2 down to the wire (and deserialize T1 off the wire).
It also allows the client to load A1 and serialize T1 down to the wire (and deserialize T2 off the wire).
Unfortunately for the fictional company stuck using .NET 1.1 with no source code, they'd have to get someone to bring them up to version 2.0 of .NET before they could appreciate the benefit.

Tuesday, June 21, 2011

Ambient Transactions

With a title like that, you'd be forgiven for thinking that the post was going to be about purchasing a beer at Cafe del Mar while watching the sun set. Unfortunately not.

First, consider this sample block of code:

using (SqlConnection connection = new SqlConnection(connectionString))
using (SqlCommand command = new SqlCommand(commandText, connection))
{
  // do something
}

There is no mention of any transactions; if you're like me you'd think that two things happen by the end of the block:

SQL Server doesn't still hold any locks for the data accessed in the block, and

The connection was returned to the pool for somebody else to use.

Wrong on both counts.
See what happens if the block was wrapped in another block (even a couple of frames higher on the stack):

using (TransactionScope transactionScope = new TransactionScope())
{
  // substitute original block here
}

Although nothing in our inner block explicitly references any transactions, an ambient transaction (i.e. one on the current thread) has been setup by our outer block and SQL Server enlists in this transaction. At the point the inner block completes, the transaction is incomplete; although the connection is returned to the connection pool, it's in a cordoned off section of the pool where it cannot be used by any other thread that's not sharing the same transaction.

Let's imagine we set Max Pool Size=1 in our connection string. This means we have 1 connection in the pool, but it's only available to the ambient transaction. If we try to obtain another connection from the pool from a different thread with no transaction or a different transaction (even a different ambient transaction) we would timeout waiting. If, instead, we repeated the inner block twice within the outer block, it would be fine: the second acquisition of a connection from the connection pool would grab the one with the (still open) ambient transaction. If we shared our transaction with another thread, we'd be able to aquire that same connection from the connection pool too.

Here's a fun exercise for the imagination: Set Max Pool Size=2; then open two connections concurrently within the same TransactionScope. You'll automatically have enlisted in a distributed transaction (not just the lightweight transaction outlined in the first part of the post). Hey presto, you're sharing a transaction across more than one SQL connection into the same server!

There are several points to take away from this post:
You can force connections not to enlist in ambient transactions by using the Enlist=false connection string property.
You can implicitly participate in transactions (whether lightweight or distributed) even when you're not expecting to.
The connection pool is slightly more complex than at first it seemed - even returning a connection to the pool doesn't guarantee its availability for other operations.
Locks can remain held long after the connection has been disposed (returned to the pool)

For more information

Sunday, June 12, 2011

External Sort

If you're ever tasked with sorting a data set larger than can fit into the memory of a single machine, you shouldn't need to panic. Put on your outside-the-box hat (pun intended) and get to work.

First of all, divide the data into blocks that are small enough to be sorted in memory, then sort them and write the results to disk (or network, or anywhere external to your process). If memory size is M, and total data to be sorted is MxN then you should now have N blocks of locally sorted data.

Next, do an N-way merge. I did it by getting N buffered readers over the N blocks. By continually getting the next lowest value from the pool of N readers (it's easy if you continually sort the readers by last obtained value in ascending order) and writing the obtained values into an output file, you will end up with N globally sorted blocks of data.

For most people attempting a very large sort, this is usually the end result (I am assuming not a lot of people have this requirement very often, and even less frequently is it their first time at seeing it.) If you're left wanting, however, you must continue...

For large values of N it first becomes prohibitive to buffer the input, then to even access one value for every N at the same time. In this case, you will need to perform a second (or higher) pass, working on less than N blocks at a time. A 32-bit Windows machine should only begin to approach this next hurdle somewhere after the 10's of terabytes mark (depending of course, in the size of the objects being sorted)...

Asynchronous ASP.NET MVC

Since ASP.NET MVC 2, Microsoft's thrown the AsyncController class into the framework, enabling asynchronous ASP.NET MVC applications without forcing developers to hand-craft their own weird and wonderful solutions. The AsyncController exposes an AsyncManager property, which allows you to increment/decrement the number of outstanding operations, and collect arguments to pass through to the XxxCompleted method when all operations are complete. To use the said controller, do this:

Derive your controller from System.Web.Mvc.AsyncController, which is treated differently by ASP.NET, and allows you access to the AsyncManager.

For each logical asynchronous method you need, provide a pair of methods that follow a set naming convention, where the method name prefix matches the action name:
1) Begin method name suffix is Async and return type is void
2) End method name suffix is Completed, parameters match those set up in the AsyncManager, and return type is an ActionResult.

For example:

public void IndexAsync()
{
    ViewData["Message"] = "Welcome to Asynchronous ASP.NET MVC!";
    AsyncManager.OutstandingOperations.Increment();
    WebService.WebService webService = new MvcApplication1.WebService.WebService();
    webService.HelloWorldCompleted += (sender, e) =>
    {
        AsyncManager.Parameters["greeting"] = e.Result;
        AsyncManager.OutstandingOperations.Decrement();
    };
    webService.HelloWorldAsync();
}

public ActionResult IndexCompleted(string greeting)
{
    ViewData["AsyncMessage"] = greeting;
    return View();
}

In case the execution flow doesn't appear obvious: on receipt of a request for the Index action, ASP.NET uses reflection to find the method pair with the prefix Index and invokes the first half of the method pair (IndexAsync) on its thread pool. The method implementation declares one asynchronous operation to the AsyncManager. We use the Event-based Asynchronous Pattern to call a demo ASMX web service asynchronously from this client - the Completed event handler sets a parameter value and decrements the number of outstanding operations. ASP.NET waits for the number of outstanding operations to reach zero, then looks for an IndexCompleted method with a string parameter named "greeting" (because this is what we called the parameter when we assigned the result on the AsyncManager, during the web service completed event handler). It invokes it (the second half of the method pair) and the rest - they say - is history.

Friday, June 10, 2011

Asynchronous ASMX Web Services

To avoid threads from blocking in an ASMX web service, Microsoft have given us a nifty pattern to employ: instead of declaring your method with a signature like this

[WebMethod]
public ReturnType Test(ArgumentType arg);

you declare a pair of methods. The first is prefixed with Begin, returns an IAsyncResult, and takes an additional AsyncCallback and some state.
The second is prefixed with End and takes an IAsyncResult. When you provide these methods, ASP.NET treats them as a pair and ensures they are called at the right times.
The idea is that you can spawn as much asynchronous work as you like in the Begin method, then when you're ready, invoke the asyncCallback passed into the Begin method. This signals ASP.NET to call your End method, which is responsible for returning the final result of the function pair.

[WebMethod]
public IAsyncResult BeginTest(string user, AsyncCallback asyncCallback, object state)
{
    SqlConnection connection = new SqlConnection(@"Server=???;Async=true");
    SqlCommand command = new SqlCommand(string.Format(@"WAITFOR DELAY '00:00:04' SELECT '{0}'", user), connection);
    connection.Open();
    return command.BeginExecuteReader(asyncCallback, command);
}

[WebMethod]
public string EndTest(IAsyncResult asyncResult)
{
    SqlDataReader reader = null;
    SqlCommand command = (SqlCommand)asyncResult.AsyncState;
    try
    {
        reader = command.EndExecuteReader(asyncResult);
        do
        {
            while (reader.Read())
            {
                return reader.GetString(0);
            }
        } while (reader.NextResult());
        throw InvalidOperationException("No results returned from reader.");
    }
    finally
    {
        if (reader != null)
            reader.Dispose();
        command.Connection.Close();
        command.Dispose();
        command.Connection.Dispose();
    }
}

In a service implemented asynchronously like this, 100 different clients could concurrently (and synchronously) execute calls to Test with the server efficiently allocating just 1 thread to service all the requests.
But there's a distinction: the server is asynchronous, yet the client calls are still synchronous by default (unless you skipped ahead and read the next bit).
That is to say, if a client with just one CPU core attempted to make 100 simultaneous calls simply by threading the requests, the throughput wouldn't be great, and there would be the overhead of having 100 threads context switching, garbage collecting etc.
It would be a far better option to make the calls asynchronously from the client too. After all, a web service call is I/O.

Visual Studio 2010 gives us an option when we generate the Web Service Reference - it's a check box titled "Generate asynchronous operations".
The proxy generated when this option is checked conforms to Microsoft's Event-based Asynchronous Pattern (EAP).
You subscribe to a completion event, and invoke the proxy method (which returns void). Once the response is ready, the event is raised and your callback invoked, at which point you get the result (or error).

This gives us an incredibly efficient way of working. With just one thread on the client, and one thread on our web server, (and potentially just one thread on the SQL Server in our simple "WAITFOR" example) we can make 100 calls in the same 4* seconds it takes to make just 1 call. We could probably stretch it to 1000 calls even! The point is that a blocked thread (usually) harms a system's performance, whether it's the end client, an intermediate server, or the end server crunching the numbers.

Wednesday, June 08, 2011

Asynchronous ASP.NET / SQL Server

We all agree that waiting is evil, right? Well, threads are evil too! Ok, we need a few threads to get the job done, but they (ideally) shouldn't ever block.

Achieving high concurrency has never been only about running operations on multiple threads. In fact, the best performance is when the thread count is a number very close to the hardware thread count[1].

.NET provides us with the Asynchronous Programming Model (APM) that gives us the ability to use non-blocking operations that might otherwise have blocked up a thread waiting for them to complete. It allows us to create as few threads as are absolutely necessary and to use them efficiently.

Let's say we have an ASP.NET web application (ignoring MVC until Part Two). The page will display a piece of data that takes 30 seconds to calculate, based on some input data. The calculation doesn't happen locally; in fact it's running on a cluster of big SQL Server boxes. We have 1000 users, and the database query that takes 30 seconds scales like magic (even at 2000 users, it still takes less than 45 seconds on average). Can we put the ASP.NET part of this application onto just one box? The answer is yes - I'll show you.

First of all, we want ASP.NET to treat our page differently to a regular ASP.NET page; when we begin our asynchronous operation, we want to signal ASP.NET that the thread we started with is now free for the next request. We also want ASP.NET to call us back when our result is ready. This is done by setting the following page attribute:

<%@ Page Async="true" ...

Our request will now be handled by an IHttpAsyncHandler and will now process in 4 phases instead of 1.
Synchronously:

PreInit,Init,InitComplete,PreLoad,LoadComplete,PreRender,PreRenderComplete,SaveState,SaveStateComplete,Render

Asynchronously:

PreInit,Init,InitComplete,PreLoad,LoadComplete,PreRender

Begin

PreRenderComplete,SaveState,SaveStateComplete,Render

See what's happened here? Instead of synchronously performing phases 1 and 4 as a single operation (on the same thread) like we would do in a regular ASP.NET page, we break the long running task into 4 chunks, running each independently, and allowing .NET to efficiently allocate tasks to physical threads. That's correct: now we have no guarantee that Render will be called on the same thread as Init. And why should it? No reason, that's why! (However, there's every possibility it *might* be more efficient to use the same physical thread - yes, just one - if the server in question had only a single core CPU. The point is we shouldn't write code that expects this to have been the case). Also note that two new phases have been sandwiched in between 1 and 4: Begin (2) and End (3).
To use Begin and End, we need only to (write and) register our callbacks, and ASP.NET will ensure they're called.

protected void Page_Load(object sender, EventArgs e)
        {
            AddOnPreRenderCompleteAsync(
                command_BeginExecuteReader,
                command_BeginExecuteReader_AsyncCallback
                );
        }

We register (as many as we need) pairs of Begin and End methods that match the following signatures (both the Begin and End methods are called asynchronously, hence the async look and feel of the parameter lists):

public delegate IAsyncResult BeginEventHandler(object sender, EventArgs e, AsyncCallback cb, object extraData);

public delegate void EndEventHandler(IAsyncResult ar);

So far so good (hopefully), but all this would be pointless if SQL Server didn't also give us asynchronous connections that can call back into the thread pool when the command completes. To use it, set

Asynchronous Processing=true;

in the connection string and code up your operation to the APM pattern:

void command_BeginExecuteReader_AsyncCallback(IAsyncResult asyncResult)
        {
            SqlCommand command = (SqlCommand)asyncResult.AsyncState;
            SqlDataReader reader = command.EndExecuteReader(asyncResult);
            do
            {
                while (reader.Read())
                {
                    Response.Write(/*do-something-with-the-reader*/);
                }
            } while (reader.NextResult());
            reader.Dispose();
            command.Connection.Close();
            command.Dispose();
            command.Connection.Dispose();
        }

        IAsyncResult command_BeginExecuteReader(object sender, EventArgs e, AsyncCallback cb, object extraData)
        {
            SqlConnection connection = new SqlConnection(@"Server=Scandium;Trusted_Connection=true;Asynchronous Processing=true;");
            SqlCommand command = new SqlCommand(@"/*hectique-code*/", connection);
            connection.Open();
            return command.BeginExecuteReader(cb, command);
        }

Did you spot the ASP.NET trick that allows this magic to work? See how our Begin method takes an AsyncCallback from the caller (ASP.NET)... well, we pass that callback to SQL Server instead of directly wiring up our End method as we might in a Windows application. That way, as soon as SQL Server is ready, ASP.NET gets notified (not us directly), which gives control to our page for us to handle the response, then when we're done (i.e. the End method call stack is unwound) ASP.NET moves onto the final phase to do it's PreRenderComplete etc.

This is how it's done. Simples. For the sake of clarity I've not bothered with error handling in the example. Resource management across threads makes C#'s using statement impossible, so implementers of IDisposable are dealt with in a slightly tricky way. Thrown exceptions would need special care too.

[1] I made this figure up, but it does seem to prove itself anecdotally quite a lot.

The Chronicles of Nojo