Thursday, March 26, 2009
Buffer vs. Array
To clear up any misconceptions, System.Array.Copy is not the same as System.Buffer.BlockCopy unless you're operating on arrays of System.Byte. The Buffer class copies bytes from one array to another. If you have an array of System.Int32 (4 bytes each) and you copy 3 bytes (not items) from your source to your destination array, you will just get the first 3 bytes of your 4 byte System.Int32. Depending on the endian-ness of your system, this could give you different results. Also, the System.Buffer class only works on primitives. Not C# primitives (which include strings) and not even derivatives of System.ValueType (e.g. enums and your own structs). Clearly it can't work on reference types safely (imagine just copying 3 of the 4 bytes from one object reference to another) but I would have expected it to work with enumerations (essentially ints).
Monday, March 23, 2009
Hello Python
>>> def reverse(value):
if len(value) > 1:
return value[len(value) - 1] + reverse(value[:len(value) - 1])
else:
return value[0]
>>> print reverse('!dlroW ,olleH')
Hello, World!
System.OutOfMemoryException Part 4
And then there were 2. I'm talking about code paths that lead to an OutOfMemoryException being thrown by the CLR.
#1 is the standard "You've run out of contiguous virtual memory, sonny boy!", which is pretty easy to do in a 32-bit process. With the advent of 64-bit operating systems, you get a little more headroom. Actually, a lot more headroom: precicely 2^32 times as much as you had to start with! But, due to limitations built into the CLR, you're no further from an OutOfMemoryException...
#2 is when you try and allocate more than 2GiB in a single object. In the following example, it's the +1 that pushes you over the edge:
If you're using a different ValueType your mileage may vary:
Or if you're on a 64-bit system:
Many thanks to this blog for helping me in the right direction here.
#1 is the standard "You've run out of contiguous virtual memory, sonny boy!", which is pretty easy to do in a 32-bit process. With the advent of 64-bit operating systems, you get a little more headroom. Actually, a lot more headroom: precicely 2^32 times as much as you had to start with! But, due to limitations built into the CLR, you're no further from an OutOfMemoryException...
#2 is when you try and allocate more than 2GiB in a single object. In the following example, it's the +1 that pushes you over the edge:
new System.Byte[System.Int32.MaxValue + 1];
If you're using a different ValueType your mileage may vary:
new System.Int32[(Ssytem.Int32.MaxValue / sizeof(System.Int32.MaxValue)) + 1];
Or if you're on a 64-bit system:
new System.IntPtr[(System.Int32.MaxValue / sizeof(System.Int64)) + 1];
Many thanks to this blog for helping me in the right direction here.
Hopscotch in the 64-bit Minefield
So it's no secret I've been playing with virtualization, Windows, Linux, IA32 and amd64. Virtualization looks the part, but so does a 24-bit color depth 1280 x 1024 bitmap of a fake desktop. You can't do much with either.
Microsoft has given us WOW64, allowing us to run old x86 (IA32) software on new 64-bit operating systems. It's so seamless you forget you've got it: until you try and install ISA Server 2006.
Getting Windows Hyper-V Server up and running on a headless box... well, let's just say I'm not that patient. I eventually settled for Hyper-V role on a full-fat Windows Server 2008 Enterprise Edition install, with crash-carted DKM. It doesn't appear to run on anything except the 64-bit version, either. Even then, loads of caveats await:
gigabit network card? you'll have to use a ridiculously slow emulated "switch" instead.
multiple cpus? not!
scsi drives? think again.
usb? umm... for some reason nobody in the world has ever thought of plugging a usb device into a server. i was the first. i'm so proud!
Granted, all these restrictions are imposed on the non-MS (see ubuntu) and older guest operating systems like Windows XP, but isn't half the pitch behind virtualization about getting rid of old physical kit and consolidating servers?
Flex Builder 3? Oh yes. But not for Linux. No wait... there's a free alpha version but it's not compatible with 64-bit.
VMWare ESXi? Check your hardware list first.
Microsoft has given us WOW64, allowing us to run old x86 (IA32) software on new 64-bit operating systems. It's so seamless you forget you've got it: until you try and install ISA Server 2006.
Getting Windows Hyper-V Server up and running on a headless box... well, let's just say I'm not that patient. I eventually settled for Hyper-V role on a full-fat Windows Server 2008 Enterprise Edition install, with crash-carted DKM. It doesn't appear to run on anything except the 64-bit version, either. Even then, loads of caveats await:
gigabit network card? you'll have to use a ridiculously slow emulated "switch" instead.
multiple cpus? not!
scsi drives? think again.
usb? umm... for some reason nobody in the world has ever thought of plugging a usb device into a server. i was the first. i'm so proud!
Granted, all these restrictions are imposed on the non-MS (see ubuntu) and older guest operating systems like Windows XP, but isn't half the pitch behind virtualization about getting rid of old physical kit and consolidating servers?
Flex Builder 3? Oh yes. But not for Linux. No wait... there's a free alpha version but it's not compatible with 64-bit.
VMWare ESXi? Check your hardware list first.
Saturday, March 21, 2009
Debug Enable!
All wireless functions on my O2 wireless box II appear to have died overnight. This is a problem for me because I use wireless all the time... to stream music to my Apple TV, to surf the Internet from my laptop, and play games on my iPod Touch. The wired ethernet is still happily routing packets between my LAN and the Internet. So I climbed up in my cupboard and pulled down the old Netgear DG834G - what a beautiful beast. It runs BusyBox, an embedded Linux operating system and you can connect to it using any telnet client. Just navigate to http://<netgear-router>/setup.cgi?todo=debug from your browser first, and you're A for Away - you can then telnet into it. Reboot the rooter when you're finished to disable the telnet server until next time.
Things that frustrated me:
There was no way to change the default gateway address handed out by either router. I suspect I could have done so with the Netgear box in time, but there is no text editor in the distribution.
To get my network back up and running would be difficult without the usual trance music coming through the Apple TV; keeping me focused. Still, I needed to:
In a nutshell:
o2wirelessbox is now 192.168.1.254/24 with DHCP handing out addresses in the range 64-253
netgear is now 192.168.1.1/24 with disabled DHCP (I would have liked to use the range 2-63)
Things that frustrated me:
There was no way to change the default gateway address handed out by either router. I suspect I could have done so with the Netgear box in time, but there is no text editor in the distribution.
To get my network back up and running would be difficult without the usual trance music coming through the Apple TV; keeping me focused. Still, I needed to:
- put both routers onto the same subnet
- keep DHCP running on the O2 box (so that the default gateway would remain the O2 box address)
- turn off DHCP on the Netgear box (otherwise it would give out its own IP address as default gateway)
- turn off the wireless interface on the O2 box for good
In a nutshell:
o2wirelessbox is now 192.168.1.254/24 with DHCP handing out addresses in the range 64-253
netgear is now 192.168.1.1/24 with disabled DHCP (I would have liked to use the range 2-63)
Wednesday, March 11, 2009
Gigabit Ethernet
... is actually quite fast. So fast - in fact - that the bottleneck on both PCs under my desk is currently the PCI bus into which the network cards are plugged. I had no idea that the bus ran at 33MHz and was only be able to transfer 32 bits per cycle (math says: 33M * 32b = 1056Mb). Todo: is this duplex?
There's a very useful FAQ regarding Gigabit Ethernet available here.
In a test with 4 network cards (2 in each machine) I saw the following:
w) 1A -> 2C (93Mb/s)
x) 1A -> 2D (902Mb/s)
y) 1B -> 2C (93Mb/s)
z) 1B -> 2D (294Mb/s)
1A: Gigabit embedded on motherboard (1000 link at switch)
1B: Gigabit on PCI bus (1000 link at switch)
2C: Fast embedded on motherboard ( 100 link at switch)
2D: Gigabit on PCI bus (1000 link at switch)
Machine 1 was sending large UDP datagrams. Machine 2 was not listening, it was just up so that ARP could get the MAC addresses of its adapters (without which, we could not send a datagram).
Interestingly:
tests w + y appeared to be throttled by the router as it managed a mixed 100/1000 route
test x was great and showed that an onboard gigabit controller can actually do its job
test z showed the PCI bus being the bottleneck, allowing 3x faster than Fast, but 3x slower than Gigabit.
There's a very useful FAQ regarding Gigabit Ethernet available here.
In a test with 4 network cards (2 in each machine) I saw the following:
w) 1A -> 2C (93Mb/s)
x) 1A -> 2D (902Mb/s)
y) 1B -> 2C (93Mb/s)
z) 1B -> 2D (294Mb/s)
1A: Gigabit embedded on motherboard (1000 link at switch)
1B: Gigabit on PCI bus (1000 link at switch)
2C: Fast embedded on motherboard ( 100 link at switch)
2D: Gigabit on PCI bus (1000 link at switch)
Machine 1 was sending large UDP datagrams. Machine 2 was not listening, it was just up so that ARP could get the MAC addresses of its adapters (without which, we could not send a datagram).
Interestingly:
tests w + y appeared to be throttled by the router as it managed a mixed 100/1000 route
test x was great and showed that an onboard gigabit controller can actually do its job
test z showed the PCI bus being the bottleneck, allowing 3x faster than Fast, but 3x slower than Gigabit.
Saturday, March 07, 2009
Extellect Utilities
I've finally put a set of C# productivity classes on Google Code under the Apache 2.0 License. So, check 'em out, see if they make your work any easier, and let me know what you think.
Remoting, bindTo and IPAddress.Any
Just solved a fun problem. While running a .NET remoting server on my ubuntu box with multiple NICs I saw some strange behavior where the server would return an alternate IP address, and the client would attempt to re-connect (using a new Socket) to the new IP address. Problem being: the server was responding with its localhost address (127.0.1.1), which the client was resolving to its own loopback adapter. See the problem yet?
It turns out that in the absence of a specific binding, the server binds to IPAddress.Any. When a client attempts to connect, it's redirected to the server's bound address. Unless client and server are hosted on the same physical machine, there's really no point in ever using the loopback adaptor... which makes it a strange choice for default.
The solution:
Before you create and register your TcpServerChannel, you need to set some options.
Voila! 'appiness should ensue...
PS. If for some reason the dotted quad doesn't appeal to your particular situation (e.g. load balancing), you can set two other properties instead:
PPS. I think the client will make always two socket connections, A + B. A is used at the start to do some initialization and get the address for connection B. B is used for meat and bones of the operation, and finally A is used just before they're both closed.
It turns out that in the absence of a specific binding, the server binds to IPAddress.Any. When a client attempts to connect, it's redirected to the server's bound address. Unless client and server are hosted on the same physical machine, there's really no point in ever using the loopback adaptor... which makes it a strange choice for default.
The solution:
Before you create and register your TcpServerChannel, you need to set some options.
IDictionary properties = new Hashtable();
properties["bindTo"] = "dotted.quad.ip.address";
properties["port"] = port;
IChannel serverChannel = new TcpServerChannel(properties, new BinaryServerFormatterSinkProvider());
RemotingConfiguration.RegisterWellKnownServiceType(typeof(Explorer), "Explorer.rem", WellKnownObjectMode.SingleCall);
Voila! 'appiness should ensue...
PS. If for some reason the dotted quad doesn't appeal to your particular situation (e.g. load balancing), you can set two other properties instead:
properties["machineName"] = "load.balanced.server.name";
properties["useIpAddress"] = false;
PPS. I think the client will make always two socket connections, A + B. A is used at the start to do some initialization and get the address for connection B. B is used for meat and bones of the operation, and finally A is used just before they're both closed.
Friday, March 06, 2009
Goodput - Season Finale
I thought I'd take a look at a couple of different (and by no means an exhaustive list of) options for transferring a reasonably large file across a network. Over the past couple of days I tried sending a 700MB DivX using the .NET Remoting API (over both TCP and HTTP), the .NET Sockets API (over TCP), and finally using a mounted network share, reading the file as if it was local.
The table that follows shows the results of these tests:
I'd advocate taking pinch of salt when interpreting the numbers.
In general, remoting doesn't support the C# compiler generated closures it emits when it compiles an iterator block (e.g. the yield return keyword): quickly remedied by exposing the IEnumerator<T> as a remote MarshalByRefObject itself, wrapping the call to the iterator block. This gave us a nice looking (easy to read) interface, but will have increased the chattiness of the application, as every call to MoveNext() and Current would have required a network call. Further to this, the default SOAP serialization used with HTTP remoting doesn't support generic classes, so I had to write a non-generic version of my Streamable<T> class.
The performance of the HTTP/SOAP remoting was abysmal and there was very little gain by switching to a faster network. Even with what I suspect to be a massively chatty protocol (mine, not theirs), the bottleneck was probably somewhere else.
TCP remoting was next up. Under the covers it will have done all the marshalling/unmarshalling on a single TCP socket, but the chatty protocol (e.g. Current, MoveNext(), Current, MoveNext() etc.) probably let it down. TCP/Binary remoting's performance jumped 2.5x when given a 10x faster network, indicating some other bottleneck as it still used just 16% of the advertised available bandwidth.
CIFS was pretty quick, but not as quick as the System.Net.Sockets approach. Both used around 30% of the bandwidth on the Gigabit tests, indicating that some kind of parallelism might increase the utilization of the network link. An inverse-multiplexer could distribute the chunks evenly (round-robin) over 3 sockets sharing the same ethernet link, and a de-inverse-multiplexer (try saying that 10 times faster, after a couple of beers) could put them together.
Back on track...
Seeing as TCP/Binary remoting was the problem area that drove me to research this problem, I thought I'd spend a little more time trying to optimise it - without changing the algorithm/protocol/interface - by parameterizing the block size. The bigger the block size, the fewer times the network calls MoveNext() and get_Current have to be made, but the trade-off is that we have to deal with successively larger blocks of data.
What the numbers say: transmission rate is a bit of an upside down smile; at very low block sizes the algorithm is too chatty, at 4M it's at its peak, and beyond that something else becomes the bottleneck. At the 4M peak, the remote iteration invocations would only have been called 175 times, and the data transfer rate was 263Mb/s (roughly 89% of the observed CIFS' 296Mb/s).
The table that follows shows the results of these tests:
I'd advocate taking pinch of salt when interpreting the numbers.
In general, remoting doesn't support the C# compiler generated closures it emits when it compiles an iterator block (e.g. the yield return keyword): quickly remedied by exposing the IEnumerator<T> as a remote MarshalByRefObject itself, wrapping the call to the iterator block. This gave us a nice looking (easy to read) interface, but will have increased the chattiness of the application, as every call to MoveNext() and Current would have required a network call. Further to this, the default SOAP serialization used with HTTP remoting doesn't support generic classes, so I had to write a non-generic version of my Streamable<T> class.
The performance of the HTTP/SOAP remoting was abysmal and there was very little gain by switching to a faster network. Even with what I suspect to be a massively chatty protocol (mine, not theirs), the bottleneck was probably somewhere else.
TCP remoting was next up. Under the covers it will have done all the marshalling/unmarshalling on a single TCP socket, but the chatty protocol (e.g. Current, MoveNext(), Current, MoveNext() etc.) probably let it down. TCP/Binary remoting's performance jumped 2.5x when given a 10x faster network, indicating some other bottleneck as it still used just 16% of the advertised available bandwidth.
CIFS was pretty quick, but not as quick as the System.Net.Sockets approach. Both used around 30% of the bandwidth on the Gigabit tests, indicating that some kind of parallelism might increase the utilization of the network link. An inverse-multiplexer could distribute the chunks evenly (round-robin) over 3 sockets sharing the same ethernet link, and a de-inverse-multiplexer (try saying that 10 times faster, after a couple of beers) could put them together.
Back on track...
Seeing as TCP/Binary remoting was the problem area that drove me to research this problem, I thought I'd spend a little more time trying to optimise it - without changing the algorithm/protocol/interface - by parameterizing the block size. The bigger the block size, the fewer times the network calls MoveNext() and get_Current have to be made, but the trade-off is that we have to deal with successively larger blocks of data.
What the numbers say: transmission rate is a bit of an upside down smile; at very low block sizes the algorithm is too chatty, at 4M it's at its peak, and beyond that something else becomes the bottleneck. At the 4M peak, the remote iteration invocations would only have been called 175 times, and the data transfer rate was 263Mb/s (roughly 89% of the observed CIFS' 296Mb/s).
Thursday, March 05, 2009
Full Duplex
Simple english: if two computers are connected by a full duplex ethernet link, then they should be able to carry out two conversations with each other simultaneously. For example, imagine two computers named A and B with a 100Mb/s full-duplex connection linking them both. A starts "talking" at 100Mb/s and B "listens". B also starts "talking" at 100Mb/s and A "listens". The total data moving up and down the link 200Mb/s. That's full duplex, baby!
Only, in real life you don't get the full 100Mb/s in either direction. On my PC, I managed to get 91Mb/s in one direction and 61Mb/s in the other direction. If I stopped the 91Mb/s conversation (call it X), the 61Mb/s conversation (call it Y) would quickly use up the extra bandwidth, becoming a 91Mb/s conversation itself. As soon as I restarted X, it reclaimed its original 91Mb/s, and Y returned to its original 61Mb/s. Freaky.
Only, in real life you don't get the full 100Mb/s in either direction. On my PC, I managed to get 91Mb/s in one direction and 61Mb/s in the other direction. If I stopped the 91Mb/s conversation (call it X), the 61Mb/s conversation (call it Y) would quickly use up the extra bandwidth, becoming a 91Mb/s conversation itself. As soon as I restarted X, it reclaimed its original 91Mb/s, and Y returned to its original 61Mb/s. Freaky.
Goodput - Part 2
So then I thought to myself, "Hey, you have two NICs in each machine. Why don't you try and get double the throughput?" Even though all my NICs are gigabit ethernet, my modem/router/switch is only capable of 10/100 (a gigabit switch is in the post, as I type). Yesterday's tests indicated that I was getting roughly 89Mb/s, so I'd be aiming for 178Mb/s with my current hardware setup. And a glorious (hypothetical) 1.78Gb/s when the parcel arrives from amazon.co.uk.
What would have to change? For starters, the server was binding one socket to System.Net.IPAddress.Any; we'd have to create two sockets and bind each one to its own IP address. Easy enough. The client would also have to connect to one of the the two new server IP addresses.
Wait a minute... there isn't any System.Net.Sockets option on the client side to specify which ethernet adapter to use. You only specify the remote IP address. Oh no! This means we could end up sending/receiving all the data through just one of the client's NICs. Luckily, you can modify the routing table so that all traffic to a particular subnet can be routed via a specific interface. I'm using ubuntu as the client, and my routing table looks like this, which indicates that eth0 would get all the traffic to my LAN:
I want to add a static route, with higher precedence than the LAN subnet, using eth1 for all communication with the remote IP 192.168.0.73, leaving eth0 with the traffic for the rest of the 192.168.1.0/24 subnet. I type this command at the console:
Disaster averted. The routing table now looks like this, and I'm happy to say my diagnostics report that I'm getting around 170Mb/s with my new trick in place. It's not the 178Mb/s I was hoping for (I've lost about 4.5% on each connection), but it's still 190% of the original throughput.
Throughput comparison reading a 700MB DivX file:
73Mb/s - Mounted network share using CIFS (although it also appeared to be caching the file on disk, incurring disk write penalties)
89Mb/s - 1x NIC using System.Net.Sockets
170Mb/s - 2x NIC using System.Net.Sockets
What would have to change? For starters, the server was binding one socket to System.Net.IPAddress.Any; we'd have to create two sockets and bind each one to its own IP address. Easy enough. The client would also have to connect to one of the the two new server IP addresses.
Wait a minute... there isn't any System.Net.Sockets option on the client side to specify which ethernet adapter to use. You only specify the remote IP address. Oh no! This means we could end up sending/receiving all the data through just one of the client's NICs. Luckily, you can modify the routing table so that all traffic to a particular subnet can be routed via a specific interface. I'm using ubuntu as the client, and my routing table looks like this, which indicates that eth0 would get all the traffic to my LAN:
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.1.0 * 255.255.255.0 U 1 0 0 eth0
192.168.1.0 * 255.255.255.0 U 1 0 0 eth1
I want to add a static route, with higher precedence than the LAN subnet, using eth1 for all communication with the remote IP 192.168.0.73, leaving eth0 with the traffic for the rest of the 192.168.1.0/24 subnet. I type this command at the console:
sudo route add -net 192.168.1.73 netmask 255.255.255.255 dev eth1
Disaster averted. The routing table now looks like this, and I'm happy to say my diagnostics report that I'm getting around 170Mb/s with my new trick in place. It's not the 178Mb/s I was hoping for (I've lost about 4.5% on each connection), but it's still 190% of the original throughput.
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.1.73 * 255.255.255.255 UH 0 0 0 eth1
192.168.1.0 * 255.255.255.0 U 1 0 0 eth0
192.168.1.0 * 255.255.255.0 U 1 0 0 eth1
Throughput comparison reading a 700MB DivX file:
73Mb/s - Mounted network share using CIFS (although it also appeared to be caching the file on disk, incurring disk write penalties)
89Mb/s - 1x NIC using System.Net.Sockets
170Mb/s - 2x NIC using System.Net.Sockets
Goodput - Part 1
I've been interested recently in maximizing the application-level throughput of data across networks. The word I didn't know I was looking for - but found anyway - was goodput.
At first I tried streaming across a large DivX file. When I realised that I might be measuring disk seek time (doubtful, at the pitiful data rates I was achieving) I transitioned my tests to stream meaningless large byte arrays directly from memory (being careful not to invoke the wrath of the garbage collector, or any of the slow operating system-level memory functions).
What I noticed was that the application-level control of the stream was a big factor in slowing down the effective data transfer rate. In short, if my design of the logical/physical stream protocol was "bad", so would be the goodput.
Throughout the test, data was transmitted in chunks of varying sizes (from 4k to 128k). Firstly, just to see what it was like, I tried establishing a new System.Net.Socket connections for each chunk. Not good. This is why database connection pooling has really gained ground. It's expensive to create new connections. Next I tried a single connection where the client explicitly requested the next chunk. Also really bad. It was chatty like an office secretary, and got less done. So I tried a design I thought would be pretty good, where 1 request resulted in many chunks being returned. For some reason, I thought that prepending each chunk with its size would be a good idea. It was 360% better than the previous incarnations, but the extra size information was just repeating data that wasn't at all useful to the protocol I had devised: it was wasting bits and adding extra CPU load, and giving nothing in return; it had to go. Stripping these items from the stream resulted in an extra 3.6% of throughput.
Interestingly, I noticed that the choice of buffer size could drastically affect the goodput, especially when it was ((1024*128)+4) bytes. I expect this was something to do with alignment. It would be cool to do some more tests, looking for optimal buffer sizes.
At first I tried streaming across a large DivX file. When I realised that I might be measuring disk seek time (doubtful, at the pitiful data rates I was achieving) I transitioned my tests to stream meaningless large byte arrays directly from memory (being careful not to invoke the wrath of the garbage collector, or any of the slow operating system-level memory functions).
What I noticed was that the application-level control of the stream was a big factor in slowing down the effective data transfer rate. In short, if my design of the logical/physical stream protocol was "bad", so would be the goodput.
Throughout the test, data was transmitted in chunks of varying sizes (from 4k to 128k). Firstly, just to see what it was like, I tried establishing a new System.Net.Socket connections for each chunk. Not good. This is why database connection pooling has really gained ground. It's expensive to create new connections. Next I tried a single connection where the client explicitly requested the next chunk. Also really bad. It was chatty like an office secretary, and got less done. So I tried a design I thought would be pretty good, where 1 request resulted in many chunks being returned. For some reason, I thought that prepending each chunk with its size would be a good idea. It was 360% better than the previous incarnations, but the extra size information was just repeating data that wasn't at all useful to the protocol I had devised: it was wasting bits and adding extra CPU load, and giving nothing in return; it had to go. Stripping these items from the stream resulted in an extra 3.6% of throughput.
Interestingly, I noticed that the choice of buffer size could drastically affect the goodput, especially when it was ((1024*128)+4) bytes. I expect this was something to do with alignment. It would be cool to do some more tests, looking for optimal buffer sizes.
Subscribe to:
Posts (Atom)