Tuesday, April 7, 2009

Life is Like a Box of Sockets

You never know what you're going to get.

It's interesting--I've been using my sockets class in programs I've built for the last three years.  Over that time, the code was mostly unchanged.  It worked and it worked well.  This tried and true code was a threaded, synchronous sockets class that was laden with graceful error handling and offered ease-of-use as well as high performance and scalability.

Until my latest project, anyway.  There are apparently some issues with using synchronous sockets on highly-threaded code, especially when it comes to event timing and delegated processing.  My file transfer service application exposed these issues because it uses a lot of concurrent sockets connecting and disconnecting and connecting again very rapidly.  Normally, when a single connection or two connect and then disconnect when finished, nothing bad happens.  This new app is different.  Very quickly I started getting conditions where the app would just stop altogether.  Deadlocks, occasional exceptions, etc, I had them all.  After fighting with it for a day or more, I decided to hell with it and began a new sockets class.

My new sockets are a hybrid synchronous-asynchronous design.  Sends for both the client and server remain synchronous so as to guarantee that data is sent in the order it was submitted.  I could create a queue and do the sends in an async manner, but there's no point when a synchronous send gets me the same effect.  Receives, connects, and disconnects are all asynchronous.  This has brought me no end of grief while figuring out the unique quirks of the delegated callback system in the .NET framework.  One of the earliest oddities ended up being a savior--all callbacks fire when the source object dies.  As such, if a socket disconnects, the callback for datareceived fires with 0 bytes.  Since it's impossible to actually send zero bytes on a socket, that's basically a good way to tell "hey, the thing on the other end died--disconnect time!"  As such, there's no need to actually poll sockets on a timed interval anymore.  Goodbye, threaded loop.  That's one less synchronization problem to worry about.

Perhaps my biggest failing was to take so long to figure out that my events would fire randomly on the server side until I synclocked the globally-accessed resources such as socket profiles during both connect and disconnect events.  Until then, something like this could happen:  Client on socket 3 disconnects, connects, and sends a message.  When that happens really fast, it's possible for the asynchronous callbacks to all fire at different times.  The client may well reconnect and send data, caught in two callbacks that fire simultaneously.  While one is busy servicing the disconnect, the incoming data gets processed and all of a sudden the socket gets yanked out from under the data receipt handler.  Whoops.  Your socket encountered an access exception when not expected!  WTF?

Anyway, things are looking up.  I've been running a test for a while now, sending sequentially numbered messages on 24 different client instances all constantly connecting, sending, and disconnecting to the server in random order.  Guess what.  70GB has been transmitted so far without incident.  Not a single out of order packet, miss-matched socket ID, or anything with packets of random sizes ranging from 10 bytes to 64KB.  Excellent.

Hopefully I make a good backup before I manage to break it again.  Heh.  This story isn't over yet.  There are still more things to test and more things to go wrong.  I'll surely run into something.

No comments: