Friday, March 11, 2011

Uncaging Cassandra

The ongoing debate between proponents of locking and multi-version concurrency control (MVCC) is almost as heated as that between supporters of vi and emacs. There is nothing fundamentally wrong with doing concurrency control in your data store using multiple reader, single writer locking. Except that multi-version concurrency control is often better.

In an MVCC system with ongoing transactions there may be several copies of a given item of data, representing its state at different points in time. A consistent read over multiple data items can be obtained by using the version of each item that most recently pre-dates the start of the transaction. In other words, the read simply ignores updates from others that occur after its transaction began. Using this system, readers never block each other, nor do they block writers.

Compared to locking, an MVCC approach places very little extra latency on reads. Instead of adding distributed communication to manage a global lock, there is just some extra data managed in the read - multiple copies of the value, from which one is selected for use according to the versioning information. Writes are a bit more involved, as they must preserve the old value and timestamp before updating the new one. So, the reduced read cost comes at the price of more storage space and slightly more activity on writes. Which, as it happens, is exactly the design philosophy on which Cassandra is built...

In most systems built using MVCC, the versioning is entirely server side and not exposed to the client. We don't currently have that luxury with Cassandra, but it does have some features we can use to layer MVCC on top of the existing storage engine. Since it's a schemaless environment, we can simply inline the versions as additional Columns in the existing row, rather than requiring distinct storage. A get becomes a get_slice and the client uses the timestamps to decide which value to use. Of course we may need to consider a more involved approach to storage if we're sharing the store with clients that are not MVCC aware, but that kind of problem is true of external locking solutions too. With the new expiring columns support in 0.7 we can also avoid doing explicit garbage collection of older values - just set the expiry to an interval longer than we expect any transaction to take.

Of course we can't use MVCC as a complete replacement for locking on Cassandra, as there is no CAS support. We're stuck with locking on write to avoid lost updates. Nevertheless, the mixed approach has got some interesting possibilities to explore. Now I just need a few extra hours in each day so that I have time to actually try it...

No comments: