Tuesday, March 8, 2011

More Speed!

"You don't have to outrun the bear, you just have to outrun the guy next to you."

Nevertheless it's always a good idea to run as fast as possible - there is no telling when the guy next to you may put on a sudden burst of acceleration.

So we did some more performance improvements on JBossTS.

For transactions that involve multiple XAResources, the protocol overhead is divided between network round trips to communicate between the TM and RMs, and disks writes to create a recovery log. There are a limited number of tricks that can be done with the network I/O (issuing the prepares in parallel rather than series springs to mind), but the disk I/O is another matter.

JBossTS writes its log through an abstract storage API called the ObjectStore. There are several ObjectStore implementations, including one that uses a relational database. Most however are filesystem based. The default store is an old, reliable piece of code that has worked well for years, but recent hardware evolution has led us to question some of the design decisions.

For years now multicore chips meant the number of concurrent threads in a typical production app server deployment rising and the processor capability in general outstripping the disk I/O capability. Relatively recently SSD have rebalanced things a bit, but we still see a large number of threads (transactions) contending for relatively limited I/O capability.

Time to break out the profiler again...

For total reliability a transaction log write must be forced to disk before the transaction can proceed. No in-memory fs block buffering by the O/S, thank you. This limits the ability of the O/S and disk to batch and reorder writes for efficiency. Essentially there are a fixed number of syncs() a drive can perform per second and for small data like tx logs it's not bottlenecked on the I/O bandwidth.

The design of the current default store is such that it syncs once for each transaction. This becomes a problem when the number of transactions your CPUs and RMs can handle exceeds the number of syncs your disk array can handle. Which is pretty much what's happened. So, back to the drawing board.

Clearly what we need is some way for multiple transaction log writes to be batched into a single sync. Which in turn means putting the log records in a single file, with all the packing and garbage collection problems that entails. Then you have the thread management problems, making sure individual writes blocks until the shared sync is done. That's a lot of fairly complex code and testing. Nightmare.

So we found someone who has already done it and borrowed their implementation. Got to love this open source thing :-)

The new ObjectStore implementation is based on Clebert's brilliant Journal code from HornetQ. The same bit of code that makes persistent messaging in HornetQ so staggeringly quick.

The Journal API is not a perfect for for what we want, but nothing that can't be fixed with another layer of abstraction. One adaptor layer later and we're ready to run another microbenchmark.

First let us get a baseline by using an in-memory ObjectStore (basically a ConcurrentHashMap). Useless for production use, but helpful to establish the runtime needed to execute a basic transaction with two dummy resources and all the log serialization overhead but no actual disk write.

53843 tx/second using 400 threads.

ok, that will do for starters - thanks to lock reduction and other performance optimizations made earlier we're pretty much saturating the quad core i7 CPU. We could probably improve matters a bit by tweaking the ConcurrentHashMap lock striping, but let's move on and swap in the default ObjectStore:

1650 tx/second using 100 threads.

iowait through the roof, CPU mostly idle. dear oh dear. Look on the bright side: that means we've got a lot of scope for improvement.

Adding more threads just causes more scheduling and locking overhead - we're bottlenecked on the number of disk syncs the 2x HDD RAID-1 can handle.

Let's wheel out the secret weapon and plug in it...

20306 tx/second at 400 threads.

yup, you read that right. Don't get too excited though - you won't see that kind of performance improvement in production. We're running an empty transaction with dummy resources - no business logic and no RM communication overhead. Still, pretty sweet huh? Better buy Clebert a beer next time you see him. And one for me too of course.

And if you thought that was good...

Linux has a nifty little library for doing efficient asynchronous I/O operations. If you are running the One True Operating System and you don't mind polluting your Pure Java with a little bit of native code you can drop a .so file into LD_LIBRARY_PATH and leave the other guy to be eaten by that bear:

36491 tx/second at 400 threads.

Coming Soon to a JBossTS release near you. Enjoy.

1 comment:

Andrig T Miller said...

Awesome stuff. I can't wait to get my hands on this for some testing in AS 7!

Great work!