This post is part of a series describing the improvements in NServiceBus 6.

The new RabbitMQ transport for NServiceBus 6 has one overriding theme: speed.

Although we've added a few other features as well, the biggest news is how much faster we've made the new version of the RabbitMQ transport. We've redesigned the message pump to be more efficient, so it can handle more incoming messages. Outgoing messages are sent faster. We've even contributed changes to the official RabbitMQ Client project to increase its performance. Almost everything we've done was focused on making your systems faster and more efficient.

Let's take a closer look at the improvements to the RabbitMQ transport and find out just how much faster the new version is.

Faster receiving

In order to handle multiple incoming messages more efficiently, we redesigned the RabbitMQ message pump. Now it's a lot easier to scale a single endpoint for maximum performance while receiving messages.

Previously, if you set the NServiceBus concurrency settings to have up to five messages processed concurrently, the message pump would create five separate polling loops, channels, and queue consumers. In addition to this, a PrefetchCount value in the connection string controlled the number of messages the broker would send a consumer before waiting for message acknowledgement. This prefetch count made sure each consumer always had enough work to keep itself busy, but each consumer would apply this value separately. As a result, an endpoint could end up prefetching more messages than might be expected (Concurrency × PrefetchCount).

While this approach worked, it was more complex than it needed to be. The use of multiple polling loops and consumers put an upper limit on the amount of concurrent work that could be done, and finding that optimal balance between the concurrency level and prefetch count could require a lot of trial and error.

The new design makes this much simpler. Instead of creating multiple polling loops, the new message pump doesn’t have any loops at all. Now it uses event-based polling to create a Task that handles each incoming message. And it does that without any extra channels or consumers either.

The new design also sets the consumer's PrefetchCount to three times the Concurrency value by default so that the endpoint can continue processing messages without waiting for more to be fetched from the server. This default makes it easier to scale effectively without as much trial and error, as there aren't as many switches and levers you need to experiment with. However, for those intent on tweaking for maximum performance, the default multiplier can be changed---or if you like, you can override the whole formula with a specific value.

Increasing receive performance and efficiency is a big win for any message-based system. But we didn't stop there. We wanted to make sending messages faster too.

Faster sending

In the previous version of the transport, whenever you sent a message outside of a message handler, we had to create and open a new channel, use that channel to send the message, and then close the channel. Closing the channel also blocked the calling thread so it could wait for confirmation that the broker received the message. All this resource allocation and thread blocking was a significant drain on performance.

In the new version of the transport, we keep a pool of open channels for sending messages. If there is an unused channel in the pool when one is needed, it's reused. Otherwise, a new channel is created and added to the pool when the sending code finishes with it. This means there's no longer a channel opening/closing cost incurred per message.

Instead, we now create a Task per message and can verify that messages are delivered to the broker without blocking any threads. The use of tasks also allows you to send messages in parallel by starting all your send operations, collecting the tasks, and then waiting for them all to finish with a single await Task.WhenAll(tasks). This is extremely useful in fan-out situations, such as when you process files coming from a third party and send out an individual message for each record in the file.

With these improvements, send performance is going to be much faster across the board. But there were even more performance gains to be made deeper down the call stack.

Faster internals

While we were doing performance testing on the new version of the RabbitMQ transport, we noticed a serious performance drop on more modern CPUs. On a machine with a brand new Skylake i7-6700K, we saw performance that was five times worse than on an older Sandy Bridge i7-2600K. Even an older Core 2 Duo machine was outperforming the brand new Skylake CPU.

Upon further testing, we discovered that performance suffered for every processor generation after Sandy Bridge. The effect was the most pronounced on the newest Skylake chipset, which really should have been the fastest of the bunch.

Since the RabbitMQ .NET client is also open source, we were able to track down some nasty lock contention occurring in its ConsumerWorkService, developed a fix, and got it accepted into their 4.1.0 release. The result is faster performance for all developers using the .NET RabbitMQ library, including those on NServiceBus.

How much faster?

The results are pretty amazing. The RabbitMQ transport is over five times as fast as before, both at sending and receiving messages.

Prior to the release of NServiceBus 6, we ran comprehensive performance tests between NServiceBus versions 5 and 6. In both cases, we used RabbitMQ Server 3.6.5 on Erlang/OTP 18.3. The hardware used for the benchmark doesn't matter much1 because we compared throughput performance of different versions using the same hardware. The only important detail on the hardware setup is that we used a workstation with a Skylake CPU, which suffers from the performance bug mentioned earlier.

We specifically compared three versions of the RabbitMQ transport:

  • 3.4 – NServiceBus 5 with the RabbitMQ client containing the lock contention bug
  • 3.5 – NServiceBus 5 with the updated RabbitMQ client fixing the lock contention bug
  • 4.1 – NServiceBus 6 with the RabbitMQ client lock contention fix as well as all of the performance improvements in NServiceBus 6 and the newest version of the transport

Each test case was run three times, and the fastest result for each scenario was used.

The following table shows the throughput improvement in each comparison. For instance, a value of 2.0 would mean that the newer version handled twice as many messages per second.

Matchup Versions
Compared
Send Throughput
Improvement
Receive Throughput
Improvement
V6 improvements only 3.5 => 4.1 5.45 1.69
V6 + lock contention fix 3.4 => 4.1 5.54 6.66

The message here is clear. The RabbitMQ transport is fast – more than five times as fast as before, both at sending and receiving messages.

Other features

Even though making things go fast is one of our favorite things to do, we managed to make a few other improvements in the RabbitMQ transport as well.

Security

As more infrastructure moves to the cloud, it is becoming increasingly important for systems to be able to communicate securely, whether running in the same rack or on opposite sides of the planet.

To enable secure communication with RabbitMQ brokers, we added support for the AMQPS protocol. If your broker already has a certificate installed, securing the connection is as simple as adding UseTls=true to your connection string. For additional security, we also support client-side authentication by using client certificates. These features were back-ported to the 3.2.0 release of the transport, so it's usable with NServiceBus 5 as well.

As a result, you can now easily use a hosted RabbitMQ provider like CloudAMQP to manage and maintain the servers and configuration for you. This makes RabbitMQ a much more attractive transport when it comes to deploying solutions to the cloud.

Built-in connection auto recovery

Another improvement makes the RabbitMQ connection recovery process more efficient. When there is a connection interruption between the endpoint and the broker, the transport needs to be able to reestablish the connection, recreate the channels on that connection, and resume message consumption.

When we first built the RabbitMQ transport, the .NET RabbitMQ client did not have any concept of automatic connection recovery. So we built our own. Over time, the RabbitMQ client has added this feature, so we are now using the built-in auto-recovery in the transport. Being closer to the metal, the RabbitMQ client is able to handle connection recovery much more transparently, without having to destroy and recreate components left in a faulted state. From the outside, the connection more or less appears to pause and then resume automatically, rather than spamming the logs with error messages.

Summary

Rabbits should be fast, so we made the RabbitMQ transport go fast. Suffice it to say, this is the fastest RabbitMQ transport we've ever delivered.

Together with the improvements in security and connection auto-recovery, there are now a lot more reasons to consider building an NServiceBus solution with RabbitMQ.

So go ahead and give NServiceBus 6 a try today.


About the author: Brandon Ording is a developer at Particular Software who maintains both the NServiceBus core and the RabbitMQ transport. He used to have rabbits as a kid, and finds the digital ones to be much easier to care for.

1 The actual hardware used for the performance benchmark was a workstation with an Intel Core i7 6700K "Skylake" CPU processor at 4.4 GHz, RAID 1 hard disk array (non-SSD) and 32GB of RAM. RAM utilization was quite low during the test and shouldn't be considered an important factor. Since RabbitMQ is heavily I/O bound in terms of performance, faster SSD disks would be a great way to improve overall throughput. But since the benchmark shows performance improvement, the disks used largely don't matter either.

Read more →