TL;DR: After temporary allocations were removed from the Azure Event Hubs partition-key encoding path, the Jenkins lookup3 hash loop itself became the next interesting place to look. Tightening that loop reduced CPU overhead, but it also raised the bar for review, portability, and correctness.
I like performance work most when it starts with a boring question: why is this small method showing up so much?
That question came up while looking at the Azure Event Hubs client. Event Hubs is built for streaming data at scale. Applications publish large numbers of events, often using a partition key so related events land in the same partition. The SDK has to turn that key into a partition choice, and part of that process is a hash function.
Hashing a string doesn’t sound expensive. It usually isn’t. But “usually” disappears when the method runs on the publishing hot path for every event. In cloud systems, tiny costs do not stay tiny for long. They multiply by event count, by customer count, by instance count, and by time.
A quick recap
Event Hubs uses partition keys to keep related events together without forcing publishers to pick a specific partition identifier themselves. If two events use the same partition key, the resolver should make the same partition choice for both.
The first post in this series looked at the encoding side of that path: turning the partition key into bytes without allocating a temporary byte[] for every publish. This post looks at the next layer down: the hash loop that consumes those bytes.
The Azure Event Hubs resolver uses Bob Jenkins’ lookup3 algorithm, a non-cryptographic hash designed for fast lookup and distribution workloads rather than security. lookup3 processes input in 12-byte chunks and mixes three 32-bit values at a time, which helps explain why the implementation looks unusually low-level compared to simpler string hashing code.
The problem was in the hash loop
After my earlier optimization removed temporary allocations from the partition-key encoding path, the Jenkins lookup3 hash loop itself became the next interesting place to look. The remaining cost was CPU time inside the hash loop.
The existing implementation already used span-based reads, which is a good starting point. But repeated slicing and indexed access can still leave bounds checks and extra instructions in a tight loop. In the follow-up PR, I rewrote the 12-byte mix loop around a ref-based traversal pattern using unaligned reads. Instead of proving every slice and index access is in range, the loop computes the safe number of chunks once and then advances a byte reference through that region.
The changelog entry for the change summarized the measured result as up to 39% faster hash calculation across the tested input sizes, with the most notable gains for smaller keys between 8 and 32 bytes. That improvement came on top of the earlier allocation work, while preserving the exact bit-for-bit Jenkins lookup3 output.
ref byte current = ref MemoryMarshal.GetReference(data);
for (int index = 0; index < chunks; index++)
{
uint first = Unsafe.ReadUnaligned<uint>(ref current);
// current is a byte reference, so these offsets are byte offsets.
uint second = Unsafe.ReadUnaligned<uint>(ref Unsafe.Add(ref current, 4));
uint third = Unsafe.ReadUnaligned<uint>(ref Unsafe.Add(ref current, 8));
current = ref Unsafe.Add(ref current, 12);
// Jenkins lookup3 mix step continues here.
}
This is the kind of code that should make reviewers slow down. Unsafe.ReadUnaligned<T> is powerful, but it removes some guard rails. The benefit is that the code can express exactly how the loop walks the input bytes. The cost is that everyone involved in the change has to reason carefully about alignment, bounds, platform behavior, and bit-for-bit compatibility.
That trade-off only makes sense because this method sits on a hot path in a high-throughput SDK. I would not use this style for ordinary application code unless a profiler had already proven the method deserved that attention.
Correctness still comes first
One of the best parts of the pull request history is not the optimization itself. It is the review discussion around correctness.
Hashing code has a strict contract. The optimized code must not merely be faster. It must map the same input to the intended output. Otherwise, a performance improvement becomes a behavior change.
For Event Hubs, that behavior change would be operationally visible. A different hash output would not just be a correctness bug. It could change partition affinity for existing workloads after an SDK upgrade.
During the first optimization pass, the review discussion surfaced a subtle issue around endianness. Reading four bytes into an integer can mean different things depending on whether a machine is little-endian or big-endian. The original lookup3 implementation distinguishes between little-endian and big-endian processing paths, so this concern is part of the algorithm’s design rather than just a theoretical portability issue. Most developer machines and cloud hosts are little-endian today, but portable code cannot rely on “most” unless the behavior is intentional.
That discussion led to a deliberate choice: make the byte interpretation consistent. Before merging, we validated the behavior with a large set of random partition keys, and the newer unsafe implementation was tested on x64, ARM64, and Apple silicon.
Performance work is humbling because the compiler, runtime, processor, and operating system all get a vote. A benchmark that looks good on one laptop is only a starting point.
Why this matters in the cloud
Cloud cost is often discussed in terms of service tiers, instance sizes, and autoscaling rules. Those matter. But the code inside the process also affects how much capacity an application needs.
If a publisher spends less CPU time per event, it can push more events through the same instance. If it allocates less memory per event, the garbage collector interrupts less often. If the client library does less work, every application using that library benefits after upgrading.
That is the quiet value of SDK optimization. Nobody adds a feature flag. Nobody changes their architecture. The system just has a little more room to breathe.
The gains are also shared. An optimization in a widely used library may save a few microseconds in one application, but across many applications and many hours, that small saving becomes meaningful. This is one reason open source performance work is satisfying: the benefit compounds outside the original benchmark.
When should you use these techniques?
Start with measurement. If the code is not on a hot path, keep the readable version. The simplest code is usually the cheapest code to maintain.
When measurement shows that a method does matter, look for allocation and copying first. Search for ToArray, GetBytes, stream-to-array conversions, repeated LINQ operations, and temporary collections. These are often easier to improve than the algorithm itself.
Use span-based overloads when the data does not need to be owned. Prefer stack memory only for small, bounded, synchronous work. Use array pooling for larger buffers when ownership does not escape the method. Reach for Unsafe only after the safer options are exhausted and the method has tests that protect its behavior.
Most importantly, keep the trade-off visible. The optimized partition key resolver is harder to read than the original version. That cost is acceptable only because the method runs at scale, benchmarks showed meaningful improvement, and review validated correctness across platforms.
What I took away
This was not about making clever code for its own sake. It was about finding a small piece of code that a large system executed again and again, then asking whether the work it did was necessary.
The later pass reduced overhead inside the Jenkins lookup3 loop. It kept the same goal as my earlier allocation work: do less work while preserving behavior.
There are more stories like this hiding in production libraries. Some are about allocations. Some are about copying. Some are about retry loops, closures, or collection access. I may write about more of them later, because the interesting part is rarely the trick itself. The interesting part is learning when the extra complexity is worth paying for.
Further reading:
- Follow-up Azure SDK pull request: PartitionResolver Jenkins lookup3 optimizations
- Original Azure SDK pull request: Optimize partition resolver compute hash
- Azure Event Hubs documentation
- .NET
Span<T>documentation
Common questions
This section answers the questions I would ask before copying any of these ideas into application code.
Should application code use Unsafe.ReadUnaligned<T>?
Usually not. It can be appropriate in a tiny, heavily tested, performance-critical loop. Most application code should stop at span-based APIs and clear ownership boundaries.
What is the main lesson?
Optimize the code that runs at scale. Leave the rest alone.