TL;DR: Small code paths become expensive when cloud workloads execute them millions of times. The Azure Event Hubs partition key resolver is one of those paths. By removing temporary allocations from the partition-key encoding path, the Azure Event Hubs SDK reduced garbage collection pressure on a publishing hot path.
I like performance work most when it starts with a boring question: why is this small method showing up so much?
That question came up while looking at the Azure Event Hubs client. Event Hubs is built for streaming data at scale. Applications publish large numbers of events, often using a partition key so related events land in the same partition. The SDK has to turn that key into a partition choice, and part of that process is a hash function.
Hashing a string doesn’t sound expensive. It usually isn’t. But "usually" disappears when the method runs on the publishing hot path for every event. In cloud systems, tiny costs do not stay tiny for long. They multiply by event count, by customer count, by instance count, and by time.
Why does a partition key need hashing?
Event Hubs uses partitions to spread event traffic. A partition key gives publishers a way to keep related events together without picking a specific partition identifier themselves. If two events use the same partition key, the resolver should make the same partition choice for both.
That means the client needs a stable mapping from a key such as customer-1234 to a partition. The SDK first turns the partition key into bytes and then feeds those bytes into the hash algorithm used by the partition resolver.
The Azure Event Hubs resolver uses Bob Jenkins’ lookup3 algorithm, a non-cryptographic hash designed for fast lookup and distribution workloads rather than security.
This is a small piece of code, but it sits on an important path. Every event with a partition key pays for it. For buffered publishing, resolving the partition on the client also helps the SDK group events into partition-specific batches before sending. If a customer publishes a few events per minute, the cost is noise. If they publish thousands of events per second, the same code becomes part of the price of running the system.
The first lesson is not "optimize every hash function." The lesson is to ask whether the code runs often enough to matter.
The problem was hidden in a byte array
The original implementation had a shape many .NET developers have written before. Take a string, encode it as UTF-8, and pass the resulting byte[] to the next method.
byte[] partitionKeyBytes = Encoding.UTF8.GetBytes(partitionKey);
short hash = GenerateHashCode(partitionKeyBytes);
That code is simple. It is also allocation-heavy on a hot path. Encoding.UTF8.GetBytes(string) has to create a new byte array because the caller receives ownership of that array. The runtime cannot know whether the caller will keep it for a millisecond, store it in a static field, or mutate it later.
For a partition key resolver, that array exists only to bridge two steps: encode the key, then hash the bytes. Once the hash is computed, the array is no longer useful. Allocating it for every publish is work we can avoid.
In the PR, I changed the ownership model. Instead of asking the encoding API to allocate memory, the resolver provides memory for the encoding API to fill. The sample below simplifies some details from the production implementation.
ReadOnlySpan<char> partitionKeyChars = partitionKey.AsSpan();
int maxByteCount = Encoding.UTF8.GetMaxByteCount(partitionKeyChars.Length);
byte[]? rentedBuffer = null;
Span<byte> hashBuffer = maxByteCount <= MaximumStackLimit
? stackalloc byte[maxByteCount]
: rentedBuffer = ArrayPool<byte>.Shared.Rent(maxByteCount);
int bytesWritten = Encoding.UTF8.GetBytes(partitionKeyChars, hashBuffer);
ReadOnlySpan<byte> bytesToHash = hashBuffer[..bytesWritten];
short hash = GenerateHashCode(bytesToHash);
if (rentedBuffer is not null)
{
ArrayPool<byte>.Shared.Return(rentedBuffer);
}
That sample is simplified, but the idea is the important part. The code asks for an upper bound with GetMaxByteCount, uses stack memory for small keys, rents from ArrayPool<byte> for larger keys, and then slices the buffer down to the bytes that were actually written.
The result is less garbage collection work. The benchmarks I included in the original pull request showed roughly 38-47 percent higher throughput for tested partition key sizes, with the per-call allocations removed.
| Change | Measured effect |
|---|---|
Replace Encoding.UTF8.GetBytes(string) allocation with caller-provided buffers | Removed per-call allocations from the measured path |
Use stack memory for small buffers and ArrayPool<byte> for larger ones | Reduced allocation pressure while keeping stack usage bounded |
Those numbers are nice, but the more useful point is the pattern: if data is only needed for the duration of a synchronous method, try not to allocate an owned array just to pass it to the next line.
Why not just pool everything?
Pooling is useful, but it is not free. Renting from ArrayPool<T> and returning to it has a cost. For small buffers that live only inside one synchronous method, stackalloc is often a better fit because the memory is tied to the method frame and disappears when the method returns.
That does not mean stackalloc should be scattered everywhere. Stack space is limited. It is safe when the size is small and controlled. It becomes dangerous when the size comes from outside input without a cap.
That is why I used a threshold in the partition key resolver. Small keys used stack memory. Larger keys used the shared array pool. The code traded a little complexity for predictable behavior across both common and uncommon key sizes.
There was another small detail: the code used an upper bound instead of first computing the exact encoded byte count. For UTF-8, GetByteCount has to inspect the input to compute the exact size. GetMaxByteCount avoids a separate pass to compute the exact encoded length and instead provides a safe upper bound. Since the code slices the buffer down to bytesWritten after encoding, the unused part of the buffer trades some temporary over-renting for avoiding an additional pass over the input.
Over-renting a buffer sounds wasteful at first. In this case, it avoids an additional pass over the input before the actual encoding and hashing work begins.
What I took away
This was not about making clever code for its own sake. It was about finding a small piece of code that a large system executed again and again, then asking whether the work it did was necessary.
The first pass removed memory allocation from the partition key encoding path. Once that allocation problem was gone, the hash loop itself became the next interesting place to look. That is a separate story.
Further reading:
- Original Azure SDK pull request: Optimize partition resolver compute hash
- Azure Event Hubs documentation
- .NET
ArrayPool<T>documentation - .NET
Span<T>documentation
Common questions
This section answers the questions I would ask before copying any of these ideas into application code.
Should every Encoding.UTF8.GetBytes call be rewritten?
No. Rewrite it only when the allocation shows up in a measured hot path and the bytes do not need to escape the method.
Is stackalloc always faster than ArrayPool<T>?
No. stackalloc is a good fit for small, bounded buffers inside synchronous methods. ArrayPool<T> is a better fit when buffers are larger or when stack usage would be risky.
What is the main lesson?
Optimize the code that runs at scale. Leave the rest alone.