Stop guessing: the performance loop for production code

TL;DR: A benchmark can tell you whether code got faster. It cannot tell you whether the code mattered. For that, use a loop: profile with a profiling harness, improve a hot path, benchmark and compare, profile again, then ship and observe production.

The first benchmark I wrote looked deceptively easy.

A class. A few attributes. A method. Run BenchmarkDotNet and get a table.

It looked a lot like unit testing, which made me dangerously confident.

That confidence did not survive contact with real code.

Benchmarks resemble unit tests structurally, but they answer a completely different question. A unit test says, “Does this behavior still work?” A benchmark says, “What distribution of timings and allocations did this code produce under these conditions?” The second question is harder because the benchmark does not know whether the measured code matters.

Many performance investigations go wrong right there. The hard part is not adding [Benchmark] to a method. The hard part is deciding what to measure, what to cut away, what to keep, and when the result is good enough to ship.

The loop comes before the benchmark

The workflow I use has five steps:

  1. Profile using a profiling harness
  2. Improve one hot path.
  3. Benchmark and compare.
  4. Profile the improved code again.
  5. Ship it and observe production.

The order matters.

Profiles find candidates.

Benchmarks compare alternatives.

Profiling comes first because it shows where the system spends time or allocates memory. Benchmarking comes after that because it compares a focused change under repeatable conditions.

If you skip the first profile, you are guessing. Sometimes you will guess right. Most of the time you will polish code that was not holding the system back.

Flame graph showing CPU cost in the NServiceBus publish pipeline
A flame graph can show the relationship between infrastructure code and business code before a benchmark exists.

The example I will use throughout this series comes from the NServiceBus pipeline. Conceptually, the pipeline is similar to middleware in ASP.NET Core. A message flows through a chain of behaviors. Each behavior can do work before and after the next behavior runs. Serialization, tracing, transactions, persistence, and user code can all sit behind that abstraction.

That makes the pipeline a good performance target. It runs on the message-processing hot path, and every user benefits if the infrastructure gets out of the way.

Start by becoming performance aware

Performance work does not have to start with a profiler. It can start with a few uncomfortable questions during normal development:

  • How often will this code run when the system is under load?
  • What does it allocate on that path?
  • Does it copy data that could be reused, pooled, streamed, or passed as a span?
  • Can setup work move out of the hot path?
  • Which parts are under our control, and which belong to another team, package, or service?
  • What would make us stop?

That last question is not a joke. Performance work has no natural stopping point. There is always another allocation to remove, another branch to simplify, another loop to tighten. Without a stopping rule, the investigation turns into hobby work.

I know this about myself. I once solved it at home by shutting off my internet around midnight. If I could no longer search for the next clue, I would finally go to bed.

For product work, perfect code is the wrong target. Code needs to be fast enough, cheap enough, and simple enough to maintain.

I call myself a principal chocolate lover these days, not a performance engineer. My job is still to ship useful software. Performance work has to earn its place beside everything else.

Why throughput and memory matter

When code runs at scale, small costs become visible. CPU time limits throughput. Allocations increase garbage collector pressure. More pressure means more pauses, more cores, more instances, or a larger cloud bill.

That bill is where performance stops sounding abstract. Someone puts down a credit card, the cloud turns CPU, memory, throughput units, and premium throughput units into cheerful line items, and then someone has to explain why the number got so large.

There is also a waste angle. If the same workload can run with fewer resources, the system uses less capacity for the same business outcome. That is good for cost, and it is good for the amount of energy we ask the platform to burn.

The Microsoft Teams migration to newer .NET versions is a good public example. The team reported large Azure compute cost reductions after moving to .NET 6 and benefiting from runtime performance improvements. Most teams will not get that size of win from one change. They do not need to. A few percent on a hot path can still matter when the code runs all day.

That is the mindset behind the loop: find repeated work, remove unnecessary cost, measure the change, and check whether the larger system improved.

A profiling harness makes the system visible

The first concrete step is a profiling harness. It runs the part of the system you want to understand while keeping unrelated work out of the profile.

For the NServiceBus pipeline investigation, the profiling harness used local infrastructure, a fast JSON serializer, and in-memory persistence. It was not comparing transports, serializers, databases, or cloud services. It was making pipeline invocation visible under a CPU profiler and a memory profiler.

EndpointConfiguration endpointConfiguration = new EndpointConfiguration("PipelineHarness");
endpointConfiguration.UseTransport<MsmqTransport>();
endpointConfiguration.UseSerialization<SystemJsonSerializer>();
endpointConfiguration.UsePersistence<NonDurablePersistence>();

IEndpointInstance endpoint = await Endpoint.Start(endpointConfiguration);

Console.WriteLine("Warmup complete. Attach profiler and press enter.");
Console.ReadLine();

for (int messageNumber = 0; messageNumber < 1000; messageNumber++)
{
    await endpoint.Publish(new SomethingHappened
    {
        Number = messageNumber
    });
}

Console.WriteLine("Published. Take snapshot and press enter.");
Console.ReadLine();

This is not production code. It is an instrument. The console prompts create clear snapshot points. The local transport keeps setup simple. The persistence choice avoids unrelated database work. The message loop runs the pipeline enough times for profiling tools to show useful patterns.

A good profiling harness has boring rules: build and run in Release mode, keep the run long enough to profile, remove avoidable noise, and emit symbols so profiler stacks point back to useful code. If tiered just-in-time compilation gets in the way during early investigation, disable it for the profiling harness and document that choice.

The profiling harness is not the final truth. It is the first map.

Profiles decide what deserves a benchmark

Once the profiling harness is running, take at least two views: memory and CPU. Memory often gives faster wins in .NET because allocations are easier to spot and easier to remove than algorithmic CPU costs.

Memory profiler view showing behavior chain allocations in the NServiceBus pipeline
The useful target was not every allocation in the process. It was the allocation pattern connected to pipeline invocation.

The profiler will show noise. Some of that noise may be large. That does not make it your best target.

In the pipeline example, some allocations came from Microsoft Message Queuing (MSMQ). That mattered less than it first appeared. MSMQ was not the target of the investigation, the code was outside the pipeline work, and its user base was shrinking. Optimizing the pipeline itself would help every transport. Optimizing MSMQ-specific overhead would not.

Context decides what the profiler output means. The tool can show you where cost appears. It cannot decide which cost is worth paying down.

Benchmark the hot path, not the whole system

After profiling points at a hot path, the benchmark can become small and focused. That usually means copying the relevant code into a controlled benchmark project, trimming unrelated dependencies, and comparing before and after versions.

Copying code feels wrong because duplication is usually how mess grows. In this case, the copy is part of a controlled experiment. It lets you remove dependency injection containers, external input/output, unrelated behaviors, and other moving parts that would blur the result.

[ShortRunJob]
[MemoryDiagnoser]
public class PipelineExecutionBenchmark
{
    BaseLinePipeline<IBehaviorContext> pipelineBeforeOptimizations;
    PipelineOptimization<IBehaviorContext> pipelineAfterOptimizations;
    BehaviorContext behaviorContext;

    [Params(10, 20, 40)]
    public int PipelineDepth { get; set; }

    [GlobalSetup]
    public void SetUp()
    {
        behaviorContext = new BehaviorContext();

        pipelineBeforeOptimizations = CreateBeforePipeline(PipelineDepth);
        pipelineAfterOptimizations = CreateAfterPipeline(PipelineDepth);
    }

    [Benchmark(Baseline = true)]
    public Task Before()
    {
        return pipelineBeforeOptimizations.Invoke(behaviorContext);
    }

    [Benchmark]
    public Task After()
    {
        return pipelineAfterOptimizations.Invoke(behaviorContext);
    }
}

The benchmark measures one thing: pipeline invocation after setup. [GlobalSetup] keeps construction out of the measured path. [Params] keeps the cases realistic. [ShortRunJob] keeps the feedback loop moving.

When the direction looks right, run a longer benchmark. Short runs are for steering. Longer runs are for confidence.

A benchmark win is not the finish line

A benchmark win is not the end. Put the improved code back into the profiling harness and profile again.

Optimized CPU flame graph showing less infrastructure overhead in the NServiceBus publish pipeline
After the optimization, the same profiling harness shows less infrastructure overhead around pipeline invocation.

This second profile answers a different question: did the focused improvement still matter when placed back into a larger execution path?

A microbenchmark might show a five-times faster operation. The whole subsystem may improve by less because it still does other work. That is fine. The dramatic benchmark number is not the prize. The system-level effect is.

Then production gets the final vote. Watch throughput, latency, allocation rate, garbage collection behavior, and cost. If the assumptions were wrong, learn from that. If the assumptions were right, write down why the change worked so the team can reuse the knowledge.

Do not rewrite before you understand

Performance investigations often start near code that looks ugly. That can tempt a team into a rewrite. Rewrites feel clean because the new code has not met production yet.

The loop pushes in the other direction. Profile first. Improve one path. Benchmark the change. Profile again. Ship when the gain justifies the complexity. Write down the trade-offs.

After a few cycles, the team knows more about the code than it did before. Sometimes that knowledge makes a rewrite unnecessary. Sometimes it makes a rewrite safer because the team now has benchmarks, profiles, and production observations to guide the new design.

That is the point of the performance loop. It turns “this code is slow” into a repeatable investigation.

Further reading

Common questions

Can I use this approach for application code?

Yes. Most applications have framework-like infrastructure inside them: message pumps, request pipelines, validation layers, serialization boundaries, caching layers, or database adapters. Start there if the business code is too broad to isolate.

Should every benchmark become a regression test?

No. Keep benchmarks that protect important hot paths. Delete or archive the experiments that helped you learn but would be expensive to maintain.

What should I do first on Monday?

Pick one path that runs often. Build a tiny profiling harness around it. Take one memory profile and one CPU profile. Do not write the benchmark until the profiles tell you what deserves one.

Performance loop status

  • [x] Understand the loop
  • [ ] Build profiling harness
  • [ ] Profile
  • [ ] Improve
  • [ ] Benchmark
  • [ ] Profile again
  • [ ] Ship and observe

About the author

Daniel Marbach

Add comment

Recent Posts