TL;DR: Before writing a benchmark, build a small profiling harness that makes the code path visible. Run it in Release mode, keep unrelated work out, add clear profiler snapshot points, and collect both memory and CPU evidence.
Production code is a terrible place to start a performance investigation.
There is too much happening at once.
Not because production code is bad. Because it is alive. It has real configuration, real input/output, logging, retries, dependency injection, serialization, network calls, database calls, and all the other parts that make software useful. Attach a profiler to all of that and the result is often a wall of noise.
A benchmark is too small for that first step. A benchmark compares a focused operation under controlled conditions. But before we can focus, we need to know where to look. That is why the performance loop starts with a profiling harness.
A profiling harness is not a benchmark
A profiling harness is a small executable that runs enough of the system to make one code path observable. It is not trying to produce a statistically significant timing table. It is trying to create a clean window for a memory profiler and a CPU profiler.
The profiling harness is not production.
That is intentional.
For the NServiceBus pipeline investigation, the target was pipeline invocation. NServiceBus receives messages from transports such as RabbitMQ, Azure Service Bus, Amazon SQS, or Microsoft Message Queuing (MSMQ), then executes a chain of behaviors. Those behaviors do infrastructure work before customer message handlers run: deserialization, correlation, tracing, transactions, persistence integration, and more.
That pipeline runs on the hot path for every message. If the framework spends too much time or memory there, every customer pays the cost before their own code does any useful work.
The profiling harness had one job: exercise the publish and receive pipelines enough times that a profiler could show where the pipeline allocated memory and spent CPU.
Boring harnesses produce useful profiles
The best harnesses remove drama from the measurement.
In this case, the profiling harness used MSMQ because it was available locally on the machine. Old. Rusty. Local. Good enough.
That made it useful. No container setup. No cloud dependency. No extra account configuration. Just enough transport infrastructure to move messages through the pipeline.
Boring was a feature.
The profiling harness also used System.Text.Json. Not because the investigation compared JSON serializers, but because the serializer needed to be fast enough to stay out of the way. Persistence was non-durable for the same reason. Database input/output would dominate the profile and distract from the pipeline invocation.
EndpointConfiguration endpointConfiguration = new EndpointConfiguration("PipelineHarness");
endpointConfiguration.UseTransport<MsmqTransport>();
endpointConfiguration.UseSerialization<SystemJsonSerializer>();
endpointConfiguration.UsePersistence<NonDurablePersistence>();
IEndpointInstance endpoint = await Endpoint.Start(endpointConfiguration);
This code is not interesting by itself. Good. The profiling harness should not become a second application.
Create profiler snapshot points
Profiling tools are easier to use when the program tells you where the interesting part starts and stops. A few console prompts can be enough.
Console.WriteLine("Warmup complete. Attach profiler and press enter.");
Console.ReadLine();
for (int messageNumber = 0; messageNumber < 1000; messageNumber++)
{
await endpoint.Publish(new SomethingHappened
{
Number = messageNumber
});
}
Console.WriteLine("Published. Take snapshot and press enter.");
Console.ReadLine();
The first prompt gives the application time to start and lets you attach the profiler. The loop runs the publish path enough times to show allocation and CPU patterns. The second prompt keeps the process alive while you take a snapshot.
This is crude. It is also effective. The profiling harness is an instrument. Make it clear, repeatable, and easy to throw away.
Run it like production code, not debug code
Most integrated development environments default to Debug mode because that is the best experience for normal development. Debug mode is the wrong default for profiling. The compiler emits different code, the just-in-time compiler has different opportunities, and the result can point you at costs that do not exist in the same shape in Release mode.
Build and run the profiling harness in Release mode. Also emit symbols so profiler stacks point back to useful methods and source lines. Without symbols, the profile becomes harder to connect to the code you can change.
<PropertyGroup>
<DebugType>pdbonly</DebugType>
<DebugSymbols>true</DebugSymbols>
</PropertyGroup>
During early investigations, I sometimes disable tiered just-in-time compilation in the profiling harness. That can reduce warmup effects and make the first profiles easier to read. It is a trade-off, not a commandment.
<PropertyGroup>
<TieredCompilation>false</TieredCompilation>
</PropertyGroup>
Tiered compilation is part of modern .NET runtime behavior, and it can make production code faster over time. Disabling it can be useful while looking for raw allocation and call-stack patterns, but final validation should still reflect the runtime configuration you ship.
Take both memory and CPU profiles
I usually take at least two profiles: memory and CPU.
Memory comes first because allocations are often the easiest .NET performance wins. A temporary array, closure allocation, delegate allocation, or repeated copy can be visible in a memory profiler and removable without changing the whole algorithm.
CPU comes next because it shows where the process spends execution time. CPU profiles are often harder to interpret. They can involve algorithms, runtime behavior, synchronization, and compiler decisions. But they are essential because allocation-free code can still burn CPU.

Input/output deserves attention too. Database queries, HTTP calls, file access, and cloud service calls often dominate application performance. This series focuses on CPU and memory because the pipeline investigation was about in-process overhead, but do not ignore input/output when it is part of your problem.
The profiling harness should answer one question
A good profiling harness has a narrow purpose. If the purpose is pipeline invocation, remove unrelated database work. If the purpose is serialization, do not mix in transport throughput. If the purpose is a database query, keep the rest of the application quiet enough that the query is visible.
The profiling harness does not need to be perfect. It needs to move you from “I think this might be slow” to “the profiler shows this path allocating or burning CPU.” Once you have that evidence, you can decide what deserves a benchmark.
The profiling harness gives us visibility.
Now we need to decide which signals matter.
Profiles reveal cost. Benchmarks compare alternatives.
That is the next step in the loop. But before benchmarking, we need to read the profiles without chasing every large number on the screen.
Further reading
Common questions
Should the profiling harness use real production infrastructure?
Only when that infrastructure is part of the question. If the investigation is about database query performance, use a database that behaves like production. If the investigation is about in-process pipeline overhead, keep the database out of the way.
Is disabling tiered compilation cheating?
It can be useful during early profiling, but it is not the final truth. Use it deliberately, document the choice, and validate the finished change under the runtime configuration you expect in production.
How long should the profiling harness run?
Long enough for the profiler to capture useful data, short enough that the investigation stays fast. A few seconds of repeated work is often enough for the first pass.
Performance loop status
- [x] Understand the loop
- [x] Build profiling harness
- [ ] Profile
- [ ] Improve
- [ ] Benchmark
- [ ] Profile again
- [ ] Ship and observe