Let me tell you a story about an adventure. An adventure that has kept me busy for the past two weeks and fortunately cumulated in a happy ending. It’s the story of the biggest and longest bug hunt of my career.
It all started around May 20th, when our support department reported that customers were having trouble loading our dashboard. The issue occurred very rarely and was limited to individual users. Refreshing the page didn’t help. However, when the support team tried to reproduce the problem at the same time, they couldn’t. Everything worked for them. After about 5 minutes, the problem would just disappear, and everything would run smoothly again.
It was as if a ghost was haunting our system, leaving no trace except a sense of uncertainty.
We noted the report and searched through our logs in App Insights, but couldn’t find anything unusual. It kept bugging us, but since we couldn’t reproduce it, we decided to leave it for the time being. However, the reports started piling up, and it reached a point where it became my top priority. The problem was happening frequently and was clearly causing a lot of frustration. Luckily, we made some progress when one of our support team members experienced the issue themselves and managed to take a screenshot of the Dev Tool Network tab.
Finally, something we could work with. A 504? Gateway Timeout? We’d never seen that before. However, this could explain why we couldn’t see anything in our App Insights logs. What if the problem was occurring before it reached our App Service? We have multiple App Service instances that automatically scale in and out according to our rule set. This means we have quite a few instance spawns and shutdowns. Additionally, a default auto-balancer provided by Microsoft is positioned directly in front of our App Services. What if that’s where the problem is?
After our questions on Stack Overflow and the Microsoft Azure forums went unanswered, we timidly submitted a support request to Microsoft to see if they had noticed anything in this area themselves.
At the same time, we noticed something: when the dashboard fails to load, not all calls are unsuccessful. The containeredIdentifiers
calls, of which there are several, seem to be the most affected. To confirm the load balancer theory, we implemented a new logging mechanism:
- We generate a unique ID on the client and send it via HTTP headers to the backend.
- On the backend, the first thing we do in the method body is generate an App Insights event with this ID.
- If a support team member encounters the problem again, they can send us a HAR file.
- We can then extract the ID from the HAR file and search for it in App Insights.
- If we find the ID, we’ll know that we ourselves are causing the 504 error.
- If not, it further supports the load balancer theory.
Quick side note:
HAR files are great and super useful. They can be generated in the browser Dev Tools in the Network tab and contain all the information about recorded HTTP calls. This includes the complete requests and responses, including all headers and payloads.
A HAR file can then be imported into a new browser session on another machine in the Dev Tools, allowing us to trace everything in detail.
And lo and behold: we actually receive a HAR file, and the ID indeed cannot be found in App Insights. It seems the bug isn’t on our end. Conveniently, we have a video call scheduled with an Azure supporter who understands our problem completely and wants to help us solve it.
It turns out that he can’t see any misbehavior in the load balancer logs. However, unlike us, he sees quite a few exceptions that result in 500 errors. He suggests we enable the autoheal feature on the App Service so that any 500 status codes generate a log.
Unfortunately, autoheal can only be enabled per slot. This means we can’t turn on autoheal on the deploy slot and then swap the production slot with the deploy slot to achieve zero downtime.
We actually need to find a time window where we can be unavailable for a short period while the production App Services restart (with autoheal activated). Another two days pass…
Well, in the meantime, I’m frantically Googling and pondering the bug. I keep looking at the HAR files and screenshots we received over and over again. Eventually, I notice that service workers are somehow involved in all our backend calls. Not too surprising, since we enabled Angular Service Workers to cache our assets like HTML, JS, images, etc., to enable faster startup and a minimal offline screen. But why are the service workers involved in the backend calls? I had explicitly excluded them from caching, and they aren’t being cached.
And now it gets really weird: if you take the time to look at the service worker’s source code closely, what do you see?
async safeFetch(req) {
try {
return await this.scope.fetch(req);
}
catch (err) {
this.debugger.log(err, `Driver.fetch(${req.url})`);
return this.adapter.newResponse(null, {
status: 504,
statusText: 'Gateway Timeout',
});
}
}
Yikes! Who in their right mind would hide something like that in there?! It turns out that ordinary errors from the backend are simply swallowed and then passed on as gateway timeouts. You can’t make this stuff up.
How can we get around this?
I quickly discovered that in the Dev Tools under Application -> Service Worker, there’s an option to bypass the calls. So, all we have to do is tell all customers to open their Dev Tools and activate this checkbox. Yeah, right. If anyone nodded their head at that, there’s no helping them. But after some more Googling, I found a solution that’s actually practical: add an ngsw-bypass
header to all backend calls. This way, we can see the real issues on the client, and these turn out to be 500 status errors from our App Service.
Good thing we activated our autoheal log and actually got a stack trace for our 500 error: InvalidOperationException: Either set authority or InterceptionEndpoint
. So, IdentityServer is to blame? But why only intermittently? And why do some calls work with the same token while others don’t? And if there’s a misconfiguration, why does our identity system work at all? To be honest, I still don’t have many answers to these questions today, even though – spoiler alert – the bug hunt ended successfully.
A brief explanation about InterceptionEndpoint: This is an endpoint on the IdentityServer side that the backend should call when verifying if a reference token is still valid. The problem is, we don’t use reference tokens at all. They would be too slow for us. Instead, we work with access tokens (JWTs) that have a lifespan of 5 minutes. 5 minutes? Yes, exactly; attentive readers might recall this time frame from the beginning of this post! “After about 5 minutes, the problem would just disappear…” So, after 5 minutes, the access token expires, a new one is requested from the Identity Server, used in the backend call, and voilĂ , the call succeeds.
Of course, that’s not how the application should behave: we’re actually making the call with a (still) valid access token, but somehow the OAuth library on the backend mistakenly identifies it as a reference token. However, what happens next is unclear since no introspection endpoint was defined, which wasn’t our intention either.
Several days went by, during which I was Googling like crazy and trying out different things until I finally stumbled upon the solution: options.SupportedTokens = SupportedTokens.Jwt
informs the library that we exclusively use JWTokens, and attempting to validate a reference token will simply not work (for now). After implementing this one line of code, we no longer encountered this issue.
To summarize everything that remains after the bug hunt in our system:
- On the frontend, all backend calls are now equipped with an
ngsw-bypass
header to capture any errors transparently. - On the backend, our OAuth is now configured with
options.SupportedTokens = SupportedTokens.Jwt
to avoid token misinterpretations. - Currently, an autoheal log is enabled on the backend to catch errors that may not be detected by AppInsights.
Even though this blog might make it sound like the next steps were always clear, there were days of pure Googling or deep thinking in between, just to make progress. Sometimes, software development can be frustrating, and the worst part is, after a week and a half of bug hunting, I can’t even stuff the bug’s oversized head and hang it on the wall as a trophy.