In this case study, I share my experience with investigating a ServiceNow integration timeout issue. The root cause turned out to be something I wouldn’t have thought of.
A few months ago, I implemented a Scripted REST API integration in ServiceNow. The integration was used for sending scanned documents, along with some metadata, via an enterprise service bus (ESB). After processing the request, a service request is created in ServiceNow, and attachments are saved. All this takes a while, to be exact: it takes 4 or 5 seconds.
Usually, I’m concerned about such a long processing time, but, since there are a large number of business rules in the service request, even manual creation would take approximately the same amount of time. Moreover, Base64 decoding and attachment creation takes time on top of creating the service request. The volume of messages was reasonably low, something like 20-30 messages per day. Considering these numbers, this response time was acceptable. I thought about using an asynchronous approach, so that first, only the request would be saved, and then the ESB would poll the API with the specific correlation ID until a specific timeout. Anyway, the decision was that I should go with the synchronous approach, and the timeout was set to 10 seconds—around 200% of the average processing time—and everything went fine, until…
The integration had been in place for a couple of months, when one day I received an email from the customer stating that timeouts occurred in the ESB, and I should check the records. Luckily, the customer was able to provide the exact time and the Correlation IDs of the affected messages, so I was able to quickly check them. They had all been correctly and completely processed, and they were not missing from ServiceNow. Thanks to the logging, I could see that the message processing itself took 8-20 seconds.
I checked a couple of things as first steps. The load was not too significant, so it was not the Base64 decoding that took lots of time. Just to be sure, I replayed the requests in the test environment (I manually invoked the API from Postman), only to find the processing time to be around the expected 4-5 seconds. So there was nothing wrong with the messages themselves, but somehow, the system was busy with other things and causing slow processing times.
I then thought about what happens when messages come in a burst. They pile up in the incoming queue and are then processed by the application nodes, but the 4-5 seconds of processing time adds up and, in the end, causes a timeout on the sending part, despite eventually being properly processed in ServiceNow. In the ServiceNow instance setting, I found the following:
Assuming the third party sends each request without cookies, the load balancer would evenly distribute the requests, and this instance would be able to immediately serve 4×4 requests and would be able to handle 4x4x(50+1) requests in a burst before starting to reject requests. If the third party sent a cookie along with their requests, and the load balancer directed the requests to the same node, the figures would have been only one-fourth of the limit described above. While it was interesting to learn something new, it turned out this was not the case, since even the first message in this batch took 8+ seconds to process. Also worth mentioning is what the ESB experts told me: they put the messages into a queue in ESB and send them one-by-one to ServiceNow, so the reason for the issue is definitely NOT a burst, since a new request is sent only after the previous one has been processed.
So I tried something else and went to the Slow Job Log. I found something odd: there was something running for MORE THAN A MINUTE at the exact time the messages arrived.
I was pretty sure that this activity—whatever it might be—slowed down the system and caused the timeout shown in the image above: JOB: PRB1273164 – Fix empty sys_update_name. This seemed to be a scheduled job, so I tried to check it, but there was no such record in the system. I got pretty confused, and I was wondering if Google knew something about this problem. Of course, there was no relevant match, so the problem was probably something customer-specific. But what exactly was it? I got stuck on the issue.
And the root cause turned out to be…
After some fruitless efforts, I had the idea—I’m not sure how—that the job might have been part of a ServiceNow patch or upgrade. Due to their nature, they probably put a higher load on the system and also might need to lock some tables for a period of time. And, indeed, this is what I found:
A London upgrade happened around the same time the issue occurred, and the job fixing the problem was most likely part of the upgrade and used system resources.
What could have been done to prevent this outage of service? That’s a good question. I’ve been wondering about it a lot, but I have no idea if the outage could have been prevented.
Patches and upgrades have to be applied from time to time; this is also an expectation from ServiceNow. However, I still don’t understand why the out-of-the-box upgrade procedure allows the load balancer to direct traffic to the node that’s being upgraded, right when it’s being upgraded. Not doing so would eliminate these kinds of issues. Also, as an international company, operating 24/7, it’s hard to define “out-of-office” periods for upgrades (times when the system has less load on it). Anyway, these patches and upgrades occur relatively rarely, so their impact is most likely not that high.
By the way, during the investigation, I spotted a couple of things that I adjusted:
So, it turns out that the root cause of an issue might be a surprising one, even an “external” reason, like an instance upgrade. So you not only need to know ServiceNow well and have some experience investigating issues, but you also need to be able to think out-of-the-box.
Finally, some advice for those who work with ServiceNow integrations:
David Tereanszky is a Lead ServiceNow Developer focusing on back-end development and integrations, experienced in web services, agile practices, enterprise-scaled development and scalable solutions. He has worked on numerous international projects in various roles including remote development, and business analysis within multinational teams for customers in Finland, Sweden, Malta, the UK and CEE countries.
 System Diagnostics > Diagnostics Page
 A node will respond with HTTP 429 Too Many Requests when the queue is full.