Investigating the root cause of a Scripted REST API ServiceNow integration timeout issue: you can’t expect the unexpected

In this case study, I share my experience with investigating a ServiceNow integration timeout issue. The root cause turned out to be something I wouldn’t have thought of.

Introduction

A few months ago, I implemented a Scripted REST API integration in ServiceNow. The integration was used for sending scanned documents, along with some metadata, via an enterprise service bus (ESB). After processing the request, a service request is created in ServiceNow, and attachments are saved. All this takes a while, to be exact: it takes 4 or 5 seconds.

Usually, I’m concerned about such a long processing time, but, since there are a large number of business rules in the service request, even manual creation would take approximately the same amount of time. Moreover, Base64 decoding and attachment creation takes time on top of creating the service request. The volume of messages was reasonably low, something like 20-30 messages per day. Considering these numbers, this response time was acceptable. I thought about using an asynchronous approach, so that first, only the request would be saved, and then the ESB would poll the API with the specific correlation ID until a specific timeout. Anyway, the decision was that I should go with the synchronous approach, and the timeout was set to 10 seconds—around 200% of the average processing time—and everything went fine, until…

Timeouts occurred

The integration had been in place for a couple of months, when one day I received an email from the customer stating that timeouts occurred in the ESB, and I should check the records. Luckily, the customer was able to provide the exact time and the Correlation IDs of the affected messages, so I was able to quickly check them. They had all been correctly and completely processed, and they were not missing from ServiceNow. Thanks to the logging, I could see that the message processing itself took 8-20 seconds.

First steps

I checked a couple of things as first steps. The load was not too significant, so it was not the Base64 decoding that took lots of time. Just to be sure, I replayed the requests in the test environment (I manually invoked the API from Postman), only to find the processing time to be around the expected 4-5 seconds. So there was nothing wrong with the messages themselves, but somehow, the system was busy with other things and causing slow processing times.

I then thought about what happens when messages come in a burst. They pile up in the incoming queue and are then processed by the application nodes, but the 4-5 seconds of processing time adds up and, in the end, causes a timeout on the sending part, despite eventually being properly processed in ServiceNow. In the ServiceNow instance setting, I found the following[1]:

  • nodes: there were 4 nodes available in the ServiceNow instance.
  • API_INT semaphores (~degree of parallelism within a node): 4 semaphores per node.
  • queue depth limit[2]: 50.

Assuming the third party sends each request without cookies, the load balancer would evenly distribute the requests, and this instance would be able to immediately serve 4×4 requests and would be able to handle 4x4x(50+1) requests in a burst before starting to reject requests. If the third party sent a cookie along with their requests, and the load balancer directed the requests to the same node, the figures would have been only one-fourth of the limit described above. While it was interesting to learn something new, it turned out this was not the case, since even the first message in this batch took 8+ seconds to process. Also worth mentioning is what the ESB experts told me: they put the messages into a queue in ESB and send them one-by-one to ServiceNow, so the reason for the issue is definitely NOT a burst, since a new request is sent only after the previous one has been processed.

Next round

So I tried something else and went to the Slow Job Log. I found something odd: there was something running for MORE THAN A MINUTE at the exact time the messages arrived.

I was pretty sure that this activity—whatever it might be—slowed down the system and caused the timeout shown in the image above: JOB: PRB1273164 – Fix empty sys_update_name. This seemed to be a scheduled job, so I tried to check it, but there was no such record in the system. I got pretty confused, and I was wondering if Google knew something about this problem. Of course, there was no relevant match, so the problem was probably something customer-specific. But what exactly was it? I got stuck on the issue.

And the root cause turned out to be…

After some fruitless efforts, I had the idea—I’m not sure how—that the job might have been part of a ServiceNow patch or upgrade. Due to their nature, they probably put a higher load on the system and also might need to lock some tables for a period of time. And, indeed, this is what I found:

A London upgrade happened around the same time the issue occurred, and the job fixing the problem was most likely part of the upgrade and used system resources.

Post-mortem

What could have been done to prevent this outage of service? That’s a good question. I’ve been wondering about it a lot, but I have no idea if the outage could have been prevented.

Patches and upgrades have to be applied from time to time; this is also an expectation from ServiceNow. However, I still don’t understand why the out-of-the-box upgrade procedure allows the load balancer to direct traffic to the node that’s being upgraded, right when it’s being upgraded. Not doing so would eliminate these kinds of issues. Also, as an international company, operating 24/7, it’s hard to define “out-of-office” periods for upgrades (times when the system has less load on it). Anyway, these patches and upgrades occur relatively rarely, so their impact is most likely not that high.

By the way, during the investigation, I spotted a couple of things that I adjusted:

  • A bunch of email reports were scheduled for the same time. As a preventive action, I added a minute between the jobs, so no bursts would occur in the system.
  • There was something else in the logs: I saw that some record queries took lots of time. I checked the entries, and it turned out that some users had set the records per page to 500. This is a high number. After a discussion, the decision was made to remove this value from the available choices and have only a maximum of 100 records per page.
  • I again considered modifying the API and making it asynchronous. But in the end, I decided not to make such a large change, since it would affect multiple components in the system.
  • Low-hanging fruit might have been adding a re-try to the ESB, so if there was a timeout in the future, the ESB would resend the message after some wait time. If ServiceNow could process the first request, it would respond with HTTP 409 Conflict, indicating that the re-sent message was ignored. This error should just be skipped in the ESB. Otherwise, it would process the re-sent message as a regular one.

Conclusion

So, it turns out that the root cause of an issue might be a surprising one, even an “external” reason, like an instance upgrade. So you not only need to know ServiceNow well and have some experience investigating issues, but you also need to be able to think out-of-the-box.

Finally, some advice for those who work with ServiceNow integrations:

  • Analyse the volume and frequency of the messages before you start implementing an integration.
  • Gain agreement amongst all the parties who will be responsible for monitoring the application, and record this decision. Also make sure the monitoring and notifications work.
  • Make reasonable simplifications that will ease your life and work, but clearly communicate them, and get all stakeholders to agree on them.
  • Think about future extendibility options and hooks. Would there be major changes in the characteristics of the messages?
  • Add logging, so you will be able to track down issues.
  • If the integration is mission-critical, consider using custom tables for the logs,
    • so you can control their retention period,
    • can define custom metrics, and
    • can add reports to home pages or set up custom notifications.

About the Author

David Tereanszky is a Lead ServiceNow Developer focusing on back-end development and integrations, experienced in web services, agile practices, enterprise-scaled development and scalable solutions. He has worked on numerous international projects in various roles including remote development, and business analysis within multinational teams for customers in Finland, Sweden, Malta, the UK and CEE countries.


[1] System Diagnostics > Diagnostics Page

[2] A node will respond with HTTP 429 Too Many Requests when the queue is full.

Updates

Blog