Back to overview
Degraded

Washington D.C. CDN requests loading slowly

Oct 23 at 02:10pm BST
Affected services
CDN request from Los Angeles
CDN request from New York

Resolved
Nov 14 at 01:37pm GMT

The detailed investigation into this issue found two contributing factors.

The first was an update to AWS Route53, which appears to have been implemented by AWS on 17 October. This change was not made by Sirv to its Route53 rules but a change within the Route53 service itself. The Los Angeles POP had slightly different routing logic to other POPs and the AWS update caused routing to behave differently, with traffic becoming routed to the Washington DC fallback. This went unnoticed as he higher load was still within its capacity and requests were returned successfully, so this went unnoticed as Washington DC was able to handle the additional load.

The second contributing factor was higher than normal traffic during peak US trading hours on 23 and 24 October. The two factors combined caused the Washington DC hard disks to return requests significantly slower than normal.

Since the issue occurred, we have taken multiple actions to prevent it repeating; to mitigate against similar possible incidents; and to accelerate resolution time in the event of an incident. Actions include:

  • New routing logic has been applied.
  • Washington DC capacity has been increased.
  • Reserve capacity has been increased in all other CDN locations.
  • New tolerance introduced for a CDN location to be removed from routing.
  • Additional disk monitoring, to help early preparation of capacity upgrades.
  • HDDs are being replaced with SSDs.
  • Updated SOP for our support team to investigate and respond to similar issues.

Updated
Oct 25 at 06:48am BST

The issue was resolved by 06:48 UTC on 25 October. Investigation points towards two causes - the spike was caused by Sirv's DNS provider routing Los Angeles POP requests to the Washington D.C. POP instead. This would have been fine in isolation but the datacentre appears to have throttled the bandwidth of the receiving servers, causing the slow responses. The investigation is ongoing, has not reached definitive conclusions yet and requires collaboration with our DNS and datacentre providers. Actions are already being taken to prevent any possible repeat of this issue. A fully detailed report will be provided here later.

Created
Oct 23 at 02:10pm BST

A dramatic increase in requests to the Washington D.C. CDN location occurred at 14:10 UTC on 23 October, causing a small proportion of requests to be returned slowly - greater than 2 seconds. The issue subsided over the next 3 hours, though average response time remained elevated. The issue returned the following day at 14: 32 UTC on 24 October, more severely than before causing 17% of requests to load slowly - greater than 2 seconds and some as long as 30 seconds. The issue impacted two of Sirv's 25 CDN locations.