Back to overview

Elevated errors on one London CDN server

Jan 15 at 05:54pm GMT

Jan 16 at 07:04am GMT

The server became healthy again and returned to service automatically, only to become unhealthy after some time then removed from service. Server health checks became healthy, then unhealthy, alternating the server in and out of service a few times, until automatic management was overridden by manually removal at 0704 UTC.

The issue was caused by a faulty hard disk. Rather than failing, the disk had random periods of errors, then would operate normally. Sirv already employs health checks to identify unhealthy disks but the behaviour of this disk permitted it to pass health checks and resume service. We are reviewing what more can be done to triage disks that behave in such a way.

Actions being taken include:

  1. Expansion of our proactive disk replacement regime, to handle a wider range of disk behaviours. This regime catches potentially failing disks early, so expanding it to more scenarios will reduce the likelihood of repeat issues.

  2. An additional alert has been added to track elevated 5xx errors per CDN server, alongside the existing alert which monitors each CDN POP (each POP has multiple servers).

Jan 15 at 05:54pm GMT

Some requests to one of the London CDN servers started returning a small number of 502 errors at 1754 UTC. The server returned over 99% of requests as normal until 1913 UTC when errors spiked and the server was automatically dropped from service. The server is being monitored.