A hard disk failure caused Sirv to temporarily pause new image processing. Service continued as normal for existing images and image uploads. The hard disk was replaced within 45 minutes, during which 359 out of 37293 processing requests were declined.
The Active Anti Entropy system in Riak fixed inconsistency but hogged 100% CPU, causing an overload in the Riak cluster. This caused about 3,000 lost requests before the issue was overcome (less than 1% of requests during the period).
The configuration of Riak AAE throttling has been updated to avoid this in future.