Outage Postmortem: Slow Response Times and Loss of Service

On July 6th, LootLocker experienced a service outage that resulted in a complete loss of availability for all customers for a duration of approximately 50 minutes.

What happened?

At 22:03 UTC, an alert was sent to our on call engineers, informing them of response times being slow. This was due to our platform being overloaded from an unanticipated spike in simultaneous requests. This overloaded a key subsystem that manages internal service discovery and orchestration. In response to the degraded performance, our infrastructure automatically initiated scaling events to respond. However the infrastructure was unable to keep up with the traffic. We quickly manually scaled up our capacity, but that also did not have the effect we expected. We started seeing 502 and 503 responses at high volumes.

While this scaling mechanism typically restores service within seconds, in this instance, we eventually rebooted the service, which exposed an expired internal certificate that is required for secure communication between services during the bootstrapping process. Because the certificate had expired, the affected services were unable to validate each other’s identities and refused to start, causing the recovery process to fail silently.

At 01:35 UTC, our on-call team manually rotated the certificate and restored the affected services, after which all systems resumed normal operation.

Why did it happen?

This incident was the result of three independent failures that aligned:

System Overload: A sharp spike in concurrent requests pushed one of our orchestration systems beyond safe operating limits. While we had autoscaling in place, it was not tuned to respond quickly enough to this specific type of workload.
Certificate Expiry: An internal certificate, which facilitates encrypted communication between core services, had expired. Due to a misconfiguration in our observability tooling, this certificate was not being monitored for expiration.
Failure to Fail Gracefully: Our system’s failover logic assumed a successful bootstrapping process. When the expired certificate prevented services from coming online, the failover loop stalled without escalating to human operators immediately.

What are we doing to prevent this in the future?

We are taking several concrete steps to ensure this type of incident does not happen again:

Proactive Certificate Monitoring: We've added automated checks and alerting for all internal and external certificates, with notifications being sent well in advance of expiration.
Improved Autoscaling Policies: We’ve updated our scaling logic to respond more aggressively to the types of spikes that triggered this incident.
Incident Detection Enhancements: We are improving observability and alerting during failover and startup processes, including specific checks for stalled bootstraps and certificate validation failures.

Final thoughts

We take incidents like this very seriously. Even short periods of downtime are unacceptable, especially when they impact the players and developers who rely on LootLocker every day. While we’re proud of our systems’ usual reliability, we know there’s no room for complacency. Thank you for your continued trust—we’re working hard to make sure we earn it every day.

Contact Us

If you have any questions or concerns about this incident, please don’t hesitate to contact us either on Discord or by email.

What happened?

At 01:35 UTC, our on-call team manually rotated the certificate and restored the affected services, after which all systems resumed normal operation.

Why did it happen?

This incident was the result of three independent failures that aligned:

System Overload: A sharp spike in concurrent requests pushed one of our orchestration systems beyond safe operating limits. While we had autoscaling in place, it was not tuned to respond quickly enough to this specific type of workload.

Certificate Expiry: An internal certificate, which facilitates encrypted communication between core services, had expired. Due to a misconfiguration in our observability tooling, this certificate was not being monitored for expiration.

Failure to Fail Gracefully: Our system’s failover logic assumed a successful bootstrapping process. When the expired certificate prevented services from coming online, the failover loop stalled without escalating to human operators immediately.

What are we doing to prevent this in the future?

We are taking several concrete steps to ensure this type of incident does not happen again:

Proactive Certificate Monitoring: We've added automated checks and alerting for all internal and external certificates, with notifications being sent well in advance of expiration.

Improved Autoscaling Policies: We’ve updated our scaling logic to respond more aggressively to the types of spikes that triggered this incident.

Incident Detection Enhancements: We are improving observability and alerting during failover and startup processes, including specific checks for stalled bootstraps and certificate validation failures.

Final thoughts

Outage Postmortem: Slow Response Times and Loss of Service

What happened?

Why did it happen?

What are we doing to prevent this in the future?

Final thoughts

Contact Us

Stay up to date.

Outage Postmortem: Slow Response Times and Loss of Service

What happened?

Why did it happen?

What are we doing to prevent this in the future?

Final thoughts

Contact Us

Stay up to date.