Outage Postmortem: Database Provider Issue

Incident
Author image
Andreas Stokholm2 min read
Cover image

On Saturday at 22:29 UTC, our platform experienced an outage due to an issue with our database provider. The root cause of this outage was an expired SSL certificate on their end, which disrupted our ability to connect to their service. While this issue originated externally, we acknowledge that the absence of alerts within our monitoring systems prolonged the response time. We have since addressed this gap to ensure quicker resolution in the future.

Impact

During the outage window, customers were unable to access our services. This disruption affected all users relying on our platform, leading to downtime and inconvenience. We sincerely apologize for any frustration this may have caused.

Root Cause

The database provider's SSL certificate expired, which caused connection failures across our systems. Our monitoring tools did not detect or alert us to this issue promptly, which delayed our mitigation efforts.

The providers post mortem can be found here: eu-west region expired TLS certificate.

Resolution

Once the expired certificate was identified as the root cause, we temporarily disabled certificate verification on service. Services were fully operational approximately one hour and a half after the outage began.

This morning we have verified that the certificate has been renewed on the provider side, and have again enabled certificate verification.

Actions Taken

  1. Implemented new monitoring rules to detect and alert on database issues proactively.
  2. Reviewed and enhanced our internal incident response processes to ensure faster triage and communication in similar scenarios.

Next Steps

  • Conduct regular audits of third-party providers to anticipate potential disruptions.
  • Improve visibility into external dependencies to ensure timely detection and resolution of service-affecting issues.
  • Continuously update and refine our public status page to keep customers informed during incidents.

Recommendations for Customers

We encourage all customers to subscribe to updates on our status page for real-time information during incidents and ongoing transparency about our operations.

We deeply regret any inconvenience caused by this outage and appreciate your understanding as we work to prevent such incidents in the future. Thank you for your continued trust in our services.

Contact Us

If you have any questions or concerns about this incident, please don’t hesitate to contact us either on Discord or by email.

Stay up to date.

Join our newsletter and get these posts sent directly to your inbox.