Date: 2026-04-28
Severity: P2 (Degraded Service)
Duration: ~3 hours degraded service; data recovery completed over the following ~2 days
Status: Resolved
Before anything else: we're sorry. Our customers trust us to keep their data safe and their services running, and on this occasion we let them down. The incident described below was ultimately caused by decisions we made, some of which unknowingly set off the chain of events that led here. We're publishing this postmortem in full so that our customers can see exactly what happened, what we got wrong, and what we are changing as a result. If you were affected and we have not already been in touch, please reach out via our support channels.
Migrating was a strategic decision, motivated by the long-term benefits it unlocks both for us and for our customers. We moved from a large managed database provider to a custom MySQL deployment running with a European infrastructure partner. A custom setup gives us the freedom to grow at our own pace and to shape the database to fit our needs as the platform evolves, rather than being constrained by the shape and limits of a generic managed offering. Working with a closer, European partner also means we can have a direct, needs-based dialogue about capacity, performance, and roadmap, rather than being one ticket among many at a much larger provider. The cost picture is favourable too. The new setup is on average around 10x less expensive per month, and those savings translate directly into more performance and headroom for our customers without raising what they pay. Migrations of this scale are rare for us; they are planned and rehearsed carefully, which is part of why we want to be fully transparent about how this one went wrong.
On 2026-04-28, during the migration described above, we discovered that a small window of data written shortly before the migration had not been replicated to the new database. The service was degraded for approximately 3 hours from cutover until normal behaviour was restored. Recovery of the identified missing records was completed piecemeal over the following two days, without further service disruption. We cannot fully rule out that some additional data written during the window of instability may have been affected.
The root cause was traced to a brief crash of a temporary one-off replication process eight days earlier, on 2026-04-20. The process recovered automatically within minutes, but the in-flight batches at the moment of the crash were dropped and never replayed. Because the crash was short and the process resumed normally, there was no visible signal at the time that anything had been lost.
A temporary replication process running for the duration of the migration held a database connection using credentials that were rotated as part of an unrelated configuration fix on 2026-04-20. The rotation caused the process to crash. It recovered automatically within minutes, but the data batches that were in flight at the moment of the crash were dropped and were not replayed. Due to the way replication offsets and timezones interacted, the affected window also included a small amount of data from the previous day.
The loss went undetected until migration day. The replication process continued running normally after recovering, so there was no visible signal that anything had been dropped.
Contributing factors:
The team prioritised restoring normal service behaviour over identifying the underlying cause. Normal behaviour was restored by ~13:30 BST, around 3 hours after cutover. Root cause analysis took place later the same evening and identified the dropped in-flight batches from the crash eight days earlier.
At that point it became clear that the original source database, which had been deleted shortly after cutover, was needed as the replay source. A restore was requested from the previous provider, which delayed recovery. Once the database was restored, the affected records were replayed over the following two days. This affected a specific subset of users but did not prevent the service from operating.
The migration itself was sound. The failure was a silent secondary effect of an unrelated change made eight days earlier. A credential rotation briefly crashed a replication process, which then appeared to recover normally. The insidious part was that the process did recover and continue working, so there was no obvious signal that a small window of data had been dropped. A pre-cutover data integrity check would have caught this before traffic resumed.
Recovery was significantly complicated by the early deletion of the source database. The decision was reasonable in context, since the migration had gone cleanly and the old database felt like dead weight, but it was made before data integrity in the new database had been validated. In future migrations, the source will be retained until post-cutover validation is complete.
More fundamentally, our choice to perform a live migration in order to minimise service disruption worked against us. A live cutover left no clean point at which to verify integrity before traffic resumed, and the eventual ~3 hours of degraded service plus two days of piecemeal recovery far exceeded the disruption a scheduled maintenance window would have caused. A short, planned outage would have given us the breathing room to validate the new database before opening it to writes, and would have avoided the incident entirely.
Our communication around the migration also fell short. We did post an announcement on Discord ahead of the change, but it was not clear enough about the potential impact and it did not give customers enough lead time to plan around it. Going forward, announcements for changes of this kind will be published in our in-product News feed (visible from the web console) as well as on Discord, and we will give substantially more advance notice.
If you have any questions or concerns about this incident, please don’t hesitate to contact us either on Discord or by email.