Postmortem: Data Loss During Database Migration

Date: 2026-04-28

Severity: P2 (Degraded Service)

Duration: ~3 hours degraded service; data recovery completed over the following ~2 days

Status: Resolved

A Note from the Team

Before anything else: we're sorry. Our customers trust us to keep their data safe and their services running, and on this occasion we let them down. The incident described below was ultimately caused by decisions we made, some of which unknowingly set off the chain of events that led here. We're publishing this postmortem in full so that our customers can see exactly what happened, what we got wrong, and what we are changing as a result. If you were affected and we have not already been in touch, please reach out via our support channels.

Summary

Why we were migrating

Migrating was a strategic decision, motivated by the long-term benefits it unlocks both for us and for our customers. We moved from a large managed database provider to a custom MySQL deployment running with a European infrastructure partner. A custom setup gives us the freedom to grow at our own pace and to shape the database to fit our needs as the platform evolves, rather than being constrained by the shape and limits of a generic managed offering. Working with a closer, European partner also means we can have a direct, needs-based dialogue about capacity, performance, and roadmap, rather than being one ticket among many at a much larger provider. The cost picture is favourable too. The new setup is on average around 10x less expensive per month, and those savings translate directly into more performance and headroom for our customers without raising what they pay. Migrations of this scale are rare for us; they are planned and rehearsed carefully, which is part of why we want to be fully transparent about how this one went wrong.

What happened

On 2026-04-28, during the migration described above, we discovered that a small window of data written shortly before the migration had not been replicated to the new database. The service was degraded for approximately 3 hours from cutover until normal behaviour was restored. Recovery of the identified missing records was completed piecemeal over the following two days, without further service disruption. We cannot fully rule out that some additional data written during the window of instability may have been affected.

The root cause was traced to a brief crash of a temporary one-off replication process eight days earlier, on 2026-04-20. The process recovered automatically within minutes, but the in-flight batches at the moment of the crash were dropped and never replayed. Because the crash was short and the process resumed normally, there was no visible signal at the time that anything had been lost.

Timeline (all times BST)

2026-04-20 ~16:20: A routine credential rotation caused the replication process to crash. In-flight data batches were dropped. The process recovered within minutes and resumed replication normally.
2026-04-28 10:34: Migration to the new database executed; cutover proceeded without issue.
2026-04-28 ~10:40: The source database was deleted, on the assumption that a rollback path was no longer needed.
2026-04-28 ~11:30: Data anomalies detected in production; investigation began.
2026-04-28 ~13:30: Service behaviour restored to normal; deeper root cause analysis deferred in favour of mitigation.
2026-04-28 ~19:00: Root cause identified. A restore of the source database was requested from the previous provider so the dropped batches could be replayed.
2026-04-29 – 2026-04-30: Source database restored; affected records replayed for impacted users. No further service disruption.

Root Cause

A temporary replication process running for the duration of the migration held a database connection using credentials that were rotated as part of an unrelated configuration fix on 2026-04-20. The rotation caused the process to crash. It recovered automatically within minutes, but the data batches that were in flight at the moment of the crash were dropped and were not replayed. Due to the way replication offsets and timezones interacted, the affected window also included a small amount of data from the previous day.

The loss went undetected until migration day. The replication process continued running normally after recovering, so there was no visible signal that anything had been dropped.

Contributing factors:

No monitoring or alerting on the health or replication lag of the temporary sync process.
No pre-cut-over runbook step requiring a sync health and integrity check before cutover.
The credential rotation was treated as a low-risk configuration change, with no downstream impact assessment for processes that might be holding the credential.
The source database was deleted shortly after cutover, before data integrity in the new database had been confirmed. This meant that when the issue was discovered, the source had to be restored before recovery could begin, significantly delaying the process.

Impact

Service degradation: ~3 hours on migration day (10:34 – ~13:30 BST).
Data gap: A small window of writes spanning 2026-04-19 to 2026-04-20 was missing from the new database after cutover.
Data recovery: The identified missing records were replayed piecemeal over 2026-04-29 to 2026-04-30. Recovery did not impact ongoing service availability.
Residual risk: We cannot fully exclude the possibility that some additional data written during the window of service instability was affected.

Resolution

The team prioritised restoring normal service behaviour over identifying the underlying cause. Normal behaviour was restored by ~13:30 BST, around 3 hours after cutover. Root cause analysis took place later the same evening and identified the dropped in-flight batches from the crash eight days earlier.

At that point it became clear that the original source database, which had been deleted shortly after cutover, was needed as the replay source. A restore was requested from the previous provider, which delayed recovery. Once the database was restored, the affected records were replayed over the following two days. This affected a specific subset of users but did not prevent the service from operating.

Lessons Learned

The migration itself was sound. The failure was a silent secondary effect of an unrelated change made eight days earlier. A credential rotation briefly crashed a replication process, which then appeared to recover normally. The insidious part was that the process did recover and continue working, so there was no obvious signal that a small window of data had been dropped. A pre-cutover data integrity check would have caught this before traffic resumed.

Recovery was significantly complicated by the early deletion of the source database. The decision was reasonable in context, since the migration had gone cleanly and the old database felt like dead weight, but it was made before data integrity in the new database had been validated. In future migrations, the source will be retained until post-cutover validation is complete.

More fundamentally, our choice to perform a live migration in order to minimise service disruption worked against us. A live cutover left no clean point at which to verify integrity before traffic resumed, and the eventual ~3 hours of degraded service plus two days of piecemeal recovery far exceeded the disruption a scheduled maintenance window would have caused. A short, planned outage would have given us the breathing room to validate the new database before opening it to writes, and would have avoided the incident entirely.

Our communication around the migration also fell short. We did post an announcement on Discord ahead of the change, but it was not clear enough about the potential impact and it did not give customers enough lead time to plan around it. Going forward, announcements for changes of this kind will be published in our in-product News feed (visible from the web console) as well as on Discord, and we will give substantially more advance notice.

What We're Changing

Future technical changes of comparable risk will be performed within a scheduled maintenance window rather than as live cutovers, accepting a short planned outage in exchange for the ability to validate integrity before traffic resumes.
A pre-cut-over checklist will be required for changes of this kind, including verification that any data replication or sync processes are live and caught up immediately before cutover.
Credential rotations affecting service accounts used by long-running processes will require an explicit downstream impact review.
Source systems will not be decommissioned or deleted until integrity in the target has been independently verified post-cutover.
Customer-facing communication for planned maintenance and migrations will be published in our in-product News feed (visible from the web console) as well as on Discord, with clear impact statements and substantially more advance notice than we gave on this occasion.

Contact Us

If you have any questions or concerns about this incident, please don’t hesitate to contact us either on Discord or by email.

Date: 2026-04-28

Severity: P2 (Degraded Service)

Duration: ~3 hours degraded service; data recovery completed over the following ~2 days

Status: Resolved

A Note from the Team

Summary

Why we were migrating

What happened

Timeline (all times BST)

2026-04-20 ~16:20: A routine credential rotation caused the replication process to crash. In-flight data batches were dropped. The process recovered within minutes and resumed replication normally.
2026-04-28 10:34: Migration to the new database executed; cutover proceeded without issue.
2026-04-28 ~10:40: The source database was deleted, on the assumption that a rollback path was no longer needed.
2026-04-28 ~11:30: Data anomalies detected in production; investigation began.
2026-04-28 ~13:30: Service behaviour restored to normal; deeper root cause analysis deferred in favour of mitigation.
2026-04-28 ~19:00: Root cause identified. A restore of the source database was requested from the previous provider so the dropped batches could be replayed.
2026-04-29 – 2026-04-30: Source database restored; affected records replayed for impacted users. No further service disruption.

Root Cause

The loss went undetected until migration day. The replication process continued running normally after recovering, so there was no visible signal that anything had been dropped.

Contributing factors:

No monitoring or alerting on the health or replication lag of the temporary sync process.
No pre-cut-over runbook step requiring a sync health and integrity check before cutover.
The credential rotation was treated as a low-risk configuration change, with no downstream impact assessment for processes that might be holding the credential.
The source database was deleted shortly after cutover, before data integrity in the new database had been confirmed. This meant that when the issue was discovered, the source had to be restored before recovery could begin, significantly delaying the process.

Impact

Service degradation: ~3 hours on migration day (10:34 – ~13:30 BST).
Data gap: A small window of writes spanning 2026-04-19 to 2026-04-20 was missing from the new database after cutover.
Data recovery: The identified missing records were replayed piecemeal over 2026-04-29 to 2026-04-30. Recovery did not impact ongoing service availability.
Residual risk: We cannot fully exclude the possibility that some additional data written during the window of service instability was affected.

Resolution

Lessons Learned

What We're Changing

Future technical changes of comparable risk will be performed within a scheduled maintenance window rather than as live cutovers, accepting a short planned outage in exchange for the ability to validate integrity before traffic resumes.
A pre-cut-over checklist will be required for changes of this kind, including verification that any data replication or sync processes are live and caught up immediately before cutover.
Credential rotations affecting service accounts used by long-running processes will require an explicit downstream impact review.
Source systems will not be decommissioned or deleted until integrity in the target has been independently verified post-cutover.
Customer-facing communication for planned maintenance and migrations will be published in our in-product News feed (visible from the web console) as well as on Discord, with clear impact statements and substantially more advance notice than we gave on this occasion.

Contact Us

If you have any questions or concerns about this incident, please don’t hesitate to contact us either on Discord or by email.

Postmortem: Data Loss During Database Migration

A Note from the Team

Summary

Why we were migrating

What happened

Timeline (all times BST)

Root Cause

Impact

Resolution

Lessons Learned

What We're Changing

Contact Us

Stay up to date.

Postmortem: Data Loss During Database Migration

A Note from the Team

Summary

Why we were migrating

What happened

Timeline (all times BST)

Root Cause

Impact

Resolution

Lessons Learned

What We're Changing

Contact Us

Stay up to date.