What happened
At 22:20 UTC on May 19, our hosting provider Railway was hit by a platform-wide outage after Google Cloud incorrectly suspended their production account. This took Railway's API, control plane, and parts of their network infrastructure offline. Railway restored their platform at approximately 08:00 UTC on May 20.
Our US api service survived the network partition but didn't fully recover its internal state (database pool, background schedulers) when Railway came back online. The container stayed in a degraded state — accepting TCP connections but unable to serve HTTP — until we forced a redeploy at 14:40 UTC on May 20, which restored service at 14:42 UTC.
Customer impact
- Users on
app.useatlas.devsaw login failures and 502 responses (reported as CORS errors in the browser, but the underlying cause was the API not responding). - Hosted MCP connections at
mcp.useatlas.devtimed out. - EU and APAC API endpoints were unaffected.
Resolution
Forced redeploy of the US api service. All endpoints verified healthy:
/api/healthreturning 200/api/auth/get-sessionreturning 200- CORS preflight from
app.useatlas.devreturning 204 with correct headers
What we're doing to prevent recurrence
- Configuring a continuous liveness probe on
/api/healthwith automatic container restart — this would have self-healed when Railway recovered, rather than waiting for manual intervention. - Hardening our database pool, scheduler, and background plugins to recover cleanly from extended upstream network partitions.
- Reviewing our hosting topology to reduce dependency on a single provider's recovery timeline.
We're sorry for the disruption. If you experienced data loss or have questions about your workspace, reach out at support@useatlas.dev.