Atlas Status

What happened

At 22:20 UTC on May 19, our hosting provider Railway was hit by a platform-wide outage after Google Cloud incorrectly suspended their production account. This took Railway's API, control plane, and parts of their network infrastructure offline. Railway restored their platform at approximately 08:00 UTC on May 20.

Our US api service survived the network partition but didn't fully recover its internal state (database pool, background schedulers) when Railway came back online. The container stayed in a degraded state — accepting TCP connections but unable to serve HTTP — until we forced a redeploy at 14:40 UTC on May 20, which restored service at 14:42 UTC.

Customer impact

Users on app.useatlas.dev saw login failures and 502 responses (reported as CORS errors in the browser, but the underlying cause was the API not responding).
Hosted MCP connections at mcp.useatlas.dev timed out.
EU and APAC API endpoints were unaffected.

Resolution

Forced redeploy of the US api service. All endpoints verified healthy:

/api/health returning 200
/api/auth/get-session returning 200
CORS preflight from app.useatlas.dev returning 204 with correct headers

What we're doing to prevent recurrence

Configuring a continuous liveness probe on /api/health with automatic container restart — this would have self-healed when Railway recovered, rather than waiting for manual intervention.
Hardening our database pool, scheduler, and background plugins to recover cleanly from extended upstream network partitions.
Reviewing our hosting topology to reduce dependency on a single provider's recovery timeline.

We're sorry for the disruption. If you experienced data loss or have questions about your workspace, reach out at support@useatlas.dev.