Issue Summary
On 03/01/2016 from 3:39 PM to 4:04 PM CET Photon’s name servers were denying service. Users were not able to connect to Photon public and enterprise cloud services. All services (Realtime, Turnbased, Chat) in all regions were affected. Users who were already connected to Photon services did not experience a disruption. The root cause of the outage was the deployment of a wrong package to the name servers.
Timeline (CET)
3:39 PM Deployment of wrong package starts
3:40 PM Service outage on first regions
3:44 PM Deployment to all name servers finished
3:44 PM Name server outage on all regions
3:47 PM Alerts from monitoring tools
3:48 PM Start of investigation
4:00 PM Deployment of correct package
4:04 PM All name servers on all regions running correctly again
4:20 PM CCU numbers for all services back to normal
Root Cause
At 3:39 PM a configuration update was meant to be been released to the Photon production name servers. During a manual step the wrong configuration has accidentally been pushed to the name servers. This caused the name servers to deny incoming requests and to return error messages to clients. Therefore clients have not been routed to Photon cloud services and were not able to join lobbies.
Resolution and Recovery
At 3:47 PM our monitoring systems started alerting. Also first customers reported issues. After a quick investigation our engineers identified the wrong configuration of the name servers. By 4:00 PM we started rolling out the correct package. At 4:04 PM name servers for all regions were running correctly and all Photon cloud services started to recover. By 4:20 PM CCU numbers of Photon services were back to normal.
Corrective and Preventative Measures
We are taking the following actions in order to address the underlying causes and to improve response times:
- Review and fix the current deployment mechanism which should have prevented this issue.
- Introduce further automation steps to the process in order to prevent manual mistakes.
- Add post deployment name server tests to our automated deployment to alert our systems immediately.