On Tuesday, October 27th we became aware of a problem that affected some of our newer host deployments. Upon investigation, it was determined that a software library was not present on these hosts due to an oversight in our build procedure. This shared library, once distributed across our fleet, caused an unforeseen incompatibility on some of the machines.
The affected machines were then unable to process jobs (boots, shutdowns, deployments, etc), unable to determine the current state of their Linodes, and were unable to manipulate their Linodes in any way (such as issuing shutdowns). Additionally, the affected machines incorrectly deconstructed networking for some Linodes.
Once system administrators on duty became aware of the issues with these machines, all staff members were immediately alerted to the issue and began working diligently to develop and execute a plan of action to conduct emergency maintenance on the affected machines. Given the urgency of the situation, with consideration to the largely inoperable state of these hosts, the decision was made to begin corrective actions without further delay. System administrators worked together to divide the workload and reported results back to managing personnel as the situation evolved.
Administrators worked until all outstanding customer issues were properly resolved, providing direct assistance to affected customers via support tickets and real-time support in our IRC channel. As a consequence of these events, we have significantly revised our internal procedures to include the following changes:
- We have adopted more formalized testing procedures for any and all updates to host machines.
- We have devised and enacted a more conservative approach to deploying system updates across our network. A key part of this approach is a gradual, more tightly controlled deployment method for such updates.
- Through customer feedback and our own internal discussions on this matter, we recognize the need for greatly improved customer communication. We are committed to developing and implementing reliable mechanisms for achieving this goal. We appreciate the feedback we have received on potential means of addressing these concerns, and are evaluating all options.
We are continuing to give this matter the full attention our customers deserve, and sincerely apologize for any issues experienced throughout the service outage. As both a company and as individuals, we place a high value on the lessons learned from these unfortunate circumstances, and we look forward to being better able to serve our customers as a result of the knowledge gained.
-Linode Team
-------------------------------
On a personal note, I'd like to add that I am incredibly proud of our team's response to these issues. Every team member demonstrated unending dedication and devotion to resolving problems faced by our customers. Multiple personnel insisted on staying through the night to ensure the complete resolution of outstanding issues, while others worked remotely and came in early to relieve those that stayed, only to stay overnight themselves. Again, we apologize for the downtime, would like to reiterate our commitment to our customers, and assure you that this will serve to make Linode better.
-Chris