tronic wrote:
I've got a few questions for caker & co. I'm curious:
- How long into the outage before the Linode staff became aware of it?
- How long after Linode staff became aware before TP had it fully fixed?
- Did TP notify Linode about the problem, or did Linode have to notify TP about it?
- Do Linode staff receive notification if systems are down? [Monitoring system in place?]
We notified ThePlanet. ThePlanet does ping monitoring of all our hosts, with an escalation procedure to contact me if problems occur. This is another thing I'm steaming over - after the downtime last week they removed all our monitoring. I'm still waiting on a reply from them with regards to that issue. We'll also be finding a third party monitoring solution to not have all our eggs in one basket.
Power was restored minutes after we contacted ThePlanet.
tronic wrote:
- How come there wasn't a static website page (at perhaps a different IP, temporarily) with basic info (e.g. "TP's down, no ETR")?
We were busy getting everything back online ASAP. But, an off-site status page is in the works.
tronic wrote:
- What does TP plan to do about their inability to keep an handle on power issues?
Are they going to hire an electrician or power engineer to perform an one-time review of how their power is distributed? Any remedial training for all NOC employees on dos/donts of power handling?
*I* plan on un-doing the changes they made last week. Basically they overloaded two power circuts and left one lightly loaded. Not the most common sense approach to distributing the load.
We're following up on every aspect of this issue.
-Chris