Linode Forum
Linode Community Forums
 FAQFAQ    SearchSearch    MembersMembers      Register Register 
 LoginLogin [ Anonymous ] 
Forum locked  This topic is locked, you cannot edit posts or make further replies.
Author Message
PostPosted: Fri Sep 12, 2003 6:09 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
Seriously, a few bizarre things happened today - all at the same time! What gives?!

We had a run-away Linode process on host4 that slowed it to a crawl. It just so happened that we started a handful of migrations off host4 to host8 around the same time. After a while and a lot of convincing that I should restart the Linode in question, I did. That fixed host4, and the Linodes on host4 looked a lot happier.

By that time the migrations (using scp, which was a bad idea in the first place because of performance) were hosed, only one migrations completing. I un-migrate those back where they were. "Something" also caused host8 to either drop off the network, or freeze completely. I wasn't able to get console access so I had to reboot host8 :(

In the process of getting host8 rebooted, "someone" (*whistle*, looks around innocently) managed to power cycle host6. Do'h!

I'm disappointed because I thought the issue with host4 was hardware related. I'm starting to believe now that its a software/kernel condition, since both host4 and host8 have had similar freezes. I need to capture an oops. Why haven't I already, you ask?

The one remote-console unit I have at ThePlanet is quirky. It only stays alive for a few minutes at a time and then drops off the network. Baytech wants me to send the bad module in for repair work, but I just bought a bunch of new console units for a new rack I'm building. I'm awaiting shipment of the new console units, and I'm sending two of the units up to ThePlanet as a replacement.

The host4 issue also caused a few shutdowns to not complete, so I had to follow up on those, as well.

In regards to shutdown's hanging, there seems to be a bug that's been hit about 7 or 8 times over the last few days. The job gets stuck, because a call to uml_mconsole (a management utility for UMLs) is hanging.

I'm also questioning the stability of the UML kernels I released. I now know (after the fact) that there were bugs in 2.4.22-1um and 2um. Hopefully 2.4.22-3um (linode9) is working fine. Just so you guys don't think I'm a total flake, I do perform a kernel compile inside the new UML kernels as a test. Lesson learned: If it ain't broke, don't fix it. I'll keep making newer kernels available, but I won't be pointing "Latest 2.4 Kernel" to anything that you guys don't approved of first.

Our 2.4.21 kernel has uptimes of months (and counting).

I won't touch anything else today, I promise.

-Chris


Top
   
 Post subject:
PostPosted: Fri Sep 12, 2003 6:23 pm 
Offline
Senior Member
User avatar

Joined: Mon Jun 23, 2003 1:25 pm
Posts: 260
At least you only restarted one machine by mistake.

It is not as bad, as loging in to the wrong remote power unit and taking offline an array of UPS systems and all of the attached devices 100+

Didnt realise what had happened until I started getting, emails, phone calls and pages telling me that the servers where down, not to mention all the other people who got the alerts as well.

I learnt two things from that, first that the monitoring service worked and worked well.

Second have different passwords for everything, so you dont log into something by mistake and to actually look what your doing.

Adam


Top
   
 Post subject:
PostPosted: Fri Sep 12, 2003 6:29 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
Thanks Adam, other people's suffering always makes me feel better :-)

This will forever hold a special place in my memory, along with the time I sshed into a machine and "/etc/rc.d/network stop" (which was not connected to remote console).

Guess what plug #8 points to on my RPC unit? Host6, of course! With everything that was going on, and that I only have a window of a few minutes when going through my console server, I must have lost total brain function.

I also almost got into an accident (on the bike, no less) when I went out for lunch an hour ago.

I'm just going to sit in the corner for a while.

-Chris


Top
   
 Post subject:
PostPosted: Fri Sep 12, 2003 6:35 pm 
Offline
Senior Member
User avatar

Joined: Mon Jun 23, 2003 1:25 pm
Posts: 260
I was lucky, the system was not yet live, although if I had done it 2 days later I would have been well.

It was put down as testing the response of the enginners to a complete loss of all power to the system and how they recovered.

Everyone has bad days, although I tend to have bad months, espcially when doing web development.

Adam


Top
   
 Post subject:
PostPosted: Fri Sep 12, 2003 7:47 pm 
Offline
Senior Member

Joined: Thu Aug 28, 2003 12:57 am
Posts: 273
I certainly don't want to add to your worries but I have to say that this episode has me a little concerned, on a few fronts.

First - how does a single Linode bring an entire system to a crawl? I thought that each Linode can only add a maximum of 1 to the load average of the system? So the first other Linode, or host server process, to start contending with the runaway Linode will still get roughly 50% of the CPU, right? I've sat at plently of Linux systems with a single process pegged at 100% and not even noticed. How does a single Linode manage to hog so much CPU that the system as a whole (and presumably all Linodes on that system) are seriously adversely affected?

Second - what would happen if Chris really were to have an accident on the way to lunch? I certainly don't mean to be morbid by suggesting such a possibility (I ride a CBR600F4i myself, I wouldn't wish an accident on anyone) ... but, imagine the scenario where Chris drops his bike on the way to lunch, has to go to the hospital with a broken leg, and in the meantime host8 hangs due to whatever unsolved problems still remain. What happens to all of the Linodes that are on host8? It would make me feel so much better about my Linode if there were just 1 other person with the ability to manage the systems ...

I'm getting very close to paying for a yearly contract for my Linode because I've been very happy with it so far and would really like the extra disk space. But threads like these start giving me cold feet when I think about my server being unavailable due to the lack of redundancy in the personnel department at Linode.com ...


Top
   
 Post subject:
PostPosted: Fri Sep 12, 2003 8:12 pm 
Offline
Senior Newbie

Joined: Wed Sep 03, 2003 2:58 pm
Posts: 19
Don't blame the full moon when there's a simple, scientific answer:
Mercury is retrograde (until sometime next week) :?


Top
   
 Post subject:
PostPosted: Sat Sep 13, 2003 10:43 pm 
Offline
Senior Member

Joined: Thu Aug 28, 2003 12:57 am
Posts: 273
Any comments on the issues I have raised? Anyone?


Top
   
 Post subject:
PostPosted: Sat Sep 13, 2003 11:40 pm 
Offline
Linode Staff
User avatar

Joined: Tue Apr 15, 2003 6:24 pm
Posts: 3090
Website: http://www.linode.com/
Location: Galloway, NJ
bji wrote:
I certainly don't want to add to your worries but I have to say that this episode has me a little concerned, on a few fronts.

First - how does a single Linode bring an entire system to a crawl? I thought that each Linode can only add a maximum of 1 to the load average of the system? So the first other Linode, or host server process, to start contending with the runaway Linode will still get roughly 50% of the CPU, right? I've sat at plently of Linux systems with a single process pegged at 100% and not even noticed. How does a single Linode manage to hog so much CPU that the system as a whole (and presumably all Linodes on that system) are seriously adversely affected?

That was not an ordinary situation (something amok). The UML process was disk i/o bound, not CPU. Just like CPU time, disk access is scheduled by the host kernel, and no one has come up with a magic scheduler that works for every type of workload. Situations like these (even though this was the first) are reported very quickly or monitored for and can be dealt with.

bji wrote:
Second - what would happen if Chris really were to have an accident on the way to lunch? I certainly don't mean to be morbid by suggesting such a possibility (I ride a CBR600F4i myself, I wouldn't wish an accident on anyone) ... but, imagine the scenario where Chris drops his bike on the way to lunch, has to go to the hospital with a broken leg, and in the meantime host8 hangs due to whatever unsolved problems still remain. What happens to all of the Linodes that are on host8? It would make me feel so much better about my Linode if there were just 1 other person with the ability to manage the systems ...

I'm getting very close to paying for a yearly contract for my Linode because I've been very happy with it so far and would really like the extra disk space. But threads like these start giving me cold feet when I think about my server being unavailable due to the lack of redundancy in the personnel department at Linode.com ...

Linode.com is new, just a few months old (launched June 16th, 2003). I have worked extremely hard to develop, market, and to provide great customer service. My vision for Linode is real, is proven, and is on track. Financially, Linode is stable and growing, but I don't have the choice to make hires at this time. It will be a few months before I start looking. I only have to offer the reputation of the level of service Linode.com provides, and my commitment to Linode. However, I realized you do have a choice, and I can appreciate your very valid concern.

I have every incentive to make sure the uptime and service is as good as possible. I will work on improving contingency plans to minimize the risk of "Bus vs. Chris" situations.

Thanks for your support,
-Chris


Top
   
 Post subject:
PostPosted: Sun Sep 21, 2003 10:09 am 
Offline
Junior Member

Joined: Wed Jul 16, 2003 1:43 am
Posts: 30
Website: http://www.alution.com
Location: Australia
well, im a lot happier that chris didnt try and spin us some story like some other hosting companies that I have been with. He has not left us in the dark and that should be applauded.


Top
   
Display posts from previous:  Sort by  
Forum locked  This topic is locked, you cannot edit posts or make further replies.


Who is online

Users browsing this forum: No registered users and 0 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum

Search for:
Jump to:  
RSS

Powered by phpBB® Forum Software © phpBB Group