Friday, August 05, 2005

Hmm... that's not right

So at work there's a computer cluster that can be used by people who have bunches of time/resource consuming programs to run. The idea is that you submit the job to one node and it decides which of the processors to set it running on. Usually, when one submits a job it includes a time limit... if the job isn't finished by the end of the its alloted time it's automatically killed to make room for other people's stuff. Last night I submitted about 20 jobs with 20 hours each. For some reason two of them didn't finish in the 20 hours I gave them and they should have been deleted. Strangely, they weren't. I noticed when I checked the queue shortly before lunch that they seemed to be running on negative time. It had sent me a few (dozen) system mails saying that the job had run out of time and been deleted but that it would be tried again later. This was definitely wrong, but I assumed it was some freak occurence that would sort itself out after a while. Three hours and 486 system mails later I decided it was probably not going to work itself out. I told my mentor/boss about it and he eventually figured out how to fix it. Excellent.


Then, I went to submit a few more jobs and noticed a considerable delay between when I typed something and when it showed up on the screen. Apparently, someone else had started a job running on the cluster that hadn't been submitted the right way, so it was running on the node that's supposed to be used to submit jobs, and it was using 100% of the memory and all the swap space. Geez. If it weren't Friday...

0 Comments:

Post a Comment

<< Home