Well, I am not sure what the schedulers require, especially not from the technical point of view. However, I am worried about scaling - statfs will not be useful with 1000's of nodes. Perhaps we need to go to supermon or something like that...
Also, I've had some issues with statfs, where it doesn't always get the status of newly rebooted nodes until statfs itself is killed and restarted. I think this happens mostly when updating to newer versions of xcpu, and rebooting the nodes into the latest xcpufs, but I think I have seen this behaviour when simply rebooting a bunch of nodes. Is it possibly related to the problem I reported some days ago about xrx "not dying properly" when a command such as "xrx -l /sbin/reboot" is issued? In this case the remote xrx disappears, and the local one segfaults (with the "keyboard guy", as Ron put it, still eating up a character). Daniel On 12/3/08, Abhishek Kulkarni <[EMAIL PROTECTED]> wrote: > > Hello, > > statfs is currently serving us as a bare minimum resource monitor in > that it can display the status of the nodes, the number of jobs they are > running etc. Most schedulers need to rely on some kind of a resource > manager to decide the scheduling of events. > > What would be the best way to monitor the status of nodes and get > notified on a state change? > > statfs periodically reads from the 'state' file on all the nodes. It can > export a file 'monitor' which a scheduler would poll. 'monitor' would > block on read and on a status change (for any node) monitor would return > all the nodes down at that instant. Suggestions? > > Thanks, > > -- Abhishek > >
