[xcpu] Re: [RFC] Xcpu node monitoring

Daniel Gruner Wed, 03 Dec 2008 13:25:25 -0800

Well, I am not sure what the schedulers require, especially not from
the technical point of view.  However, I am worried about scaling -
statfs will not be useful with 1000's of nodes.  Perhaps we need to go
to supermon or something like that...

Also, I've had some issues with statfs, where it doesn't always get
the status of newly rebooted nodes until statfs itself is killed and
restarted.  I think this happens mostly when updating to newer
versions of xcpu, and rebooting the nodes into the latest xcpufs, but
I think I have seen this behaviour when simply rebooting a bunch of
nodes.

Is it possibly related to the problem I reported some days ago about
xrx "not dying properly" when a command such as "xrx -l /sbin/reboot"
is issued?  In this case the remote xrx disappears, and the local one
segfaults (with the "keyboard guy", as Ron put it, still eating up a
character).

Daniel

On 12/3/08, Abhishek Kulkarni <[EMAIL PROTECTED]> wrote:
>
>  Hello,
>
>  statfs is currently serving us as a bare minimum resource monitor in
>  that it can display the status of the nodes, the number of jobs they are
>  running etc. Most schedulers need to rely on some kind of a resource
>  manager to decide the scheduling of events.
>
>  What would be the best way to monitor the status of nodes and get
>  notified on a state change?
>
>  statfs periodically reads from the 'state' file on all the nodes. It can
>  export a file 'monitor' which a scheduler would poll. 'monitor' would
>  block on read and on a status change (for any node) monitor would return
>  all the nodes down at that instant. Suggestions?
>
>  Thanks,
>
>   -- Abhishek
>
>

[xcpu] Re: [RFC] Xcpu node monitoring

Reply via email to