Re: Monitoring for a hung NFS mount?

2008-03-28 Thread Michael Alan Dorman
On Fri, 28 Mar 2008 11:00:54 -0700 (PDT)
Andrew Ryan <[EMAIL PROTECTED]> wrote:

> You'd probably need 2 processes ; one to drive and another process to
> go off and stat the mount point. The driver would invoke the
> stat'ers, and if the stat doesn't come back in some seconds, declare
> the mount hung. Because if the mount really is hung, the stat process
> is going to hang forever too, so you don't want your driver process
> to get hung too.

Why not implement this as a heartbeat trap---you have a process that
just stats the mount point every however often, and sends a trap to
mon.  If it freezes because the mount freezes, it won't send the trap
and mon will alert.

Seems a lot cleaner, IMHO.

Mike.

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


Re: Monitoring for a hung NFS mount?

2008-03-28 Thread Andrew Ryan
Yeah, a plain fork/alarm isn't going to help you here because it's going
to block waiting on the IO from the hung mount.

A low tech way around this would be instead of forking, to exec off your
stat process, and have it write/touch some file marker somewhere after it
finishes the stat. The only way it wont finish the stat is if the mount
hangs.

Then sleep for the requisite number of seconds,wake up and check mtime on
the marker files. There are other ways too, but you get the idea.

(greets from bolt.sonic.net, which I see has 71 NFS mounts. I can see why
you have this need :) )

On Fri, 28 Mar 2008, Augie Schwer wrote:

> On Fri, Mar 28, 2008 at 11:00 AM, Andrew Ryan <[EMAIL PROTECTED]> wrote:
> > You'd probably need 2 processes ; one to drive and another process to go
> >  off and stat the mount point. The driver would invoke the stat'ers, and
> >  if the stat doesn't come back in some seconds, declare the mount hung.
> >  Because if the mount really is hung, the stat process is going to hang
> >  forever too, so you don't want your driver process to get hung too.
>
> This is the path I went down, but the problem I ran into was that the
> forked child inherits process info., file descriptors, etc. from the
> parent and running my monitor remotely (ssh, or snmp) will hang the
> session as ssh or snmp waits for the inherited resources to be
> released from the child.
>
> Attached is where I got stuck; you can see I try and divorce the
> parent from the child as much as possible by closing all file handles,
> but when I run it via snmp (exec) it still hangs when I walk the tree;
> hanging on the STDOUT OID bit. I do get the correct return code
> though, so my next step is to try and just grab the return value OID
> and see if I can alert on that.
>
> Of course I still leave hung procs. around which is not really desirable.
>
>
> --
> Augie Schwer - [EMAIL PROTECTED] - http://schwer.us
> Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072
>

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


Re: Monitoring for a hung NFS mount?

2008-03-28 Thread Augie Schwer
On Fri, Mar 28, 2008 at 11:00 AM, Andrew Ryan <[EMAIL PROTECTED]> wrote:
> You'd probably need 2 processes ; one to drive and another process to go
>  off and stat the mount point. The driver would invoke the stat'ers, and
>  if the stat doesn't come back in some seconds, declare the mount hung.
>  Because if the mount really is hung, the stat process is going to hang
>  forever too, so you don't want your driver process to get hung too.

This is the path I went down, but the problem I ran into was that the
forked child inherits process info., file descriptors, etc. from the
parent and running my monitor remotely (ssh, or snmp) will hang the
session as ssh or snmp waits for the inherited resources to be
released from the child.

Attached is where I got stuck; you can see I try and divorce the
parent from the child as much as possible by closing all file handles,
but when I run it via snmp (exec) it still hangs when I walk the tree;
hanging on the STDOUT OID bit. I do get the correct return code
though, so my next step is to try and just grab the return value OID
and see if I can alert on that.

Of course I still leave hung procs. around which is not really desirable.


-- 
Augie Schwer - [EMAIL PROTECTED] - http://schwer.us
Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072


nfs_monitor.pl
Description: Perl program
___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


Re: Monitoring for a hung NFS mount?

2008-03-28 Thread Andrew Ryan
You'd probably need 2 processes ; one to drive and another process to go
off and stat the mount point. The driver would invoke the stat'ers, and
if the stat doesn't come back in some seconds, declare the mount hung.
Because if the mount really is hung, the stat process is going to hang
forever too, so you don't want your driver process to get hung too.

--andrew

On Thu, 27 Mar 2008, Augie Schwer wrote:

> Anyone have a good way to monitor for a hung NFS mount on a remote machine?
>
> I've been at it all day trying to come up with a clever way to check
> the hung mount, not let the monitor get hung and return some useful
> information; like what mount is hung, but I've come to a dead end and
> I think the best that can be done is to let the monitor timeout and
> then sound an alarm based on that timeout.
>
> Anyone else have ideas?
>
>
> --
> Augie Schwer - [EMAIL PROTECTED] - http://schwer.us
> Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072
>
> ___
> mon mailing list
> mon@linux.kernel.org
> http://linux.kernel.org/mailman/listinfo/mon
>
>

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon


Re: Monitoring for a hung NFS mount?

2008-03-28 Thread Augie Schwer
On Fri, Mar 28, 2008 at 8:17 AM, Jeff Price <[EMAIL PROTECTED]> wrote:
> can you cat a file on the mounted directory and maybe do a checksum on
>  it?  If you can't open the file consider it hung.  I am not suire of a
>  direct way to get at the NFS, maybe the NFS control port is unavaila
>  when it hangs, but I think since those get spawned on demand that might
>  not work.

The problem is that the mount is hung so any procs. trying to do a
read on that mount hang as well; which could be your indicator that
you have a failure scenario, but then you have hung procs. stacking up
and you can't communicate back to your monitor agent which mount is
hung.


-- 
Augie Schwer - [EMAIL PROTECTED] - http://schwer.us
Key fingerprint = 9815 AE19 AFD1 1FE7 5DEE 2AC3 CB99 2784 27B0 C072

___
mon mailing list
mon@linux.kernel.org
http://linux.kernel.org/mailman/listinfo/mon