Jamie Wilkinson wrote:
> This one time, at band camp, Craig Dibble wrote:
>> ...right up until I deployed our new LDAP servers to production. Now I
>> find that I get intermittent failures from the keepalive script 

> Immediately I am thinking that the problem is somewhere in NSS.  Timeouts
> due to LDAP connection overheads, fd leaks in nss_ldap, nscd's very
> existence, all could be causing something to fail.

This is my thinking too, but I'm at a loss as to how I might debug this.

> Unlike Solaris, POSIX and Linux don't cater to temporary failure, so
> anything that explodes in the pipeline is going to return a failed lookup
> (and if you're using nscd, it'll cache that negative if you're really
> unlucky.)

Again, that's what I thought, but I still get it even with the caches
cleared, disabled, or nscd stopped.

>> [1] As a temporary fix I have put a simple hook in the keepalive script
>> to die if the returned process list is empty. 

> Is there a timeout on the process list command in the keepalive script?

No, that would be my next step in tidying up the script. I'll probably
do that today just so I can see if it is in fact timing out or if there
is some other issue causing the failures. To be honest, I'd rather bin
the script and start again but that won't help me understand why this is
happening.

> Do you get an empty process list when you run it by hand?

> The first thing to try is to replicate the conditions in the script to get a
> repeatable failure of ps.  Once you've done that, you'll have some idea as
> to where to look next.

Like I said, it's intermittent so very hard to replicate. I haven't yet
managed to figure out exactly what else may be occurring at the exact
instant it fails, or indeed how/why it fails.

As far as I have been able to ascertain simply running the ps command by
hand does not seem to fail, but interestingly, with the quick fix I put
in yesterday I put a backticked date command in the 'die' expression to
print a timestamp to the log file (to compare against any potential
further false positives). About 1/3 of the failures overnight in the log
have no timestamp on them. The ps command itself is also contained in
backticks. I'm not sure what that's telling me, but I think it's time to
add some debugging to the script and see what is going on at each stage.

Craig
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Reply via email to