Jamie Wilkinson wrote: > This one time, at band camp, Craig Dibble wrote: >> ...right up until I deployed our new LDAP servers to production. Now I >> find that I get intermittent failures from the keepalive script
> Immediately I am thinking that the problem is somewhere in NSS. Timeouts > due to LDAP connection overheads, fd leaks in nss_ldap, nscd's very > existence, all could be causing something to fail. This is my thinking too, but I'm at a loss as to how I might debug this. > Unlike Solaris, POSIX and Linux don't cater to temporary failure, so > anything that explodes in the pipeline is going to return a failed lookup > (and if you're using nscd, it'll cache that negative if you're really > unlucky.) Again, that's what I thought, but I still get it even with the caches cleared, disabled, or nscd stopped. >> [1] As a temporary fix I have put a simple hook in the keepalive script >> to die if the returned process list is empty. > Is there a timeout on the process list command in the keepalive script? No, that would be my next step in tidying up the script. I'll probably do that today just so I can see if it is in fact timing out or if there is some other issue causing the failures. To be honest, I'd rather bin the script and start again but that won't help me understand why this is happening. > Do you get an empty process list when you run it by hand? > The first thing to try is to replicate the conditions in the script to get a > repeatable failure of ps. Once you've done that, you'll have some idea as > to where to look next. Like I said, it's intermittent so very hard to replicate. I haven't yet managed to figure out exactly what else may be occurring at the exact instant it fails, or indeed how/why it fails. As far as I have been able to ascertain simply running the ps command by hand does not seem to fail, but interestingly, with the quick fix I put in yesterday I put a backticked date command in the 'die' expression to print a timestamp to the log file (to compare against any potential further false positives). About 1/3 of the failures overnight in the log have no timestamp on them. The ps command itself is also contained in backticks. I'm not sure what that's telling me, but I think it's time to add some debugging to the script and see what is going on at each stage. Craig -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html
