"Loong, Andreas" <[email protected]> writes:

>> [It shouldn't have needed a new install -- you can do a live
>> upgrade.]
>
> Noted, we opted for this to have a side-by-side installation for testing
> purposes. Now I'm kind of glad we did it like that.

Fair enough.  I didn't realize you had them in parallel, but it's
possible to duplicate the original spool in that case.

>> > We have an internal bind which xCAT manages.
>> 
>> Why dnsmasq and bind (not that it should be relevant to the crash)?
>
> For its ability to cache results for long periods of time in the event
> of a failure, mainly. This keeps the cluster in normal operation for the
> most part, even after there's a problem with the DNS server.

Yes, but I don't understand why you wouldn't use dnsmasq also to do the
job that bind is doing.  Presumably the host database is also cached by
nscd, although I've turned that off for reasons that may no longer be
relevant involving nscd bugs.

> My fault for not being more clear. It starts up just fine and everything
> I'd expect works just as our old qmaster did. After approx 2 minutes it
> segfaults without anything new or even odd added to the messages file.

It sounds most likely to be due to a load report arriving, but I've no
idea what might be wrong.  It's rather odd with a clean install on Red
Hat.  Is the installation from the RPMs I made, or a separate build?

> If I change NSS to files, it stops segfaulting altogether and we can
> almost make use of it.
>
> I've included the log you requested at the end of the mail.

As you, say, nothing very unusual there.

>> > As soon as I change back from pure files to "files dns" it takes
>> > 2-3 minutes and the qmaster segfaults again.
>> 
>> Do you mean qmaster runs for that long, or the init script waits
>> that
>> long for it?  What do you get with and without dns in NSS and
>> flushing
>> the nscd cache from
>> 
>>   utilbin/lx-amd64/gethostbyname -all srvname
>
> I tried as many options I could think of to get differing results, but
> the output never changed.

There must be something different between them if it affects the
qmaster, but I don't have any good ideas what to try.  There have been
changes made for problems with NSS-style lookup, but I'm surprised if
any of that has caused trouble, and I don't remember any being
specifically host-related.

> Next, I'll try the GDB approach.

If you can send me a backtrace from that, I hope it will indicate what's
wrong.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to