Sorry I didn't see this before replying to the original. "Loong, Andreas" <[email protected]> writes:
> After some investigation, I've narrowed the problem down somewhat. But, first > some system information: > > CentOS 5.8, kernel 2.6.18-308.el5, 64-bit Intel system. We're using dnsmasq > to cache dns-queries and nscd. Same here. > We have an internal bind which xCAT manages. Why dnsmasq and bind (not that it should be relevant to the crash)? > Once the qmaster starts up, I see this in the messages-file: > 10/31/2012 09:48:11| main|srvname|W|local configuration srvname not defined > - using global configuration I'm confused. I thought it crashed immediately, or is the above with the NSS change? Otherwise, what are the last messages before the crash (preferably with log_Level info)? > Has anyone seen anything like this before? Even if the DNS has bogus > and faulty information (although I can't see that it has), the qmaster > shouldn't segfault, should it? Yes, a segfault is normally a clear program bug. An exception is when stack allocation fails. > As soon as I change back from pure files to "files dns" it takes 2-3 > minutes and the qmaster segfaults again. Do you mean qmaster runs for that long, or the init script waits that long for it? What do you get with and without dns in NSS and flushing the nscd cache from utilbin/lx-amd64/gethostbyname -all srvname > It might be worth noting that this host is an SGE 6.2u5 qmaster usually, with > the original configuration of the resolver, it works without problems. Are the execds always the same version as the qmaster (although I'd expect something different if not)? > Any ideas of what could lead to this rather strange behaviour, or how > to dig up more information to further narrow it down? Ideally the gdb output from my previous reply. -- Community Grid Engine: http://arc.liv.ac.uk/SGE/ _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
