Thanks Mark, I'll let everyone know! mike
On Wed, Aug 10, 2011 at 10:34 AM, Mark A. Grondona <[email protected]> wrote: > On Wed, 10 Aug 2011 10:19:56 -0700, Mike Schachter <[email protected]> > wrote: >> A note from the guy who set up DNS: >> >> "the dynamic dns updates about every 5 min, so a dead server will have >> its dns entry removed" >> >> Also, looks like nothing is being returned by gethostbyname: >> >> me@slurmctrlr:~$ perl -MSocket -e 'print (inet_ntoa(scalar >> gethostbyname("thenodename")),"\n")' >> Bad arg length for Socket::inet_ntoa, length is 0, should be 4 at -e line 1. >> >> me@slurmctrlr:~$ perl -MSocket -e 'print (scalar >> gethostbyname("thenodename"), "\ntest line\n")' >> >> test line > > Node hostnames need to be resolvable whether they are up or down. > I'm not sure dynamic dns is going to work very well with SLURM either. > I would put all your hosts into /etc/hosts, and don't let their IP > addresses change. > > mark > > >> >> >> >> On Wed, Aug 10, 2011 at 10:02 AM, Mark A. Grondona <[email protected]> >> wrote: >> > On Wed, 10 Aug 2011 09:49:50 -0700, Mike Schachter >> > <[email protected]> wrote: >> >> me@slurmctrlr:~$ nslookup thenodename >> >> Server: 10.0.0.1 >> >> Address: 10.0.0.1#53 >> >> >> >> ** server can't find thenodename: NXDOMAIN >> > >> > Sorry, that wasn't quite sufficient, except now we know "thenodename" >> > isn't in DNS. Does this perl script print the IP for thenodename: >> > >> > perl -MSocket -e 'print (inet_ntoa(scalar >> > gethostbyname("thenodename")),"\n")' >> > >> > mark >> > >> >> >> >> >> >> >> >> On Wed, Aug 10, 2011 at 9:40 AM, Mark A. Grondona <[email protected]> >> >> wrote: >> >> > On Wed, 10 Aug 2011 09:28:41 -0700, Mike Schachter >> >> > <[email protected]> wrote: >> >> >> No, once the node "thenodename" goes down, there is no more >> >> >> host resolution, it can't be pinged or ssh'ed to. >> >> >> >> >> >> mike >> >> > >> >> > I meant does >> >> > >> >> > nslookup thenodename >> >> > >> >> > work? >> >> > >> >> > mark >> >> > >> >> > >> >> >> >> >> >> On Wed, Aug 10, 2011 at 9:21 AM, Mark A. Grondona <[email protected]> >> >> >> wrote: >> >> >> > On Wed, 10 Aug 2011 09:13:17 -0700, Mike Schachter >> >> >> > <[email protected]> wrote: >> >> >> >> I'm sure that slurm is built to handle failover like this! That's >> >> >> >> why I'm so troubled by the behavior I'm seeing. >> >> >> >> >> >> >> >> The node configuration is relatively straightforward: >> >> >> >> >> >> >> >> NodeName=thenodename Procs=4 State=UNKNOWN >> >> >> >> #.. other nodes defined the same way >> >> >> >> >> >> >> >> PartitionName=all Nodes=nodename1,nodename2,thenodename Priority=100 >> >> >> >> Shared=NO Default=YES >> >> >> > >> >> >> > Can you resolve all these hostnames on the controller node? >> >> >> > >> >> >> > >> >> >> > >> >> >> >> Could it be the State=UNKNOWN that is screwing things up? Are >> >> >> >> there any other configuration options that could produce this >> >> >> >> behavior? What seemed to happen was that a reconfigure command >> >> >> >> was sent to the controller, right before the error messages I sent >> >> >> >> previously: >> >> >> >> >> >> >> >> [2011-08-10T06:47:52] Reconfigure signal (SIGHUP) received >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Aug 10, 2011 at 9:03 AM, <[email protected]> wrote: >> >> >> >> > There are many clusters running slurm where nodes go down daily >> >> >> >> > and you are >> >> >> >> > the first person to report a problem. My best guess is that your >> >> >> >> > slurm.conf >> >> >> >> > file is bad. What does your node configuration line(s) look like >> >> >> >> > in >> >> >> >> > slurm.conf? >> >> >> >> > >> >> >> >> > Quoting Mike Schachter <[email protected]>: >> >> >> >> > >> >> >> >> >> So this morning a node went down in the middle of a bunch of >> >> >> >> >> jobs running, the slurm controller tried to reconfigure, and this >> >> >> >> >> was the only error message we got in the log file: >> >> >> >> >> >> >> >> >> >> slurmctld: error: Unable to resolve "thenodename": Unknown host >> >> >> >> >> slurmctld: fatal: slurm_set_addr failure on thenodename >> >> >> >> >> >> >> >> >> >> The controller won't even restart if I set the State=DOWN for >> >> >> >> >> the node >> >> >> >> >> in /etc/slurm.conf. I have to manually remove the node from >> >> >> >> >> configuration >> >> >> >> >> file in order for the controller to restart. >> >> >> >> >> >> >> >> >> >> This is a huge problem for us! We expected that nodes could go >> >> >> >> >> down graceful failover. Any idea what's going on? >> >> >> >> >> >> >> >> >> >> mike >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Tue, Aug 2, 2011 at 4:53 PM, <[email protected]> wrote: >> >> >> >> >>> >> >> >> >> >>> Your log should say what is happening. If not, try logging on >> >> >> >> >>> as root and >> >> >> >> >>> starting the daemon by hand with lots of debugging (-v's): >> >> >> >> >>> "slurmctld -Dvvvvv" >> >> >> >> >>> >> >> >> >> >>> Quoting Mike Schachter <[email protected]>: >> >> >> >> >>> >> >> >> >> >>>> Hi there, >> >> >> >> >>>> >> >> >> >> >>>> If we have a node down, and then restart the slurm controller, >> >> >> >> >>>> for some reason slurm won't start up! Is there some way to >> >> >> >> >>>> ameliorate this issue? >> >> >> >> >>>> >> >> >> >> >>>> mike >> >> >> >> >>>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> >> >> >> > >> >> >> >> >> > >> >> >> > >> >
