On Wed, Aug 10, 2011 at 10:34 AM, Mark A. Grondona <[email protected]> wrote:On Wed, 10 Aug 2011 10:19:56 -0700, Mike Schachter <[email protected]> wrote:A note from the guy who set up DNS: "the dynamic dns updates about every 5 min, so a dead server will have its dns entry removed" Also, looks like nothing is being returned by gethostbyname: me@slurmctrlr:~$ perl -MSocket -e 'print (inet_ntoa(scalar gethostbyname("thenodename")),"\n")'Bad arg length for Socket::inet_ntoa, length is 0, should be 4 at -e line 1.me@slurmctrlr:~$ perl -MSocket -e 'print (scalar gethostbyname("thenodename"), "\ntest line\n")' test lineNode hostnames need to be resolvable whether they are up or down. I'm not sure dynamic dns is going to work very well with SLURM either. I would put all your hosts into /etc/hosts, and don't let their IP addresses change. markOn Wed, Aug 10, 2011 at 10:02 AM, Mark A. Grondona <[email protected]> wrote: > On Wed, 10 Aug 2011 09:49:50 -0700, Mike Schachter <[email protected]> wrote:>> me@slurmctrlr:~$ nslookup thenodename >> Server: 10.0.0.1 >> Address: 10.0.0.1#53 >> >> ** server can't find thenodename: NXDOMAIN > > Sorry, that wasn't quite sufficient, except now we know "thenodename" > isn't in DNS. Does this perl script print the IP for thenodename: >> perl -MSocket -e 'print (inet_ntoa(scalar gethostbyname("thenodename")),"\n")'> > mark > >> >> >>>> On Wed, Aug 10, 2011 at 9:40 AM, Mark A. Grondona <[email protected]> wrote: >> > On Wed, 10 Aug 2011 09:28:41 -0700, Mike Schachter <[email protected]> wrote:>> >> No, once the node "thenodename" goes down, there is no more >> >> host resolution, it can't be pinged or ssh'ed to. >> >> >> >> mike >> > >> > I meant does >> > >> > nslookup thenodename >> > >> > work? >> > >> > mark >> > >> > >> >>>> >> On Wed, Aug 10, 2011 at 9:21 AM, Mark A. Grondona <[email protected]> wrote: >> >> > On Wed, 10 Aug 2011 09:13:17 -0700, Mike Schachter <[email protected]> wrote:>> >> >> I'm sure that slurm is built to handle failover like this! That's >> >> >> why I'm so troubled by the behavior I'm seeing. >> >> >> >> >> >> The node configuration is relatively straightforward: >> >> >> >> >> >> NodeName=thenodename Procs=4 State=UNKNOWN >> >> >> #.. other nodes defined the same way >> >> >>>> >> >> PartitionName=all Nodes=nodename1,nodename2,thenodename Priority=100>> >> >> Shared=NO Default=YES >> >> > >> >> > Can you resolve all these hostnames on the controller node? >> >> > >> >> > >> >> > >> >> >> Could it be the State=UNKNOWN that is screwing things up? Are >> >> >> there any other configuration options that could produce this >> >> >> behavior? What seemed to happen was that a reconfigure command >> >> >> was sent to the controller, right before the error messages I sent >> >> >> previously: >> >> >> >> >> >> [2011-08-10T06:47:52] Reconfigure signal (SIGHUP) received >> >> >> >> >> >> >> >> >> >> >> >> On Wed, Aug 10, 2011 at 9:03 AM, <[email protected]> wrote:>> >> >> > There are many clusters running slurm where nodes go down daily and you are >> >> >> > the first person to report a problem. My best guess is that your slurm.conf >> >> >> > file is bad. What does your node configuration line(s) look like in>> >> >> > slurm.conf? >> >> >> > >> >> >> > Quoting Mike Schachter <[email protected]>: >> >> >> > >> >> >> >> So this morning a node went down in the middle of a bunch of>> >> >> >> jobs running, the slurm controller tried to reconfigure, and this>> >> >> >> was the only error message we got in the log file: >> >> >> >> >> >> >> >> slurmctld: error: Unable to resolve "thenodename": Unknown host >> >> >> >> slurmctld: fatal: slurm_set_addr failure on thenodename >> >> >> >>>> >> >> >> The controller won't even restart if I set the State=DOWN for the node >> >> >> >> in /etc/slurm.conf. I have to manually remove the node from configuration>> >> >> >> file in order for the controller to restart. >> >> >> >> >> >> >> >> This is a huge problem for us! We expected that nodes could go >> >> >> >> down graceful failover. Any idea what's going on? >> >> >> >> >> >> >> >> mike >> >> >> >> >> >> >> >> >> >> >> >> On Tue, Aug 2, 2011 at 4:53 PM, <[email protected]> wrote: >> >> >> >>>>> >> >> >>> Your log should say what is happening. If not, try logging on as root and>> >> >> >>> starting the daemon by hand with lots of debugging (-v's): >> >> >> >>> "slurmctld -Dvvvvv" >> >> >> >>> >> >> >> >>> Quoting Mike Schachter <[email protected]>: >> >> >> >>> >> >> >> >>>> Hi there, >> >> >> >>>>>> >> >> >>>> If we have a node down, and then restart the slurm controller,>> >> >> >>>> for some reason slurm won't start up! Is there some way to >> >> >> >>>> ameliorate this issue? >> >> >> >>>> >> >> >> >>>> mike >> >> >> >>>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> >> >> > >> >> >> > >> >
I concur with Mark, _but_ if if putting all of your hosts into
/etc/hosts is not possible, SLURM has a node state of "Future" that
might be used for nodes that will appear later. It is intended more
for adding nodes from the cloud and will take some effort to manage
the node stats in SLURM, but could be used.
- Re: [slurm-dev] slurm won't start if node is down? jette
- Re: [slurm-dev] slurm won't start if node is down... Mike Schachter
- Re: [slurm-dev] slurm won't start if node is ... Mark A. Grondona
- Re: [slurm-dev] slurm won't start if node... Mike Schachter
- Re: [slurm-dev] slurm won't start if ... Mark A. Grondona
- Re: [slurm-dev] slurm won't start if ... Mike Schachter
- Re: [slurm-dev] slurm won't start if ... Mark A. Grondona
- Re: [slurm-dev] slurm won't start if ... Mike Schachter
- Re: [slurm-dev] slurm won't start if ... Mark A. Grondona
- Re: [slurm-dev] slurm won't start if ... Mike Schachter
- Re: [slurm-dev] slurm won't start if ... jette
- Re: [slurm-dev] slurm won't start if ... Paul H. Hargrove
- Re: [slurm-dev] slurm won't start if ... Michael Di Domenico
- Re: [slurm-dev] slurm won't start if ... Mark A. Grondona
