The following specification seems to prove my suspicion about parsing
problems within slurmctl


PartitionName=CLUSTER Default=yes State=UP
nodes=gpu-[1]-[4-17],gpu-[2]-[6-16],gpu-[3]-[9],gpu-[2]-[4]

This line works and shows all nodes correctly, interesting enough in the
comma delimited spelling slurmctl rejects when starting

CLUSTER      up   infinite      5   down gpu-1-[10,13,16-17],gpu-2-8
CLUSTER      up   infinite     12   idle gpu-2-[4,6-7,9-16],gpu-3-9


-Eva

On Thu, 11 Jul 2013, Eva Hocks wrote:

>
>
> yes, that's the line
>
> PartitionName=CLUSTER Default=yes State=UP 
> nodes=gpu-[1]-[4-17],gpu-[2]-[4,6-16],gpu-[3]-[9]
>
> and I have gcn-2-4 defined in the nodenames file
> NodeName=gpu-2-4 NodeAddr=10.240.31.235 CPUs=32 Sockets=2 CoresPerSocket=8 
> ThreadsPerCore=2 Gres=gpu:4 Weight=20512304 Feature=rack-2,32CPUs 
> RealMemory=245760
>
> as well as /etc/hosts
>
> 10.240.31.235   gpu-2-4.local   gpu-2-4
>
> but nevertheless slurmctl crashes:
>
> [2013-07-11T10:11:40.764] Recovered state of 29 nodes
> [2013-07-11T10:11:40.764] Recovered information about 0 jobs
> [2013-07-11T10:11:40.764] error: find_node_record: lookup failure for 
> gpu-[2]-[4]
> [2013-07-11T10:11:40.764] error: node_name2bitmap: invalid node specified 
> gpu-[2]-[4]
> [2013-07-11T10:11:40.764] error: find_node_record: lookup failure for 6-16]
> [2013-07-11T10:11:40.764] error: node_name2bitmap: invalid node specified 
> 6-16]
> [2013-07-11T10:11:40.764] fatal: Invalid node names in partition CLUSTER
>
>
> Looks to me like a parsing error. Also if torque can't resolve a
> hostname it just logs an error but still functions. slurm is completely
> dead with one missing node!
>
>
> Thanks
> Eva
>
> On Wed, 10 Jul 2013, John Thiltges wrote:
>
> >
> > Hi Eva,
> >
> > I wasn't able to reproduce the problem with a quick test. You have
> > config lines similar to these?
> >
> >      NodeName=gpu-1-[4-17],gpu-2-[4,6-16],gpu-3-9 ...
> >      PartitionName=... Nodes=gpu-1-[4-17],gpu-2-[4,6-16],gpu-3-9
> >
> > Regards,
> > John
> >
> > On 2013-07-10 19:20, Eva Hocks wrote:
> > >
> > >
> > >
> > >
> > > Thanks, John
> > >
> > >
> > >
> > > but  this is what I have in the partition file:
> > >
> > > nodes=gpu-1-[4-17],gpu-2-[4,6-16],gpu-3-9
> > >
> > >
> > >
> > > slurm gets confused when it can't look up gpu-2-4 and then splits the
> > >
> > > gpu-2-[4,6-16]   into gpu-[2]-[4] (failed lookup) and 6-16] (which is
> > >
> > > actually no node name at all but a wrong parsing after the failure)
> > >
> > >
> > >
> > > Thanks
> > >
> > > Eva
> > >
> > >
> > >
> > > On Wed, 10 Jul 2013, John Thiltges wrote:
> > >
> > >
> > >
> > >> On 07/10/2013 06:16 PM, Eva Hocks wrote:
> > >>> The entry in partiton.conf:
> > >>> PartitionName=CLUSTER Default=yes State=UP 
> > >>> nodes=gpu-[1]-[4-17],gpu-[2]-[4,6-16],gpu-[3]-[9]
> > >>> causes slurmctl to crash:
> > >>> 2013-07-10T16:03:22.923] error: find_node_record: lookup failure for 
> > >>> gpu-[2]-[4]
> > >>> [2013-07-10T16:03:22.923] error: node_name2bitmap: invalid node 
> > >>> specified gpu-[2]-[4]
> > >>> [2013-07-10T16:03:22.923] error: find_node_record: lookup failure for 
> > >>> 6-16]
> > >>> [2013-07-10T16:03:22.923] error: node_name2bitmap: invalid node 
> > >>> specified 6-16]
> > >>> [2013-07-10T16:03:22.923] fatal: Invalid node names in partition CLUSTER
> > >> It looks like the hostlist parser is confused by the brackets, finding
> > >> names of "6-16]" and "gpu-[2]-[4]".
> > >> Brackets are only needed when there is a range. If you take out the
> > >> extra brackets, it should parse OK:
> > >>       nodes=gpu-1-[4-17],gpu-2-[4,6-16],gpu-3-9
> > >> Regards,
> > >> John
> > > >
> >
>
>

Reply via email to