Does User 500 exist on your compute nodes?

Jeff

From: David Carlet [mailto:[email protected]]
Sent: Thursday, April 09, 2015 4:00 PM
To: slurm-dev
Subject: [slurm-dev] SlurmDBD/accounting not aligning user names and UIDs?

So, I built a multi-node, single-head Warewulf cluster and installed SLURM, 
minus the accounting bits, and after a bit of learning, I got it up and running 
just fine and dandy.  Jobs submitted fine, MPI ran fine, etc.

However, when I tore SLURM out, recompiled it to be a bit more production-ready 
(integrating with MySQL for accounting and BLCR for 
checkpoint/restart)...things went awry.

According to scontrol, all my nodes show up and register.  I added the cluster 
to the accounting database (sacctmgr add cluster proto), added a test account 
(sacctmgr add account test Description="test" Organization="none"), and added a 
user to that account (sacctmgr add user protoadmin DefaultAccount=test 
Cluster=proto Partition=protonodes).  Then I attempted to submit a job (salloc 
-N8 sh)--it failed.  The error message: salloc: error: Job submit/allocate 
failed: Invalid account or account/partition combination specified.  So then I 
tried manually specifying the account (salloc -N8 --account=test sh), same 
error.  So I decided to check the slurmdbd log, and all I see are a long string 
of "DBD_CLUSTER_CPUS: cluster not registered"...not seemingly useful, so I 
decided to check the slurmctld.log file.  This provided some lovely error 
messages: error: User 500 not found\n _job_create: invalid account or partition 
for user 500, account '(null)', and partition 'protonodes' 
\n_slurm_rpc_allocate_resources: Invalid account or account/partition 
combination specified. (and then repeated, except account '(test)' for when I 
tried specifying the account)

Which I found odd--userid 500 DOES belong to protoadmin, and the associations 
for it clearly show that it is a member of the test account and associated with 
the protonodes partition.  Weird.

I tried the same thing but with root, and got the exact same type of error 
message (s/500/0/g), minus the "error: User xxx not found" bit.

I've been poking around at the config files, and as far as I can tell, reading 
the documentation, nothing seems inconsistent.  Does anyone have any ideas what 
might be holding this up?  It seems to me like the database for some reason 
can't associate the user id with the user name it stores in the database...but 
isn't it supposed to only care what the user name is?

Also, I have the /etc/passwd, /group, /shadow, etc. files managed with 
warewulf's file provisioning, so they are identical across the cluster.   Same 
with the slurm.conf file.

Some other tidbits:
Packages installed on head node:
yum list installed | grep slurm
slurm.x86_64 14.11.3-1.el6 (won't retype version unless it changes, which it 
doesn't)
slurm-blcr
slurm-devel
slurm-munge
slurm-pam_slurm
slurm-perlapi
slurm-plugins
slurm-sjobexit
slurm-sjstat
slurm-slurmdb-direct
slurm-slurmdbd
slurm-sql
slurm-torque (I actually need to remove this as I won't be using torque)

On the nodes:
yum --installroot=/var/chroots/nodecent65/ list installed | grep slurm:
slurm
slurm-munge
slurm-plugins

Slurm.conf:
SlurmUser=slurm
usepam=no
AccountingStorageEnforce=associations
AccountingStorageHost=protohead
AccountingStoragePort=6819
AccountingStorageType=accounting_storage/slurmdbd
AccountStorageUser=slurmdbd
ClusterName=proto

Slurmdbd.conf:
DbdAddr=protohead
DbdHost=protohead
DbdPort=6819
SlurmUser=slurm
DebugLevel=5
StorageType=accounting_storage/MySQL
StorageHost=protohead
StoragePort=3306
StoragePass=(password)
StorageUser=slurmdbd
StorageLoc=slurm_acct_db

If someone could help me figure this out, it'd be great!  Been beating my head 
against the wall the last day and a half with this.

Thanks!
-Dave



Reply via email to