Dear David,
we ran about the same problems. We are creating slurm users merely by the user id, ie. 'sacctmgr add user 43224'. This should help you out.
best regards, Markus On 04/10/2015 12:01 AM, David Carlet wrote:
Yup; pdsh -w proto-node[01-12] cat /etc/passwd | grep protoadmin: proto-node01: protoadmin:x:500:500:Proto Admin:/home/protoadmin:/bin/bash (repeat for all the other nodes) On Thu, Apr 9, 2015 at 5:16 PM, Sarlo, Jeffrey S <[email protected] <mailto:[email protected]>> wrote: Does User 500 exist on your compute nodes?____ __ __ Jeff____ __ __ *From:*David Carlet [mailto:[email protected] <mailto:[email protected]>] *Sent:* Thursday, April 09, 2015 4:00 PM *To:* slurm-dev *Subject:* [slurm-dev] SlurmDBD/accounting not aligning user names and UIDs?____ __ __ So, I built a multi-node, single-head Warewulf cluster and installed SLURM, minus the accounting bits, and after a bit of learning, I got it up and running just fine and dandy. Jobs submitted fine, MPI ran fine, etc.____ __ __ However, when I tore SLURM out, recompiled it to be a bit more production-ready (integrating with MySQL for accounting and BLCR for checkpoint/restart)...things went awry.____ __ __ According to scontrol, all my nodes show up and register. I added the cluster to the accounting database (sacctmgr add cluster proto), added a test account (sacctmgr add account test Description="test" Organization="none"), and added a user to that account (sacctmgr add user protoadmin DefaultAccount=test Cluster=proto Partition=protonodes). Then I attempted to submit a job (salloc -N8 sh)--it failed. The error message: salloc: error: Job submit/allocate failed: Invalid account or account/partition combination specified. So then I tried manually specifying the account (salloc -N8 --account=test sh), same error. So I decided to check the slurmdbd log, and all I see are a long string of "DBD_CLUSTER_CPUS: cluster not registered"...not seemingly useful, so I decided to check the slurmctld.log file. This provided some lovely error messages: error: User 500 not found\n _job_create: invalid account or partition for user 500, account '(null)', and partition 'protonodes' \n_slurm_rpc_allocate_resources: Invalid account or account/partition combination specified. (and then repeated, except account '(test)' for when I tried specifying the account)____ __ __ Which I found odd--userid 500 DOES belong to protoadmin, and the associations for it clearly show that it is a member of the test account and associated with the protonodes partition. Weird.____ __ __ I tried the same thing but with root, and got the exact same type of error message (s/500/0/g), minus the "error: User xxx not found" bit.____ __ __ I've been poking around at the config files, and as far as I can tell, reading the documentation, nothing seems inconsistent. Does anyone have any ideas what might be holding this up? It seems to me like the database for some reason can't associate the user id with the user name it stores in the database...but isn't it supposed to only care what the user name is?____ __ __ Also, I have the /etc/passwd, /group, /shadow, etc. files managed with warewulf's file provisioning, so they are identical across the cluster. Same with the slurm.conf file.____ __ __ Some other tidbits:____ Packages installed on head node:____ yum list installed | grep slurm____ slurm.x86_64 14.11.3-1.el6 (won't retype version unless it changes, which it doesn't)____ slurm-blcr____ slurm-devel____ slurm-munge____ slurm-pam_slurm____ slurm-perlapi____ slurm-plugins____ slurm-sjobexit____ slurm-sjstat____ slurm-slurmdb-direct____ slurm-slurmdbd____ slurm-sql____ slurm-torque (I actually need to remove this as I won't be using torque)____ __ __ On the nodes:____ yum --installroot=/var/chroots/nodecent65/ list installed | grep slurm:____ slurm____ slurm-munge____ slurm-plugins____ __ __ Slurm.conf:____ SlurmUser=slurm____ usepam=no____ AccountingStorageEnforce=associations____ AccountingStorageHost=protohead____ AccountingStoragePort=6819____ AccountingStorageType=accounting_storage/slurmdbd____ AccountStorageUser=slurmdbd____ ClusterName=proto____ __ __ Slurmdbd.conf:____ DbdAddr=protohead____ DbdHost=protohead____ DbdPort=6819____ SlurmUser=slurm____ DebugLevel=5____ StorageType=accounting_storage/MySQL____ StorageHost=protohead____ StoragePort=3306____ StoragePass=(password)____ StorageUser=slurmdbd____ StorageLoc=slurm_acct_db____ __ __ If someone could help me figure this out, it'd be great! Been beating my head against the wall the last day and a half with this.____ __ __ Thanks!____ -Dave____ __ __ __ __ ____
-- ===================================================== Dr. Markus Stöhr Zentraler Informatikdienst BOKU Wien / TU Wien Wiedner Hauptstraße 8-10 1040 Wien Tel. +43-1-58801-420754 Fax +43-1-58801-9420754 Email: [email protected] =====================================================
