Dear David,

we ran about the same problems. We are creating slurm users merely by the user id, ie. 'sacctmgr add user 43224'. This should help you out.

best regards,
Markus

On 04/10/2015 12:01 AM, David Carlet wrote:
Yup;
pdsh -w proto-node[01-12] cat /etc/passwd | grep protoadmin:
proto-node01: protoadmin:x:500:500:Proto Admin:/home/protoadmin:/bin/bash
(repeat for all the other nodes)

On Thu, Apr 9, 2015 at 5:16 PM, Sarlo, Jeffrey S <[email protected]
<mailto:[email protected]>> wrote:

    Does User 500 exist on your compute nodes?____

    __ __

    Jeff____

    __ __

    *From:*David Carlet [mailto:[email protected]
    <mailto:[email protected]>]
    *Sent:* Thursday, April 09, 2015 4:00 PM
    *To:* slurm-dev
    *Subject:* [slurm-dev] SlurmDBD/accounting not aligning user names
    and UIDs?____

    __ __

    So, I built a multi-node, single-head Warewulf cluster and installed
    SLURM, minus the accounting bits, and after a bit of learning, I got
    it up and running just fine and dandy.  Jobs submitted fine, MPI ran
    fine, etc.____

    __ __

    However, when I tore SLURM out, recompiled it to be a bit more
    production-ready (integrating with MySQL for accounting and BLCR for
    checkpoint/restart)...things went awry.____

    __ __

    According to scontrol, all my nodes show up and register.  I added
    the cluster to the accounting database (sacctmgr add cluster proto),
    added a test account (sacctmgr add account test Description="test"
    Organization="none"), and added a user to that account (sacctmgr add
    user protoadmin DefaultAccount=test Cluster=proto
    Partition=protonodes).  Then I attempted to submit a job (salloc -N8
    sh)--it failed.  The error message: salloc: error: Job
    submit/allocate failed: Invalid account or account/partition
    combination specified.  So then I tried manually specifying the
    account (salloc -N8 --account=test sh), same error.  So I decided to
    check the slurmdbd log, and all I see are a long string of
    "DBD_CLUSTER_CPUS: cluster not registered"...not seemingly useful,
    so I decided to check the slurmctld.log file.  This provided some
    lovely error messages: error: User 500 not found\n _job_create:
    invalid account or partition for user 500, account '(null)', and
    partition 'protonodes' \n_slurm_rpc_allocate_resources: Invalid
    account or account/partition combination specified. (and then
    repeated, except account '(test)' for when I tried specifying the
    account)____

    __ __

    Which I found odd--userid 500 DOES belong to protoadmin, and the
    associations for it clearly show that it is a member of the test
    account and associated with the protonodes partition.  Weird.____

    __ __

    I tried the same thing but with root, and got the exact same type of
    error message (s/500/0/g), minus the "error: User xxx not found"
    bit.____

    __ __

    I've been poking around at the config files, and as far as I can
    tell, reading the documentation, nothing seems inconsistent.  Does
    anyone have any ideas what might be holding this up?  It seems to me
    like the database for some reason can't associate the user id with
    the user name it stores in the database...but isn't it supposed to
    only care what the user name is?____

    __ __

    Also, I have the /etc/passwd, /group, /shadow, etc. files managed
    with warewulf's file provisioning, so they are identical across the
    cluster.   Same with the slurm.conf file.____

    __ __

    Some other tidbits:____

    Packages installed on head node:____

    yum list installed | grep slurm____

    slurm.x86_64 14.11.3-1.el6 (won't retype version unless it changes,
    which it doesn't)____

    slurm-blcr____

    slurm-devel____

    slurm-munge____

    slurm-pam_slurm____

    slurm-perlapi____

    slurm-plugins____

    slurm-sjobexit____

    slurm-sjstat____

    slurm-slurmdb-direct____

    slurm-slurmdbd____

    slurm-sql____

    slurm-torque (I actually need to remove this as I won't be using
    torque)____

    __ __

    On the nodes:____

    yum --installroot=/var/chroots/nodecent65/ list installed | grep
    slurm:____

    slurm____

    slurm-munge____

    slurm-plugins____

    __ __

    Slurm.conf:____

    SlurmUser=slurm____

    usepam=no____

    AccountingStorageEnforce=associations____

    AccountingStorageHost=protohead____

    AccountingStoragePort=6819____

    AccountingStorageType=accounting_storage/slurmdbd____

    AccountStorageUser=slurmdbd____

    ClusterName=proto____

    __ __

    Slurmdbd.conf:____

    DbdAddr=protohead____

    DbdHost=protohead____

    DbdPort=6819____

    SlurmUser=slurm____

    DebugLevel=5____

    StorageType=accounting_storage/MySQL____

    StorageHost=protohead____

    StoragePort=3306____

    StoragePass=(password)____

    StorageUser=slurmdbd____

    StorageLoc=slurm_acct_db____

    __ __

    If someone could help me figure this out, it'd be great!  Been
    beating my head against the wall the last day and a half with this.____

    __ __

    Thanks!____

    -Dave____

    __ __

    __ __

    ____




--
=====================================================
Dr. Markus Stöhr
Zentraler Informatikdienst BOKU Wien / TU Wien
Wiedner Hauptstraße 8-10
1040 Wien

Tel. +43-1-58801-420754
Fax  +43-1-58801-9420754

Email: [email protected]
=====================================================

Reply via email to