Mr. Jette -

Thanks. This was pilot error. I had some changes to my /home, so my slurm user disappeared.

DMR
On 5/24/2013 6:32 PM, Morris Jette wrote:
Looks like slurmd daemons not running on your compute nodes. Run "sinfo".

David Race <[email protected]> wrote:

    Hello -

    I am sure this is simple, but I cannot get an srun job to start:

    Here is my config file:

    #
    # Example slurm.conf file. Please run configurator.html
    # (in doc/html) to build a configuration file customized
    # for your environment.
    #
    #
    # slurm.conf file generated by configurator.html.
    #
    # See the slurm.conf man page for more information.
    #
    ClusterName=Prod
    ControlMachine=tx321
    #ControlAddr=
    #BackupController=
    #BackupAddr=
    #
    SlurmUser=slurm
    #SlurmdUser=root
    SlurmctldPort=6817
    SlurmdPort=6818
    AuthType=auth/munge
    #JobCredentialPrivateKey=
    #JobCredentialPublicCertificate=
    StateSaveLocation=/tmp
    SlurmdSpoolDir=/tmp/slurmd
    SwitchType=switch/none
    MpiDefault=none
    SlurmctldPidFile=/var/run/slurmctld.pid
    SlurmdPidFile=/var/run/slurmd.pid
    ProctrackType=proctrack/pgid
    #PluginDir=
    CacheGroups=0
    #FirstJobId=
    ReturnToService=2
    #MaxJobCount=
    #PlugStackConfig=
    #PropagatePrioProcess=
    #PropagateResourceLimits=
    #PropagateResourceLimitsExcept=
    #Prolog=
    #Epilog=
    #SrunProlog=
    #SrunEpilog=
    #TaskProlog=
    #TaskEpilog=
    #TaskPlugin=
    #TrackWCKey=no
    #TreeWidth=50
    #TmpFs=
    #UsePAM=
    #
    # TIMERS
    SlurmctldTimeout=300
    SlurmdTimeout=300
    InactiveLimit=0
    MinJobAge=300
    KillWait=30
    Waittime=0
    #
    # SCHEDULING
    SchedulerType=sched/builtin
    #SchedulerAuth=
    #SchedulerPort=
    #SchedulerRootFilter=
    SelectType=select/linear
    FastSchedule=1
    #PriorityType=priority/multifactor
    #PriorityDecayHalfLife=14-0
    #PriorityUsageResetPeriod=14-0
    #PriorityWeightFairshare=100000
    #PriorityWeightAge=1000
    #PriorityWeightPartition=10000
    #PriorityWeightJobSize=1000
    #PriorityMaxAge=1-0
    #
    # LOGGING
    SlurmctldDebug=3
    #SlurmctldLogFile=
    SlurmdDebug=3
    #SlurmdLogFile=
    JobCompType=jobcomp/none
    #JobCompLoc=
    #
    # ACCOUNTING
    #JobAcctGatherType=jobacct_gather/linux
    #JobAcctGatherFrequency=30
    #
    #AccountingStorageType=accounting_storage/slurmdbd
    #AccountingStorageHost=
    #AccountingStorageLoc=
    #AccountingStoragePass=
    #AccountingStorageUser=
    #
    # COMPUTE NODES
    NodeName=prod-000[1-4] Procs=8 State=UNKNOWN
    PartitionName=all.q Nodes=prod-000[1-4] Default=YES
    MaxTime=INFINITE State=UP

    =============END CONFIG===================================

    I attempt to run srun with

    srun -n 8 -vvvvvv date

    and I get:
    srun: defined options for program `srun'
    srun: --------------- ---------------------
    srun: user           : `root'
    srun: uid            : 0
    srun: gid            : 0
    srun: cwd            : /home/drace/QueueOptions/ACE/ace/daemons
    srun: ntasks         : 8 (set)
    srun: nodes          : 1 (default)
    srun: jobid          : 4294967294 (default)
    srun: partition      : default
    srun: job name       : `(null)'
    srun: reservation    : `(null)'
    srun: wckey          : `(null)'
    srun: switches       : -1
    srun: wait-for-switches : -1
    srun: distribution   : unknown
    srun: cpu_bind       : default
    srun: mem_bind       : default
    srun: cpu_freq       : 4294967294
    srun: verbose        : 5
    srun: slurmd_debug   : 0
    srun: immediate      : false
    srun: label output   : false
    srun: unbuffered IO  : false
    srun: overcommit     : false
    srun: threads        : 60
    srun: checkpoint_dir : /home/drace/QueueOptions/ACE/ace/daemons
    srun: wait           : 0
    srun: account        : (null)
    srun: comment        : (null)
    srun: dependency     : (null)
    srun: exclusive      : false
    srun: qos            : (null)
    srun: constraints    : tmp-per-node=4294967294
    srun: geometry       : (null)
    srun: reboot         : yes
    srun: rotate         : no
    srun: preserve_env   : false
    srun: network        : (null)
    srun: propagate      : NONE
    srun: prolog         : (null)
    srun: epilog         : (null)
    srun: mail_type      : NONE
    srun: mail_user      : (null)
    srun: task_prolog    : (null)
    srun: task_epilog    : (null)
    srun: multi_prog     : no
    srun: sockets-per-node  : -2
    srun: cores-per-socket  : -2
    srun: threads-per-core  : -2
    srun: ntasks-per-node   : -2
    srun: ntasks-per-socket : -2
    srun: ntasks-per-core   : -2
    srun: plane_size        : 4294967294
    srun: remote command    : `/bin/date'
    srun: debug:  propagating RLIMIT_CPU=18446744073709551615
    srun: debug:  propagating RLIMIT_FSIZE=18446744073709551615
    srun: debug:  propagating RLIMIT_DATA=18446744073709551615
    srun: debug:  propagating RLIMIT_STACK=10485760
    srun: debug:  propagating RLIMIT_CORE=0
    srun: debug:  propagating RLIMIT_RSS=18446744073709551615
    srun: debug:  propagating RLIMIT_NPROC=131072
    srun: debug:  propagating RLIMIT_NOFILE=327680
    srun: debug:  propagating RLIMIT_MEMLOCK=18446744073709551615
    srun: debug:  propagating RLIMIT_AS=18446744073709551615
    srun: debug:  propagating SLURM_PRIO_PROCESS=0
    srun: debug:  propagating UMASK=0022
    srun: debug2: srun PMI messages to port=43211
    srun: debug:  Entering slurm_allocation_msg_thr_create()
    srun: debug:  port from net_stream_listen is 34042
    srun: debug:  Entering _msg_thr_internal
    srun: debug4: eio: handling events for 1 objects
    srun: debug3: Called eio_message_socket_readable 0 4
    srun: debug3: Trying to load plugin /slurm/lib/slurm/auth_munge.so
    srun: auth plugin for Munge (http://code.google.com/p/munge/) loaded
    srun: debug3: Success.
    srun: Required node not available (down or drained)
    srun: debug2: Pending job allocation 33
    srun: job 33 queued and waiting for resources


    Any ideas?

    Thanks

    David Race



--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Reply via email to