No :(
Everything is fine.
Every nodes are up to date, Debian Wheezy, SLURM 2.2.7
User, Groups are linked to LDAP via nss_ldap, it could help...
Thanks, thanks a lot. I don't understand anything anymore
--
Alexis GÜNST HORN
System administrator
Exascale Computing Research
Le 12/07/2011 16:22, Danny Auble a écrit :
Alexis, do you happen to notice anything in your slurmctld.log that is out
of the ordinary? If possible I would run at debug level 6.
Make sure you updated your compute nodes as well.
Danny
On Tuesday July 12 2011 2:24:00 PM you wrote:
No problems !
Here it is
ControlMachine=couperin
AuthType=auth/none
CacheGroups=1
CryptoType=crypto/openssl
JobCredentialPrivateKey=/etc/slurm-llnl/slurm.key
JobCredentialPublicCertificate=/etc/slurm-llnl/slurm.cert
#DisableRootJobs=NO
#EnforcePartLimits=NO
JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=2
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
UsePAM=1
# TIMERS
####InactiveLimit=300
InactiveLimit=0
KillWait=600
MinJobAge=300
SlurmctldTimeout=300
SlurmdTimeout=300
Waittime=0
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=/var/log/slurm-llnl/slurm_acct.log
#AccountingStoragePass=
#AccountingStoragePort=
#AccountingStorageType=accounting_storage/filetxt
#AccountingStorageUser=
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=DEFAULT TmpDisk=20480 State=UNKNOWN
NodeName=campion NodeHostname=campion Sockets=1 CoresPerSocket=4
RealMemory=7993
NodeName=carissimi NodeHostname=carissimi Sockets=1 CoresPerSocket=2
RealMemory=3953
NodeName=borodine NodeHostname=borodine Sockets=1 CoresPerSocket=4
RealMemory=7993
NodeName=britten NodeHostname=britten Sockets=1 CoresPerSocket=4
RealMemory=3953
NodeName=bruckner NodeHostname=bruckner Sockets=2 CoresPerSocket=4
RealMemory=24153
NodeName=buxtehude NodeHostname=buxtehude Sockets=2 CoresPerSocket=4
RealMemory=24153
NodeName=chopin NodeHostname=chopin Sockets=1 CoresPerSocket=4
RealMemory=3924
NodeName=clerambault NodeHostname=clerambault Sockets=4 CoresPerSocket=8
RealMemory=129177
PartitionName=cluster
Nodes=campion,carissimi,borodine,britten,bruckner,buxtehude,chopin,clerambau
lt
Default=YES MaxTime=04:00:00 State=UP
Except for the "job complete message received" message, it looks pretty
normal to me; I'll have to defer to others.
Alexis, it will probably help if you could forward a copy of your
slurm.conf file.
Regards,
Andy
On 07/12/2011 07:43 AM, Alexis GÜNST HORN wrote:
Thanks.
Here the output :
Everything seems normal, no ?
agunst@couperin:~$ salloc -vvvvv -w bruckner
salloc: defined options for program `salloc'
salloc: --------------- ---------------------
salloc: user : `agunst'
salloc: uid : 10001
salloc: gid : 10007
salloc: ntasks : 1 (default)
salloc: cpus_per_task : 1 (default)
salloc: nodes : 1 (default)
salloc: partition : default
salloc: job name : `bash'
salloc: reservation : `(null)'
salloc: wckey : `(null)'
salloc: distribution : unknown
salloc: verbose : 5
salloc: immediate : false
salloc: overcommit : false
salloc: account : (null)
salloc: comment : (null)
salloc: dependency : (null)
salloc: network : (null)
salloc: qos : (null)
salloc: constraints : mincpus=1 nodelist=bruckner
salloc: geometry : (null)
salloc: reboot : yes
salloc: rotate : no
salloc: mail_type : NONE
salloc: mail_user : (null)
salloc: sockets-per-node : -2
salloc: cores-per-socket : -2
salloc: threads-per-core : -2
salloc: ntasks-per-node : 0
salloc: ntasks-per-socket : -2
salloc: ntasks-per-core : -2
salloc: plane_size : 4294967294
salloc: cpu_bind : default
salloc: mem_bind : default
salloc: user command : `/bin/bash'
salloc: debug: Entering slurm_allocation_msg_thr_create()
salloc: debug: port from net_stream_listen is 45043
salloc: debug: Entering _msg_thr_internal
salloc: debug4: eio: handling events for 1 objects
salloc: debug3: Called eio_message_socket_readable 0 3
salloc: debug3: Trying to load plugin /usr/lib/slurm/auth_none.so
salloc: Null authentication plugin loaded
salloc: debug3: Success.
salloc: debug3: Trying to load plugin /usr/lib/slurm/select_cons_res.so
salloc: debug3: Success.
salloc: debug3: Trying to load plugin /usr/lib/slurm/select_bgq.so
salloc: debug3: Success.
salloc: debug3: Trying to load plugin /usr/lib/slurm/select_cray.so
salloc: debug3: Success.
salloc: debug3: Trying to load plugin /usr/lib/slurm/select_linear.so
salloc: debug3: Success.
salloc: debug3: Trying to load plugin /usr/lib/slurm/select_bluegene.so
salloc: debug3: Success.
salloc: debug4: eio: handling events for 1 objects
salloc: debug3: Called eio_message_socket_readable 0 3
salloc: Granted job allocation 323
salloc: debug: laying out the 8 tasks on 1 hosts bruckner
salloc: Relinquishing job allocation 323
salloc: debug3: Called eio_msg_socket_accept
salloc: debug2: got message connection from 192.168.96.104:37736 7
salloc: debug3: job complete message received
salloc: Job allocation 323 has been revoked.
salloc: debug4: eio: handling events for 1 objects
salloc: debug3: Called eio_message_socket_readable 0 3
salloc: debug2: slurm_allocation_msg_thr_destroy: clearing up message
thread
salloc: debug4: eio: handling events for 1 objects
salloc: debug3: Called eio_message_socket_readable 1 3
salloc: debug2: false, shutdown
salloc: debug: Leaving _msg_thr_internal