Alexis, do you happen to notice anything in your slurmctld.log that is out of the ordinary? If possible I would run at debug level 6.
Make sure you updated your compute nodes as well. Danny On Tuesday July 12 2011 2:24:00 PM you wrote: > No problems ! > > Here it is > > > ControlMachine=couperin > AuthType=auth/none > CacheGroups=1 > CryptoType=crypto/openssl > JobCredentialPrivateKey=/etc/slurm-llnl/slurm.key > JobCredentialPublicCertificate=/etc/slurm-llnl/slurm.cert > #DisableRootJobs=NO > #EnforcePartLimits=NO > JobCheckpointDir=/var/lib/slurm-llnl/checkpoint > MpiDefault=none > ProctrackType=proctrack/pgid > ReturnToService=2 > > SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid > SlurmctldPort=6817 > SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid > SlurmdPort=6818 > SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd > SlurmUser=slurm > > StateSaveLocation=/var/lib/slurm-llnl/slurmctld > SwitchType=switch/none > TaskPlugin=task/none > > UsePAM=1 > > # TIMERS > ####InactiveLimit=300 > InactiveLimit=0 > KillWait=600 > MinJobAge=300 > SlurmctldTimeout=300 > SlurmdTimeout=300 > Waittime=0 > > # SCHEDULING > #DefMemPerCPU=0 > FastSchedule=1 > #MaxMemPerCPU=0 > #SchedulerRootFilter=1 > #SchedulerTimeSlice=30 > SchedulerType=sched/backfill > SchedulerPort=7321 > SelectType=select/linear > #SelectTypeParameters= > # > # > # JOB PRIORITY > #PriorityType=priority/basic > #PriorityDecayHalfLife= > #PriorityFavorSmall= > #PriorityMaxAge= > #PriorityUsageResetPeriod= > #PriorityWeightAge= > #PriorityWeightFairshare= > #PriorityWeightJobSize= > #PriorityWeightPartition= > #PriorityWeightQOS= > # > # > # LOGGING AND ACCOUNTING > #AccountingStorageEnforce=0 > #AccountingStorageHost= > #AccountingStorageLoc=/var/log/slurm-llnl/slurm_acct.log > #AccountingStoragePass= > #AccountingStoragePort= > #AccountingStorageType=accounting_storage/filetxt > #AccountingStorageUser= > ClusterName=cluster > #DebugFlags= > #JobCompHost= > #JobCompLoc= > #JobCompPass= > #JobCompPort= > JobCompType=jobcomp/none > #JobCompUser= > JobAcctGatherFrequency=30 > JobAcctGatherType=jobacct_gather/none > SlurmctldDebug=3 > SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log > SlurmdDebug=3 > SlurmdLogFile=/var/log/slurm-llnl/slurmd.log > # > # > # POWER SAVE SUPPORT FOR IDLE NODES (optional) > #SuspendProgram= > #ResumeProgram= > #SuspendTimeout= > #ResumeTimeout= > #ResumeRate= > #SuspendExcNodes= > #SuspendExcParts= > #SuspendRate= > #SuspendTime= > # > # > # COMPUTE NODES > NodeName=DEFAULT TmpDisk=20480 State=UNKNOWN > NodeName=campion NodeHostname=campion Sockets=1 CoresPerSocket=4 > RealMemory=7993 > NodeName=carissimi NodeHostname=carissimi Sockets=1 CoresPerSocket=2 > RealMemory=3953 > NodeName=borodine NodeHostname=borodine Sockets=1 CoresPerSocket=4 > RealMemory=7993 > NodeName=britten NodeHostname=britten Sockets=1 CoresPerSocket=4 > RealMemory=3953 > NodeName=bruckner NodeHostname=bruckner Sockets=2 CoresPerSocket=4 > RealMemory=24153 > NodeName=buxtehude NodeHostname=buxtehude Sockets=2 CoresPerSocket=4 > RealMemory=24153 > NodeName=chopin NodeHostname=chopin Sockets=1 CoresPerSocket=4 > RealMemory=3924 > NodeName=clerambault NodeHostname=clerambault Sockets=4 CoresPerSocket=8 > RealMemory=129177 > > > PartitionName=cluster > Nodes=campion,carissimi,borodine,britten,bruckner,buxtehude,chopin,clerambau lt > Default=YES MaxTime=04:00:00 State=UP > > > Except for the "job complete message received" message, it looks pretty > > normal to me; I'll have to defer to others. > > > > Alexis, it will probably help if you could forward a copy of your > > slurm.conf file. > > > > Regards, > > Andy > > > > On 07/12/2011 07:43 AM, Alexis GÜNST HORN wrote: > >> Thanks. > >> Here the output : > >> Everything seems normal, no ? > >> > >> agunst@couperin:~$ salloc -vvvvv -w bruckner > >> salloc: defined options for program `salloc' > >> salloc: --------------- --------------------- > >> salloc: user : `agunst' > >> salloc: uid : 10001 > >> salloc: gid : 10007 > >> salloc: ntasks : 1 (default) > >> salloc: cpus_per_task : 1 (default) > >> salloc: nodes : 1 (default) > >> salloc: partition : default > >> salloc: job name : `bash' > >> salloc: reservation : `(null)' > >> salloc: wckey : `(null)' > >> salloc: distribution : unknown > >> salloc: verbose : 5 > >> salloc: immediate : false > >> salloc: overcommit : false > >> salloc: account : (null) > >> salloc: comment : (null) > >> salloc: dependency : (null) > >> salloc: network : (null) > >> salloc: qos : (null) > >> salloc: constraints : mincpus=1 nodelist=bruckner > >> salloc: geometry : (null) > >> salloc: reboot : yes > >> salloc: rotate : no > >> salloc: mail_type : NONE > >> salloc: mail_user : (null) > >> salloc: sockets-per-node : -2 > >> salloc: cores-per-socket : -2 > >> salloc: threads-per-core : -2 > >> salloc: ntasks-per-node : 0 > >> salloc: ntasks-per-socket : -2 > >> salloc: ntasks-per-core : -2 > >> salloc: plane_size : 4294967294 > >> salloc: cpu_bind : default > >> salloc: mem_bind : default > >> salloc: user command : `/bin/bash' > >> salloc: debug: Entering slurm_allocation_msg_thr_create() > >> salloc: debug: port from net_stream_listen is 45043 > >> salloc: debug: Entering _msg_thr_internal > >> salloc: debug4: eio: handling events for 1 objects > >> salloc: debug3: Called eio_message_socket_readable 0 3 > >> salloc: debug3: Trying to load plugin /usr/lib/slurm/auth_none.so > >> salloc: Null authentication plugin loaded > >> salloc: debug3: Success. > >> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_cons_res.so > >> salloc: debug3: Success. > >> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_bgq.so > >> salloc: debug3: Success. > >> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_cray.so > >> salloc: debug3: Success. > >> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_linear.so > >> salloc: debug3: Success. > >> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_bluegene.so > >> salloc: debug3: Success. > >> salloc: debug4: eio: handling events for 1 objects > >> salloc: debug3: Called eio_message_socket_readable 0 3 > >> salloc: Granted job allocation 323 > >> salloc: debug: laying out the 8 tasks on 1 hosts bruckner > >> salloc: Relinquishing job allocation 323 > >> salloc: debug3: Called eio_msg_socket_accept > >> salloc: debug2: got message connection from 192.168.96.104:37736 7 > >> salloc: debug3: job complete message received > >> salloc: Job allocation 323 has been revoked. > >> salloc: debug4: eio: handling events for 1 objects > >> salloc: debug3: Called eio_message_socket_readable 0 3 > >> salloc: debug2: slurm_allocation_msg_thr_destroy: clearing up message > >> thread > >> salloc: debug4: eio: handling events for 1 objects > >> salloc: debug3: Called eio_message_socket_readable 1 3 > >> salloc: debug2: false, shutdown > >> salloc: debug: Leaving _msg_thr_internal > >> > >> > >> > >
