Alexis, do you happen to notice anything in your slurmctld.log that is out 
of the ordinary?  If possible I would run at debug level 6.

Make sure you updated your compute nodes as well.

Danny

On Tuesday July 12 2011 2:24:00 PM you wrote:
> No problems !
> 
> Here it is
> 
> 
> ControlMachine=couperin
> AuthType=auth/none
> CacheGroups=1
> CryptoType=crypto/openssl
> JobCredentialPrivateKey=/etc/slurm-llnl/slurm.key
> JobCredentialPublicCertificate=/etc/slurm-llnl/slurm.cert
> #DisableRootJobs=NO
> #EnforcePartLimits=NO
> JobCheckpointDir=/var/lib/slurm-llnl/checkpoint
> MpiDefault=none
> ProctrackType=proctrack/pgid
> ReturnToService=2
> 
> SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
> SlurmctldPort=6817
> SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
> SlurmdPort=6818
> SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
> SlurmUser=slurm
> 
> StateSaveLocation=/var/lib/slurm-llnl/slurmctld
> SwitchType=switch/none
> TaskPlugin=task/none
> 
> UsePAM=1
> 
> # TIMERS
> ####InactiveLimit=300
> InactiveLimit=0
> KillWait=600
> MinJobAge=300
> SlurmctldTimeout=300
> SlurmdTimeout=300
> Waittime=0
> 
> # SCHEDULING
> #DefMemPerCPU=0
> FastSchedule=1
> #MaxMemPerCPU=0
> #SchedulerRootFilter=1
> #SchedulerTimeSlice=30
> SchedulerType=sched/backfill
> SchedulerPort=7321
> SelectType=select/linear
> #SelectTypeParameters=
> #
> #
> # JOB PRIORITY
> #PriorityType=priority/basic
> #PriorityDecayHalfLife=
> #PriorityFavorSmall=
> #PriorityMaxAge=
> #PriorityUsageResetPeriod=
> #PriorityWeightAge=
> #PriorityWeightFairshare=
> #PriorityWeightJobSize=
> #PriorityWeightPartition=
> #PriorityWeightQOS=
> #
> #
> # LOGGING AND ACCOUNTING
> #AccountingStorageEnforce=0
> #AccountingStorageHost=
> #AccountingStorageLoc=/var/log/slurm-llnl/slurm_acct.log
> #AccountingStoragePass=
> #AccountingStoragePort=
> #AccountingStorageType=accounting_storage/filetxt
> #AccountingStorageUser=
> ClusterName=cluster
> #DebugFlags=
> #JobCompHost=
> #JobCompLoc=
> #JobCompPass=
> #JobCompPort=
> JobCompType=jobcomp/none
> #JobCompUser=
> JobAcctGatherFrequency=30
> JobAcctGatherType=jobacct_gather/none
> SlurmctldDebug=3
> SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
> SlurmdDebug=3
> SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
> #
> #
> # POWER SAVE SUPPORT FOR IDLE NODES (optional)
> #SuspendProgram=
> #ResumeProgram=
> #SuspendTimeout=
> #ResumeTimeout=
> #ResumeRate=
> #SuspendExcNodes=
> #SuspendExcParts=
> #SuspendRate=
> #SuspendTime=
> #
> #
> # COMPUTE NODES
> NodeName=DEFAULT TmpDisk=20480 State=UNKNOWN
> NodeName=campion NodeHostname=campion Sockets=1 CoresPerSocket=4 
> RealMemory=7993
> NodeName=carissimi NodeHostname=carissimi Sockets=1 CoresPerSocket=2 
> RealMemory=3953
> NodeName=borodine NodeHostname=borodine Sockets=1 CoresPerSocket=4 
> RealMemory=7993
> NodeName=britten NodeHostname=britten Sockets=1 CoresPerSocket=4 
> RealMemory=3953
> NodeName=bruckner NodeHostname=bruckner Sockets=2 CoresPerSocket=4 
> RealMemory=24153
> NodeName=buxtehude NodeHostname=buxtehude Sockets=2 CoresPerSocket=4 
> RealMemory=24153
> NodeName=chopin NodeHostname=chopin Sockets=1 CoresPerSocket=4 
> RealMemory=3924
> NodeName=clerambault NodeHostname=clerambault Sockets=4 CoresPerSocket=8 
> RealMemory=129177
> 
> 
> PartitionName=cluster 
> 
Nodes=campion,carissimi,borodine,britten,bruckner,buxtehude,chopin,clerambau
lt 
> Default=YES MaxTime=04:00:00 State=UP
> 
> > Except for the "job complete message received" message, it looks pretty
> > normal to me; I'll have to defer to others.
> >
> > Alexis, it will probably help if you could forward a copy of your
> > slurm.conf file.
> >
> > Regards,
> > Andy
> >
> > On 07/12/2011 07:43 AM, Alexis GÜNST HORN wrote:
> >> Thanks.
> >> Here the output :
> >> Everything seems normal, no ?
> >>
> >> agunst@couperin:~$ salloc -vvvvv -w bruckner
> >> salloc: defined options for program `salloc'
> >> salloc: --------------- ---------------------
> >> salloc: user : `agunst'
> >> salloc: uid : 10001
> >> salloc: gid : 10007
> >> salloc: ntasks : 1 (default)
> >> salloc: cpus_per_task : 1 (default)
> >> salloc: nodes : 1 (default)
> >> salloc: partition : default
> >> salloc: job name : `bash'
> >> salloc: reservation : `(null)'
> >> salloc: wckey : `(null)'
> >> salloc: distribution : unknown
> >> salloc: verbose : 5
> >> salloc: immediate : false
> >> salloc: overcommit : false
> >> salloc: account : (null)
> >> salloc: comment : (null)
> >> salloc: dependency : (null)
> >> salloc: network : (null)
> >> salloc: qos : (null)
> >> salloc: constraints : mincpus=1 nodelist=bruckner
> >> salloc: geometry : (null)
> >> salloc: reboot : yes
> >> salloc: rotate : no
> >> salloc: mail_type : NONE
> >> salloc: mail_user : (null)
> >> salloc: sockets-per-node : -2
> >> salloc: cores-per-socket : -2
> >> salloc: threads-per-core : -2
> >> salloc: ntasks-per-node : 0
> >> salloc: ntasks-per-socket : -2
> >> salloc: ntasks-per-core : -2
> >> salloc: plane_size : 4294967294
> >> salloc: cpu_bind : default
> >> salloc: mem_bind : default
> >> salloc: user command : `/bin/bash'
> >> salloc: debug: Entering slurm_allocation_msg_thr_create()
> >> salloc: debug: port from net_stream_listen is 45043
> >> salloc: debug: Entering _msg_thr_internal
> >> salloc: debug4: eio: handling events for 1 objects
> >> salloc: debug3: Called eio_message_socket_readable 0 3
> >> salloc: debug3: Trying to load plugin /usr/lib/slurm/auth_none.so
> >> salloc: Null authentication plugin loaded
> >> salloc: debug3: Success.
> >> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_cons_res.so
> >> salloc: debug3: Success.
> >> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_bgq.so
> >> salloc: debug3: Success.
> >> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_cray.so
> >> salloc: debug3: Success.
> >> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_linear.so
> >> salloc: debug3: Success.
> >> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_bluegene.so
> >> salloc: debug3: Success.
> >> salloc: debug4: eio: handling events for 1 objects
> >> salloc: debug3: Called eio_message_socket_readable 0 3
> >> salloc: Granted job allocation 323
> >> salloc: debug: laying out the 8 tasks on 1 hosts bruckner
> >> salloc: Relinquishing job allocation 323
> >> salloc: debug3: Called eio_msg_socket_accept
> >> salloc: debug2: got message connection from 192.168.96.104:37736 7
> >> salloc: debug3: job complete message received
> >> salloc: Job allocation 323 has been revoked.
> >> salloc: debug4: eio: handling events for 1 objects
> >> salloc: debug3: Called eio_message_socket_readable 0 3
> >> salloc: debug2: slurm_allocation_msg_thr_destroy: clearing up message
> >> thread
> >> salloc: debug4: eio: handling events for 1 objects
> >> salloc: debug3: Called eio_message_socket_readable 1 3
> >> salloc: debug2: false, shutdown
> >> salloc: debug: Leaving _msg_thr_internal
> >>
> >>
> >>
> >

Reply via email to