Hi Alexis, Perhaps this may be of relevence:
https://computing.llnl.gov/linux/slurm/faq.html#sigpipe I'd also suggest doing something like this: strace -o salloc.strace -f salloc -w bruckner That will create a file called salloc.strace that contains a trace of the system calls made by salloc. Each time you run it that file will be overwritten so once you reproduce the described behavior of the shell not being launched, send the file file back to the list (attached, of course). -Aaron On Tue, Jul 12, 2011 at 10:27 AM, Alexis GÜNST HORN < [email protected]> wrote: > No :( > Everything is fine. > Every nodes are up to date, Debian Wheezy, SLURM 2.2.7 > > User, Groups are linked to LDAP via nss_ldap, it could help... > > Thanks, thanks a lot. I don't understand anything anymore > > > -- > Alexis GÜNST HORN > System administrator > Exascale Computing Research > > Le 12/07/2011 16:22, Danny Auble a écrit : > > Alexis, do you happen to notice anything in your slurmctld.log that is out >> of the ordinary? If possible I would run at debug level 6. >> >> Make sure you updated your compute nodes as well. >> >> Danny >> >> On Tuesday July 12 2011 2:24:00 PM you wrote: >> >>> No problems ! >>> >>> Here it is >>> >>> >>> ControlMachine=couperin >>> AuthType=auth/none >>> CacheGroups=1 >>> CryptoType=crypto/openssl >>> JobCredentialPrivateKey=/etc/**slurm-llnl/slurm.key >>> JobCredentialPublicCertificate**=/etc/slurm-llnl/slurm.cert >>> #DisableRootJobs=NO >>> #EnforcePartLimits=NO >>> JobCheckpointDir=/var/lib/**slurm-llnl/checkpoint >>> MpiDefault=none >>> ProctrackType=proctrack/pgid >>> ReturnToService=2 >>> >>> SlurmctldPidFile=/var/run/**slurm-llnl/slurmctld.pid >>> SlurmctldPort=6817 >>> SlurmdPidFile=/var/run/slurm-**llnl/slurmd.pid >>> SlurmdPort=6818 >>> SlurmdSpoolDir=/var/lib/slurm-**llnl/slurmd >>> SlurmUser=slurm >>> >>> StateSaveLocation=/var/lib/**slurm-llnl/slurmctld >>> SwitchType=switch/none >>> TaskPlugin=task/none >>> >>> UsePAM=1 >>> >>> # TIMERS >>> ####InactiveLimit=300 >>> InactiveLimit=0 >>> KillWait=600 >>> MinJobAge=300 >>> SlurmctldTimeout=300 >>> SlurmdTimeout=300 >>> Waittime=0 >>> >>> # SCHEDULING >>> #DefMemPerCPU=0 >>> FastSchedule=1 >>> #MaxMemPerCPU=0 >>> #SchedulerRootFilter=1 >>> #SchedulerTimeSlice=30 >>> SchedulerType=sched/backfill >>> SchedulerPort=7321 >>> SelectType=select/linear >>> #SelectTypeParameters= >>> # >>> # >>> # JOB PRIORITY >>> #PriorityType=priority/basic >>> #PriorityDecayHalfLife= >>> #PriorityFavorSmall= >>> #PriorityMaxAge= >>> #PriorityUsageResetPeriod= >>> #PriorityWeightAge= >>> #PriorityWeightFairshare= >>> #PriorityWeightJobSize= >>> #PriorityWeightPartition= >>> #PriorityWeightQOS= >>> # >>> # >>> # LOGGING AND ACCOUNTING >>> #AccountingStorageEnforce=0 >>> #AccountingStorageHost= >>> #AccountingStorageLoc=/var/**log/slurm-llnl/slurm_acct.log >>> #AccountingStoragePass= >>> #AccountingStoragePort= >>> #AccountingStorageType=**accounting_storage/filetxt >>> #AccountingStorageUser= >>> ClusterName=cluster >>> #DebugFlags= >>> #JobCompHost= >>> #JobCompLoc= >>> #JobCompPass= >>> #JobCompPort= >>> JobCompType=jobcomp/none >>> #JobCompUser= >>> JobAcctGatherFrequency=30 >>> JobAcctGatherType=jobacct_**gather/none >>> SlurmctldDebug=3 >>> SlurmctldLogFile=/var/log/**slurm-llnl/slurmctld.log >>> SlurmdDebug=3 >>> SlurmdLogFile=/var/log/slurm-**llnl/slurmd.log >>> # >>> # >>> # POWER SAVE SUPPORT FOR IDLE NODES (optional) >>> #SuspendProgram= >>> #ResumeProgram= >>> #SuspendTimeout= >>> #ResumeTimeout= >>> #ResumeRate= >>> #SuspendExcNodes= >>> #SuspendExcParts= >>> #SuspendRate= >>> #SuspendTime= >>> # >>> # >>> # COMPUTE NODES >>> NodeName=DEFAULT TmpDisk=20480 State=UNKNOWN >>> NodeName=campion NodeHostname=campion Sockets=1 CoresPerSocket=4 >>> RealMemory=7993 >>> NodeName=carissimi NodeHostname=carissimi Sockets=1 CoresPerSocket=2 >>> RealMemory=3953 >>> NodeName=borodine NodeHostname=borodine Sockets=1 CoresPerSocket=4 >>> RealMemory=7993 >>> NodeName=britten NodeHostname=britten Sockets=1 CoresPerSocket=4 >>> RealMemory=3953 >>> NodeName=bruckner NodeHostname=bruckner Sockets=2 CoresPerSocket=4 >>> RealMemory=24153 >>> NodeName=buxtehude NodeHostname=buxtehude Sockets=2 CoresPerSocket=4 >>> RealMemory=24153 >>> NodeName=chopin NodeHostname=chopin Sockets=1 CoresPerSocket=4 >>> RealMemory=3924 >>> NodeName=clerambault NodeHostname=clerambault Sockets=4 CoresPerSocket=8 >>> RealMemory=129177 >>> >>> >>> PartitionName=cluster >>> >>> Nodes=campion,carissimi,**borodine,britten,bruckner,** >> buxtehude,chopin,clerambau >> lt >> >>> Default=YES MaxTime=04:00:00 State=UP >>> >>> Except for the "job complete message received" message, it looks pretty >>>> normal to me; I'll have to defer to others. >>>> >>>> Alexis, it will probably help if you could forward a copy of your >>>> slurm.conf file. >>>> >>>> Regards, >>>> Andy >>>> >>>> On 07/12/2011 07:43 AM, Alexis GÜNST HORN wrote: >>>> >>>>> Thanks. >>>>> Here the output : >>>>> Everything seems normal, no ? >>>>> >>>>> agunst@couperin:~$ salloc -vvvvv -w bruckner >>>>> salloc: defined options for program `salloc' >>>>> salloc: --------------- --------------------- >>>>> salloc: user : `agunst' >>>>> salloc: uid : 10001 >>>>> salloc: gid : 10007 >>>>> salloc: ntasks : 1 (default) >>>>> salloc: cpus_per_task : 1 (default) >>>>> salloc: nodes : 1 (default) >>>>> salloc: partition : default >>>>> salloc: job name : `bash' >>>>> salloc: reservation : `(null)' >>>>> salloc: wckey : `(null)' >>>>> salloc: distribution : unknown >>>>> salloc: verbose : 5 >>>>> salloc: immediate : false >>>>> salloc: overcommit : false >>>>> salloc: account : (null) >>>>> salloc: comment : (null) >>>>> salloc: dependency : (null) >>>>> salloc: network : (null) >>>>> salloc: qos : (null) >>>>> salloc: constraints : mincpus=1 nodelist=bruckner >>>>> salloc: geometry : (null) >>>>> salloc: reboot : yes >>>>> salloc: rotate : no >>>>> salloc: mail_type : NONE >>>>> salloc: mail_user : (null) >>>>> salloc: sockets-per-node : -2 >>>>> salloc: cores-per-socket : -2 >>>>> salloc: threads-per-core : -2 >>>>> salloc: ntasks-per-node : 0 >>>>> salloc: ntasks-per-socket : -2 >>>>> salloc: ntasks-per-core : -2 >>>>> salloc: plane_size : 4294967294 >>>>> salloc: cpu_bind : default >>>>> salloc: mem_bind : default >>>>> salloc: user command : `/bin/bash' >>>>> salloc: debug: Entering slurm_allocation_msg_thr_**create() >>>>> salloc: debug: port from net_stream_listen is 45043 >>>>> salloc: debug: Entering _msg_thr_internal >>>>> salloc: debug4: eio: handling events for 1 objects >>>>> salloc: debug3: Called eio_message_socket_readable 0 3 >>>>> salloc: debug3: Trying to load plugin /usr/lib/slurm/auth_none.so >>>>> salloc: Null authentication plugin loaded >>>>> salloc: debug3: Success. >>>>> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_cons_** >>>>> res.so >>>>> salloc: debug3: Success. >>>>> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_bgq.so >>>>> salloc: debug3: Success. >>>>> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_cray.so >>>>> salloc: debug3: Success. >>>>> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_linear.** >>>>> so >>>>> salloc: debug3: Success. >>>>> salloc: debug3: Trying to load plugin /usr/lib/slurm/select_** >>>>> bluegene.so >>>>> salloc: debug3: Success. >>>>> salloc: debug4: eio: handling events for 1 objects >>>>> salloc: debug3: Called eio_message_socket_readable 0 3 >>>>> salloc: Granted job allocation 323 >>>>> salloc: debug: laying out the 8 tasks on 1 hosts bruckner >>>>> salloc: Relinquishing job allocation 323 >>>>> salloc: debug3: Called eio_msg_socket_accept >>>>> salloc: debug2: got message connection from 192.168.96.104:37736 7 >>>>> salloc: debug3: job complete message received >>>>> salloc: Job allocation 323 has been revoked. >>>>> salloc: debug4: eio: handling events for 1 objects >>>>> salloc: debug3: Called eio_message_socket_readable 0 3 >>>>> salloc: debug2: slurm_allocation_msg_thr_**destroy: clearing up >>>>> message >>>>> thread >>>>> salloc: debug4: eio: handling events for 1 objects >>>>> salloc: debug3: Called eio_message_socket_readable 1 3 >>>>> salloc: debug2: false, shutdown >>>>> salloc: debug: Leaving _msg_thr_internal >>>>> >>>>> >>>>> >>>>> >>>> >> -- Aaron Knister Systems Administrator Division of Information Technology University of Maryland, Baltimore County [email protected]
