That's strange! It sounds like it might be some sort of race condition then. Yeah, strace will make it run VERY slowly since it's writing out each system call to a file. One thing I noticed from the strace output was this line:
13998 read(4, "MemTotal: 524504 kB\nMemFre"..., 1024) = 656 >From that output it looks as though the system only has ~512mB of ram. I wonder if bash wasn't launching because it couldn't allocate enough memory. I've seen bash do that before. Is there a way you could give the system more memory? I wonder if you can run something like this repeatedly (15 times or so): salloc -w bruckner date -Aaron On Tue, Jul 12, 2011 at 11:01 AM, Alexis GÜNST HORN < [email protected]> wrote: > Hi Aaron, > > I'm so sad :( > When I put "strace -o salloc.strace -f" before my salloc command, it works > every time ! But it's really slow : > from > salloc: Pending job allocation 466 > salloc: job 466 queued and waiting for resources > to > salloc: job 466 has been allocated resources > salloc: Granted job allocation 466 > there is at least 15 secondes left. (the node is of course free) > > And i had already read the FAQ about the SIGPIPE. > > > -- > Alexis GÜNST HORN > System administrator > Exascale Computing Research > > > Le 12/07/2011 16:47, Aaron Knister a écrit : > >> Hi Alexis, >> >> Perhaps this may be of relevence: >> >> https://computing.llnl.gov/**linux/slurm/faq.html#sigpipe<https://computing.llnl.gov/linux/slurm/faq.html#sigpipe> >> >> I'd also suggest doing something like this: >> >> strace -o salloc.strace -f salloc -w bruckner >> >> That will create a file called salloc.strace that contains a trace of >> the system calls made by salloc. Each time you run it that file will be >> overwritten so once you reproduce the described behavior of the shell >> not being launched, send the file file back to the list (attached, of >> course). >> >> -Aaron >> >> On Tue, Jul 12, 2011 at 10:27 AM, Alexis GÜNST HORN >> <alexis.gunst@exascale-**computing.eu<[email protected]> >> <mailto:alexis.gunst@exascale-**computing.eu<[email protected]>>> >> wrote: >> >> No :( >> Everything is fine. >> Every nodes are up to date, Debian Wheezy, SLURM 2.2.7 >> >> User, Groups are linked to LDAP via nss_ldap, it could help... >> >> Thanks, thanks a lot. I don't understand anything anymore >> >> >> -- >> Alexis GÜNST HORN >> System administrator >> Exascale Computing Research >> >> Le 12/07/2011 16:22, Danny Auble a écrit : >> >> Alexis, do you happen to notice anything in your slurmctld.log >> that is out >> of the ordinary? If possible I would run at debug level 6. >> >> Make sure you updated your compute nodes as well. >> >> Danny >> >> On Tuesday July 12 2011 2:24:00 PM you wrote: >> >> No problems ! >> >> Here it is >> >> >> ControlMachine=couperin >> AuthType=auth/none >> CacheGroups=1 >> CryptoType=crypto/openssl >> JobCredentialPrivateKey=/etc/_**_slurm-llnl/slurm.key >> JobCredentialPublicCertificate**__=/etc/slurm-llnl/slurm.cert >> #DisableRootJobs=NO >> #EnforcePartLimits=NO >> JobCheckpointDir=/var/lib/__**slurm-llnl/checkpoint >> MpiDefault=none >> ProctrackType=proctrack/pgid >> ReturnToService=2 >> >> SlurmctldPidFile=/var/run/__**slurm-llnl/slurmctld.pid >> SlurmctldPort=6817 >> SlurmdPidFile=/var/run/slurm-_**_llnl/slurmd.pid >> SlurmdPort=6818 >> SlurmdSpoolDir=/var/lib/slurm-**__llnl/slurmd >> SlurmUser=slurm >> >> StateSaveLocation=/var/lib/__**slurm-llnl/slurmctld >> SwitchType=switch/none >> TaskPlugin=task/none >> >> UsePAM=1 >> >> # TIMERS >> ####InactiveLimit=300 >> InactiveLimit=0 >> KillWait=600 >> MinJobAge=300 >> SlurmctldTimeout=300 >> SlurmdTimeout=300 >> Waittime=0 >> >> # SCHEDULING >> #DefMemPerCPU=0 >> FastSchedule=1 >> #MaxMemPerCPU=0 >> #SchedulerRootFilter=1 >> #SchedulerTimeSlice=30 >> SchedulerType=sched/backfill >> SchedulerPort=7321 >> SelectType=select/linear >> #SelectTypeParameters= >> # >> # >> # JOB PRIORITY >> #PriorityType=priority/basic >> #PriorityDecayHalfLife= >> #PriorityFavorSmall= >> #PriorityMaxAge= >> #PriorityUsageResetPeriod= >> #PriorityWeightAge= >> #PriorityWeightFairshare= >> #PriorityWeightJobSize= >> #PriorityWeightPartition= >> #PriorityWeightQOS= >> # >> # >> # LOGGING AND ACCOUNTING >> #AccountingStorageEnforce=0 >> #AccountingStorageHost= >> #AccountingStorageLoc=/var/__**log/slurm-llnl/slurm_acct.log >> #AccountingStoragePass= >> #AccountingStoragePort= >> #AccountingStorageType=__**accounting_storage/filetxt >> #AccountingStorageUser= >> ClusterName=cluster >> #DebugFlags= >> #JobCompHost= >> #JobCompLoc= >> #JobCompPass= >> #JobCompPort= >> JobCompType=jobcomp/none >> #JobCompUser= >> JobAcctGatherFrequency=30 >> JobAcctGatherType=jobacct___**gather/none >> SlurmctldDebug=3 >> SlurmctldLogFile=/var/log/__**slurm-llnl/slurmctld.log >> SlurmdDebug=3 >> SlurmdLogFile=/var/log/slurm-_**_llnl/slurmd.log >> # >> # >> # POWER SAVE SUPPORT FOR IDLE NODES (optional) >> #SuspendProgram= >> #ResumeProgram= >> #SuspendTimeout= >> #ResumeTimeout= >> #ResumeRate= >> #SuspendExcNodes= >> #SuspendExcParts= >> #SuspendRate= >> #SuspendTime= >> # >> # >> # COMPUTE NODES >> NodeName=DEFAULT TmpDisk=20480 State=UNKNOWN >> NodeName=campion NodeHostname=campion Sockets=1 >> CoresPerSocket=4 >> RealMemory=7993 >> NodeName=carissimi NodeHostname=carissimi Sockets=1 >> CoresPerSocket=2 >> RealMemory=3953 >> NodeName=borodine NodeHostname=borodine Sockets=1 >> CoresPerSocket=4 >> RealMemory=7993 >> NodeName=britten NodeHostname=britten Sockets=1 >> CoresPerSocket=4 >> RealMemory=3953 >> NodeName=bruckner NodeHostname=bruckner Sockets=2 >> CoresPerSocket=4 >> RealMemory=24153 >> NodeName=buxtehude NodeHostname=buxtehude Sockets=2 >> CoresPerSocket=4 >> RealMemory=24153 >> NodeName=chopin NodeHostname=chopin Sockets=1 CoresPerSocket=4 >> RealMemory=3924 >> NodeName=clerambault NodeHostname=clerambault Sockets=4 >> CoresPerSocket=8 >> RealMemory=129177 >> >> >> PartitionName=cluster >> >> Nodes=campion,carissimi,__**borodine,britten,bruckner,__** >> buxtehude,chopin,clerambau >> lt >> >> Default=YES MaxTime=04:00:00 State=UP >> >> Except for the "job complete message received" message, >> it looks pretty >> normal to me; I'll have to defer to others. >> >> Alexis, it will probably help if you could forward a >> copy of your >> slurm.conf file. >> >> Regards, >> Andy >> >> On 07/12/2011 07:43 AM, Alexis GÜNST HORN wrote: >> >> Thanks. >> Here the output : >> Everything seems normal, no ? >> >> agunst@couperin:~$ salloc -vvvvv -w bruckner >> salloc: defined options for program `salloc' >> salloc: --------------- --------------------- >> salloc: user : `agunst' >> salloc: uid : 10001 >> salloc: gid : 10007 >> salloc: ntasks : 1 (default) >> salloc: cpus_per_task : 1 (default) >> salloc: nodes : 1 (default) >> salloc: partition : default >> salloc: job name : `bash' >> salloc: reservation : `(null)' >> salloc: wckey : `(null)' >> salloc: distribution : unknown >> salloc: verbose : 5 >> salloc: immediate : false >> salloc: overcommit : false >> salloc: account : (null) >> salloc: comment : (null) >> salloc: dependency : (null) >> salloc: network : (null) >> salloc: qos : (null) >> salloc: constraints : mincpus=1 nodelist=bruckner >> salloc: geometry : (null) >> salloc: reboot : yes >> salloc: rotate : no >> salloc: mail_type : NONE >> salloc: mail_user : (null) >> salloc: sockets-per-node : -2 >> salloc: cores-per-socket : -2 >> salloc: threads-per-core : -2 >> salloc: ntasks-per-node : 0 >> salloc: ntasks-per-socket : -2 >> salloc: ntasks-per-core : -2 >> salloc: plane_size : 4294967294 >> salloc: cpu_bind : default >> salloc: mem_bind : default >> salloc: user command : `/bin/bash' >> salloc: debug: Entering >> slurm_allocation_msg_thr___**create() >> salloc: debug: port from net_stream_listen is 45043 >> salloc: debug: Entering _msg_thr_internal >> salloc: debug4: eio: handling events for 1 objects >> salloc: debug3: Called eio_message_socket_readable 0 3 >> salloc: debug3: Trying to load plugin >> /usr/lib/slurm/auth_none.so >> salloc: Null authentication plugin loaded >> salloc: debug3: Success. >> salloc: debug3: Trying to load plugin >> /usr/lib/slurm/select_cons___**res.so >> salloc: debug3: Success. >> salloc: debug3: Trying to load plugin >> /usr/lib/slurm/select_bgq.so >> salloc: debug3: Success. >> salloc: debug3: Trying to load plugin >> /usr/lib/slurm/select_cray.so >> salloc: debug3: Success. >> salloc: debug3: Trying to load plugin >> /usr/lib/slurm/select_linear._**_so >> salloc: debug3: Success. >> salloc: debug3: Trying to load plugin >> /usr/lib/slurm/select___**bluegene.so >> salloc: debug3: Success. >> salloc: debug4: eio: handling events for 1 objects >> salloc: debug3: Called eio_message_socket_readable 0 3 >> salloc: Granted job allocation 323 >> salloc: debug: laying out the 8 tasks on 1 hosts >> bruckner >> salloc: Relinquishing job allocation 323 >> salloc: debug3: Called eio_msg_socket_accept >> salloc: debug2: got message connection from >> 192.168.96.104:37736 <http://192.168.96.104:37736> 7 >> >> salloc: debug3: job complete message received >> salloc: Job allocation 323 has been revoked. >> salloc: debug4: eio: handling events for 1 objects >> salloc: debug3: Called eio_message_socket_readable 0 3 >> salloc: debug2: slurm_allocation_msg_thr___**destroy: >> clearing up message >> thread >> salloc: debug4: eio: handling events for 1 objects >> salloc: debug3: Called eio_message_socket_readable 1 3 >> salloc: debug2: false, shutdown >> salloc: debug: Leaving _msg_thr_internal >> >> >> >> >> >> >> >> >> -- >> Aaron Knister >> Systems Administrator >> Division of Information Technology >> University of Maryland, Baltimore County >> [email protected] <mailto:[email protected]> >> > -- Aaron Knister Systems Administrator Division of Information Technology University of Maryland, Baltimore County [email protected]
