That's strange! It sounds like it might be some sort of race condition then.
Yeah, strace will make it run VERY slowly since it's writing out each system
call to a file. One thing I noticed from the strace output was this line:

13998 read(4, "MemTotal:       524504 kB\nMemFre"..., 1024) = 656

>From that output it looks as though the system only has ~512mB of ram. I
wonder if bash wasn't launching because it couldn't allocate enough memory.
I've seen bash do that before. Is there a way you could give the system more
memory?

I wonder if you can run something like this repeatedly (15 times or so):
salloc -w bruckner date

-Aaron

On Tue, Jul 12, 2011 at 11:01 AM, Alexis GÜNST HORN <
[email protected]> wrote:

> Hi Aaron,
>
> I'm so sad :(
> When I put "strace -o salloc.strace -f" before my salloc command, it works
> every time ! But it's really slow :
> from
> salloc: Pending job allocation 466
> salloc: job 466 queued and waiting for resources
> to
> salloc: job 466 has been allocated resources
> salloc: Granted job allocation 466
> there is at least 15 secondes left. (the node is of course free)
>
> And i had already read the FAQ about the SIGPIPE.
>
>
> --
> Alexis GÜNST HORN
> System administrator
> Exascale Computing Research
>
>
> Le 12/07/2011 16:47, Aaron Knister a écrit :
>
>> Hi Alexis,
>>
>> Perhaps this may be of relevence:
>>
>> https://computing.llnl.gov/**linux/slurm/faq.html#sigpipe<https://computing.llnl.gov/linux/slurm/faq.html#sigpipe>
>>
>> I'd also suggest doing something like this:
>>
>> strace -o salloc.strace -f salloc -w bruckner
>>
>> That will create a file called salloc.strace that contains a trace of
>> the system calls made by salloc. Each time you run it that file will be
>> overwritten so once you reproduce the described behavior of the shell
>> not being launched, send the file file back to the list (attached, of
>> course).
>>
>> -Aaron
>>
>> On Tue, Jul 12, 2011 at 10:27 AM, Alexis GÜNST HORN
>> <alexis.gunst@exascale-**computing.eu<[email protected]>
>> <mailto:alexis.gunst@exascale-**computing.eu<[email protected]>>>
>> wrote:
>>
>>    No :(
>>    Everything is fine.
>>    Every nodes are up to date, Debian Wheezy, SLURM 2.2.7
>>
>>    User, Groups are linked to LDAP via nss_ldap, it could help...
>>
>>    Thanks, thanks a lot. I don't understand anything anymore
>>
>>
>>    --
>>    Alexis GÜNST HORN
>>    System administrator
>>    Exascale Computing Research
>>
>>    Le 12/07/2011 16:22, Danny Auble a écrit :
>>
>>        Alexis, do you happen to notice anything in your slurmctld.log
>>        that is out
>>        of the ordinary?  If possible I would run at debug level 6.
>>
>>        Make sure you updated your compute nodes as well.
>>
>>        Danny
>>
>>        On Tuesday July 12 2011 2:24:00 PM you wrote:
>>
>>            No problems !
>>
>>            Here it is
>>
>>
>>            ControlMachine=couperin
>>            AuthType=auth/none
>>            CacheGroups=1
>>            CryptoType=crypto/openssl
>>            JobCredentialPrivateKey=/etc/_**_slurm-llnl/slurm.key
>>            JobCredentialPublicCertificate**__=/etc/slurm-llnl/slurm.cert
>>            #DisableRootJobs=NO
>>            #EnforcePartLimits=NO
>>            JobCheckpointDir=/var/lib/__**slurm-llnl/checkpoint
>>            MpiDefault=none
>>            ProctrackType=proctrack/pgid
>>            ReturnToService=2
>>
>>            SlurmctldPidFile=/var/run/__**slurm-llnl/slurmctld.pid
>>            SlurmctldPort=6817
>>            SlurmdPidFile=/var/run/slurm-_**_llnl/slurmd.pid
>>            SlurmdPort=6818
>>            SlurmdSpoolDir=/var/lib/slurm-**__llnl/slurmd
>>            SlurmUser=slurm
>>
>>            StateSaveLocation=/var/lib/__**slurm-llnl/slurmctld
>>            SwitchType=switch/none
>>            TaskPlugin=task/none
>>
>>            UsePAM=1
>>
>>            # TIMERS
>>            ####InactiveLimit=300
>>            InactiveLimit=0
>>            KillWait=600
>>            MinJobAge=300
>>            SlurmctldTimeout=300
>>            SlurmdTimeout=300
>>            Waittime=0
>>
>>            # SCHEDULING
>>            #DefMemPerCPU=0
>>            FastSchedule=1
>>            #MaxMemPerCPU=0
>>            #SchedulerRootFilter=1
>>            #SchedulerTimeSlice=30
>>            SchedulerType=sched/backfill
>>            SchedulerPort=7321
>>            SelectType=select/linear
>>            #SelectTypeParameters=
>>            #
>>            #
>>            # JOB PRIORITY
>>            #PriorityType=priority/basic
>>            #PriorityDecayHalfLife=
>>            #PriorityFavorSmall=
>>            #PriorityMaxAge=
>>            #PriorityUsageResetPeriod=
>>            #PriorityWeightAge=
>>            #PriorityWeightFairshare=
>>            #PriorityWeightJobSize=
>>            #PriorityWeightPartition=
>>            #PriorityWeightQOS=
>>            #
>>            #
>>            # LOGGING AND ACCOUNTING
>>            #AccountingStorageEnforce=0
>>            #AccountingStorageHost=
>>            #AccountingStorageLoc=/var/__**log/slurm-llnl/slurm_acct.log
>>            #AccountingStoragePass=
>>            #AccountingStoragePort=
>>            #AccountingStorageType=__**accounting_storage/filetxt
>>            #AccountingStorageUser=
>>            ClusterName=cluster
>>            #DebugFlags=
>>            #JobCompHost=
>>            #JobCompLoc=
>>            #JobCompPass=
>>            #JobCompPort=
>>            JobCompType=jobcomp/none
>>            #JobCompUser=
>>            JobAcctGatherFrequency=30
>>            JobAcctGatherType=jobacct___**gather/none
>>            SlurmctldDebug=3
>>            SlurmctldLogFile=/var/log/__**slurm-llnl/slurmctld.log
>>            SlurmdDebug=3
>>            SlurmdLogFile=/var/log/slurm-_**_llnl/slurmd.log
>>            #
>>            #
>>            # POWER SAVE SUPPORT FOR IDLE NODES (optional)
>>            #SuspendProgram=
>>            #ResumeProgram=
>>            #SuspendTimeout=
>>            #ResumeTimeout=
>>            #ResumeRate=
>>            #SuspendExcNodes=
>>            #SuspendExcParts=
>>            #SuspendRate=
>>            #SuspendTime=
>>            #
>>            #
>>            # COMPUTE NODES
>>            NodeName=DEFAULT TmpDisk=20480 State=UNKNOWN
>>            NodeName=campion NodeHostname=campion Sockets=1
>> CoresPerSocket=4
>>            RealMemory=7993
>>            NodeName=carissimi NodeHostname=carissimi Sockets=1
>>            CoresPerSocket=2
>>            RealMemory=3953
>>            NodeName=borodine NodeHostname=borodine Sockets=1
>>            CoresPerSocket=4
>>            RealMemory=7993
>>            NodeName=britten NodeHostname=britten Sockets=1
>> CoresPerSocket=4
>>            RealMemory=3953
>>            NodeName=bruckner NodeHostname=bruckner Sockets=2
>>            CoresPerSocket=4
>>            RealMemory=24153
>>            NodeName=buxtehude NodeHostname=buxtehude Sockets=2
>>            CoresPerSocket=4
>>            RealMemory=24153
>>            NodeName=chopin NodeHostname=chopin Sockets=1 CoresPerSocket=4
>>            RealMemory=3924
>>            NodeName=clerambault NodeHostname=clerambault Sockets=4
>>            CoresPerSocket=8
>>            RealMemory=129177
>>
>>
>>            PartitionName=cluster
>>
>>        Nodes=campion,carissimi,__**borodine,britten,bruckner,__**
>> buxtehude,chopin,clerambau
>>        lt
>>
>>            Default=YES MaxTime=04:00:00 State=UP
>>
>>                Except for the "job complete message received" message,
>>                it looks pretty
>>                normal to me; I'll have to defer to others.
>>
>>                Alexis, it will probably help if you could forward a
>>                copy of your
>>                slurm.conf file.
>>
>>                Regards,
>>                Andy
>>
>>                On 07/12/2011 07:43 AM, Alexis GÜNST HORN wrote:
>>
>>                    Thanks.
>>                    Here the output :
>>                    Everything seems normal, no ?
>>
>>                    agunst@couperin:~$ salloc -vvvvv -w bruckner
>>                    salloc: defined options for program `salloc'
>>                    salloc: --------------- ---------------------
>>                    salloc: user : `agunst'
>>                    salloc: uid : 10001
>>                    salloc: gid : 10007
>>                    salloc: ntasks : 1 (default)
>>                    salloc: cpus_per_task : 1 (default)
>>                    salloc: nodes : 1 (default)
>>                    salloc: partition : default
>>                    salloc: job name : `bash'
>>                    salloc: reservation : `(null)'
>>                    salloc: wckey : `(null)'
>>                    salloc: distribution : unknown
>>                    salloc: verbose : 5
>>                    salloc: immediate : false
>>                    salloc: overcommit : false
>>                    salloc: account : (null)
>>                    salloc: comment : (null)
>>                    salloc: dependency : (null)
>>                    salloc: network : (null)
>>                    salloc: qos : (null)
>>                    salloc: constraints : mincpus=1 nodelist=bruckner
>>                    salloc: geometry : (null)
>>                    salloc: reboot : yes
>>                    salloc: rotate : no
>>                    salloc: mail_type : NONE
>>                    salloc: mail_user : (null)
>>                    salloc: sockets-per-node : -2
>>                    salloc: cores-per-socket : -2
>>                    salloc: threads-per-core : -2
>>                    salloc: ntasks-per-node : 0
>>                    salloc: ntasks-per-socket : -2
>>                    salloc: ntasks-per-core : -2
>>                    salloc: plane_size : 4294967294
>>                    salloc: cpu_bind : default
>>                    salloc: mem_bind : default
>>                    salloc: user command : `/bin/bash'
>>                    salloc: debug: Entering
>>                    slurm_allocation_msg_thr___**create()
>>                    salloc: debug: port from net_stream_listen is 45043
>>                    salloc: debug: Entering _msg_thr_internal
>>                    salloc: debug4: eio: handling events for 1 objects
>>                    salloc: debug3: Called eio_message_socket_readable 0 3
>>                    salloc: debug3: Trying to load plugin
>>                    /usr/lib/slurm/auth_none.so
>>                    salloc: Null authentication plugin loaded
>>                    salloc: debug3: Success.
>>                    salloc: debug3: Trying to load plugin
>>                    /usr/lib/slurm/select_cons___**res.so
>>                    salloc: debug3: Success.
>>                    salloc: debug3: Trying to load plugin
>>                    /usr/lib/slurm/select_bgq.so
>>                    salloc: debug3: Success.
>>                    salloc: debug3: Trying to load plugin
>>                    /usr/lib/slurm/select_cray.so
>>                    salloc: debug3: Success.
>>                    salloc: debug3: Trying to load plugin
>>                    /usr/lib/slurm/select_linear._**_so
>>                    salloc: debug3: Success.
>>                    salloc: debug3: Trying to load plugin
>>                    /usr/lib/slurm/select___**bluegene.so
>>                    salloc: debug3: Success.
>>                    salloc: debug4: eio: handling events for 1 objects
>>                    salloc: debug3: Called eio_message_socket_readable 0 3
>>                    salloc: Granted job allocation 323
>>                    salloc: debug: laying out the 8 tasks on 1 hosts
>>                    bruckner
>>                    salloc: Relinquishing job allocation 323
>>                    salloc: debug3: Called eio_msg_socket_accept
>>                    salloc: debug2: got message connection from
>>                    192.168.96.104:37736 <http://192.168.96.104:37736> 7
>>
>>                    salloc: debug3: job complete message received
>>                    salloc: Job allocation 323 has been revoked.
>>                    salloc: debug4: eio: handling events for 1 objects
>>                    salloc: debug3: Called eio_message_socket_readable 0 3
>>                    salloc: debug2: slurm_allocation_msg_thr___**destroy:
>>                    clearing up message
>>                    thread
>>                    salloc: debug4: eio: handling events for 1 objects
>>                    salloc: debug3: Called eio_message_socket_readable 1 3
>>                    salloc: debug2: false, shutdown
>>                    salloc: debug: Leaving _msg_thr_internal
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Aaron Knister
>> Systems Administrator
>> Division of Information Technology
>> University of Maryland, Baltimore County
>> [email protected] <mailto:[email protected]>
>>
>


-- 
Aaron Knister
Systems Administrator
Division of Information Technology
University of Maryland, Baltimore County
[email protected]

Reply via email to