Re: [slurm-dev] Problem with SLURM 2.26 (in Debian Testing)

Aaron Knister Tue, 12 Jul 2011 08:11:04 -0700

That's strange! It sounds like it might be some sort of race condition then.
Yeah, strace will make it run VERY slowly since it's writing out each system
call to a file. One thing I noticed from the strace output was this line:


13998 read(4, "MemTotal:       524504 kB\nMemFre"..., 1024) = 656

>From that output it looks as though the system only has ~512mB of ram. I
wonder if bash wasn't launching because it couldn't allocate enough memory.
I've seen bash do that before. Is there a way you could give the system more
memory?

I wonder if you can run something like this repeatedly (15 times or so):
salloc -w bruckner date

-Aaron

On Tue, Jul 12, 2011 at 11:01 AM, Alexis GÜNST HORN <
[email protected]> wrote:

> Hi Aaron,
>
> I'm so sad :(
> When I put "strace -o salloc.strace -f" before my salloc command, it works
> every time ! But it's really slow :
> from
> salloc: Pending job allocation 466
> salloc: job 466 queued and waiting for resources
> to
> salloc: job 466 has been allocated resources
> salloc: Granted job allocation 466
> there is at least 15 secondes left. (the node is of course free)
>
> And i had already read the FAQ about the SIGPIPE.
>
>
> --
> Alexis GÜNST HORN
> System administrator
> Exascale Computing Research
>
>
> Le 12/07/2011 16:47, Aaron Knister a écrit :
>
>> Hi Alexis,
>>
>> Perhaps this may be of relevence:
>>
>> https://computing.llnl.gov/**linux/slurm/faq.html#sigpipe<https://computing.llnl.gov/linux/slurm/faq.html#sigpipe>
>>
>> I'd also suggest doing something like this:
>>
>> strace -o salloc.strace -f salloc -w bruckner
>>
>> That will create a file called salloc.strace that contains a trace of
>> the system calls made by salloc. Each time you run it that file will be
>> overwritten so once you reproduce the described behavior of the shell
>> not being launched, send the file file back to the list (attached, of
>> course).
>>
>> -Aaron
>>
>> On Tue, Jul 12, 2011 at 10:27 AM, Alexis GÜNST HORN
>> <alexis.gunst@exascale-**computing.eu<[email protected]>
>> <mailto:alexis.gunst@exascale-**computing.eu<[email protected]>>>
>> wrote:
>>
>>    No :(
>>    Everything is fine.
>>    Every nodes are up to date, Debian Wheezy, SLURM 2.2.7
>>
>>    User, Groups are linked to LDAP via nss_ldap, it could help...
>>
>>    Thanks, thanks a lot. I don't understand anything anymore
>>
>>
>>    --
>>    Alexis GÜNST HORN
>>    System administrator
>>    Exascale Computing Research
>>
>>    Le 12/07/2011 16:22, Danny Auble a écrit :
>>
>>        Alexis, do you happen to notice anything in your slurmctld.log
>>        that is out
>>        of the ordinary?  If possible I would run at debug level 6.
>>
>>        Make sure you updated your compute nodes as well.
>>
>>        Danny
>>
>>        On Tuesday July 12 2011 2:24:00 PM you wrote:
>>
>>            No problems !
>>
>>            Here it is
>>
>>
>>            ControlMachine=couperin
>>            AuthType=auth/none
>>            CacheGroups=1
>>            CryptoType=crypto/openssl
>>            JobCredentialPrivateKey=/etc/_**_slurm-llnl/slurm.key
>>            JobCredentialPublicCertificate**__=/etc/slurm-llnl/slurm.cert
>>            #DisableRootJobs=NO
>>            #EnforcePartLimits=NO
>>            JobCheckpointDir=/var/lib/__**slurm-llnl/checkpoint
>>            MpiDefault=none
>>            ProctrackType=proctrack/pgid
>>            ReturnToService=2
>>
>>            SlurmctldPidFile=/var/run/__**slurm-llnl/slurmctld.pid
>>            SlurmctldPort=6817
>>            SlurmdPidFile=/var/run/slurm-_**_llnl/slurmd.pid
>>            SlurmdPort=6818
>>            SlurmdSpoolDir=/var/lib/slurm-**__llnl/slurmd
>>            SlurmUser=slurm
>>
>>            StateSaveLocation=/var/lib/__**slurm-llnl/slurmctld
>>            SwitchType=switch/none
>>            TaskPlugin=task/none
>>
>>            UsePAM=1
>>
>>            # TIMERS
>>            ####InactiveLimit=300
>>            InactiveLimit=0
>>            KillWait=600
>>            MinJobAge=300
>>            SlurmctldTimeout=300
>>            SlurmdTimeout=300
>>            Waittime=0
>>
>>            # SCHEDULING
>>            #DefMemPerCPU=0
>>            FastSchedule=1
>>            #MaxMemPerCPU=0
>>            #SchedulerRootFilter=1
>>            #SchedulerTimeSlice=30
>>            SchedulerType=sched/backfill
>>            SchedulerPort=7321
>>            SelectType=select/linear
>>            #SelectTypeParameters=
>>            #
>>            #
>>            # JOB PRIORITY
>>            #PriorityType=priority/basic
>>            #PriorityDecayHalfLife=
>>            #PriorityFavorSmall=
>>            #PriorityMaxAge=
>>            #PriorityUsageResetPeriod=
>>            #PriorityWeightAge=
>>            #PriorityWeightFairshare=
>>            #PriorityWeightJobSize=
>>            #PriorityWeightPartition=
>>            #PriorityWeightQOS=
>>            #
>>            #
>>            # LOGGING AND ACCOUNTING
>>            #AccountingStorageEnforce=0
>>            #AccountingStorageHost=
>>            #AccountingStorageLoc=/var/__**log/slurm-llnl/slurm_acct.log
>>            #AccountingStoragePass=
>>            #AccountingStoragePort=
>>            #AccountingStorageType=__**accounting_storage/filetxt
>>            #AccountingStorageUser=
>>            ClusterName=cluster
>>            #DebugFlags=
>>            #JobCompHost=
>>            #JobCompLoc=
>>            #JobCompPass=
>>            #JobCompPort=
>>            JobCompType=jobcomp/none
>>            #JobCompUser=
>>            JobAcctGatherFrequency=30
>>            JobAcctGatherType=jobacct___**gather/none
>>            SlurmctldDebug=3
>>            SlurmctldLogFile=/var/log/__**slurm-llnl/slurmctld.log
>>            SlurmdDebug=3
>>            SlurmdLogFile=/var/log/slurm-_**_llnl/slurmd.log
>>            #
>>            #
>>            # POWER SAVE SUPPORT FOR IDLE NODES (optional)
>>            #SuspendProgram=
>>            #ResumeProgram=
>>            #SuspendTimeout=
>>            #ResumeTimeout=
>>            #ResumeRate=
>>            #SuspendExcNodes=
>>            #SuspendExcParts=
>>            #SuspendRate=
>>            #SuspendTime=
>>            #
>>            #
>>            # COMPUTE NODES
>>            NodeName=DEFAULT TmpDisk=20480 State=UNKNOWN
>>            NodeName=campion NodeHostname=campion Sockets=1
>> CoresPerSocket=4
>>            RealMemory=7993
>>            NodeName=carissimi NodeHostname=carissimi Sockets=1
>>            CoresPerSocket=2
>>            RealMemory=3953
>>            NodeName=borodine NodeHostname=borodine Sockets=1
>>            CoresPerSocket=4
>>            RealMemory=7993
>>            NodeName=britten NodeHostname=britten Sockets=1
>> CoresPerSocket=4
>>            RealMemory=3953
>>            NodeName=bruckner NodeHostname=bruckner Sockets=2
>>            CoresPerSocket=4
>>            RealMemory=24153
>>            NodeName=buxtehude NodeHostname=buxtehude Sockets=2
>>            CoresPerSocket=4
>>            RealMemory=24153
>>            NodeName=chopin NodeHostname=chopin Sockets=1 CoresPerSocket=4
>>            RealMemory=3924
>>            NodeName=clerambault NodeHostname=clerambault Sockets=4
>>            CoresPerSocket=8
>>            RealMemory=129177
>>
>>
>>            PartitionName=cluster
>>
>>        Nodes=campion,carissimi,__**borodine,britten,bruckner,__**
>> buxtehude,chopin,clerambau
>>        lt
>>
>>            Default=YES MaxTime=04:00:00 State=UP
>>
>>                Except for the "job complete message received" message,
>>                it looks pretty
>>                normal to me; I'll have to defer to others.
>>
>>                Alexis, it will probably help if you could forward a
>>                copy of your
>>                slurm.conf file.
>>
>>                Regards,
>>                Andy
>>
>>                On 07/12/2011 07:43 AM, Alexis GÜNST HORN wrote:
>>
>>                    Thanks.
>>                    Here the output :
>>                    Everything seems normal, no ?
>>
>>                    agunst@couperin:~$ salloc -vvvvv -w bruckner
>>                    salloc: defined options for program `salloc'
>>                    salloc: --------------- ---------------------
>>                    salloc: user : `agunst'
>>                    salloc: uid : 10001
>>                    salloc: gid : 10007
>>                    salloc: ntasks : 1 (default)
>>                    salloc: cpus_per_task : 1 (default)
>>                    salloc: nodes : 1 (default)
>>                    salloc: partition : default
>>                    salloc: job name : `bash'
>>                    salloc: reservation : `(null)'
>>                    salloc: wckey : `(null)'
>>                    salloc: distribution : unknown
>>                    salloc: verbose : 5
>>                    salloc: immediate : false
>>                    salloc: overcommit : false
>>                    salloc: account : (null)
>>                    salloc: comment : (null)
>>                    salloc: dependency : (null)
>>                    salloc: network : (null)
>>                    salloc: qos : (null)
>>                    salloc: constraints : mincpus=1 nodelist=bruckner
>>                    salloc: geometry : (null)
>>                    salloc: reboot : yes
>>                    salloc: rotate : no
>>                    salloc: mail_type : NONE
>>                    salloc: mail_user : (null)
>>                    salloc: sockets-per-node : -2
>>                    salloc: cores-per-socket : -2
>>                    salloc: threads-per-core : -2
>>                    salloc: ntasks-per-node : 0
>>                    salloc: ntasks-per-socket : -2
>>                    salloc: ntasks-per-core : -2
>>                    salloc: plane_size : 4294967294
>>                    salloc: cpu_bind : default
>>                    salloc: mem_bind : default
>>                    salloc: user command : `/bin/bash'
>>                    salloc: debug: Entering
>>                    slurm_allocation_msg_thr___**create()
>>                    salloc: debug: port from net_stream_listen is 45043
>>                    salloc: debug: Entering _msg_thr_internal
>>                    salloc: debug4: eio: handling events for 1 objects
>>                    salloc: debug3: Called eio_message_socket_readable 0 3
>>                    salloc: debug3: Trying to load plugin
>>                    /usr/lib/slurm/auth_none.so
>>                    salloc: Null authentication plugin loaded
>>                    salloc: debug3: Success.
>>                    salloc: debug3: Trying to load plugin
>>                    /usr/lib/slurm/select_cons___**res.so
>>                    salloc: debug3: Success.
>>                    salloc: debug3: Trying to load plugin
>>                    /usr/lib/slurm/select_bgq.so
>>                    salloc: debug3: Success.
>>                    salloc: debug3: Trying to load plugin
>>                    /usr/lib/slurm/select_cray.so
>>                    salloc: debug3: Success.
>>                    salloc: debug3: Trying to load plugin
>>                    /usr/lib/slurm/select_linear._**_so
>>                    salloc: debug3: Success.
>>                    salloc: debug3: Trying to load plugin
>>                    /usr/lib/slurm/select___**bluegene.so
>>                    salloc: debug3: Success.
>>                    salloc: debug4: eio: handling events for 1 objects
>>                    salloc: debug3: Called eio_message_socket_readable 0 3
>>                    salloc: Granted job allocation 323
>>                    salloc: debug: laying out the 8 tasks on 1 hosts
>>                    bruckner
>>                    salloc: Relinquishing job allocation 323
>>                    salloc: debug3: Called eio_msg_socket_accept
>>                    salloc: debug2: got message connection from
>>                    192.168.96.104:37736 <http://192.168.96.104:37736> 7
>>
>>                    salloc: debug3: job complete message received
>>                    salloc: Job allocation 323 has been revoked.
>>                    salloc: debug4: eio: handling events for 1 objects
>>                    salloc: debug3: Called eio_message_socket_readable 0 3
>>                    salloc: debug2: slurm_allocation_msg_thr___**destroy:
>>                    clearing up message
>>                    thread
>>                    salloc: debug4: eio: handling events for 1 objects
>>                    salloc: debug3: Called eio_message_socket_readable 1 3
>>                    salloc: debug2: false, shutdown
>>                    salloc: debug: Leaving _msg_thr_internal
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> Aaron Knister
>> Systems Administrator
>> Division of Information Technology
>> University of Maryland, Baltimore County
>> [email protected] <mailto:[email protected]>
>>
>


-- 
Aaron Knister
Systems Administrator
Division of Information Technology
University of Maryland, Baltimore County
[email protected]

Re: [slurm-dev] Problem with SLURM 2.26 (in Debian Testing)

Reply via email to