[slurm-dev] Re: PySlurm for SLURM 2.3.2 API
i guess the error is with cython because it generates pyslurm.c *Unable to find pgen, not compiling formal grammar.* i got this while installing cython but i ignored it because i thought this is no error On Fri, Apr 15, 2016 at 9:41 AM, Naajil Aamirwrote: > Do i need pyslurm.c file having some code?because in pyslurm2.3.3 this > file is empty and i installed it successfully but when i try to import > pyslurm i get the following error > > > > > > > > > > *import pyslurmTraceback (most recent call last): File "", line 1, > in File > "/usr/local/lib/python2.7/dist-packages/pyslurm/__init__.py", line 17, in > from .pyslurm import *ImportError: dynamic module does not > define init function (initpyslurm)* > This is what i get after installing pyslurm > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *mpiu@fypmaster-OptiPlex-330:~/Desktop/pyslurm-slurm-2.3.3$ sudo python > setup.py buildINFO:root:Info:INFO:root:Info: Building PySlurm > (2.3.3-1)INFO:root:Info: > --INFO:root:Info:INFO:root:Info: Cython version > 0.24 installedrunning buildrunning build_pycopying pyslurm/__init__.py -> > build/lib.linux-x86_64-2.7/pyslurmrunning build_extskipping > 'pyslurm/pyslurm.c' Cython extension > (up-to-date)mpiu@fypmaster-OptiPlex-330:~/Desktop/pyslurm-slurm-2.3.3$ sudo > python setup.py installINFO:root:Info:INFO:root:Info: Building PySlurm > (2.3.3-1)INFO:root:Info: > --INFO:root:Info:INFO:root:Info: Cython version > 0.24 installedrunning installrunning buildrunning build_pyrunning > build_extskipping 'pyslurm/pyslurm.c' Cython extension (up-to-date)running > install_libbyte-compiling > /usr/local/lib/python2.7/dist-packages/pyslurm/__init__.py to > __init__.pycrunning install_egg_infoRemoving > /usr/local/lib/python2.7/dist-packages/pyslurm-2.3.3_1.egg-infoWriting > /usr/local/lib/python2.7/dist-packages/pyslurm-2.3.3_1.egg-infompiu@fypmaster-OptiPlex-330:~/Desktop/pyslurm-slurm-2.3.3$ > * > > On Thu, Apr 14, 2016 at 3:46 PM, Benjamin Redling < > benjamin.ra...@uni-jena.de> wrote: > >> >> On 04/14/2016 11:08, Naajil Aamir wrote: >> > Hi hope you are doing well. I am currently working on a scheduling >> policy >> > of slurm 2.3.2 for that i need *PYSLURM* version that is compatible with >> > slurm 2.3.3 which i am unable to find on internet. It would be a great >> help >> > if you could provide a link to PYSLURM for Slurm 2.3.2 repository. >> >> Maybe the stale branches of pyslurm are what you are looking for? >> https://github.com/PySlurm/pyslurm/branches >> >> 2.3.3 seems to be the oldest >> >> Benjamin >> -- >> FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html >> vox: +49 3641 9 44323 | fax: +49 3641 9 44321 >> > >
[slurm-dev] Re: PySlurm for SLURM 2.3.2 API
Do i need pyslurm.c file having some code?because in pyslurm2.3.3 this file is empty and i installed it successfully but when i try to import pyslurm i get the following error *import pyslurmTraceback (most recent call last): File "", line 1, in File "/usr/local/lib/python2.7/dist-packages/pyslurm/__init__.py", line 17, in from .pyslurm import *ImportError: dynamic module does not define init function (initpyslurm)* This is what i get after installing pyslurm *mpiu@fypmaster-OptiPlex-330:~/Desktop/pyslurm-slurm-2.3.3$ sudo python setup.py buildINFO:root:Info:INFO:root:Info: Building PySlurm (2.3.3-1)INFO:root:Info: --INFO:root:Info:INFO:root:Info: Cython version 0.24 installedrunning buildrunning build_pycopying pyslurm/__init__.py -> build/lib.linux-x86_64-2.7/pyslurmrunning build_extskipping 'pyslurm/pyslurm.c' Cython extension (up-to-date)mpiu@fypmaster-OptiPlex-330:~/Desktop/pyslurm-slurm-2.3.3$ sudo python setup.py installINFO:root:Info:INFO:root:Info: Building PySlurm (2.3.3-1)INFO:root:Info: --INFO:root:Info:INFO:root:Info: Cython version 0.24 installedrunning installrunning buildrunning build_pyrunning build_extskipping 'pyslurm/pyslurm.c' Cython extension (up-to-date)running install_libbyte-compiling /usr/local/lib/python2.7/dist-packages/pyslurm/__init__.py to __init__.pycrunning install_egg_infoRemoving /usr/local/lib/python2.7/dist-packages/pyslurm-2.3.3_1.egg-infoWriting /usr/local/lib/python2.7/dist-packages/pyslurm-2.3.3_1.egg-infompiu@fypmaster-OptiPlex-330:~/Desktop/pyslurm-slurm-2.3.3$ * On Thu, Apr 14, 2016 at 3:46 PM, Benjamin Redling < benjamin.ra...@uni-jena.de> wrote: > > On 04/14/2016 11:08, Naajil Aamir wrote: > > Hi hope you are doing well. I am currently working on a scheduling policy > > of slurm 2.3.2 for that i need *PYSLURM* version that is compatible with > > slurm 2.3.3 which i am unable to find on internet. It would be a great > help > > if you could provide a link to PYSLURM for Slurm 2.3.2 repository. > > Maybe the stale branches of pyslurm are what you are looking for? > https://github.com/PySlurm/pyslurm/branches > > 2.3.3 seems to be the oldest > > Benjamin > -- > FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html > vox: +49 3641 9 44323 | fax: +49 3641 9 44321 >
[slurm-dev] Re: dynamic srun invocation limits?
Do the "thousands upon thousands" of sub-processes have dependencies among them or are they fully independent of each other? Is it necessary to spawn them using srun; i.e., are you using srun to provide job step accounting or to make them subject to scheduling policies or what? Just trying to understand your context. Gary D. Brown Adaptive Computing On Thu, Apr 14, 2016 at 4:02 PM, Pyramid Bioengineering < pyramidbioengineer...@gmail.com> wrote: > Hi All, > > Our team is using Slurm to distribute tasks across a cluster, but our > implementation may be a little different than what the typical person is > doing... maybe? > > We'll submit a very simple sbatch, like so: > > ``` > #!/bin/bash > #SBATCH --errror=/tmp/error.log > #SBATCH --output/tmp/output.log > execute_algorithm arg1 arg2 > ``` > > `execute_alogorithm` is where things get a bit funny, it can be some > variant of a complex C algorithm of ours that will spawn potentially > thousands upon thousands of subprocess invocations. Each of these > subprocess invocations are executed with `srun`, and slurm is successfully > recognizing them as job steps. It should also be noted that we are waiting > for all the threads to exit before exiting `execute_algorithm` itself. > > The question here is can slurm handle this sort of srun task allotment, > firing them all out at once? > > In testing, this has appeared to work on very small jobs that are just > above the limits of our node resources. I can actually see entries in the > job error log that slurm is recognizing that it's hitting task capacity and > waiting: > > `srun: Job step creation temporarily disabled, retrying` > > We have yet to get to a point where we can run the "thousands" of tasks > that I speak of, but that will be coming up at the end of the month, and > frankly I'm skeptical. > > Is this a common approach? If we are stuck with this approach and there is > not another way to do it, do we just build some internal scheduling logic > into the `execute_algorithm`? > > Thanks! >
[slurm-dev] dynamic srun invocation limits?
Hi All, Our team is using Slurm to distribute tasks across a cluster, but our implementation may be a little different than what the typical person is doing... maybe? We'll submit a very simple sbatch, like so: ``` #!/bin/bash #SBATCH --errror=/tmp/error.log #SBATCH --output/tmp/output.log execute_algorithm arg1 arg2 ``` `execute_alogorithm` is where things get a bit funny, it can be some variant of a complex C algorithm of ours that will spawn potentially thousands upon thousands of subprocess invocations. Each of these subprocess invocations are executed with `srun`, and slurm is successfully recognizing them as job steps. It should also be noted that we are waiting for all the threads to exit before exiting `execute_algorithm` itself. The question here is can slurm handle this sort of srun task allotment, firing them all out at once? In testing, this has appeared to work on very small jobs that are just above the limits of our node resources. I can actually see entries in the job error log that slurm is recognizing that it's hitting task capacity and waiting: `srun: Job step creation temporarily disabled, retrying` We have yet to get to a point where we can run the "thousands" of tasks that I speak of, but that will be coming up at the end of the month, and frankly I'm skeptical. Is this a common approach? If we are stuck with this approach and there is not another way to do it, do we just build some internal scheduling logic into the `execute_algorithm`? Thanks!
[slurm-dev] DBD_JOB_COMPLETE: cluster not registered
hi all: I can not store my job info to mysql by slurmdbd, there have a "cluster not registered" message at logfile. I have add cluster name to my db by sacctmgr follow by http://slurm.schedmd.com/accounting.html And there have my cluster name in mysql db. =slurmdbd log info== slurmdbd: debug2: DBD_JOB_START: START CALL ID:42 NAME:test_slurm INX:0 slurmdbd: debug2: as_mysql_slurmdb_job_start() called slurmdbd: DBD_JOB_START: cluster not registered slurmdbd: debug2: DBD_STEP_START: ID:42.4294967294 NAME:batch SUBMIT:1460646258 slurmdbd: DBD_STEP_START: cluster not registered slurmdbd: debug2: DBD_STEP_COMPLETE: ID:42.4294967294 SUBMIT:1460646258 slurmdbd: DBD_STEP_COMPLETE: cluster not registered slurmdbd: debug2: DBD_JOB_START: START CALL ID:42 NAME:test_slurm INX:9 slurmdbd: debug2: as_mysql_slurmdb_job_start() called slurmdbd: DBD_JOB_START: cluster not registered slurmdbd: debug2: DBD_JOB_COMPLETE: ID:42 slurmdbd: debug2: as_mysql_slurmdb_job_complete() called slurmdbd: DBD_JOB_COMPLETE: cluster not registered == cluster in my sql mysql> select * from cluster_table; +---++-+--+--+--+---+-+++--+---+ | creation_time | mod_time | deleted | name | control_host | control_port | last_port | rpc_version | classification | dimensions | plugin_id_select | flags | +---++-+--+--+--+---+-+++--+---+ |1460645091 | 1460645091 | 0 | hgcp_cluster_2_0 | | 0 | 0 | 0 | 0 | 1 | 0 | 0 | +---++-+--+--+--+---+-+++--+---+ 1 row in set (0.00 sec) how should I Solve this problem? Zihan Wen
[slurm-dev] Re: Oversubscribing nodes
Paul, >> To answer John's question: I don't want to limit 1 job per node, which is >> why I don't want "SHARED=FORCE:1". I want for jobs to be able to share >> nodes, but not cores -- and I want for users to not be able to override this >> with "--exclusive". Ok - we are doing exactly this: sharing nodes but not CPU's (cores), and there are no issues with over-subscription. However, users can, and do, request exclusivity by using "--exclusive". A job submit plugin can strip that request. Here are the relevant portions of our sanitized configuration which you could potentially apply and test to negate over-subscription: PreemptMode = CANCEL PreemptType = preempt/qos SchedulerType = sched/backfill SelectType = select/cons_res SelectTypeParameters= CR_CPU_MEMORY NodeName= CPUs=8 CoresPerSocket=4 Sockets=2 RealMemory=16012 NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=24028 NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=24028 NodeName= CPUs=20 CoresPerSocket=10 Sockets=2 RealMemory=516766 NodeName= CPUs=16 CoresPerSocket=4 Sockets=4 RealMemory=129119 NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=24013 NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=24016 NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=48258 NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=48251 NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=19973 NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=24016 NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=23980 NodeName= CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32075 NodeName= CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32074 NodeName= CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32076 NodeName= CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32073 PartitionName=blah1 State=UP Shared=NO Default=YES DefaultTime=01:00:00 DefMemPerCPU=512 PartitionName=blah2 Nodes=list1 DefaultTime=01:00:00 MaxTime=168:00:00 AllowQOS=qos1 Shared=NO State=UP DefMemPerCPU=512 PartitionName=blah3 Nodes=list2 State=UP Shared=No DefaultTime=01:00:00 DenyQOS=qos2 Priority=5000 DefMemPerCPU=512 PartitionName=blah4 Nodes=list3 State=UP Shared=No MaxTime=02:00:00 MaxNodes=4 DenyQOS=qos2 DefMemPerCPU=512 PartitionName=blah5 Nodes=list4 State=UP Shared=No DefaultTime=01:00:00 MaxTime=26:00:00 DenyQOS=qos2 DefMemPerCPU=512 Again, this configuration affords us the ability to have nodes run multiple user jobs without CPU's being shared; if all CPU's are allocated, the controller will not dispatch jobs to the nodes in question. I may be incorrect here, but looking at your configuration you've specified that your preempt type is partition priority, yet your partition definitions hint to a preemption configured via QOS. Is this what you wanted? The documentation states that based upon your preemption settings, the resultant behavior would be a mix of running and suspended jobs based upon (a) job priorities, and (b) partition priorities; maybe check the controller logs for preemption notices to confirm or deny this thought? At any rate, I'd suggest only using preemption based upon QOS. HTH, John DeSantis On 04/14/2016 07:38 AM, Wiegand, Paul wrote: > I guess I'm going to have to dig in with a filter or something like that. > Our setup seems to oversubscribe no matter what I do. > > To answer John's question: I don't want to limit 1 job per node, which is > why I don't want "SHARED=FORCE:1". I want for jobs to be able to share > nodes, but not cores -- and I want for users to not be able to override this > with "--exclusive". It seems to be the case that slurm correctly ignores the > "--exclusive" switch with using FORCE, which is what I want. It's just the > deploying 50 tasks to a 16-core node that's a problem for us. > > Even though it is not what I want, I took everyone's advice and switched to > SHARED=NO to resolve the oversubscription problems; however, the problem is > not resolved. Though this appears to happen less frequently, it still > happens. Currently, we have several user jobs that are alternating between > suspended and running, and at least one of those nodes is definitely > oversubscribed. There are 35 tasks from various user jobs deployed to it, > while it has only 16 cores available. > > And I still don't see where my misunderstanding of the shared and cgroups > parameters are. I've read the documentation repeatedly and asked on several > email lists what I'm not understanding, and mostly people have been telling > me "don't do that" ... which is fine as far as that goes, but what I *really* > want to know is: "What am I misunderstanding? > > I've been operating under the assumption that the problem is a failure on my > part: I've misconfigured something somewhere. This is why I want to > understand where my reading of the docs is flawed. Or is it possible this a > bug in slurm? > > I admit that I'm pretty frustrated. It is
[slurm-dev] Re: Slurm Checkpoint/Restart example
Danny : I'm unable to use srun_cr command. I got this error message from slurmctld log file after submitting srun_cr with sbatch: [2016-04-14T19:22:42.719] job_complete: JobID=67 State=0x1 NodeCnt=2 WEXITSTATUS 255 Any idea to fix this ? - yes, my job needs more than 5 minutes. Andy : Yes, /mirror directory is shared across my cluster. I have configured it using NFS. Regards, Husen On Thu, Apr 14, 2016 at 6:15 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: > I've found two things, first you could try srun_cr instead of srun and the > second is, do your job needs more than 5 minutes?! > But I'm not sure, so you may try it and post the result. > > > Am 14.04.2016 um 12:56 schrieb Husen R: > >> Hello Danny, >> >> I have tried to restart using "scontrol checkpoint restart " but it >> doesn't work. >> In addition, ".0" directory and its content are doesn't exist in my >> --checkpoint-dir. >> The following is my batch job : >> >> =batch job=== >> >> #!/bin/bash >> #SBATCH -J MatMul >> #SBATCH -o mm-%j.out >> #SBATCH -A pro >> #SBATCH -N 3 >> #SBATCH -n 24 >> #SBATCH --checkpoint=5 >> #SBATCH --checkpoint-dir=/mirror/source/cr >> #SBATCH --time=01:30:00 >> #SBATCH --mail-user=hus...@gmail.com >> #SBATCH --mail-type=begin >> #SBATCH --mail-type=end >> >> srun --mpi=pmi2 ./mm.o >> >> ===end batch job >> >> is there something that prevents me from getting the right directory >> structure ? >> >> >> Regards, >> >> >> >> Husen >> >> >> >> >> On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher < >> danny.rotsc...@tu-dresden.de> wrote: >> >> Hello, >>> >>> usually the directory, which is specified by --checkpoint-dir, should >>> have >>> the following structure: >>> >>> |__ script.ckpt >>> |__ .0 >>> |__ task.0.ckpt >>> |__ task.1.ckpt >>> |__ ... >>> >>> But you only have to run the following command to restart your batch job: >>> scontrol checkpoint restart >>> >>> I tried only batch jobs and currently I try to build MVAPICH2 with BLCR >>> and Slurm support, because that mpi library is explicitly mentioned in >>> the >>> Slurm documentation. >>> >>> A colleague also tested DMTCP but no success. >>> >>> Kind reagards >>> Danny >>> TU Dresden >>> Germany >>> >>> >>> Am 14.04.2016 um 11:01 schrieb Husen R: >>> >>> Hi all, Thank you for your reply Danny : I have installed BLCR and SLURM successfully. I also have configured CheckpointType, --checkpoint, --checkpoint-dir and JobCheckpointDir in order for slurm to support checkpoint. I have tried to checkpoint a simple MPI parallel application many times in my small cluster, and like you said, after checkpoint is completed there is a directory named with jobid in --checkpoint-dir. in that directory there is a file named "script.ckpt". I tried to restart directly using srun command below : srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o where --restart-dir is directory that contains "script.ckpt". Unfortunately, I got the following error : Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or directory srun: error: compute-node: task 0: Exited with exit code 255 As we can see from the error message above, there was no "task.0.ckpt" file. I don't know how to get such file. The files that I got from checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and two files in JobCheckpointDir named ".ckpt" and ".ckpt.old". According to the information in section srun in this link http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed there should be checkpoint files of the form ".ckpt" and "..ckpt" in --checkpoint-dir. Any idea to solve this ? Manuel : Yes, BLCR doesn't support checkpoint/restart parallel/distributed application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by other software to do that (I hope the software is SLURM..huhu) I have ever tried to restart mpi application using DMTCP but it doesn't work. Would you please tell me how to do that ? Thank you in advance, Regards, Husen On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: I forgot something to add, you have to create a directory for the > checkpoint meta data, which is for default located in > /var/slurm/checkpoint: > mkdir -p /var/slurm/checkpoint > chown -R slurm /var/slurm > or you define your own directory in slurm.conf: > JobCheckpointDir= > > The parameters you could check with: > scontrol show config | grep checkpoint > > Kind regards, > Danny > TU
[slurm-dev] Re: Slurm Checkpoint/Restart example
Is your /mirror directory shared across your cluster? On 04/14/2016 06:56 AM, Husen R wrote: Re: [slurm-dev] Re: Slurm Checkpoint/Restart example Hello Danny, I have tried to restart using "scontrol checkpoint restart " but it doesn't work. In addition, ".0" directory and its content are doesn't exist in my --checkpoint-dir. The following is my batch job : =batch job=== #!/bin/bash #SBATCH -J MatMul #SBATCH -o mm-%j.out #SBATCH -A pro #SBATCH -N 3 #SBATCH -n 24 #SBATCH --checkpoint=5 #SBATCH --checkpoint-dir=/mirror/source/cr #SBATCH --time=01:30:00 #SBATCH --mail-user=hus...@gmail.com #SBATCH --mail-type=begin #SBATCH --mail-type=end srun --mpi=pmi2 ./mm.o ===end batch job is there something�that prevents me from getting the right directory structure ? Regards, Husen On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscherwrote: Hello, usually the directory, which is specified by --checkpoint-dir, should have the following structure: |__ script.ckpt |__ .0 � � �|__ task.0.ckpt � � �|__ task.1.ckpt � � �|__ ... But you only have to run the following command to restart your batch job: scontrol checkpoint restart I tried only batch jobs and currently I try to build MVAPICH2 with BLCR and Slurm support, because that mpi library is explicitly mentioned in the Slurm documentation. A colleague also tested DMTCP but no success. Kind reagards Danny TU Dresden Germany Am 14.04.2016 um 11:01 schrieb Husen R: Hi all, Thank you for your reply Danny : I have installed BLCR and SLURM successfully. I also have configured CheckpointType, --checkpoint, --checkpoint-dir and JobCheckpointDir in order for slurm to support checkpoint. I have tried to checkpoint a simple MPI parallel application many times in my small cluster, and like you said, after checkpoint is completed there is a directory named with jobid in� --checkpoint-dir. in that directory there is a file named "script.ckpt". I tried to restart directly using srun command below : srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o where --restart-dir is directory that contains "script.ckpt". Unfortunately, I got the following error : Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or directory srun: error: compute-node: task 0: Exited with exit code 255 As we can see from the error message above, there was no "task.0.ckpt" file. I don't know how to get such file. The files that I got from checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and two files in JobCheckpointDir named ".ckpt" and ".ckpt.old". According to the information in section srun in this link http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed there should be checkpoint files of the form ".ckpt" and "..ckpt" in --checkpoint-dir. Any idea to solve this ? Manuel : Yes, BLCR doesn't support checkpoint/restart parallel/distributed application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by other software to do that (I hope the software is SLURM..huhu) I have ever tried to restart mpi application using DMTCP but it doesn't work. Would you please tell me how to do that ? Thank you in advance, Regards, Husen On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de>
[slurm-dev] Re: Oversubscribing nodes
I guess I'm going to have to dig in with a filter or something like that. Our setup seems to oversubscribe no matter what I do. To answer John's question: I don't want to limit 1 job per node, which is why I don't want "SHARED=FORCE:1". I want for jobs to be able to share nodes, but not cores -- and I want for users to not be able to override this with "--exclusive". It seems to be the case that slurm correctly ignores the "--exclusive" switch with using FORCE, which is what I want. It's just the deploying 50 tasks to a 16-core node that's a problem for us. Even though it is not what I want, I took everyone's advice and switched to SHARED=NO to resolve the oversubscription problems; however, the problem is not resolved. Though this appears to happen less frequently, it still happens. Currently, we have several user jobs that are alternating between suspended and running, and at least one of those nodes is definitely oversubscribed. There are 35 tasks from various user jobs deployed to it, while it has only 16 cores available. And I still don't see where my misunderstanding of the shared and cgroups parameters are. I've read the documentation repeatedly and asked on several email lists what I'm not understanding, and mostly people have been telling me "don't do that" ... which is fine as far as that goes, but what I *really* want to know is: "What am I misunderstanding? I've been operating under the assumption that the problem is a failure on my part: I've misconfigured something somewhere. This is why I want to understand where my reading of the docs is flawed. Or is it possible this a bug in slurm? I admit that I'm pretty frustrated. It is causing a lot of performance problems for our users with big jobs. Even though the event is rare, it is highly probable that a large, long job will experience some unintended oversubscription during the lifetime of its run, and this greatly hinders performance (sometimes by orders of magnitudes). Users are very annoyed with us, and I honestly have no idea why it is happening. We don't have the the resources to be able to manage throughput of jobs with only exclusive nodes, so if I cannot resolve this problem, we may have to switch back to Torque/{Moab|Maui}. I'm attaching our slurm.conf. Let me know if you need our topology.conf and cgroup.conf file, as well. Paul. slurm.conf Description: slurm.conf > On Apr 11, 2016, at 10:26, Aaron Knisterwrote: > > > Hi Paul, > > There's always the Swiss Army knife that is a submission filter. If a user > specifies exclusive you could literally strip that argument via a submission > filter before the allocation request hits the scheduler. > > -Aaron > > Sent from my iPhone > >> On Apr 8, 2016, at 2:03 PM, Wiegand, Paul wrote: >> >> >> This is *almost* what I want, but not quite. When I do this, users can >> throw the "--exclusive=YES" flag and get an exclusive node. I guess I'd >> interpreted "Shared=FORCE" to mean that they could not do this, which is >> what I wanted. I'm not sure I understand, based on the documentation, why >> "Shared=FORCE" implies that cores and consumed differently. >> >> Is there no way to disable "exclusive" while also getting what I want? >> >> Thanks, >> Paul. >> >> >>> On Apr 8, 2016, at 13:40, John DeSantis wrote: >>> >>> >>> Paul, >>> >>> Try changing the Partition "Shared=FORCE" statement to "Shared=NO". >>> >>> We do that on all of our partitions and get the desired behavior. >>> >>> John DeSantis >>> On 04/08/2016 01:06 PM, Wiegand, Paul wrote: Greetings, I would like to have our cluster configured (and believed I had done) to work as follows: * User jobs can share a node, but not cores and memory * Users cannot override the share option using "--exclusive" * Once assigned cores and memory, jobs are restricted to just those assigned resources But this is not what is happening. Instead, if jobs request a sufficiently small amount of memory that they can both consume their requested amounts without contention, slurm can and does deploy jobs to the same node ... even if there are an insufficient number of cores. Instead, the jobs alternate between a "suspended" mode and "running" mode and, of course, run much, much slower as a result. Clearly I've misunderstood my configuration. We are running Slurm 15.08.3. Here are what I believe are the relevant parameters from our slurm.conf. Let me know if you want me to provide more. ProctrackType=proctrack/cgroup TaskPlugin=task/cgroup SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory PartitionName=DEFAULT Shared=FORCE State=UP DefaultTime=00:10:00
[slurm-dev] Re: Slurm Checkpoint/Restart example
I've found two things, first you could try srun_cr instead of srun and the second is, do your job needs more than 5 minutes?! But I'm not sure, so you may try it and post the result. Am 14.04.2016 um 12:56 schrieb Husen R: Hello Danny, I have tried to restart using "scontrol checkpoint restart " but it doesn't work. In addition, ".0" directory and its content are doesn't exist in my --checkpoint-dir. The following is my batch job : =batch job=== #!/bin/bash #SBATCH -J MatMul #SBATCH -o mm-%j.out #SBATCH -A pro #SBATCH -N 3 #SBATCH -n 24 #SBATCH --checkpoint=5 #SBATCH --checkpoint-dir=/mirror/source/cr #SBATCH --time=01:30:00 #SBATCH --mail-user=hus...@gmail.com #SBATCH --mail-type=begin #SBATCH --mail-type=end srun --mpi=pmi2 ./mm.o ===end batch job is there something that prevents me from getting the right directory structure ? Regards, Husen On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: Hello, usually the directory, which is specified by --checkpoint-dir, should have the following structure: |__ script.ckpt |__ .0 |__ task.0.ckpt |__ task.1.ckpt |__ ... But you only have to run the following command to restart your batch job: scontrol checkpoint restart I tried only batch jobs and currently I try to build MVAPICH2 with BLCR and Slurm support, because that mpi library is explicitly mentioned in the Slurm documentation. A colleague also tested DMTCP but no success. Kind reagards Danny TU Dresden Germany Am 14.04.2016 um 11:01 schrieb Husen R: Hi all, Thank you for your reply Danny : I have installed BLCR and SLURM successfully. I also have configured CheckpointType, --checkpoint, --checkpoint-dir and JobCheckpointDir in order for slurm to support checkpoint. I have tried to checkpoint a simple MPI parallel application many times in my small cluster, and like you said, after checkpoint is completed there is a directory named with jobid in --checkpoint-dir. in that directory there is a file named "script.ckpt". I tried to restart directly using srun command below : srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o where --restart-dir is directory that contains "script.ckpt". Unfortunately, I got the following error : Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or directory srun: error: compute-node: task 0: Exited with exit code 255 As we can see from the error message above, there was no "task.0.ckpt" file. I don't know how to get such file. The files that I got from checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and two files in JobCheckpointDir named ".ckpt" and ".ckpt.old". According to the information in section srun in this link http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed there should be checkpoint files of the form ".ckpt" and "..ckpt" in --checkpoint-dir. Any idea to solve this ? Manuel : Yes, BLCR doesn't support checkpoint/restart parallel/distributed application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by other software to do that (I hope the software is SLURM..huhu) I have ever tried to restart mpi application using DMTCP but it doesn't work. Would you please tell me how to do that ? Thank you in advance, Regards, Husen On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: I forgot something to add, you have to create a directory for the checkpoint meta data, which is for default located in /var/slurm/checkpoint: mkdir -p /var/slurm/checkpoint chown -R slurm /var/slurm or you define your own directory in slurm.conf: JobCheckpointDir= The parameters you could check with: scontrol show config | grep checkpoint Kind regards, Danny TU Dresden Germany Am 14.04.2016 um 06:41 schrieb Danny Rotscher: Hello, we don't get it to work too, but we already build Slurm with the BLCR. You first have to install the BLCR library, which is described on the following website: https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html Then we build and installed Slurm from source and BLCR checkpointing has been included. After that you have to set at least one Parameter in the file "slurm.conf": CheckpointType=checkpoint/blcr It exists two ways to create ceckpointing, you could either make a checkpoint by the following command from outside your job: scontrol checkpoint create or you could let Slurm do some periodical checkpoints with the following sbatch parameter: #SBATCH --checkpoint We also tried: #SBATCH --checkpoint : e.g. #SBATCH --checkpoint 0:10 to test it, but it doesn't work for us. We also set the parameter for the checkpoint directory: #SBATCH --checkpoint-dir After you create a checkpoint and in your checkpoint directory is created a directory with name of your jobid, you could restart the job by the following command: scontrol checkpoint
[slurm-dev] Re: Slurm Checkpoint/Restart example
Hello Danny, I have tried to restart using "scontrol checkpoint restart " but it doesn't work. In addition, ".0" directory and its content are doesn't exist in my --checkpoint-dir. The following is my batch job : =batch job=== #!/bin/bash #SBATCH -J MatMul #SBATCH -o mm-%j.out #SBATCH -A pro #SBATCH -N 3 #SBATCH -n 24 #SBATCH --checkpoint=5 #SBATCH --checkpoint-dir=/mirror/source/cr #SBATCH --time=01:30:00 #SBATCH --mail-user=hus...@gmail.com #SBATCH --mail-type=begin #SBATCH --mail-type=end srun --mpi=pmi2 ./mm.o ===end batch job is there something that prevents me from getting the right directory structure ? Regards, Husen On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: > Hello, > > usually the directory, which is specified by --checkpoint-dir, should have > the following structure: > > |__ script.ckpt > |__ .0 > |__ task.0.ckpt > |__ task.1.ckpt > |__ ... > > But you only have to run the following command to restart your batch job: > scontrol checkpoint restart > > I tried only batch jobs and currently I try to build MVAPICH2 with BLCR > and Slurm support, because that mpi library is explicitly mentioned in the > Slurm documentation. > > A colleague also tested DMTCP but no success. > > Kind reagards > Danny > TU Dresden > Germany > > > Am 14.04.2016 um 11:01 schrieb Husen R: > >> Hi all, >> Thank you for your reply >> >> Danny : >> I have installed BLCR and SLURM successfully. >> I also have configured CheckpointType, --checkpoint, --checkpoint-dir and >> JobCheckpointDir in order for slurm to support checkpoint. >> >> I have tried to checkpoint a simple MPI parallel application many times in >> my small cluster, and like you said, after checkpoint is completed there >> is >> a directory named with jobid in --checkpoint-dir. in that directory there >> is a file named "script.ckpt". I tried to restart directly using srun >> command below : >> >> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o >> >> where --restart-dir is directory that contains "script.ckpt". >> Unfortunately, I got the following error : >> >> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file >> or >> directory >> srun: error: compute-node: task 0: Exited with exit code 255 >> >> As we can see from the error message above, there was no "task.0.ckpt" >> file. I don't know how to get such file. The files that I got from >> checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and >> two files in JobCheckpointDir named ".ckpt" and ".ckpt.old". >> >> According to the information in section srun in this link >> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is >> completed there should be checkpoint files of the form ".ckpt" and >> "..ckpt" in --checkpoint-dir. >> >> Any idea to solve this ? >> >> Manuel : >> >> Yes, BLCR doesn't support checkpoint/restart parallel/distributed >> application by itself ( >> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). >> But it can be used by other software to do that (I hope the software is >> SLURM..huhu) >> >> I have ever tried to restart mpi application using DMTCP but it doesn't >> work. >> Would you please tell me how to do that ? >> >> >> Thank you in advance, >> >> Regards, >> >> >> Husen >> >> >> >> >> >> On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < >> danny.rotsc...@tu-dresden.de> wrote: >> >> I forgot something to add, you have to create a directory for the >>> checkpoint meta data, which is for default located in >>> /var/slurm/checkpoint: >>> mkdir -p /var/slurm/checkpoint >>> chown -R slurm /var/slurm >>> or you define your own directory in slurm.conf: >>> JobCheckpointDir= >>> >>> The parameters you could check with: >>> scontrol show config | grep checkpoint >>> >>> Kind regards, >>> Danny >>> TU Dresden >>> Germany >>> >>> Am 14.04.2016 um 06:41 schrieb Danny Rotscher: >>> >>> Hello, we don't get it to work too, but we already build Slurm with the BLCR. You first have to install the BLCR library, which is described on the following website: https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html Then we build and installed Slurm from source and BLCR checkpointing has been included. After that you have to set at least one Parameter in the file "slurm.conf": CheckpointType=checkpoint/blcr It exists two ways to create ceckpointing, you could either make a checkpoint by the following command from outside your job: scontrol checkpoint create or you could let Slurm do some periodical checkpoints with the following sbatch parameter: #SBATCH --checkpoint We also tried: #SBATCH --checkpoint : e.g. #SBATCH --checkpoint 0:10 to test it, but it doesn't work for us. We also set the parameter for the checkpoint directory: #SBATCH
[slurm-dev] Re: PySlurm for SLURM 2.3.2 API
On 04/14/2016 11:08, Naajil Aamir wrote: > Hi hope you are doing well. I am currently working on a scheduling policy > of slurm 2.3.2 for that i need *PYSLURM* version that is compatible with > slurm 2.3.3 which i am unable to find on internet. It would be a great help > if you could provide a link to PYSLURM for Slurm 2.3.2 repository. Maybe the stale branches of pyslurm are what you are looking for? https://github.com/PySlurm/pyslurm/branches 2.3.3 seems to be the oldest Benjamin -- FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html vox: +49 3641 9 44323 | fax: +49 3641 9 44321
[slurm-dev] Re: Slurm Checkpoint/Restart example
Hello, usually the directory, which is specified by --checkpoint-dir, should have the following structure: |__ script.ckpt |__ .0 |__ task.0.ckpt |__ task.1.ckpt |__ ... But you only have to run the following command to restart your batch job: scontrol checkpoint restart I tried only batch jobs and currently I try to build MVAPICH2 with BLCR and Slurm support, because that mpi library is explicitly mentioned in the Slurm documentation. A colleague also tested DMTCP but no success. Kind reagards Danny TU Dresden Germany Am 14.04.2016 um 11:01 schrieb Husen R: Hi all, Thank you for your reply Danny : I have installed BLCR and SLURM successfully. I also have configured CheckpointType, --checkpoint, --checkpoint-dir and JobCheckpointDir in order for slurm to support checkpoint. I have tried to checkpoint a simple MPI parallel application many times in my small cluster, and like you said, after checkpoint is completed there is a directory named with jobid in --checkpoint-dir. in that directory there is a file named "script.ckpt". I tried to restart directly using srun command below : srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o where --restart-dir is directory that contains "script.ckpt". Unfortunately, I got the following error : Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or directory srun: error: compute-node: task 0: Exited with exit code 255 As we can see from the error message above, there was no "task.0.ckpt" file. I don't know how to get such file. The files that I got from checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and two files in JobCheckpointDir named ".ckpt" and ".ckpt.old". According to the information in section srun in this link http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed there should be checkpoint files of the form ".ckpt" and "..ckpt" in --checkpoint-dir. Any idea to solve this ? Manuel : Yes, BLCR doesn't support checkpoint/restart parallel/distributed application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by other software to do that (I hope the software is SLURM..huhu) I have ever tried to restart mpi application using DMTCP but it doesn't work. Would you please tell me how to do that ? Thank you in advance, Regards, Husen On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: I forgot something to add, you have to create a directory for the checkpoint meta data, which is for default located in /var/slurm/checkpoint: mkdir -p /var/slurm/checkpoint chown -R slurm /var/slurm or you define your own directory in slurm.conf: JobCheckpointDir= The parameters you could check with: scontrol show config | grep checkpoint Kind regards, Danny TU Dresden Germany Am 14.04.2016 um 06:41 schrieb Danny Rotscher: Hello, we don't get it to work too, but we already build Slurm with the BLCR. You first have to install the BLCR library, which is described on the following website: https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html Then we build and installed Slurm from source and BLCR checkpointing has been included. After that you have to set at least one Parameter in the file "slurm.conf": CheckpointType=checkpoint/blcr It exists two ways to create ceckpointing, you could either make a checkpoint by the following command from outside your job: scontrol checkpoint create or you could let Slurm do some periodical checkpoints with the following sbatch parameter: #SBATCH --checkpoint We also tried: #SBATCH --checkpoint : e.g. #SBATCH --checkpoint 0:10 to test it, but it doesn't work for us. We also set the parameter for the checkpoint directory: #SBATCH --checkpoint-dir After you create a checkpoint and in your checkpoint directory is created a directory with name of your jobid, you could restart the job by the following command: scontrol checkpoint restart We tested some sequential and openmp programs with different parameters and it works (checkpoint creation and restarting), but *we don't get any mpi library to work*, we already tested some programs build with openmpi and intelmpi. The checkpoint will be created but we get the following error when we want to restart them: - Failed to open file '/' - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) - cr_rstrt_child [28534]: Unable to restore files! (err=-21) Restart failed: Is a directory srun: error: taurusi4010: task 0: Exited with exit code 21 So, it would be great if you could confirm our problems, maybe then schedmd higher up the priority of such mails;-) If you get it to work, please help us to understand how. Kind reagards, Danny TU Dresden Germany Am 11.04.2016 um 10:09 schrieb Husen R: Hi all, Based on the information in this link http://slurm.schedmd.com/checkpoint_blcr.html, Slurm able to checkpoint the whole batch jobs and then Restart execution of batch jobs and
[slurm-dev] Re: Slurm Checkpoint/Restart example
There is a good tutorial on how to use DMTCP on their github page, https://github.com/dmtcp/dmtcp/blob/master/QUICK-START.md I would start there. Anyway, probably this Slurm mailing list is not the best place to ask for that information. Best regards, Manuel 2016-04-14 11:01 GMT+02:00 Husen R: > Hi all, > Thank you for your reply > > Danny : > I have installed BLCR and SLURM successfully. > I also have configured CheckpointType, --checkpoint, --checkpoint-dir and > JobCheckpointDir in order for slurm to support checkpoint. > > I have tried to checkpoint a simple MPI parallel application many times in > my small cluster, and like you said, after checkpoint is completed there is > a directory named with jobid in --checkpoint-dir. in that directory there > is a file named "script.ckpt". I tried to restart directly using srun > command below : > > srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o > > where --restart-dir is directory that contains "script.ckpt". > Unfortunately, I got the following error : > > Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or > directory > srun: error: compute-node: task 0: Exited with exit code 255 > > As we can see from the error message above, there was no "task.0.ckpt" file. > I don't know how to get such file. The files that I got from checkpoint > operation is a file named "script.ckpt" in --checkpoint-dir and two files in > JobCheckpointDir named ".ckpt" and ".ckpt.old". > > According to the information in section srun in this link > http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed > there should be checkpoint files of the form ".ckpt" and > "..ckpt" in --checkpoint-dir. > > Any idea to solve this ? > > Manuel : > > Yes, BLCR doesn't support checkpoint/restart parallel/distributed > application by itself ( > https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by > other software to do that (I hope the software is SLURM..huhu) > > I have ever tried to restart mpi application using DMTCP but it doesn't > work. > Would you please tell me how to do that ? > > > Thank you in advance, > > Regards, > > > Husen > > > > > > On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher > wrote: >> >> I forgot something to add, you have to create a directory for the >> checkpoint meta data, which is for default located in /var/slurm/checkpoint: >> mkdir -p /var/slurm/checkpoint >> chown -R slurm /var/slurm >> or you define your own directory in slurm.conf: >> JobCheckpointDir= >> >> The parameters you could check with: >> scontrol show config | grep checkpoint >> >> Kind regards, >> Danny >> TU Dresden >> Germany >> >> Am 14.04.2016 um 06:41 schrieb Danny Rotscher: >>> >>> Hello, >>> >>> we don't get it to work too, but we already build Slurm with the BLCR. >>> >>> You first have to install the BLCR library, which is described on the >>> following website: >>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html >>> >>> Then we build and installed Slurm from source and BLCR checkpointing has >>> been included. >>> >>> After that you have to set at least one Parameter in the file >>> "slurm.conf": >>> CheckpointType=checkpoint/blcr >>> >>> It exists two ways to create ceckpointing, you could either make a >>> checkpoint by the following command from outside your job: >>> scontrol checkpoint create >>> or you could let Slurm do some periodical checkpoints with the following >>> sbatch parameter: >>> #SBATCH --checkpoint >>> We also tried: >>> #SBATCH --checkpoint : >>> e.g. >>> #SBATCH --checkpoint 0:10 >>> to test it, but it doesn't work for us. >>> >>> We also set the parameter for the checkpoint directory: >>> #SBATCH --checkpoint-dir >>> >>> After you create a checkpoint and in your checkpoint directory is created >>> a directory with name of your jobid, you could restart the job by the >>> following command: >>> scontrol checkpoint restart >>> >>> We tested some sequential and openmp programs with different parameters >>> and it works (checkpoint creation and restarting), >>> but *we don't get any mpi library to work*, we already tested some >>> programs build with openmpi and intelmpi. >>> The checkpoint will be created but we get the following error when we >>> want to restart them: >>> - Failed to open file '/' >>> - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) >>> - cr_rstrt_child [28534]: Unable to restore files! (err=-21) >>> Restart failed: Is a directory >>> srun: error: taurusi4010: task 0: Exited with exit code 21 >>> >>> So, it would be great if you could confirm our problems, maybe then >>> schedmd higher up the priority of such mails;-) >>> If you get it to work, please help us to understand how. >>> >>> Kind reagards, >>> Danny >>> TU Dresden >>> Germany >>> >>> Am 11.04.2016 um 10:09 schrieb Husen R: Hi all, Based on the information in this link
[slurm-dev] PySlurm for SLURM 2.3.2 API
Hi hope you are doing well. I am currently working on a scheduling policy of slurm 2.3.2 for that i need *PYSLURM* version that is compatible with slurm 2.3.3 which i am unable to find on internet. It would be a great help if you could provide a link to PYSLURM for Slurm 2.3.2 repository. Thanks in Advance
[slurm-dev] Re: Slurm Checkpoint/Restart example
Hi all, Thank you for your reply Danny : I have installed BLCR and SLURM successfully. I also have configured CheckpointType, --checkpoint, --checkpoint-dir and JobCheckpointDir in order for slurm to support checkpoint. I have tried to checkpoint a simple MPI parallel application many times in my small cluster, and like you said, after checkpoint is completed there is a directory named with jobid in --checkpoint-dir. in that directory there is a file named "script.ckpt". I tried to restart directly using srun command below : srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o where --restart-dir is directory that contains "script.ckpt". Unfortunately, I got the following error : Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or directory srun: error: compute-node: task 0: Exited with exit code 255 As we can see from the error message above, there was no "task.0.ckpt" file. I don't know how to get such file. The files that I got from checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and two files in JobCheckpointDir named ".ckpt" and ".ckpt.old". According to the information in section srun in this link http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed there should be checkpoint files of the form ".ckpt" and "..ckpt" in --checkpoint-dir. Any idea to solve this ? Manuel : Yes, BLCR doesn't support checkpoint/restart parallel/distributed application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by other software to do that (I hope the software is SLURM..huhu) I have ever tried to restart mpi application using DMTCP but it doesn't work. Would you please tell me how to do that ? Thank you in advance, Regards, Husen On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: > I forgot something to add, you have to create a directory for the > checkpoint meta data, which is for default located in /var/slurm/checkpoint: > mkdir -p /var/slurm/checkpoint > chown -R slurm /var/slurm > or you define your own directory in slurm.conf: > JobCheckpointDir= > > The parameters you could check with: > scontrol show config | grep checkpoint > > Kind regards, > Danny > TU Dresden > Germany > > Am 14.04.2016 um 06:41 schrieb Danny Rotscher: > >> Hello, >> >> we don't get it to work too, but we already build Slurm with the BLCR. >> >> You first have to install the BLCR library, which is described on the >> following website: >> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html >> >> Then we build and installed Slurm from source and BLCR checkpointing has >> been included. >> >> After that you have to set at least one Parameter in the file >> "slurm.conf": >> CheckpointType=checkpoint/blcr >> >> It exists two ways to create ceckpointing, you could either make a >> checkpoint by the following command from outside your job: >> scontrol checkpoint create >> or you could let Slurm do some periodical checkpoints with the following >> sbatch parameter: >> #SBATCH --checkpoint >> We also tried: >> #SBATCH --checkpoint : >> e.g. >> #SBATCH --checkpoint 0:10 >> to test it, but it doesn't work for us. >> >> We also set the parameter for the checkpoint directory: >> #SBATCH --checkpoint-dir >> >> After you create a checkpoint and in your checkpoint directory is created >> a directory with name of your jobid, you could restart the job by the >> following command: >> scontrol checkpoint restart >> >> We tested some sequential and openmp programs with different parameters >> and it works (checkpoint creation and restarting), >> but *we don't get any mpi library to work*, we already tested some >> programs build with openmpi and intelmpi. >> The checkpoint will be created but we get the following error when we >> want to restart them: >> - Failed to open file '/' >> - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) >> - cr_rstrt_child [28534]: Unable to restore files! (err=-21) >> Restart failed: Is a directory >> srun: error: taurusi4010: task 0: Exited with exit code 21 >> >> So, it would be great if you could confirm our problems, maybe then >> schedmd higher up the priority of such mails;-) >> If you get it to work, please help us to understand how. >> >> Kind reagards, >> Danny >> TU Dresden >> Germany >> >> Am 11.04.2016 um 10:09 schrieb Husen R: >> >>> Hi all, >>> >>> Based on the information in this link >>> http://slurm.schedmd.com/checkpoint_blcr.html, >>> Slurm able to checkpoint the whole batch jobs and then Restart execution >>> of >>> batch jobs and job steps from checkpoint files. >>> >>> Anyone please tell me how to do that ? >>> I need help. >>> >>> Thank you in advance. >>> >>> Regards, >>> >>> >>> Husen Rusdiansyah >>> University of Indonesia >>> >> >> >
[slurm-dev] Re: Slurm Checkpoint/Restart example
Hi Danny, all, As far as I know, unfortunately BLCR does not count with MPI support At lest I haven't been able to achieve it. On the other side, DMTCP ( http://dmtcp.sourceforge.net/ ) does work with MPI. My team is very interested on counting with a reliable checkpoint/restar mechanism in Slurm, so we are now plugin to integrate it. We are facing some technical problems, but are working together with DMTCP team to solve them and we are confident on having the integration ready soon. anyway, i'll send a mail to this list when it's ready. Cheers, Manuel 2016-04-14 7:03 GMT+02:00 Danny Rotscher: > I forgot something to add, you have to create a directory for the checkpoint > meta data, which is for default located in /var/slurm/checkpoint: > mkdir -p /var/slurm/checkpoint > chown -R slurm /var/slurm > or you define your own directory in slurm.conf: > JobCheckpointDir= > > The parameters you could check with: > scontrol show config | grep checkpoint > > Kind regards, > Danny > TU Dresden > Germany > > Am 14.04.2016 um 06:41 schrieb Danny Rotscher: >> >> Hello, >> >> we don't get it to work too, but we already build Slurm with the BLCR. >> >> You first have to install the BLCR library, which is described on the >> following website: >> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html >> >> Then we build and installed Slurm from source and BLCR checkpointing has >> been included. >> >> After that you have to set at least one Parameter in the file >> "slurm.conf": >> CheckpointType=checkpoint/blcr >> >> It exists two ways to create ceckpointing, you could either make a >> checkpoint by the following command from outside your job: >> scontrol checkpoint create >> or you could let Slurm do some periodical checkpoints with the following >> sbatch parameter: >> #SBATCH --checkpoint >> We also tried: >> #SBATCH --checkpoint : >> e.g. >> #SBATCH --checkpoint 0:10 >> to test it, but it doesn't work for us. >> >> We also set the parameter for the checkpoint directory: >> #SBATCH --checkpoint-dir >> >> After you create a checkpoint and in your checkpoint directory is created >> a directory with name of your jobid, you could restart the job by the >> following command: >> scontrol checkpoint restart >> >> We tested some sequential and openmp programs with different parameters >> and it works (checkpoint creation and restarting), >> but *we don't get any mpi library to work*, we already tested some >> programs build with openmpi and intelmpi. >> The checkpoint will be created but we get the following error when we want >> to restart them: >> - Failed to open file '/' >> - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) >> - cr_rstrt_child [28534]: Unable to restore files! (err=-21) >> Restart failed: Is a directory >> srun: error: taurusi4010: task 0: Exited with exit code 21 >> >> So, it would be great if you could confirm our problems, maybe then >> schedmd higher up the priority of such mails;-) >> If you get it to work, please help us to understand how. >> >> Kind reagards, >> Danny >> TU Dresden >> Germany >> >> Am 11.04.2016 um 10:09 schrieb Husen R: >>> >>> Hi all, >>> >>> Based on the information in this link >>> http://slurm.schedmd.com/checkpoint_blcr.html, >>> Slurm able to checkpoint the whole batch jobs and then Restart execution >>> of >>> batch jobs and job steps from checkpoint files. >>> >>> Anyone please tell me how to do that ? >>> I need help. >>> >>> Thank you in advance. >>> >>> Regards, >>> >>> >>> Husen Rusdiansyah >>> University of Indonesia >> >> >