[slurm-dev] Re: PySlurm for SLURM 2.3.2 API

2016-04-14 Thread Naajil Aamir
i guess the error is with cython because it generates pyslurm.c

*Unable to find pgen, not compiling formal grammar.*
i got this while installing cython but i ignored it because i thought this
is no error

On Fri, Apr 15, 2016 at 9:41 AM, Naajil Aamir 
wrote:

> Do i need pyslurm.c file having some code?because in pyslurm2.3.3 this
> file is empty and i installed it successfully but when i try to import
> pyslurm i get the following error
>
>
>
>
>
>
>
>
>
> *import pyslurmTraceback (most recent call last):  File "", line 1,
> in   File
> "/usr/local/lib/python2.7/dist-packages/pyslurm/__init__.py", line 17, in
> from .pyslurm import *ImportError: dynamic module does not
> define init function (initpyslurm)*
> This is what i get after installing pyslurm
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *mpiu@fypmaster-OptiPlex-330:~/Desktop/pyslurm-slurm-2.3.3$ sudo python
> setup.py buildINFO:root:Info:INFO:root:Info: Building PySlurm
> (2.3.3-1)INFO:root:Info:
> --INFO:root:Info:INFO:root:Info: Cython version
> 0.24 installedrunning buildrunning build_pycopying pyslurm/__init__.py ->
> build/lib.linux-x86_64-2.7/pyslurmrunning build_extskipping
> 'pyslurm/pyslurm.c' Cython extension
> (up-to-date)mpiu@fypmaster-OptiPlex-330:~/Desktop/pyslurm-slurm-2.3.3$ sudo
> python setup.py installINFO:root:Info:INFO:root:Info: Building PySlurm
> (2.3.3-1)INFO:root:Info:
> --INFO:root:Info:INFO:root:Info: Cython version
> 0.24 installedrunning installrunning buildrunning build_pyrunning
> build_extskipping 'pyslurm/pyslurm.c' Cython extension (up-to-date)running
> install_libbyte-compiling
> /usr/local/lib/python2.7/dist-packages/pyslurm/__init__.py to
> __init__.pycrunning install_egg_infoRemoving
> /usr/local/lib/python2.7/dist-packages/pyslurm-2.3.3_1.egg-infoWriting
> /usr/local/lib/python2.7/dist-packages/pyslurm-2.3.3_1.egg-infompiu@fypmaster-OptiPlex-330:~/Desktop/pyslurm-slurm-2.3.3$
> *
>
> On Thu, Apr 14, 2016 at 3:46 PM, Benjamin Redling <
> benjamin.ra...@uni-jena.de> wrote:
>
>>
>> On 04/14/2016 11:08, Naajil Aamir wrote:
>> > Hi hope you are doing well. I am currently working on a scheduling
>> policy
>> > of slurm 2.3.2 for that i need *PYSLURM* version that is compatible with
>> > slurm 2.3.3 which i am unable to find on internet. It would be a great
>> help
>> > if you could provide a link to PYSLURM for Slurm 2.3.2 repository.
>>
>> Maybe the stale branches of pyslurm are what you are looking for?
>> https://github.com/PySlurm/pyslurm/branches
>>
>> 2.3.3 seems to be the oldest
>>
>> Benjamin
>> --
>> FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
>> vox: +49 3641 9 44323 | fax: +49 3641 9 44321
>>
>
>


[slurm-dev] Re: PySlurm for SLURM 2.3.2 API

2016-04-14 Thread Naajil Aamir
Do i need pyslurm.c file having some code?because in pyslurm2.3.3 this file
is empty and i installed it successfully but when i try to import pyslurm i
get the following error









*import pyslurmTraceback (most recent call last):  File "", line 1,
in   File
"/usr/local/lib/python2.7/dist-packages/pyslurm/__init__.py", line 17, in
from .pyslurm import *ImportError: dynamic module does not
define init function (initpyslurm)*
This is what i get after installing pyslurm





























*mpiu@fypmaster-OptiPlex-330:~/Desktop/pyslurm-slurm-2.3.3$ sudo python
setup.py buildINFO:root:Info:INFO:root:Info: Building PySlurm
(2.3.3-1)INFO:root:Info:
--INFO:root:Info:INFO:root:Info: Cython version
0.24 installedrunning buildrunning build_pycopying pyslurm/__init__.py ->
build/lib.linux-x86_64-2.7/pyslurmrunning build_extskipping
'pyslurm/pyslurm.c' Cython extension
(up-to-date)mpiu@fypmaster-OptiPlex-330:~/Desktop/pyslurm-slurm-2.3.3$ sudo
python setup.py installINFO:root:Info:INFO:root:Info: Building PySlurm
(2.3.3-1)INFO:root:Info:
--INFO:root:Info:INFO:root:Info: Cython version
0.24 installedrunning installrunning buildrunning build_pyrunning
build_extskipping 'pyslurm/pyslurm.c' Cython extension (up-to-date)running
install_libbyte-compiling
/usr/local/lib/python2.7/dist-packages/pyslurm/__init__.py to
__init__.pycrunning install_egg_infoRemoving
/usr/local/lib/python2.7/dist-packages/pyslurm-2.3.3_1.egg-infoWriting
/usr/local/lib/python2.7/dist-packages/pyslurm-2.3.3_1.egg-infompiu@fypmaster-OptiPlex-330:~/Desktop/pyslurm-slurm-2.3.3$
*

On Thu, Apr 14, 2016 at 3:46 PM, Benjamin Redling <
benjamin.ra...@uni-jena.de> wrote:

>
> On 04/14/2016 11:08, Naajil Aamir wrote:
> > Hi hope you are doing well. I am currently working on a scheduling policy
> > of slurm 2.3.2 for that i need *PYSLURM* version that is compatible with
> > slurm 2.3.3 which i am unable to find on internet. It would be a great
> help
> > if you could provide a link to PYSLURM for Slurm 2.3.2 repository.
>
> Maybe the stale branches of pyslurm are what you are looking for?
> https://github.com/PySlurm/pyslurm/branches
>
> 2.3.3 seems to be the oldest
>
> Benjamin
> --
> FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
> vox: +49 3641 9 44323 | fax: +49 3641 9 44321
>


[slurm-dev] Re: dynamic srun invocation limits?

2016-04-14 Thread Gary Brown
Do the "thousands upon thousands" of sub-processes have dependencies among
them or are they fully independent of each other?  Is it necessary to spawn
them using srun; i.e., are you using srun to provide job step accounting or
to make them subject to scheduling policies or what?  Just trying to
understand your context.

Gary D. Brown
Adaptive Computing


On Thu, Apr 14, 2016 at 4:02 PM, Pyramid Bioengineering <
pyramidbioengineer...@gmail.com> wrote:

> Hi All,
>
> Our team is using Slurm to distribute tasks across a cluster, but our
> implementation may be a little different than what the typical person is
> doing... maybe?
>
> We'll submit a very simple sbatch, like so:
>
> ```
> #!/bin/bash
> #SBATCH --errror=/tmp/error.log
> #SBATCH --output/tmp/output.log
> execute_algorithm arg1 arg2
> ```
>
> `execute_alogorithm` is where things get a bit funny, it can be some
> variant of a complex C algorithm of ours that will spawn potentially
> thousands upon thousands of subprocess invocations. Each of these
> subprocess invocations are executed with `srun`, and slurm is successfully
> recognizing them as job steps. It should also be noted that we are waiting
> for all the threads to exit before exiting `execute_algorithm` itself.
>
> The question here is can slurm handle this sort of srun task allotment,
> firing them all out at once?
>
> In testing, this has appeared to work on very small jobs that are just
> above the limits of our node resources. I can actually see entries in the
> job error log that slurm is recognizing that it's hitting task capacity and
> waiting:
>
> `srun: Job step creation temporarily disabled, retrying`
>
> We have yet to get to a point where we can run the "thousands" of tasks
> that I speak of, but that will be coming up at the end of the month, and
> frankly I'm skeptical.
>
> Is this a common approach? If we are stuck with this approach and there is
> not another way to do it, do we just build some internal scheduling logic
> into the `execute_algorithm`?
>
> Thanks!
>


[slurm-dev] dynamic srun invocation limits?

2016-04-14 Thread Pyramid Bioengineering
Hi All,

Our team is using Slurm to distribute tasks across a cluster, but our
implementation may be a little different than what the typical person is
doing... maybe?

We'll submit a very simple sbatch, like so:

```
#!/bin/bash
#SBATCH --errror=/tmp/error.log
#SBATCH --output/tmp/output.log
execute_algorithm arg1 arg2
```

`execute_alogorithm` is where things get a bit funny, it can be some
variant of a complex C algorithm of ours that will spawn potentially
thousands upon thousands of subprocess invocations. Each of these
subprocess invocations are executed with `srun`, and slurm is successfully
recognizing them as job steps. It should also be noted that we are waiting
for all the threads to exit before exiting `execute_algorithm` itself.

The question here is can slurm handle this sort of srun task allotment,
firing them all out at once?

In testing, this has appeared to work on very small jobs that are just
above the limits of our node resources. I can actually see entries in the
job error log that slurm is recognizing that it's hitting task capacity and
waiting:

`srun: Job step creation temporarily disabled, retrying`

We have yet to get to a point where we can run the "thousands" of tasks
that I speak of, but that will be coming up at the end of the month, and
frankly I'm skeptical.

Is this a common approach? If we are stuck with this approach and there is
not another way to do it, do we just build some internal scheduling logic
into the `execute_algorithm`?

Thanks!


[slurm-dev] DBD_JOB_COMPLETE: cluster not registered

2016-04-14 Thread shengzhao wen
hi all:
 I can not store my job info to mysql by slurmdbd,  there have a "cluster not 
registered" message at logfile.
 I have  add cluster name to my db by sacctmgr follow by 
http://slurm.schedmd.com/accounting.html 
 And there have my cluster name in mysql db.


=slurmdbd log 
info==
slurmdbd: debug2: DBD_JOB_START: START CALL ID:42 NAME:test_slurm INX:0
slurmdbd: debug2: as_mysql_slurmdb_job_start() called
slurmdbd: DBD_JOB_START: cluster not registered
slurmdbd: debug2: DBD_STEP_START: ID:42.4294967294 NAME:batch SUBMIT:1460646258
slurmdbd: DBD_STEP_START: cluster not registered
slurmdbd: debug2: DBD_STEP_COMPLETE: ID:42.4294967294 SUBMIT:1460646258
slurmdbd: DBD_STEP_COMPLETE: cluster not registered
slurmdbd: debug2: DBD_JOB_START: START CALL ID:42 NAME:test_slurm INX:9
slurmdbd: debug2: as_mysql_slurmdb_job_start() called
slurmdbd: DBD_JOB_START: cluster not registered
slurmdbd: debug2: DBD_JOB_COMPLETE: ID:42
slurmdbd: debug2: as_mysql_slurmdb_job_complete() called
slurmdbd: DBD_JOB_COMPLETE: cluster not registered


== cluster in my sql
mysql> select * from   cluster_table;
+---++-+--+--+--+---+-+++--+---+
| creation_time | mod_time   | deleted | name | control_host | 
control_port | last_port | rpc_version | classification | dimensions | 
plugin_id_select | flags |
+---++-+--+--+--+---+-+++--+---+
|1460645091 | 1460645091 |   0 | hgcp_cluster_2_0 |  |  
  0 | 0 |   0 |  0 |  1 |   
 0 | 0 |
+---++-+--+--+--+---+-+++--+---+
1 row in set (0.00 sec)


how should I Solve this problem?


Zihan Wen





[slurm-dev] Re: Oversubscribing nodes

2016-04-14 Thread John DeSantis

Paul,

>> To answer John's question:  I don't want to limit 1 job per node, which is 
>> why I don't want "SHARED=FORCE:1".  I want for jobs to be able to share 
>> nodes, but not cores -- and I want for users to not be able to override this 
>> with "--exclusive".

Ok - we are doing exactly this: sharing nodes but not CPU's (cores), and
there are no issues with over-subscription.  However, users can, and do,
request exclusivity by using "--exclusive".  A job submit plugin can
strip that request.

Here are the relevant portions of our sanitized configuration which you
could potentially apply and test to negate over-subscription:

PreemptMode = CANCEL
PreemptType = preempt/qos
SchedulerType   = sched/backfill
SelectType  = select/cons_res
SelectTypeParameters= CR_CPU_MEMORY


NodeName= CPUs=8 CoresPerSocket=4 Sockets=2 RealMemory=16012
NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=24028
NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=24028
NodeName= CPUs=20 CoresPerSocket=10 Sockets=2 RealMemory=516766
NodeName= CPUs=16 CoresPerSocket=4 Sockets=4 RealMemory=129119
NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=24013
NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=24016
NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=48258
NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=48251
NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=19973
NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=24016
NodeName= CPUs=12 CoresPerSocket=6 Sockets=2 RealMemory=23980
NodeName= CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32075
NodeName= CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32074
NodeName= CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32076
NodeName= CPUs=16 CoresPerSocket=8 Sockets=2 RealMemory=32073


PartitionName=blah1 State=UP Shared=NO Default=YES DefaultTime=01:00:00
DefMemPerCPU=512
PartitionName=blah2 Nodes=list1 DefaultTime=01:00:00 MaxTime=168:00:00
AllowQOS=qos1 Shared=NO State=UP DefMemPerCPU=512
PartitionName=blah3 Nodes=list2 State=UP Shared=No DefaultTime=01:00:00
DenyQOS=qos2 Priority=5000 DefMemPerCPU=512
PartitionName=blah4 Nodes=list3 State=UP Shared=No MaxTime=02:00:00
MaxNodes=4 DenyQOS=qos2 DefMemPerCPU=512
PartitionName=blah5 Nodes=list4 State=UP Shared=No DefaultTime=01:00:00
MaxTime=26:00:00 DenyQOS=qos2 DefMemPerCPU=512

Again, this configuration affords us the ability to have nodes run
multiple user jobs without CPU's being shared;  if all CPU's are
allocated, the controller will not dispatch jobs to the nodes in question.

I may be incorrect here, but looking at your configuration you've
specified that your preempt type is partition priority, yet your
partition definitions hint to a preemption configured via QOS.  Is this
what you wanted?  The documentation states that based upon your
preemption settings, the resultant behavior would be a mix of running
and suspended jobs based upon (a) job priorities, and (b) partition
priorities; maybe check the controller logs for preemption notices to
confirm or deny this thought?  At any rate, I'd suggest only using
preemption based upon QOS.

HTH,
John DeSantis


On 04/14/2016 07:38 AM, Wiegand, Paul wrote:
> I guess I'm going to have to dig in with a filter or something like that.  
> Our setup seems to oversubscribe no matter what I do.
> 
> To answer John's question:  I don't want to limit 1 job per node, which is 
> why I don't want "SHARED=FORCE:1".  I want for jobs to be able to share 
> nodes, but not cores -- and I want for users to not be able to override this 
> with "--exclusive".  It seems to be the case that slurm correctly ignores the 
> "--exclusive" switch with using FORCE, which is what I want.  It's just the 
> deploying 50 tasks to a 16-core node that's a problem for us.
> 
> Even though it is not what I want, I took everyone's advice and switched to 
> SHARED=NO to resolve the oversubscription problems; however, the problem is 
> not resolved.  Though this appears to happen less frequently, it still 
> happens.  Currently, we have several user jobs that are alternating between 
> suspended and running, and at least one of those nodes is definitely 
> oversubscribed.  There are 35 tasks from various user jobs deployed to it, 
> while it has only 16 cores available.
> 
> And I still don't see where my misunderstanding of the shared and cgroups 
> parameters are.  I've read the documentation repeatedly and asked on several 
> email lists what I'm not understanding, and mostly people have been telling 
> me "don't do that" ... which is fine as far as that goes, but what I *really* 
> want to know is:  "What am I misunderstanding?
> 
> I've been operating under the assumption that the problem is a failure on my 
> part:  I've misconfigured something somewhere.  This is why I want to 
> understand where my reading of the docs is flawed.  Or is it possible this a 
> bug in slurm?
> 
> I admit that I'm pretty frustrated.  It is 

[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Husen R
Danny :

I'm unable to use srun_cr command. I got this error message from slurmctld
log file after submitting srun_cr with sbatch:

[2016-04-14T19:22:42.719] job_complete: JobID=67 State=0x1 NodeCnt=2
WEXITSTATUS 255

Any idea to fix this ?

- yes, my job needs more than 5 minutes.

Andy :

Yes, /mirror directory is shared across my cluster. I have configured it
using NFS.

Regards,



Husen



On Thu, Apr 14, 2016 at 6:15 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:

> I've found two things, first you could try srun_cr instead of srun and the
> second is, do your job needs more than 5 minutes?!
> But I'm not sure, so you may try it and post the result.
>
>
> Am 14.04.2016 um 12:56 schrieb Husen R:
>
>> Hello Danny,
>>
>> I have tried to restart using "scontrol checkpoint restart " but it
>> doesn't work.
>> In addition, ".0" directory and its content are doesn't exist in my
>> --checkpoint-dir.
>> The following is my batch job :
>>
>> =batch job===
>>
>> #!/bin/bash
>> #SBATCH -J MatMul
>> #SBATCH -o mm-%j.out
>> #SBATCH -A pro
>> #SBATCH -N 3
>> #SBATCH -n 24
>> #SBATCH --checkpoint=5
>> #SBATCH --checkpoint-dir=/mirror/source/cr
>> #SBATCH --time=01:30:00
>> #SBATCH --mail-user=hus...@gmail.com
>> #SBATCH --mail-type=begin
>> #SBATCH --mail-type=end
>>
>> srun --mpi=pmi2 ./mm.o
>>
>> ===end batch job
>>
>> is there something that prevents me from getting the right directory
>> structure ?
>>
>>
>> Regards,
>>
>>
>>
>> Husen
>>
>>
>>
>>
>> On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher <
>> danny.rotsc...@tu-dresden.de> wrote:
>>
>> Hello,
>>>
>>> usually the directory, which is specified by --checkpoint-dir, should
>>> have
>>> the following structure:
>>> 
>>> |__ script.ckpt
>>> |__ .0
>>>   |__ task.0.ckpt
>>>   |__ task.1.ckpt
>>>   |__ ...
>>>
>>> But you only have to run the following command to restart your batch job:
>>> scontrol checkpoint restart 
>>>
>>> I tried only batch jobs and currently I try to build MVAPICH2 with BLCR
>>> and Slurm support, because that mpi library is explicitly mentioned in
>>> the
>>> Slurm documentation.
>>>
>>> A colleague also tested DMTCP but no success.
>>>
>>> Kind reagards
>>> Danny
>>> TU Dresden
>>> Germany
>>>
>>>
>>> Am 14.04.2016 um 11:01 schrieb Husen R:
>>>
>>> Hi all,
 Thank you for your reply

 Danny :
 I have installed BLCR and SLURM successfully.
 I also have configured CheckpointType, --checkpoint, --checkpoint-dir
 and
 JobCheckpointDir in order for slurm to support checkpoint.

 I have tried to checkpoint a simple MPI parallel application many times
 in
 my small cluster, and like you said, after checkpoint is completed there
 is
 a directory named with jobid in  --checkpoint-dir. in that directory
 there
 is a file named "script.ckpt". I tried to restart directly using srun
 command below :

 srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o

 where --restart-dir is directory that contains "script.ckpt".
 Unfortunately, I got the following error :

 Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file
 or
 directory
 srun: error: compute-node: task 0: Exited with exit code 255

 As we can see from the error message above, there was no "task.0.ckpt"
 file. I don't know how to get such file. The files that I got from
 checkpoint operation is a file named "script.ckpt" in --checkpoint-dir
 and
 two files in JobCheckpointDir named ".ckpt" and
 ".ckpt.old".

 According to the information in section srun in this link
 http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
 completed there should be checkpoint files of the form ".ckpt"
 and
 "..ckpt" in --checkpoint-dir.

 Any idea to solve this ?

 Manuel :

 Yes, BLCR doesn't support checkpoint/restart parallel/distributed
 application by itself (
 https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
 But it can be used by other software to do that (I hope the software is
 SLURM..huhu)

 I have ever tried to restart mpi application using DMTCP but it doesn't
 work.
 Would you please tell me how to do that ?


 Thank you in advance,

 Regards,


 Husen





 On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
 danny.rotsc...@tu-dresden.de> wrote:

 I forgot something to add, you have to create a directory for the

> checkpoint meta data, which is for default located in
> /var/slurm/checkpoint:
> mkdir -p /var/slurm/checkpoint
> chown -R slurm /var/slurm
> or you define your own directory in slurm.conf:
> JobCheckpointDir=
>
> The parameters you could check with:
> scontrol show config | grep checkpoint
>
> Kind regards,
> Danny
> TU 

[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Andy Riebs
   Is your /mirror directory shared across your cluster?
 
 On 04/14/2016 06:56 AM, Husen R wrote:
   Re: [slurm-dev] Re: Slurm Checkpoint/Restart example
   
   Hello Danny,
 I have tried to restart using "scontrol checkpoint restart
 " but it doesn't work.
 In addition,
 ".0" directory and its content are doesn't
 exist in my --checkpoint-dir.
 The following is my batch
 job :
 =batch
 job===
   #!/bin/bash
   #SBATCH -J MatMul
   #SBATCH -o mm-%j.out
   #SBATCH -A pro
   #SBATCH -N 3
   #SBATCH -n 24
   #SBATCH --checkpoint=5
   #SBATCH
 --checkpoint-dir=/mirror/source/cr
   #SBATCH --time=01:30:00
   #SBATCH --mail-user=hus...@gmail.com
   #SBATCH --mail-type=begin
   #SBATCH --mail-type=end
   srun --mpi=pmi2 ./mm.o
   ===end batch
 job
   is there something�that prevents me from getting the
 right directory structure ?
   Regards,
   Husen
 On Thu, Apr 14, 2016 at 5:36 PM, Danny
   Rotscher 
   wrote:
   Hello,
 
 usually the directory, which is specified by
 --checkpoint-dir, should have the following structure:
 
 |__ script.ckpt
 |__ .0
 � � �|__ task.0.ckpt
 � � �|__ task.1.ckpt
 � � �|__ ...
 
 But you only have to run the following command to restart
 your batch job:
 scontrol checkpoint restart 
 
 I tried only batch jobs and currently I try to build
 MVAPICH2 with BLCR and Slurm support, because that mpi
 library is explicitly mentioned in the Slurm documentation.
 
 A colleague also tested DMTCP but no success.
 
 Kind reagards
 Danny
 TU Dresden
 Germany
 Am 14.04.2016 um 11:01 schrieb Husen R:
 
   Hi all,
   Thank you for your reply
   
   Danny :
   I have installed BLCR and SLURM successfully.
   I also have configured CheckpointType, --checkpoint,
   --checkpoint-dir and
   JobCheckpointDir in order for slurm to support
   checkpoint.
   
   I have tried to checkpoint a simple MPI parallel
   application many times in
   my small cluster, and like you said, after checkpoint
   is completed there is
   a directory named with jobid in� --checkpoint-dir. in
   that directory there
   is a file named "script.ckpt". I tried to restart
   directly using srun
   command below :
   
   srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51
   ./mm.o
   
   where --restart-dir is directory that contains
   "script.ckpt".
   Unfortunately, I got the following error :
   
   Failed to open(/mirror/source/cr/51/task.0.ckpt,
   O_RDONLY): No such file or
   directory
   srun: error: compute-node: task 0: Exited with exit
   code 255
   
   As we can see from the error message above, there was
   no "task.0.ckpt"
   file. I don't know how to get such file. The files
   that I got from
   checkpoint operation is a file named "script.ckpt" in
   --checkpoint-dir and
   two files in JobCheckpointDir named
   ".ckpt" and ".ckpt.old".
   
   According to the information in section srun in this
   link
   http://slurm.schedmd.com/checkpoint_blcr.html,
   after checkpoint is
   completed there should be checkpoint files of the form
   ".ckpt" and
   "..ckpt" in
   --checkpoint-dir.
   
   Any idea to solve this ?
   
   Manuel :
   
   Yes, BLCR doesn't support checkpoint/restart
   parallel/distributed
   application by itself ( 
https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
   But it can be used by other software to do that (I
   hope the software is
   SLURM..huhu)
   
   I have ever tried to restart mpi application using
   DMTCP but it doesn't
   work.
   Would you please tell me how to do that ?
   Thank you in advance,
   
   Regards,
   Husen
   On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
   danny.rotsc...@tu-dresden.de>
   

[slurm-dev] Re: Oversubscribing nodes

2016-04-14 Thread Wiegand, Paul
I guess I'm going to have to dig in with a filter or something like that.  Our 
setup seems to oversubscribe no matter what I do.

To answer John's question:  I don't want to limit 1 job per node, which is why 
I don't want "SHARED=FORCE:1".  I want for jobs to be able to share nodes, but 
not cores -- and I want for users to not be able to override this with 
"--exclusive".  It seems to be the case that slurm correctly ignores the 
"--exclusive" switch with using FORCE, which is what I want.  It's just the 
deploying 50 tasks to a 16-core node that's a problem for us.

Even though it is not what I want, I took everyone's advice and switched to 
SHARED=NO to resolve the oversubscription problems; however, the problem is not 
resolved.  Though this appears to happen less frequently, it still happens.  
Currently, we have several user jobs that are alternating between suspended and 
running, and at least one of those nodes is definitely oversubscribed.  There 
are 35 tasks from various user jobs deployed to it, while it has only 16 cores 
available.

And I still don't see where my misunderstanding of the shared and cgroups 
parameters are.  I've read the documentation repeatedly and asked on several 
email lists what I'm not understanding, and mostly people have been telling me 
"don't do that" ... which is fine as far as that goes, but what I *really* want 
to know is:  "What am I misunderstanding?

I've been operating under the assumption that the problem is a failure on my 
part:  I've misconfigured something somewhere.  This is why I want to 
understand where my reading of the docs is flawed.  Or is it possible this a 
bug in slurm?

I admit that I'm pretty frustrated.  It is causing a lot of performance 
problems for our users with big jobs.  Even though the event is rare, it is 
highly probable that a large, long job will experience some unintended 
oversubscription during the lifetime of its run, and this greatly hinders 
performance (sometimes by orders of magnitudes).  Users are very annoyed with 
us, and I honestly have no idea why it is happening.  We don't have the the 
resources to be able to manage throughput of jobs with only exclusive nodes, so 
if I cannot resolve this problem, we may have to switch back to 
Torque/{Moab|Maui}.

I'm attaching our slurm.conf.  Let me know if you need our topology.conf and 
cgroup.conf file, as well.

Paul.


slurm.conf
Description: slurm.conf



> On Apr 11, 2016, at 10:26, Aaron Knister  wrote:
> 
> 
> Hi Paul,
> 
> There's always the Swiss Army knife that is a submission filter. If a user 
> specifies exclusive you could literally strip that argument via a submission 
> filter before the allocation request hits the scheduler. 
> 
> -Aaron
> 
> Sent from my iPhone
> 
>> On Apr 8, 2016, at 2:03 PM, Wiegand, Paul  wrote:
>> 
>> 
>> This is *almost* what I want, but not quite.  When I do this, users can 
>> throw the "--exclusive=YES" flag and get an exclusive node.  I guess I'd 
>> interpreted "Shared=FORCE" to mean that they could not do this, which is 
>> what I wanted.  I'm not sure I understand, based on the documentation, why 
>> "Shared=FORCE" implies that cores and consumed differently.  
>> 
>> Is there no way to disable "exclusive" while also getting what I want?
>> 
>> Thanks,
>> Paul.
>> 
>> 
>>> On Apr 8, 2016, at 13:40, John DeSantis  wrote:
>>> 
>>> 
>>> Paul,
>>> 
>>> Try changing the Partition "Shared=FORCE" statement to "Shared=NO".
>>> 
>>> We do that on all of our partitions and get the desired behavior.
>>> 
>>> John DeSantis
>>> 
 On 04/08/2016 01:06 PM, Wiegand, Paul wrote:
 
 Greetings,
 
 I would like to have our cluster configured (and believed I had done) to 
 work as follows:
 
 * User jobs can share a node, but not cores and memory
 * Users cannot override the share option using "--exclusive" 
 * Once assigned cores and memory, jobs are restricted to just those 
 assigned resources
 
 
 But this is not what is happening.  Instead, if jobs request a 
 sufficiently small amount of memory that they can both consume their 
 requested amounts without contention, slurm can and does deploy jobs to 
 the same node ... even if there are an insufficient number of cores.  
 Instead, the jobs alternate between a "suspended" mode and "running" mode 
 and, of course, run much, much slower as a result.
 
 Clearly I've misunderstood my configuration.  We are running Slurm 
 15.08.3.  Here are what I believe are the relevant parameters from our 
 slurm.conf.  Let me know if you want me to provide more.
 
 
 ProctrackType=proctrack/cgroup
 TaskPlugin=task/cgroup
 
 SelectType=select/cons_res
 SelectTypeParameters=CR_Core_Memory
 
 PartitionName=DEFAULT  Shared=FORCE  State=UP   DefaultTime=00:10:00



[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Danny Rotscher
I've found two things, first you could try srun_cr instead of srun and 
the second is, do your job needs more than 5 minutes?!

But I'm not sure, so you may try it and post the result.

Am 14.04.2016 um 12:56 schrieb Husen R:

Hello Danny,

I have tried to restart using "scontrol checkpoint restart " but it
doesn't work.
In addition, ".0" directory and its content are doesn't exist in my
--checkpoint-dir.
The following is my batch job :

=batch job===

#!/bin/bash
#SBATCH -J MatMul
#SBATCH -o mm-%j.out
#SBATCH -A pro
#SBATCH -N 3
#SBATCH -n 24
#SBATCH --checkpoint=5
#SBATCH --checkpoint-dir=/mirror/source/cr
#SBATCH --time=01:30:00
#SBATCH --mail-user=hus...@gmail.com
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

srun --mpi=pmi2 ./mm.o

===end batch job

is there something that prevents me from getting the right directory
structure ?


Regards,



Husen




On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:


Hello,

usually the directory, which is specified by --checkpoint-dir, should have
the following structure:

|__ script.ckpt
|__ .0
  |__ task.0.ckpt
  |__ task.1.ckpt
  |__ ...

But you only have to run the following command to restart your batch job:
scontrol checkpoint restart 

I tried only batch jobs and currently I try to build MVAPICH2 with BLCR
and Slurm support, because that mpi library is explicitly mentioned in the
Slurm documentation.

A colleague also tested DMTCP but no success.

Kind reagards
Danny
TU Dresden
Germany


Am 14.04.2016 um 11:01 schrieb Husen R:


Hi all,
Thank you for your reply

Danny :
I have installed BLCR and SLURM successfully.
I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
JobCheckpointDir in order for slurm to support checkpoint.

I have tried to checkpoint a simple MPI parallel application many times in
my small cluster, and like you said, after checkpoint is completed there
is
a directory named with jobid in  --checkpoint-dir. in that directory there
is a file named "script.ckpt". I tried to restart directly using srun
command below :

srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o

where --restart-dir is directory that contains "script.ckpt".
Unfortunately, I got the following error :

Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file
or
directory
srun: error: compute-node: task 0: Exited with exit code 255

As we can see from the error message above, there was no "task.0.ckpt"
file. I don't know how to get such file. The files that I got from
checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and
two files in JobCheckpointDir named ".ckpt" and ".ckpt.old".

According to the information in section srun in this link
http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
completed there should be checkpoint files of the form ".ckpt" and
"..ckpt" in --checkpoint-dir.

Any idea to solve this ?

Manuel :

Yes, BLCR doesn't support checkpoint/restart parallel/distributed
application by itself (
https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
But it can be used by other software to do that (I hope the software is
SLURM..huhu)

I have ever tried to restart mpi application using DMTCP but it doesn't
work.
Would you please tell me how to do that ?


Thank you in advance,

Regards,


Husen





On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:

I forgot something to add, you have to create a directory for the

checkpoint meta data, which is for default located in
/var/slurm/checkpoint:
mkdir -p /var/slurm/checkpoint
chown -R slurm /var/slurm
or you define your own directory in slurm.conf:
JobCheckpointDir=

The parameters you could check with:
scontrol show config | grep checkpoint

Kind regards,
Danny
TU Dresden
Germany

Am 14.04.2016 um 06:41 schrieb Danny Rotscher:

Hello,

we don't get it to work too, but we already build Slurm with the BLCR.

You first have to install the BLCR library, which is described on the
following website:
https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html

Then we build and installed Slurm from source and BLCR checkpointing has
been included.

After that you have to set at least one Parameter in the file
"slurm.conf":
CheckpointType=checkpoint/blcr

It exists two ways to create ceckpointing, you could either make a
checkpoint by the following command from outside your job:
scontrol checkpoint create 
or you could let Slurm do some periodical checkpoints with the following
sbatch parameter:
#SBATCH --checkpoint 
We also tried:
#SBATCH --checkpoint :
e.g.
#SBATCH --checkpoint 0:10
to test it, but it doesn't work for us.

We also set the parameter for the checkpoint directory:
#SBATCH --checkpoint-dir 

After you create a checkpoint and in your checkpoint directory is
created
a directory with name of your jobid, you could restart the job by the
following command:
scontrol checkpoint 

[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Husen R
Hello Danny,

I have tried to restart using "scontrol checkpoint restart " but it
doesn't work.
In addition, ".0" directory and its content are doesn't exist in my
--checkpoint-dir.
The following is my batch job :

=batch job===

#!/bin/bash
#SBATCH -J MatMul
#SBATCH -o mm-%j.out
#SBATCH -A pro
#SBATCH -N 3
#SBATCH -n 24
#SBATCH --checkpoint=5
#SBATCH --checkpoint-dir=/mirror/source/cr
#SBATCH --time=01:30:00
#SBATCH --mail-user=hus...@gmail.com
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

srun --mpi=pmi2 ./mm.o

===end batch job

is there something that prevents me from getting the right directory
structure ?


Regards,



Husen




On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:

> Hello,
>
> usually the directory, which is specified by --checkpoint-dir, should have
> the following structure:
> 
> |__ script.ckpt
> |__ .0
>  |__ task.0.ckpt
>  |__ task.1.ckpt
>  |__ ...
>
> But you only have to run the following command to restart your batch job:
> scontrol checkpoint restart 
>
> I tried only batch jobs and currently I try to build MVAPICH2 with BLCR
> and Slurm support, because that mpi library is explicitly mentioned in the
> Slurm documentation.
>
> A colleague also tested DMTCP but no success.
>
> Kind reagards
> Danny
> TU Dresden
> Germany
>
>
> Am 14.04.2016 um 11:01 schrieb Husen R:
>
>> Hi all,
>> Thank you for your reply
>>
>> Danny :
>> I have installed BLCR and SLURM successfully.
>> I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
>> JobCheckpointDir in order for slurm to support checkpoint.
>>
>> I have tried to checkpoint a simple MPI parallel application many times in
>> my small cluster, and like you said, after checkpoint is completed there
>> is
>> a directory named with jobid in  --checkpoint-dir. in that directory there
>> is a file named "script.ckpt". I tried to restart directly using srun
>> command below :
>>
>> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o
>>
>> where --restart-dir is directory that contains "script.ckpt".
>> Unfortunately, I got the following error :
>>
>> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file
>> or
>> directory
>> srun: error: compute-node: task 0: Exited with exit code 255
>>
>> As we can see from the error message above, there was no "task.0.ckpt"
>> file. I don't know how to get such file. The files that I got from
>> checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and
>> two files in JobCheckpointDir named ".ckpt" and ".ckpt.old".
>>
>> According to the information in section srun in this link
>> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
>> completed there should be checkpoint files of the form ".ckpt" and
>> "..ckpt" in --checkpoint-dir.
>>
>> Any idea to solve this ?
>>
>> Manuel :
>>
>> Yes, BLCR doesn't support checkpoint/restart parallel/distributed
>> application by itself (
>> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
>> But it can be used by other software to do that (I hope the software is
>> SLURM..huhu)
>>
>> I have ever tried to restart mpi application using DMTCP but it doesn't
>> work.
>> Would you please tell me how to do that ?
>>
>>
>> Thank you in advance,
>>
>> Regards,
>>
>>
>> Husen
>>
>>
>>
>>
>>
>> On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
>> danny.rotsc...@tu-dresden.de> wrote:
>>
>> I forgot something to add, you have to create a directory for the
>>> checkpoint meta data, which is for default located in
>>> /var/slurm/checkpoint:
>>> mkdir -p /var/slurm/checkpoint
>>> chown -R slurm /var/slurm
>>> or you define your own directory in slurm.conf:
>>> JobCheckpointDir=
>>>
>>> The parameters you could check with:
>>> scontrol show config | grep checkpoint
>>>
>>> Kind regards,
>>> Danny
>>> TU Dresden
>>> Germany
>>>
>>> Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
>>>
>>> Hello,

 we don't get it to work too, but we already build Slurm with the BLCR.

 You first have to install the BLCR library, which is described on the
 following website:
 https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html

 Then we build and installed Slurm from source and BLCR checkpointing has
 been included.

 After that you have to set at least one Parameter in the file
 "slurm.conf":
 CheckpointType=checkpoint/blcr

 It exists two ways to create ceckpointing, you could either make a
 checkpoint by the following command from outside your job:
 scontrol checkpoint create 
 or you could let Slurm do some periodical checkpoints with the following
 sbatch parameter:
 #SBATCH --checkpoint 
 We also tried:
 #SBATCH --checkpoint :
 e.g.
 #SBATCH --checkpoint 0:10
 to test it, but it doesn't work for us.

 We also set the parameter for the checkpoint directory:
 #SBATCH 

[slurm-dev] Re: PySlurm for SLURM 2.3.2 API

2016-04-14 Thread Benjamin Redling

On 04/14/2016 11:08, Naajil Aamir wrote:
> Hi hope you are doing well. I am currently working on a scheduling policy
> of slurm 2.3.2 for that i need *PYSLURM* version that is compatible with
> slurm 2.3.3 which i am unable to find on internet. It would be a great help
> if you could provide a link to PYSLURM for Slurm 2.3.2 repository.

Maybe the stale branches of pyslurm are what you are looking for?
https://github.com/PySlurm/pyslurm/branches

2.3.3 seems to be the oldest

Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321


[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Danny Rotscher

Hello,

usually the directory, which is specified by --checkpoint-dir, should 
have the following structure:


|__ script.ckpt
|__ .0
 |__ task.0.ckpt
 |__ task.1.ckpt
 |__ ...

But you only have to run the following command to restart your batch job:
scontrol checkpoint restart 

I tried only batch jobs and currently I try to build MVAPICH2 with BLCR 
and Slurm support, because that mpi library is explicitly mentioned in 
the Slurm documentation.


A colleague also tested DMTCP but no success.

Kind reagards
Danny
TU Dresden
Germany

Am 14.04.2016 um 11:01 schrieb Husen R:

Hi all,
Thank you for your reply

Danny :
I have installed BLCR and SLURM successfully.
I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
JobCheckpointDir in order for slurm to support checkpoint.

I have tried to checkpoint a simple MPI parallel application many times in
my small cluster, and like you said, after checkpoint is completed there is
a directory named with jobid in  --checkpoint-dir. in that directory there
is a file named "script.ckpt". I tried to restart directly using srun
command below :

srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o

where --restart-dir is directory that contains "script.ckpt".
Unfortunately, I got the following error :

Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or
directory
srun: error: compute-node: task 0: Exited with exit code 255

As we can see from the error message above, there was no "task.0.ckpt"
file. I don't know how to get such file. The files that I got from
checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and
two files in JobCheckpointDir named ".ckpt" and ".ckpt.old".

According to the information in section srun in this link
http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
completed there should be checkpoint files of the form ".ckpt" and
"..ckpt" in --checkpoint-dir.

Any idea to solve this ?

Manuel :

Yes, BLCR doesn't support checkpoint/restart parallel/distributed
application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
But it can be used by other software to do that (I hope the software is
SLURM..huhu)

I have ever tried to restart mpi application using DMTCP but it doesn't
work.
Would you please tell me how to do that ?


Thank you in advance,

Regards,


Husen





On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:


I forgot something to add, you have to create a directory for the
checkpoint meta data, which is for default located in /var/slurm/checkpoint:
mkdir -p /var/slurm/checkpoint
chown -R slurm /var/slurm
or you define your own directory in slurm.conf:
JobCheckpointDir=

The parameters you could check with:
scontrol show config | grep checkpoint

Kind regards,
Danny
TU Dresden
Germany

Am 14.04.2016 um 06:41 schrieb Danny Rotscher:


Hello,

we don't get it to work too, but we already build Slurm with the BLCR.

You first have to install the BLCR library, which is described on the
following website:
https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html

Then we build and installed Slurm from source and BLCR checkpointing has
been included.

After that you have to set at least one Parameter in the file
"slurm.conf":
CheckpointType=checkpoint/blcr

It exists two ways to create ceckpointing, you could either make a
checkpoint by the following command from outside your job:
scontrol checkpoint create 
or you could let Slurm do some periodical checkpoints with the following
sbatch parameter:
#SBATCH --checkpoint 
We also tried:
#SBATCH --checkpoint :
e.g.
#SBATCH --checkpoint 0:10
to test it, but it doesn't work for us.

We also set the parameter for the checkpoint directory:
#SBATCH --checkpoint-dir 

After you create a checkpoint and in your checkpoint directory is created
a directory with name of your jobid, you could restart the job by the
following command:
scontrol checkpoint restart 

We tested some sequential and openmp programs with different parameters
and it works (checkpoint creation and restarting),
but *we don't get any mpi library to work*, we already tested some
programs build with openmpi and intelmpi.
The checkpoint will be created but we get the following error when we
want to restart them:
- Failed to open file '/'
- cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
- cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
Restart failed: Is a directory
srun: error: taurusi4010: task 0: Exited with exit code 21

So, it would be great if you could confirm our problems, maybe then
schedmd higher up the priority of such mails;-)
If you get it to work, please help us to understand how.

Kind reagards,
Danny
TU Dresden
Germany

Am 11.04.2016 um 10:09 schrieb Husen R:


Hi all,

Based on the information in this link
http://slurm.schedmd.com/checkpoint_blcr.html,
Slurm able to checkpoint the whole batch jobs and then Restart execution
of
batch jobs and 

[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Manuel Rodríguez Pascual

There is a good tutorial on how to use DMTCP on their github page,

https://github.com/dmtcp/dmtcp/blob/master/QUICK-START.md

I would start there. Anyway, probably this Slurm mailing list is not
the best place to ask for that information.

Best regards,

Manuel

2016-04-14 11:01 GMT+02:00 Husen R :
> Hi all,
> Thank you for your reply
>
> Danny :
> I have installed BLCR and SLURM successfully.
> I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
> JobCheckpointDir in order for slurm to support checkpoint.
>
> I have tried to checkpoint a simple MPI parallel application many times in
> my small cluster, and like you said, after checkpoint is completed there is
> a directory named with jobid in  --checkpoint-dir. in that directory there
> is a file named "script.ckpt". I tried to restart directly using srun
> command below :
>
> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o
>
> where --restart-dir is directory that contains "script.ckpt".
> Unfortunately, I got the following error :
>
> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or
> directory
> srun: error: compute-node: task 0: Exited with exit code 255
>
> As we can see from the error message above, there was no "task.0.ckpt" file.
> I don't know how to get such file. The files that I got from checkpoint
> operation is a file named "script.ckpt" in --checkpoint-dir and two files in
> JobCheckpointDir named ".ckpt" and ".ckpt.old".
>
> According to the information in section srun in this link
> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed
> there should be checkpoint files of the form ".ckpt" and
> "..ckpt" in --checkpoint-dir.
>
> Any idea to solve this ?
>
> Manuel :
>
> Yes, BLCR doesn't support checkpoint/restart parallel/distributed
> application by itself (
> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by
> other software to do that (I hope the software is SLURM..huhu)
>
> I have ever tried to restart mpi application using DMTCP but it doesn't
> work.
> Would you please tell me how to do that ?
>
>
> Thank you in advance,
>
> Regards,
>
>
> Husen
>
>
>
>
>
> On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher
>  wrote:
>>
>> I forgot something to add, you have to create a directory for the
>> checkpoint meta data, which is for default located in /var/slurm/checkpoint:
>> mkdir -p /var/slurm/checkpoint
>> chown -R slurm /var/slurm
>> or you define your own directory in slurm.conf:
>> JobCheckpointDir=
>>
>> The parameters you could check with:
>> scontrol show config | grep checkpoint
>>
>> Kind regards,
>> Danny
>> TU Dresden
>> Germany
>>
>> Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
>>>
>>> Hello,
>>>
>>> we don't get it to work too, but we already build Slurm with the BLCR.
>>>
>>> You first have to install the BLCR library, which is described on the
>>> following website:
>>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html
>>>
>>> Then we build and installed Slurm from source and BLCR checkpointing has
>>> been included.
>>>
>>> After that you have to set at least one Parameter in the file
>>> "slurm.conf":
>>> CheckpointType=checkpoint/blcr
>>>
>>> It exists two ways to create ceckpointing, you could either make a
>>> checkpoint by the following command from outside your job:
>>> scontrol checkpoint create 
>>> or you could let Slurm do some periodical checkpoints with the following
>>> sbatch parameter:
>>> #SBATCH --checkpoint 
>>> We also tried:
>>> #SBATCH --checkpoint :
>>> e.g.
>>> #SBATCH --checkpoint 0:10
>>> to test it, but it doesn't work for us.
>>>
>>> We also set the parameter for the checkpoint directory:
>>> #SBATCH --checkpoint-dir 
>>>
>>> After you create a checkpoint and in your checkpoint directory is created
>>> a directory with name of your jobid, you could restart the job by the
>>> following command:
>>> scontrol checkpoint restart 
>>>
>>> We tested some sequential and openmp programs with different parameters
>>> and it works (checkpoint creation and restarting),
>>> but *we don't get any mpi library to work*, we already tested some
>>> programs build with openmpi and intelmpi.
>>> The checkpoint will be created but we get the following error when we
>>> want to restart them:
>>> - Failed to open file '/'
>>> - cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
>>> - cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
>>> Restart failed: Is a directory
>>> srun: error: taurusi4010: task 0: Exited with exit code 21
>>>
>>> So, it would be great if you could confirm our problems, maybe then
>>> schedmd higher up the priority of such mails;-)
>>> If you get it to work, please help us to understand how.
>>>
>>> Kind reagards,
>>> Danny
>>> TU Dresden
>>> Germany
>>>
>>> Am 11.04.2016 um 10:09 schrieb Husen R:

 Hi all,

 Based on the information in this link
 

[slurm-dev] PySlurm for SLURM 2.3.2 API

2016-04-14 Thread Naajil Aamir
Hi hope you are doing well. I am currently working on a scheduling policy
of slurm 2.3.2 for that i need *PYSLURM* version that is compatible with
slurm 2.3.3 which i am unable to find on internet. It would be a great help
if you could provide a link to PYSLURM for Slurm 2.3.2 repository.
Thanks in Advance


[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Husen R
Hi all,
Thank you for your reply

Danny :
I have installed BLCR and SLURM successfully.
I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
JobCheckpointDir in order for slurm to support checkpoint.

I have tried to checkpoint a simple MPI parallel application many times in
my small cluster, and like you said, after checkpoint is completed there is
a directory named with jobid in  --checkpoint-dir. in that directory there
is a file named "script.ckpt". I tried to restart directly using srun
command below :

srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o

where --restart-dir is directory that contains "script.ckpt".
Unfortunately, I got the following error :

Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or
directory
srun: error: compute-node: task 0: Exited with exit code 255

As we can see from the error message above, there was no "task.0.ckpt"
file. I don't know how to get such file. The files that I got from
checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and
two files in JobCheckpointDir named ".ckpt" and ".ckpt.old".

According to the information in section srun in this link
http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
completed there should be checkpoint files of the form ".ckpt" and
"..ckpt" in --checkpoint-dir.

Any idea to solve this ?

Manuel :

Yes, BLCR doesn't support checkpoint/restart parallel/distributed
application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
But it can be used by other software to do that (I hope the software is
SLURM..huhu)

I have ever tried to restart mpi application using DMTCP but it doesn't
work.
Would you please tell me how to do that ?


Thank you in advance,

Regards,


Husen





On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:

> I forgot something to add, you have to create a directory for the
> checkpoint meta data, which is for default located in /var/slurm/checkpoint:
> mkdir -p /var/slurm/checkpoint
> chown -R slurm /var/slurm
> or you define your own directory in slurm.conf:
> JobCheckpointDir=
>
> The parameters you could check with:
> scontrol show config | grep checkpoint
>
> Kind regards,
> Danny
> TU Dresden
> Germany
>
> Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
>
>> Hello,
>>
>> we don't get it to work too, but we already build Slurm with the BLCR.
>>
>> You first have to install the BLCR library, which is described on the
>> following website:
>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html
>>
>> Then we build and installed Slurm from source and BLCR checkpointing has
>> been included.
>>
>> After that you have to set at least one Parameter in the file
>> "slurm.conf":
>> CheckpointType=checkpoint/blcr
>>
>> It exists two ways to create ceckpointing, you could either make a
>> checkpoint by the following command from outside your job:
>> scontrol checkpoint create 
>> or you could let Slurm do some periodical checkpoints with the following
>> sbatch parameter:
>> #SBATCH --checkpoint 
>> We also tried:
>> #SBATCH --checkpoint :
>> e.g.
>> #SBATCH --checkpoint 0:10
>> to test it, but it doesn't work for us.
>>
>> We also set the parameter for the checkpoint directory:
>> #SBATCH --checkpoint-dir 
>>
>> After you create a checkpoint and in your checkpoint directory is created
>> a directory with name of your jobid, you could restart the job by the
>> following command:
>> scontrol checkpoint restart 
>>
>> We tested some sequential and openmp programs with different parameters
>> and it works (checkpoint creation and restarting),
>> but *we don't get any mpi library to work*, we already tested some
>> programs build with openmpi and intelmpi.
>> The checkpoint will be created but we get the following error when we
>> want to restart them:
>> - Failed to open file '/'
>> - cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
>> - cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
>> Restart failed: Is a directory
>> srun: error: taurusi4010: task 0: Exited with exit code 21
>>
>> So, it would be great if you could confirm our problems, maybe then
>> schedmd higher up the priority of such mails;-)
>> If you get it to work, please help us to understand how.
>>
>> Kind reagards,
>> Danny
>> TU Dresden
>> Germany
>>
>> Am 11.04.2016 um 10:09 schrieb Husen R:
>>
>>> Hi all,
>>>
>>> Based on the information in this link
>>> http://slurm.schedmd.com/checkpoint_blcr.html,
>>> Slurm able to checkpoint the whole batch jobs and then Restart execution
>>> of
>>> batch jobs and job steps from checkpoint files.
>>>
>>> Anyone please tell me how to do that ?
>>> I need help.
>>>
>>> Thank you in advance.
>>>
>>> Regards,
>>>
>>>
>>> Husen Rusdiansyah
>>> University of Indonesia
>>>
>>
>>
>


[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Manuel Rodríguez Pascual

Hi Danny, all,

As far as I know, unfortunately BLCR does not count with MPI support
At lest I haven't been able to achieve it.

On the other side, DMTCP ( http://dmtcp.sourceforge.net/ ) does work
with MPI. My team is very interested on counting with a reliable
checkpoint/restar mechanism in Slurm, so we are now plugin to
integrate it. We are facing some technical problems, but are working
together with  DMTCP team to solve them and we are confident on having
the integration ready soon.

anyway, i'll send a mail to this list when it's ready.

Cheers,


Manuel


2016-04-14 7:03 GMT+02:00 Danny Rotscher :
> I forgot something to add, you have to create a directory for the checkpoint
> meta data, which is for default located in /var/slurm/checkpoint:
> mkdir -p /var/slurm/checkpoint
> chown -R slurm /var/slurm
> or you define your own directory in slurm.conf:
> JobCheckpointDir=
>
> The parameters you could check with:
> scontrol show config | grep checkpoint
>
> Kind regards,
> Danny
> TU Dresden
> Germany
>
> Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
>>
>> Hello,
>>
>> we don't get it to work too, but we already build Slurm with the BLCR.
>>
>> You first have to install the BLCR library, which is described on the
>> following website:
>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html
>>
>> Then we build and installed Slurm from source and BLCR checkpointing has
>> been included.
>>
>> After that you have to set at least one Parameter in the file
>> "slurm.conf":
>> CheckpointType=checkpoint/blcr
>>
>> It exists two ways to create ceckpointing, you could either make a
>> checkpoint by the following command from outside your job:
>> scontrol checkpoint create 
>> or you could let Slurm do some periodical checkpoints with the following
>> sbatch parameter:
>> #SBATCH --checkpoint 
>> We also tried:
>> #SBATCH --checkpoint :
>> e.g.
>> #SBATCH --checkpoint 0:10
>> to test it, but it doesn't work for us.
>>
>> We also set the parameter for the checkpoint directory:
>> #SBATCH --checkpoint-dir 
>>
>> After you create a checkpoint and in your checkpoint directory is created
>> a directory with name of your jobid, you could restart the job by the
>> following command:
>> scontrol checkpoint restart 
>>
>> We tested some sequential and openmp programs with different parameters
>> and it works (checkpoint creation and restarting),
>> but *we don't get any mpi library to work*, we already tested some
>> programs build with openmpi and intelmpi.
>> The checkpoint will be created but we get the following error when we want
>> to restart them:
>> - Failed to open file '/'
>> - cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
>> - cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
>> Restart failed: Is a directory
>> srun: error: taurusi4010: task 0: Exited with exit code 21
>>
>> So, it would be great if you could confirm our problems, maybe then
>> schedmd higher up the priority of such mails;-)
>> If you get it to work, please help us to understand how.
>>
>> Kind reagards,
>> Danny
>> TU Dresden
>> Germany
>>
>> Am 11.04.2016 um 10:09 schrieb Husen R:
>>>
>>> Hi all,
>>>
>>> Based on the information in this link
>>> http://slurm.schedmd.com/checkpoint_blcr.html,
>>> Slurm able to checkpoint the whole batch jobs and then Restart execution
>>> of
>>> batch jobs and job steps from checkpoint files.
>>>
>>> Anyone please tell me how to do that ?
>>> I need help.
>>>
>>> Thank you in advance.
>>>
>>> Regards,
>>>
>>>
>>> Husen Rusdiansyah
>>> University of Indonesia
>>
>>
>