[slurm-dev] Re: Trying to get a simple slurm cluster going

2016-07-18 Thread Husen R
On Mon, Jul 18, 2016 at 5:52 AM, P. Larry Nelson 
wrote:

>
> Hello,
>
> While I am in search of real hardware on which to build/test Slurm,
> I am attempting to just play around with it on a test VM (Scientific
> Linux 6.8), which, of course, is using NATted networking and is a
> standalone system protected from the outside world.
>
> I downloaded the latest (16.05.2) tarball and ran the rpmbuild
> and then installed all the rpm's.  Ran the Easy Configurator
> and gave it the hostname of the VM for the ControlMachine
> and the loopback address of 127.0.0.1 for the ControlAddr.
>

Try to fill ControlMachine and ControlAddr with the same value (ex.
hostname)

>
> Made a munge key and it started just fine.
>
> When I do a 'service slurm start', it responds "OK" for both slurmctld
> and slurmd, but slurmctld dies right away.
>
> If I do a 'slurmctld -Dvvv', I get:
> slurmctld: pidfile not locked, assuming no running daemon
> slurmctld: debug:  creating clustername file: /var/spool/clustername
> slurmctld: fatal: _create_clustername_file: failed to create file
> /var/spool/clustername
>
> The slurm.conf has this for ClusterName:
> ClusterName=SlurmCluster
>
> So, why is slurmctld trying to create file /var/spool/clustername
> instead of /var/spool/SlurmCluster.
>

clustername is just a filename. You can see your ClusterName in that file.

>
> Slurmd and slurmctld are started as root.
> I'm obviously missing something here
>
> Thanks!
> - Larry
>
>
> --
> P. Larry Nelson (217-244-9855) | IT Administrator
> 457 Loomis Lab | High Energy Physics Group
> 1110 W. Green St., Urbana, IL  | Physics Dept., Univ. of Ill.
> MailTo: lnel...@illinois.edu   |
> http://hep.physics.illinois.edu/home/lnelson/
>
> --
>  "Information without accountability is just noise."  - P.L. Nelson
>



-- 
Post Graduate Student
Faculty of Computer Science
University of Indonesia
Depok


[slurm-dev] Re: number of processes in slurm job

2016-07-12 Thread Husen R
On Tue, Jul 12, 2016 at 3:04 PM, David Ramírez <drami...@sie.es> wrote:

> Hi Husen
>
>
>
> I don’t use mcapich since some years. But I remenber when you compile
> MVAPCIH2 you must indicate
>
>
>
> ./ configure --with -pm=no --with - pmi = slurm
>

Hi David,

Thanks for the information

>
>
> I hope to you can fix it ;)
>
>
>
> I used srun –pmi=mpi2 in some programs and Works too
>
>
>
>
>
> *De:* Husen R [mailto:hus...@gmail.com]
> *Enviado el:* martes, 12 de julio de 2016 9:59
>
> *Para:* slurm-dev <slurm-dev@schedmd.com>
> *Asunto:* [slurm-dev] Re: number of processes in slurm job
>
>
>
>
>
>
>
> On Tue, Jul 12, 2016 at 2:53 PM, David Ramírez <drami...@sie.es> wrote:
>
> Hi Husen.
>
>
>
> Are you compiled OpenMPI with Slurm?? Indicate for example –with-slurm and
> –with-pmi ¿?
>
>
>
> Hi David,
>
> I use MVAPICH2.
>
> I found that I have to use srun instead of mpirun.
>
> I use this command : srun --mpi=pmi2 ./mm.o 6000
>
> it works.
>
>
>
> I use OpenMPI and Works fine without indicated mpi procs into my batch
> file.
>
>
>
> *De:* Husen R [mailto:hus...@gmail.com]
> *Enviado el:* martes, 12 de julio de 2016 9:41
> *Para:* slurm-dev <slurm-dev@schedmd.com>
> *Asunto:* [slurm-dev] Re: number of processes in slurm job
>
>
>
>
>
>
>
> On Tue, Jul 12, 2016 at 2:24 PM, Loris Bennett <loris.benn...@fu-berlin.de>
> wrote:
>
>
> Husen R <hus...@gmail.com> writes:
>
> > Re: [slurm-dev] Re: number of processes in slurm job
> >
> > Hi,
> >
> > Thanks for your reply !
> >
> > I use this sbatch script
> >
> > #!/bin/bash
> > #SBATCH -J mm6kn2_03
> > #SBATCH -o 6kn203-%j.out
> > #SBATCH -A necis
> > #SBATCH -N 3
> > #SBATCH -n 16
> > #SBATCH --time=05:30:00
> >
> > mpirun ./mm.o 6000
>
> You need to tell 'mpirun' how many processes to start.  If you do not,
> probably all cores available will be used.  So it looks like you have 6
> cores per node and thus 'mpirun' starts 18 processes.  You should write
> some thing like
>
>   mpirun -np ${SLURM_NTASKS} ./mm.o 6000
>
>
>
> without specifying -np value, using "#SBATCH -n 16" as written in my
> sbatch script I hope mpirun will use 16 as the number of processes.
>
> however, I just realized mpirun doesn't read sbatch script.
>
>
> Cheers,
>
> Loris
>
>
> > regards,
> >
> > Husen
> >
> > On Tue, Jul 12, 2016 at 1:21 PM, Loris Bennett
> > <loris.benn...@fu-berlin.de> wrote:
> >
> > Husen R <hus...@gmail.com> writes:
> >
> > > number of processes in slurm job
> >
> >
> > >
> > > Hi all,
> > >
> > > I tried to run a job on 3 nodes (N=3) with 16 number of processes
> > > (n=16) but slurm automatically changes that n value to 18 (n=18).
> > >
> > > I also tried to use other combination of n values that are not
> equally
> > > devided by N but Slurm automatically changes those n values to
> values
> > > that are equally devided by N.
> > >
> > > How to change this behavior ?
> > > I need to use a specific value of n for experimental purpose.
> > >
> > > Thank you in advance.
> > >
> > > Regards,
> > >
> > > Husen
> >
> >
> > You need to give more details about what you did. How did you set the
> > number of processes?
> >
> > Cheers,
> >
> > Loris
> >
> > --
> > Dr. Loris Bennett (Mr.)
> > ZEDAT, Freie Universität Berlin Email
> > loris.benn...@fu-berlin.de
> >
> >
> >
>
> --
> Dr. Loris Bennett (Mr.)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>
>
>
>
>
> Este correo y sus archivos asociados son privados y confidenciales y va
> dirigido exclusivamente a su destinatario. Si recibe este correo sin ser el
> destinatario del mismo, le rogamos proceda a su eliminación y lo ponga en
> conocimiento del emisor. La difusión por cualquier medio del contenido de
> este correo podría ser sancionada conforme a lo previsto en las leyes
> españolas. No se autoriza la utilización con fines comerciales o para su
> incorporación a ficheros automatizados de las direcciones del emisor o del
> destinatario .
>
> This mail and its attached files are confidential and are exclusively
> intended to their addre

[slurm-dev] Re: number of processes in slurm job

2016-07-12 Thread Husen R
On Tue, Jul 12, 2016 at 2:53 PM, David Ramírez <drami...@sie.es> wrote:

> Hi Husen.
>
>
>
> Are you compiled OpenMPI with Slurm?? Indicate for example –with-slurm and
> –with-pmi ¿?
>

Hi David,

I use MVAPICH2.
I found that I have to use srun instead of mpirun.
I use this command : srun --mpi=pmi2 ./mm.o 6000
it works.

>
>
> I use OpenMPI and Works fine without indicated mpi procs into my batch
> file.
>
>
>
> *De:* Husen R [mailto:hus...@gmail.com]
> *Enviado el:* martes, 12 de julio de 2016 9:41
> *Para:* slurm-dev <slurm-dev@schedmd.com>
> *Asunto:* [slurm-dev] Re: number of processes in slurm job
>
>
>
>
>
>
>
> On Tue, Jul 12, 2016 at 2:24 PM, Loris Bennett <loris.benn...@fu-berlin.de>
> wrote:
>
>
> Husen R <hus...@gmail.com> writes:
>
> > Re: [slurm-dev] Re: number of processes in slurm job
> >
> > Hi,
> >
> > Thanks for your reply !
> >
> > I use this sbatch script
> >
> > #!/bin/bash
> > #SBATCH -J mm6kn2_03
> > #SBATCH -o 6kn203-%j.out
> > #SBATCH -A necis
> > #SBATCH -N 3
> > #SBATCH -n 16
> > #SBATCH --time=05:30:00
> >
> > mpirun ./mm.o 6000
>
> You need to tell 'mpirun' how many processes to start.  If you do not,
> probably all cores available will be used.  So it looks like you have 6
> cores per node and thus 'mpirun' starts 18 processes.  You should write
> some thing like
>
>   mpirun -np ${SLURM_NTASKS} ./mm.o 6000
>
>
>
> without specifying -np value, using "#SBATCH -n 16" as written in my
> sbatch script I hope mpirun will use 16 as the number of processes.
>
> however, I just realized mpirun doesn't read sbatch script.
>
>
> Cheers,
>
> Loris
>
>
> > regards,
> >
> > Husen
> >
> > On Tue, Jul 12, 2016 at 1:21 PM, Loris Bennett
> > <loris.benn...@fu-berlin.de> wrote:
> >
> > Husen R <hus...@gmail.com> writes:
> >
> > > number of processes in slurm job
> >
> >
> > >
> > > Hi all,
> > >
> > > I tried to run a job on 3 nodes (N=3) with 16 number of processes
> > > (n=16) but slurm automatically changes that n value to 18 (n=18).
> > >
> > > I also tried to use other combination of n values that are not
> equally
> > > devided by N but Slurm automatically changes those n values to
> values
> > > that are equally devided by N.
> > >
> > > How to change this behavior ?
> > > I need to use a specific value of n for experimental purpose.
> > >
> > > Thank you in advance.
> > >
> > > Regards,
> > >
> > > Husen
> >
> >
> > You need to give more details about what you did. How did you set the
> > number of processes?
> >
> > Cheers,
> >
> > Loris
> >
> > --
> > Dr. Loris Bennett (Mr.)
> > ZEDAT, Freie Universität Berlin Email
> > loris.benn...@fu-berlin.de
> >
> >
> >
>
> --
> Dr. Loris Bennett (Mr.)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>
>
>
> Este correo y sus archivos asociados son privados y confidenciales y va
> dirigido exclusivamente a su destinatario. Si recibe este correo sin ser el
> destinatario del mismo, le rogamos proceda a su eliminación y lo ponga en
> conocimiento del emisor. La difusión por cualquier medio del contenido de
> este correo podría ser sancionada conforme a lo previsto en las leyes
> españolas. No se autoriza la utilización con fines comerciales o para su
> incorporación a ficheros automatizados de las direcciones del emisor o del
> destinatario .
>
> This mail and its attached files are confidential and are exclusively
> intended to their addressee. In case you may receive this mail not being
> its addressee, we beg you to let us know the error by reply and to proceed
> to delete it. The circulation by any mean of this mail could be penalised
> in accordance with the Spanish legislation. The use of both the transmitter
> and the addressee’s address with a commercial aim, or in order to be
> incorporated to automated files, is not authorised.
>


[slurm-dev] Re: number of processes in slurm job

2016-07-12 Thread Husen R
On Tue, Jul 12, 2016 at 2:37 PM, Carlos Fenoy <mini...@gmail.com> wrote:

> If you do not specify the number of nodes does it work as expected?
>

it run with 2 nodes, not 3 nodes.
I just realized, I have to use srun instead of mpirun in order for SLURM to
run my job as I'm expected.



> On Tue, 12 Jul 2016, 09:25 Loris Bennett, <loris.benn...@fu-berlin.de>
> wrote:
>
>>
>> Husen R <hus...@gmail.com> writes:
>>
>> > Re: [slurm-dev] Re: number of processes in slurm job
>> >
>> > Hi,
>> >
>> > Thanks for your reply !
>> >
>> > I use this sbatch script
>> >
>> > #!/bin/bash
>> > #SBATCH -J mm6kn2_03
>> > #SBATCH -o 6kn203-%j.out
>> > #SBATCH -A necis
>> > #SBATCH -N 3
>> > #SBATCH -n 16
>> > #SBATCH --time=05:30:00
>> >
>> > mpirun ./mm.o 6000
>>
>> You need to tell 'mpirun' how many processes to start.  If you do not,
>> probably all cores available will be used.  So it looks like you have 6
>> cores per node and thus 'mpirun' starts 18 processes.  You should write
>> some thing like
>>
>>   mpirun -np ${SLURM_NTASKS} ./mm.o 6000
>>
>> Cheers,
>>
>> Loris
>>
>> > regards,
>> >
>> > Husen
>> >
>> > On Tue, Jul 12, 2016 at 1:21 PM, Loris Bennett
>> > <loris.benn...@fu-berlin.de> wrote:
>> >
>> > Husen R <hus...@gmail.com> writes:
>> >
>> > > number of processes in slurm job
>> >
>> >
>> > >
>> > > Hi all,
>> > >
>> > > I tried to run a job on 3 nodes (N=3) with 16 number of processes
>> > > (n=16) but slurm automatically changes that n value to 18 (n=18).
>> > >
>> > > I also tried to use other combination of n values that are not
>> equally
>> > > devided by N but Slurm automatically changes those n values to
>> values
>> > > that are equally devided by N.
>> > >
>> > > How to change this behavior ?
>> > > I need to use a specific value of n for experimental purpose.
>> > >
>> > > Thank you in advance.
>> > >
>> > > Regards,
>> > >
>> > > Husen
>> >
>> >
>> > You need to give more details about what you did. How did you set
>> the
>> > number of processes?
>> >
>> > Cheers,
>> >
>> > Loris
>> >
>> > --
>> > Dr. Loris Bennett (Mr.)
>> > ZEDAT, Freie Universität Berlin Email
>> > loris.benn...@fu-berlin.de
>> >
>> >
>> >
>>
>> --
>> Dr. Loris Bennett (Mr.)
>> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>>
>


[slurm-dev] Re: number of processes in slurm job

2016-07-12 Thread Husen R
On Tue, Jul 12, 2016 at 2:24 PM, Loris Bennett <loris.benn...@fu-berlin.de>
wrote:

>
> Husen R <hus...@gmail.com> writes:
>
> > Re: [slurm-dev] Re: number of processes in slurm job
> >
> > Hi,
> >
> > Thanks for your reply !
> >
> > I use this sbatch script
> >
> > #!/bin/bash
> > #SBATCH -J mm6kn2_03
> > #SBATCH -o 6kn203-%j.out
> > #SBATCH -A necis
> > #SBATCH -N 3
> > #SBATCH -n 16
> > #SBATCH --time=05:30:00
> >
> > mpirun ./mm.o 6000
>
> You need to tell 'mpirun' how many processes to start.  If you do not,
> probably all cores available will be used.  So it looks like you have 6
> cores per node and thus 'mpirun' starts 18 processes.  You should write
> some thing like
>
>   mpirun -np ${SLURM_NTASKS} ./mm.o 6000
>

without specifying -np value, using "#SBATCH -n 16" as written in my sbatch
script I hope mpirun will use 16 as the number of processes.
however, I just realized mpirun doesn't read sbatch script.

>
> Cheers,
>
> Loris
>
> > regards,
> >
> > Husen
> >
> > On Tue, Jul 12, 2016 at 1:21 PM, Loris Bennett
> > <loris.benn...@fu-berlin.de> wrote:
> >
> > Husen R <hus...@gmail.com> writes:
> >
> > > number of processes in slurm job
> >
> >
> > >
> > > Hi all,
> > >
> > > I tried to run a job on 3 nodes (N=3) with 16 number of processes
> > > (n=16) but slurm automatically changes that n value to 18 (n=18).
> > >
> > > I also tried to use other combination of n values that are not
> equally
> > > devided by N but Slurm automatically changes those n values to
> values
> > > that are equally devided by N.
> > >
> > > How to change this behavior ?
> > > I need to use a specific value of n for experimental purpose.
> > >
> > > Thank you in advance.
> > >
> > > Regards,
> > >
> > > Husen
> >
> >
> > You need to give more details about what you did. How did you set the
> > number of processes?
> >
> > Cheers,
> >
> > Loris
> >
> > --
> > Dr. Loris Bennett (Mr.)
> > ZEDAT, Freie Universität Berlin Email
> > loris.benn...@fu-berlin.de
> >
> >
> >
>
> --
> Dr. Loris Bennett (Mr.)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>


[slurm-dev] Re: number of processes in slurm job

2016-07-12 Thread Husen R
Hi,

Thanks for your reply !

I use this sbatch script

#!/bin/bash
#SBATCH -J mm6kn2_03
#SBATCH -o 6kn203-%j.out
#SBATCH -A necis
#SBATCH -N 3
#SBATCH -n 16
#SBATCH --time=05:30:00

mpirun ./mm.o 6000

regards,

Husen

On Tue, Jul 12, 2016 at 1:21 PM, Loris Bennett <loris.benn...@fu-berlin.de>
wrote:

>
> Husen R <hus...@gmail.com> writes:
>
> > number of processes in slurm job
> >
> > Hi all,
> >
> > I tried to run a job on 3 nodes (N=3) with 16 number of processes
> > (n=16) but slurm automatically changes that n value to 18 (n=18).
> >
> > I also tried to use other combination of n values that are not equally
> > devided by N but Slurm automatically changes those n values to values
> > that are equally devided by N.
> >
> > How to change this behavior ?
> > I need to use a specific value of n for experimental purpose.
> >
> > Thank you in advance.
> >
> > Regards,
> >
> > Husen
>
> You need to give more details about what you did.  How did you set the
> number of processes?
>
> Cheers,
>
> Loris
>
> --
> Dr. Loris Bennett (Mr.)
> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de
>


[slurm-dev] Node order in squeue nodelist column

2016-07-11 Thread Husen R
Hi

I'm wondering,, how does slurm arrange nodes in squeue nodelist column ?
are the nodes just arranged in alphabetical order (based on nodename) from
left to right or is it priority order ?

The following is the output of squeue command :

JOBIDPARTITION NAME   USER   STTIME  NODES NODELIST(REASON)

 1194 part1  md50n2_0 necis   R30:26   2
compute-node,head-node

in nodelist column, compute-node always appeared first before head-node.
I have tried to set weight to each node in slurm.conf, however the nodes
sequence still unchanged.

It seems that the first node in nodelist is responsible to compute MPI Rank
0, and it affects my experiment result.
I want MPI rank 0 is computed in head-node, because that node is a submit
node.

Thank you in advance,


Husen


[slurm-dev] Re: How to setup node sequence

2016-06-28 Thread Husen R
Hi,

I tried to set weight to each node but the order of nodes still not changed.
The following is part of my slurm.conf

NodeName=node1 weight=1 
NodeName=node2 weight=10 ...
NodeName=node3 weight=20 ...

using that conf, I want slurm chooses node1 when there is a job that
request one node, but the selected node is always node2.

any idea how to solve this ?
thank you in advance,


Husen

On Tue, Jun 14, 2016 at 12:57 PM, Husen R <hus...@gmail.com> wrote:

> Thanks !
>
> I'll check it out.
>
> Regards,
>
>
> Husen
>
> On Mon, Jun 13, 2016 at 5:40 PM, Benjamin Redling <
> benjamin.ra...@uni-jena.de> wrote:
>
>>
>>
>>
>> On 06/13/2016 09:50, Husen R wrote:
>> > Hi all,
>> >
>> > How to setup node sequence/order in slurm ?
>> > I configured nodes in slurm.conf like this -> Nodes =
>> head,compute,spare.
>> >
>> > Using that configuration, if I use one node in my job, I hope slurm will
>> > choose head as computing node (as it is in a first order). However slurm
>> > always choose compute, not head.
>> >
>> > how to fix this ?
>>
>> http://slurm.schedmd.com/slurm.conf.html
>>
>> "
>> Weight
>> The priority of the node for scheduling purposes. All things being
>> equal, jobs will be allocated the nodes with the lowest weight which
>> satisfies their requirements. For example, a heterogeneous collection of
>> nodes might be placed into a single partition for greater system
>> utilization, responsiveness and capability. It would be preferable to
>> allocate smaller memory nodes rather than larger memory nodes if either
>> will satisfy a job's requirements. The units of weight are arbitrary,
>> but larger weights should be assigned to nodes with more processors,
>> memory, disk space, higher processor speed, etc. Note that if a job
>> allocation request can not be satisfied using the nodes with the lowest
>> weight, the set of nodes with the next lowest weight is added to the set
>> of nodes under consideration for use (repeat as needed for higher weight
>> values). If you absolutely want to minimize the number of higher weight
>> nodes allocated to a job (at a cost of higher scheduling overhead), give
>> each node a distinct Weight value and they will be added to the pool of
>> nodes being considered for scheduling individually. The default value is
>> 1.
>> "
>>
>> Benjamin
>> --
>> FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
>> vox: +49 3641 9 44323 | fax: +49 3641 9 44321
>>
>
>


[slurm-dev] RE: working directory of a completed job

2016-06-22 Thread Husen R
Hello Giovanni,

Thank you for your reply!

the job completion log is not yet configured in my slurm sonfig.

On Tue, Jun 21, 2016 at 8:48 PM, Giovanni Torres <torre...@helix.nih.gov>
wrote:

> You could get this from the job completion log:
>
>
>
> $ scontrol show config | grep JobComp
>
>
>
> Giovanni
>
>
>
> *From:* Husen R [mailto:hus...@gmail.com]
> *Sent:* Tuesday, June 21, 2016 1:59 AM
> *To:* slurm-dev <slurm-dev@schedmd.com>
> *Subject:* [slurm-dev] working directory of a completed job
>
>
>
> Hi,
>
> How to get a workdir of a completed job ?
>
> The command "scontrol show job JOBID" is only for running job.
>
> If I use that command for completed job, the following error message
> appeared..
> "slurm_load_jobs error: Invalid job id specified"
>
> thank you in advance
>
> Regards,
>
> Husen
>
>
>


[slurm-dev] Re: working directory of a completed job

2016-06-22 Thread Husen R
Hello jason,

Thank you for your reply !

I use slurmdbd as my AccountingStorageType...
And finally, I just handle the jobs that are in running/pending state..
I use bash script command  "scontrol show job jobid | grep WorkDir | cut -d
"=" -f2" to get WorkDir..

Regards,


Husen

On Tue, Jun 21, 2016 at 9:48 PM, Jason Bacon <bacon4...@gmail.com> wrote:

>
>
> Hello Husen,
>
> See http://slurm.schedmd.com/sacct.html.
>
> If you're using AccountingStorageType=accounting_storage/filetxt, you can
> also grep/awk/more the JobComp file.
>
> Jason
>
> On 06/21/16 00:56, Husen R wrote:
>
>> working directory of a completed job
>> Hi,
>>
>> How to get a workdir of a completed job ?
>> The command "scontrol show job JOBID" is only for running job.
>>
>> If I use that command for completed job, the following error message
>> appeared..
>> "slurm_load_jobs error: Invalid job id specified"
>>
>> thank you in advance
>> Regards,
>>
>> Husen
>>
>>
>
> --
> All wars are civil wars, because all men are brothers ... Each one owes
> infinitely more to the human race than to the particular country in
> which he was born.
> -- Francois Fenelon
>


[slurm-dev] working directory of a completed job

2016-06-20 Thread Husen R
Hi,

How to get a workdir of a completed job ?
The command "scontrol show job JOBID" is only for running job.

If I use that command for completed job, the following error message
appeared..
"slurm_load_jobs error: Invalid job id specified"

thank you in advance
Regards,

Husen


[slurm-dev] Re: How to setup slurm database accounting feature

2016-05-24 Thread Husen R
Hi Chris,

Thank you for your reply !

I have followed your suggestion. I kill slurmdbd process and drop all
tables in slurm_acct_db.
However when I try to run "sudo sacctmgr add cluster hpctesis", the error
message "Database is busy or waiting for lock from other user." still
appeared.

Therefore, I decided to change my clustername to something else and it
works !
I don't know why my first clustername is doesn't work.

in addition, I have a new question. Does sacct command only displays jobs
in current day ?
I use sacct to display all jobs, but what I got is jobs in current day
only. All jobs executed before current day are not appeared.

I know I can use "sacct -c" to display job completion but this command
doesn't display jobs in RUNNING states as it intends to do so.

so, is there a way to change sacct behavior ?
so that I can display all jobs (RUNNING, FAILED, CANCELLED,COMPLETED etc)
from every days available in slurm database at once.

Thank you in advance.

Regards,
Husen

On Mon, May 23, 2016 at 12:25 PM, Christopher Samuel <sam...@unimelb.edu.au>
wrote:

>
> On 23/05/16 14:08, Husen R wrote:
>
> > anyone can tell me how to solve this please ?
>
> Kill all your slurmdbd's first, and check that all the associated
> processes are gone.
>
> Then as long as you've not got any important data there (and that seems
> unlikely if you can't create your first cluser) drop all the tables in
> that database by hand and then start one slurmdbd in debugging mode with:
>
> slurmdbd -D -v
>
> and see what happens.
>
> By the way is this is standard MySQL or MariaDB or is it a clustered
> version (Galera/Percona-xtradb/etc)?
>
> All the best,
> Chris
> --
>  Christopher SamuelSenior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/  http://twitter.com/vlsci
>


[slurm-dev] Re: How to setup slurm database accounting feature

2016-05-22 Thread Husen R
Hi,

in order to solve error message "Database is busy or waiting for lock from
other user.",
I tried to see process list in mysql. This is the output of "show
processlist \G;" command:


*** 1. row ***
 Id: 10
   User: root
   Host: localhost
 db: slurm_acct_db
Command: Sleep
   Time: 2556
  State:
   Info: NULL
*** 2. row ***
 Id: 11
   User: root
   Host: localhost
 db: slurm_acct_db
Command: Sleep
   Time: 153
  State:
   Info: NULL
*** 3. row ***
 Id: 39
   User: root
   Host: localhost
 db: slurm_jobcomp_db
Command: Sleep
   Time: 2556
  State:
   Info: NULL
*** 4. row ***
 Id: 41
   User: root
   Host: localhost
 db: slurm_acct_db
Command: Query
   Time: 2468
  State: Waiting for table metadata lock
   Info: create table if not exists "hpctesis_assoc_table" (`creation_time`
int unsigned not null, `mod_time`
*** 5. row ***
 Id: 42
   User: root
   Host: localhost
 db: slurm_acct_db
Command: Query
   Time: 2405
  State: Waiting for table metadata lock
   Info: create table if not exists "hpctesis_assoc_table" (`creation_time`
int unsigned not null, `mod_time`
*** 6. row ***
 Id: 43
   User: root
   Host: localhost
 db: slurm_acct_db
Command: Query
   Time: 2372
  State: Waiting for table metadata lock
   Info: create table if not exists "hpctesis_assoc_table" (`creation_time`
int unsigned not null, `mod_time`
*** 7. row ***
 Id: 44
   User: root
   Host: localhost
 db: slurm_acct_db
Command: Query
   Time: 2357
  State: Waiting for table metadata lock
   Info: create table if not exists "hpctesis_assoc_table" (`creation_time`
int unsigned not null, `mod_time`
*** 8. row ***
 Id: 46
   User: root
   Host: localhost
 db: slurm_acct_db
Command: Query
   Time: 2099
  State: Waiting for table metadata lock
   Info: create table if not exists "hpctesis_assoc_table" (`creation_time`
int unsigned not null, `mod_time`
*** 9. row ***
 Id: 47
   User: root
   Host: localhost
 db: NULL
Command: Query
   Time: 0
  State: NULL
   Info: show processlist
9 rows in set (0.00 sec)

ERROR:
No query specified


Based on the output above, I see the process with the info "create table if
not exists "hpctesis_assoc_table" (`creation_time` int unsigned not null,
`mod_time`" and the state :"Waiting for table metadata lock".

I checked to slurm_acct_db database, the table named hpctesis_assoc_table
is exist but when I try to run SQL SELECT command mysql server seems to be
not responding..

I guess this is the cause of the problem that prevents me from running
sacctmgr command.

anyone can tell me how to solve this please ?

Thank you in advance,

Regards,


Husen

On Mon, May 23, 2016 at 10:26 AM, Husen R <hus...@gmail.com> wrote:

> Hi,
>
> Yes, I can connect as the slurmdbd 'storageuser'. I also can create and
> drop tables.
> I don't know how to solve this.
> the message "Database is busy or waiting for lock from other user." is
> keep appearing everytime I try to add cluster using sacctmgr.
>
> I need help
>
> Regards,
>
> Husen
>
> On Sun, May 22, 2016 at 2:05 PM, Daniel Letai <d...@letai.org.il> wrote:
>
>> It might be a permissions issue - can you connect as the slurmdbd
>> 'storageuser' to your db and create and drop tables?
>> From http://slurm.schedmd.com/accounting.html :
>>
>>
>>- *StorageUser*: Define the name of the user we are going to connect
>>to the database with to store the job accounting data.
>>
>> MySQL Configuration
>>
>> While Slurm will create the database tables automatically you will need
>> to make sure the StorageUser is given permissions in the MySQL or MariaDB
>> database to do so. As the *mysql* user grant privileges to that user
>> using a command such as:
>>
>> GRANT ALL ON StorageLoc.* TO 'StorageUser'@'StorageHost';
>> (The ticks are needed)
>>
>> (You need to be root to do this. Also in the info for password usage
>> there is a line that starts with '->'. This a continuation prompt since the
>> previous mysql statement did not end with a ';'. It assumes that you wish
>> to input more info.)
>>
>> If you want Slurm to create the database itself, and any future
>> databases, you can change your grant line to be *.* instead of StorageLoc.*
>>
>>
>>

[slurm-dev] Re: How to setup slurm database accounting feature

2016-05-22 Thread Husen R
Hi,

Yes, I can connect as the slurmdbd 'storageuser'. I also can create and
drop tables.
I don't know how to solve this.
the message "Database is busy or waiting for lock from other user." is keep
appearing everytime I try to add cluster using sacctmgr.

I need help

Regards,

Husen

On Sun, May 22, 2016 at 2:05 PM, Daniel Letai <d...@letai.org.il> wrote:

> It might be a permissions issue - can you connect as the slurmdbd
> 'storageuser' to your db and create and drop tables?
> From http://slurm.schedmd.com/accounting.html :
>
>
>- *StorageUser*: Define the name of the user we are going to connect
>to the database with to store the job accounting data.
>
> MySQL Configuration
>
> While Slurm will create the database tables automatically you will need to
> make sure the StorageUser is given permissions in the MySQL or MariaDB
> database to do so. As the *mysql* user grant privileges to that user
> using a command such as:
>
> GRANT ALL ON StorageLoc.* TO 'StorageUser'@'StorageHost';
> (The ticks are needed)
>
> (You need to be root to do this. Also in the info for password usage there
> is a line that starts with '->'. This a continuation prompt since the
> previous mysql statement did not end with a ';'. It assumes that you wish
> to input more info.)
>
> If you want Slurm to create the database itself, and any future databases,
> you can change your grant line to be *.* instead of StorageLoc.*
>
>
>
>
> On 05/22/2016 06:16 AM, Husen R wrote:
>
> Hi,
>
> The following is the error message I got from slurmdbd.log. I got this
> error message after I try to add my clustername=hpctesis to slurmdbd using
> command  "sudo sacctmgr add cluster hpctesis".
>
>
> [2016-05-22T10:04:33.047] error: We should have gotten a new id: Table
> 'slurm_acct_db.hpctesis_job_table' doesn't exist
> [2016-05-22T10:04:33.047] error: couldn't add job 386 at job completion
> [2016-05-22T10:04:33.047] DBD_JOB_COMPLETE: cluster not registered
>
> Should I create a table named hpctesis_job_table manually ?
>
> as far as I understood, slurm should able to do this by it self..am I
> right ?
> how to solve this ?
>
> I need help.
> Thank you in advance,
>
>
> Regards,
>
>
> Husen
>
> On Sat, May 21, 2016 at 7:31 PM, Husen R <hus...@gmail.com> wrote:
>
>> Hi daniel,
>>
>> Thank you for your reply !
>>
>> The error regarding mysql socket has been solved.
>>  I forget to run slurmdbd daemon prior to running slurmctld daemon.
>>
>> however, I got this error message when I try to add cluster using
>> sacctmgr command :
>>
>>
>> --
>>
>> $ sudo sacctmgr add cluster comeon
>>
>>  Adding Cluster(s)
>>   Name  = comeon
>> Would you like to commit changes? (You have 30 seconds to decide)
>> (N/y): y
>>  Database is busy or waiting for lock from other user.
>>
>> ---
>>
>> How to fix this ?
>> Thank you in advance.
>>
>> Regards,
>>
>>
>> Husen
>>
>> On Sat, May 21, 2016 at 6:28 PM, Daniel Letai < <d...@letai.org.il>
>> d...@letai.org.il> wrote:
>>
>>>
>>> Does the socket file exists?
>>> What's in your /etc/my.cnf (or my.cnf.d/some other config file) under
>>> [mysqld]?
>>> [mysqld]
>>> socket=/path/to/datadir/mysql/mysql.sock
>>>
>>> If a socket value doesn't exist, either create one, or create a link
>>> between the actual socket file and /var/run/mysqld/mysqld.sock
>>> BTW - either you have a typo in your mail, or your socket is
>>> misconfigured - never saw mysqld.soc (without 'k' at end) as the name of
>>> the socket, although it's certainly legal.
>>>
>>> Other option is that the mysql server is not running - did you start the
>>> daemon?
>>>
>>> On 05/21/2016 01:45 PM, Husen R wrote:
>>>
>>>> Re: [slurm-dev] How to setup slurm database accounting feature
>>>> I checked slurmctld.log, I got this error message. how to solve this ?
>>>>
>>>> [2016-05-21T17:37:40.589] error: mysql_real_connect failed: 2002 Can't
>>>> connect to local MySQL server through socket '/var/run/mysqld/mysqld.soc$
>>>> [2016-05-21T17:37:40.589] fatal: You haven't inited this storage yet.
>>>>
>>>> Thank you in advance
>>>> oe
>>

[slurm-dev] Re: How to setup slurm database accounting feature

2016-05-21 Thread Husen R
Hi,

The following is the error message I got from slurmdbd.log. I got this
error message after I try to add my clustername=hpctesis to slurmdbd using
command  "sudo sacctmgr add cluster hpctesis".


[2016-05-22T10:04:33.047] error: We should have gotten a new id: Table
'slurm_acct_db.hpctesis_job_table' doesn't exist
[2016-05-22T10:04:33.047] error: couldn't add job 386 at job completion
[2016-05-22T10:04:33.047] DBD_JOB_COMPLETE: cluster not registered

Should I create a table named hpctesis_job_table manually ?

as far as I understood, slurm should able to do this by it self..am I right
?
how to solve this ?

I need help.
Thank you in advance,


Regards,


Husen

On Sat, May 21, 2016 at 7:31 PM, Husen R <hus...@gmail.com> wrote:

> Hi daniel,
>
> Thank you for your reply !
>
> The error regarding mysql socket has been solved.
>  I forget to run slurmdbd daemon prior to running slurmctld daemon.
>
> however, I got this error message when I try to add cluster using sacctmgr
> command :
>
>
> --
>
> $ sudo sacctmgr add cluster comeon
>
>  Adding Cluster(s)
>   Name  = comeon
> Would you like to commit changes? (You have 30 seconds to decide)
> (N/y): y
>  Database is busy or waiting for lock from other user.
>
> ---
>
> How to fix this ?
> Thank you in advance.
>
> Regards,
>
>
> Husen
>
> On Sat, May 21, 2016 at 6:28 PM, Daniel Letai <d...@letai.org.il> wrote:
>
>>
>> Does the socket file exists?
>> What's in your /etc/my.cnf (or my.cnf.d/some other config file) under
>> [mysqld]?
>> [mysqld]
>> socket=/path/to/datadir/mysql/mysql.sock
>>
>> If a socket value doesn't exist, either create one, or create a link
>> between the actual socket file and /var/run/mysqld/mysqld.sock
>> BTW - either you have a typo in your mail, or your socket is
>> misconfigured - never saw mysqld.soc (without 'k' at end) as the name of
>> the socket, although it's certainly legal.
>>
>> Other option is that the mysql server is not running - did you start the
>> daemon?
>>
>> On 05/21/2016 01:45 PM, Husen R wrote:
>>
>>> Re: [slurm-dev] How to setup slurm database accounting feature
>>> I checked slurmctld.log, I got this error message. how to solve this ?
>>>
>>> [2016-05-21T17:37:40.589] error: mysql_real_connect failed: 2002 Can't
>>> connect to local MySQL server through socket '/var/run/mysqld/mysqld.soc$
>>> [2016-05-21T17:37:40.589] fatal: You haven't inited this storage yet.
>>>
>>> Thank you in advance
>>> oe
>>> Regards,
>>>
>>>
>>> Husen
>>>
>>> On Sat, May 21, 2016 at 3:16 PM, Husen R <hus...@gmail.com >> hus...@gmail.com>> wrote:
>>>
>>> dear all,
>>>
>>> I tried to configure slurm accounting feature using database.
>>> I already read the instruction available in this page
>>> http://slurm.schedmd.com/accounting.html, but the accounting
>>> feature still not working.
>>> I got this error message when I try to execute sacct command :
>>>
>>> sacct: error: Problem talking to the database: Connection refused
>>>
>>> the following is my slurm.conf:
>>>
>>>
>>> --Slurm.conf
>>>
>>> #
>>> # Sample /etc/slurm.conf for mcr.llnl.gov <http://mcr.llnl.gov>
>>>
>>> #
>>> ControlMachine=head-node
>>> ControlAddr=head-node
>>> #BackupController=mcrj
>>> #BackupAddr=emcrj
>>> #
>>> AuthType=auth/munge
>>> CheckpointType=checkpoint/blcr
>>> #Epilog=/usr/local/slurm/etc/epilog
>>> FastSchedule=1
>>> #JobCompLoc=/var/tmp/jette/slurm.job.log
>>> JobCompType=jobcomp/mysql
>>> #AccountingStorageType=accounting_storage/mysql
>>> AccountingStorageType=accounting_storage/slurmdbd
>>> AccountingStorageHost=localhost
>>> AccountingStoragePass=/var/run/munge/munge.socket.2
>>> ClusterName=comeon
>>> JobCompHost=head-node
>>> JobCompPass=password
>>> JobCompPort=3306
>>> JobCompUser=root
>>> JobCredentialPrivateKey=/usr/loca

[slurm-dev] Re: How to setup slurm database accounting feature

2016-05-21 Thread Husen R
Hi daniel,

Thank you for your reply !

The error regarding mysql socket has been solved.
 I forget to run slurmdbd daemon prior to running slurmctld daemon.

however, I got this error message when I try to add cluster using sacctmgr
command :

--

$ sudo sacctmgr add cluster comeon

 Adding Cluster(s)
  Name  = comeon
Would you like to commit changes? (You have 30 seconds to decide)
(N/y): y
 Database is busy or waiting for lock from other user.
---

How to fix this ?
Thank you in advance.

Regards,


Husen

On Sat, May 21, 2016 at 6:28 PM, Daniel Letai <d...@letai.org.il> wrote:

>
> Does the socket file exists?
> What's in your /etc/my.cnf (or my.cnf.d/some other config file) under
> [mysqld]?
> [mysqld]
> socket=/path/to/datadir/mysql/mysql.sock
>
> If a socket value doesn't exist, either create one, or create a link
> between the actual socket file and /var/run/mysqld/mysqld.sock
> BTW - either you have a typo in your mail, or your socket is misconfigured
> - never saw mysqld.soc (without 'k' at end) as the name of the socket,
> although it's certainly legal.
>
> Other option is that the mysql server is not running - did you start the
> daemon?
>
> On 05/21/2016 01:45 PM, Husen R wrote:
>
>> Re: [slurm-dev] How to setup slurm database accounting feature
>> I checked slurmctld.log, I got this error message. how to solve this ?
>>
>> [2016-05-21T17:37:40.589] error: mysql_real_connect failed: 2002 Can't
>> connect to local MySQL server through socket '/var/run/mysqld/mysqld.soc$
>> [2016-05-21T17:37:40.589] fatal: You haven't inited this storage yet.
>>
>> Thank you in advance
>> oe
>> Regards,
>>
>>
>> Husen
>>
>> On Sat, May 21, 2016 at 3:16 PM, Husen R <hus...@gmail.com > hus...@gmail.com>> wrote:
>>
>> dear all,
>>
>> I tried to configure slurm accounting feature using database.
>> I already read the instruction available in this page
>> http://slurm.schedmd.com/accounting.html, but the accounting
>> feature still not working.
>> I got this error message when I try to execute sacct command :
>>
>> sacct: error: Problem talking to the database: Connection refused
>>
>> the following is my slurm.conf:
>>
>>
>> --Slurm.conf
>>
>> #
>> # Sample /etc/slurm.conf for mcr.llnl.gov <http://mcr.llnl.gov>
>>
>> #
>> ControlMachine=head-node
>> ControlAddr=head-node
>> #BackupController=mcrj
>> #BackupAddr=emcrj
>> #
>> AuthType=auth/munge
>> CheckpointType=checkpoint/blcr
>> #Epilog=/usr/local/slurm/etc/epilog
>> FastSchedule=1
>> #JobCompLoc=/var/tmp/jette/slurm.job.log
>> JobCompType=jobcomp/mysql
>> #AccountingStorageType=accounting_storage/mysql
>> AccountingStorageType=accounting_storage/slurmdbd
>> AccountingStorageHost=localhost
>> AccountingStoragePass=/var/run/munge/munge.socket.2
>> ClusterName=comeon
>> JobCompHost=head-node
>> JobCompPass=password
>> JobCompPort=3306
>> JobCompUser=root
>> JobCredentialPrivateKey=/usr/local/etc/slurm.key
>> JobCredentialPublicCertificate=/usr/local/etc/slurm.cert
>> MsgAggregationParams=WindowMsgs=2,WindowTime=100
>> PluginDir=/usr/local/lib/slurm
>> JobCheckpointDir=/mirror/source/cr
>> #Prolog=/usr/local/slurm/etc/prolog
>> MailProg=/usr/bin/mail
>> SchedulerType=sched/backfill
>> SelectType=select/linear
>> SlurmUser=slurm
>> SlurmctldLogFile=/var/tmp/slurmctld.log
>> SlurmctldPort=7002
>> SlurmctldTimeout=300
>> SlurmdPort=7003
>> SlurmdSpoolDir=/var/tmp/slurmd.spool
>> SlurmdTimeout=300
>> SlurmdLogFile=/var/tmp/slurmd.log
>> StateSaveLocation=/var/tmp/slurm.state
>> #SwitchType=switch/none
>> TreeWidth=50
>> #
>> # Node Configurations
>> #
>> NodeName=DEFAULT CPUs=8 RealMemory=5949 TmpDisk=64000 State=UNKNOWN
>> NodeName=head-node,compute-node,spare-node
>> NodeAddr=head-node,compute-node,spare-node SocketsPerBoard=1
>> CoresPerSocket=4 ThreadsPerCore=2
>> #
>> # Partition Configurations
>&g

[slurm-dev] Re: How to setup slurm database accounting feature

2016-05-21 Thread Husen R
I checked slurmctld.log, I got this error message. how to solve this ?

[2016-05-21T17:37:40.589] error: mysql_real_connect failed: 2002 Can't
connect to local MySQL server through socket '/var/run/mysqld/mysqld.soc$
[2016-05-21T17:37:40.589] fatal: You haven't inited this storage yet.

Thank you in advance

Regards,


Husen

On Sat, May 21, 2016 at 3:16 PM, Husen R <hus...@gmail.com> wrote:

> dear all,
>
> I tried to configure slurm accounting feature using database.
> I already read the instruction available in this page
> http://slurm.schedmd.com/accounting.html, but the accounting feature
> still not working.
> I got this error message when I try to execute sacct command :
>
> sacct: error: Problem talking to the database: Connection refused
>
> the following is my slurm.conf:
>
>
> --Slurm.conf
>
> #
> # Sample /etc/slurm.conf for mcr.llnl.gov
> #
> ControlMachine=head-node
> ControlAddr=head-node
> #BackupController=mcrj
> #BackupAddr=emcrj
> #
> AuthType=auth/munge
> CheckpointType=checkpoint/blcr
> #Epilog=/usr/local/slurm/etc/epilog
> FastSchedule=1
> #JobCompLoc=/var/tmp/jette/slurm.job.log
> JobCompType=jobcomp/mysql
> #AccountingStorageType=accounting_storage/mysql
> AccountingStorageType=accounting_storage/slurmdbd
> AccountingStorageHost=localhost
> AccountingStoragePass=/var/run/munge/munge.socket.2
> ClusterName=comeon
> JobCompHost=head-node
> JobCompPass=password
> JobCompPort=3306
> JobCompUser=root
> JobCredentialPrivateKey=/usr/local/etc/slurm.key
> JobCredentialPublicCertificate=/usr/local/etc/slurm.cert
> MsgAggregationParams=WindowMsgs=2,WindowTime=100
> PluginDir=/usr/local/lib/slurm
> JobCheckpointDir=/mirror/source/cr
> #Prolog=/usr/local/slurm/etc/prolog
> MailProg=/usr/bin/mail
> SchedulerType=sched/backfill
> SelectType=select/linear
> SlurmUser=slurm
> SlurmctldLogFile=/var/tmp/slurmctld.log
> SlurmctldPort=7002
> SlurmctldTimeout=300
> SlurmdPort=7003
> SlurmdSpoolDir=/var/tmp/slurmd.spool
> SlurmdTimeout=300
> SlurmdLogFile=/var/tmp/slurmd.log
> StateSaveLocation=/var/tmp/slurm.state
> #SwitchType=switch/none
> TreeWidth=50
> #
> # Node Configurations
> #
> NodeName=DEFAULT CPUs=8 RealMemory=5949 TmpDisk=64000 State=UNKNOWN
> NodeName=head-node,compute-node,spare-node
> NodeAddr=head-node,compute-node,spare-node SocketsPerBoard=1
> CoresPerSocket=4 ThreadsPerCore=2
> #
> # Partition Configurations
> #
> PartitionName=DEFAULT State=UP
> PartitionName=comeon Nodes=head-node,compute-node,spare-node
> MaxTime=168:00:00 MaxNodes=32 Default=YES
>
>
> 
>
> what is the difference between slurmdbd and mysql ?
> based on the information in this page,
> http://slurm.schedmd.com/accounting.html, slurmdbd has its own
> configuration file called slurmdbd.conf.
> is there any example of slurmdbd.conf file ? where should I store this
> file ? how do I setup slurm to read slurmdbd.conf file ?
>
> I have installed mysql. I also have created slurm_acct_db database.
> I need help.
>
> Thank you in advance
>
> regards,
>
>
> Husen
>
>
>
>


[slurm-dev] How to setup slurm database accounting feature

2016-05-21 Thread Husen R
dear all,

I tried to configure slurm accounting feature using database.
I already read the instruction available in this page
http://slurm.schedmd.com/accounting.html, but the accounting feature still
not working.
I got this error message when I try to execute sacct command :

sacct: error: Problem talking to the database: Connection refused

the following is my slurm.conf:

--Slurm.conf

#
# Sample /etc/slurm.conf for mcr.llnl.gov
#
ControlMachine=head-node
ControlAddr=head-node
#BackupController=mcrj
#BackupAddr=emcrj
#
AuthType=auth/munge
CheckpointType=checkpoint/blcr
#Epilog=/usr/local/slurm/etc/epilog
FastSchedule=1
#JobCompLoc=/var/tmp/jette/slurm.job.log
JobCompType=jobcomp/mysql
#AccountingStorageType=accounting_storage/mysql
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=localhost
AccountingStoragePass=/var/run/munge/munge.socket.2
ClusterName=comeon
JobCompHost=head-node
JobCompPass=password
JobCompPort=3306
JobCompUser=root
JobCredentialPrivateKey=/usr/local/etc/slurm.key
JobCredentialPublicCertificate=/usr/local/etc/slurm.cert
MsgAggregationParams=WindowMsgs=2,WindowTime=100
PluginDir=/usr/local/lib/slurm
JobCheckpointDir=/mirror/source/cr
#Prolog=/usr/local/slurm/etc/prolog
MailProg=/usr/bin/mail
SchedulerType=sched/backfill
SelectType=select/linear
SlurmUser=slurm
SlurmctldLogFile=/var/tmp/slurmctld.log
SlurmctldPort=7002
SlurmctldTimeout=300
SlurmdPort=7003
SlurmdSpoolDir=/var/tmp/slurmd.spool
SlurmdTimeout=300
SlurmdLogFile=/var/tmp/slurmd.log
StateSaveLocation=/var/tmp/slurm.state
#SwitchType=switch/none
TreeWidth=50
#
# Node Configurations
#
NodeName=DEFAULT CPUs=8 RealMemory=5949 TmpDisk=64000 State=UNKNOWN
NodeName=head-node,compute-node,spare-node
NodeAddr=head-node,compute-node,spare-node SocketsPerBoard=1
CoresPerSocket=4 ThreadsPerCore=2
#
# Partition Configurations
#
PartitionName=DEFAULT State=UP
PartitionName=comeon Nodes=head-node,compute-node,spare-node
MaxTime=168:00:00 MaxNodes=32 Default=YES



what is the difference between slurmdbd and mysql ?
based on the information in this page,
http://slurm.schedmd.com/accounting.html, slurmdbd has its own
configuration file called slurmdbd.conf.
is there any example of slurmdbd.conf file ? where should I store this file
? how do I setup slurm to read slurmdbd.conf file ?

I have installed mysql. I also have created slurm_acct_db database.
I need help.

Thank you in advance

regards,


Husen


[slurm-dev] Re: Slurm checkpoint error

2016-05-17 Thread Husen R
This is the output of ls -a -l. The files that appeared in the error
message are 0 bytes in size and they are all resulting from processes in
the remote nodes.

drwxrwxr-x  2 necis necis  4096 Mei 17 14:44 .
drwxrwxrwx 12 root  root   4096 Mei 17 17:24 ..
-r  1 necis necis 0 Mei 17 14:42 .task.0.ckpt.tmp
-r  1 necis necis 183630160 Mei 17 14:44 task.1.ckpt
-r  1 necis necis 183630200 Mei 17 14:43 task.2.ckpt
-r  1 necis necis 0 Mei 17 14:43 .task.2.ckpt.tmp
-r  1 necis necis 0 Mei 17 14:42 .task.3.ckpt.tmp
-r  1 necis necis 0 Mei 17 14:42 .task.4.ckpt.tmp
-r  1 necis necis 183297635 Mei 17 14:43 task.5.ckpt
-r  1 necis necis 183297635 Mei 17 14:43 task.6.ckpt
-r  1 necis necis 183297635 Mei 17 14:43 task.7.ckpt
-r  1 necis necis 183301731 Mei 17 14:43 task.8.ckpt
-r  1 necis necis 183297635 Mei 17 14:43 task.9.ckpt

Regards,

Husen


On Wed, May 18, 2016 at 7:38 AM, Husen R <hus...@gmail.com> wrote:

> Hi,
>
> This is the output of ls -a:
>
>  .   .task.0.ckpt.tmp  task.2.ckpt   .task.3.ckpt.tmp  task.5.ckpt
>  task.7.ckpt  task.9.ckpt
> ..  task.1.ckpt   .task.2.ckpt.tmp  .task.4.ckpt.tmp  task.6.ckpt
>  task.8.ckpt
>
> This is the output of ls :
>
> task.1.ckpt  task.2.ckpt  task.5.ckpt  task.6.ckpt  task.7.ckpt
>  task.8.ckpt  task.9.ckpt
>
>
> The temporary files appeared when I use ls -a command. What does it mean ?
> I can create file in the directory with touch.
>
> I try to checkpoint mpi application using non root user with slurm
> checkpoint interval feature. I don't directly checkpoint using
> cr_checkpoint command.
>
> Regards,
>
> Husen
>
> On Tue, May 17, 2016 at 10:30 PM, Eric Roman <ero...@lbl.gov> wrote:
>
>>
>>
>> Are the temporary files created?
>>
>> Does ls -a on the directory show the missing files?
>>
>> Can you create files in that directory with touch?
>>
>> Finally, is cr_checkpoint being run by root?  Or some other user?  The
>> checkpoint file will be created by the user invoking cr_checkpoint.
>>
>> Eric
>>
>> On Tue, May 17, 2016 at 01:10:32AM -0700, Husen R wrote:
>> >dear all,
>> >I failed everytime I try to checkpoint MPI application using BLCR in
>> >Slurm. The following is my sbatch script :
>> >##SBATCH SCRIPT
>> >#!/bin/bash
>> >#SBATCH -J MatMul
>> >#SBATCH -o cr/mm-%j.out
>> >#SBATCH -A necis
>> >#SBATCH -N 3
>> >#SBATCH -n 24
>> >#SBATCH --checkpoint=1
>> >#SBATCH --checkpoint-dir=cr
>> >#SBATCH --time=01:30:00
>> >#SBATCH --mail-user=[1]hus...@gmail.com
>> >#SBATCH --mail-type=begin
>> >#SBATCH --mail-type=end
>> >srun --mpi=pmi2 ./mm.o
>> >
>> >I also have tried to run directly using srun command but I failed.
>> The
>> >following is the command I use and the error message that occured.
>> >command :�
>> >srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o�
>> >error :
>> >Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp':
>> Permission
>> >denied
>> >Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt'
>> >Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp':
>> Permission
>> >denied
>> >Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt'
>> >Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp':
>> Permission
>> >denied
>> >Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt'
>> >Received results from task 6
>> >Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp':
>> Permission
>> >denied
>> >Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt'
>> >Unable to open file '/mirror/source/cr/275.0/.task.2.ckpt.tmp':
>> Permission
>> >denied
>> >Failed to open checkpoint file '/mirror/source/cr/275.0/task.2.ckpt'
>> >Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp':
>> Permission
>> >denied
>> >Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt'
>> >Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp':
>> Permission
>> >denied
>> >Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt

[slurm-dev] Slurm checkpoint error

2016-05-17 Thread Husen R
dear all,

I failed everytime I try to checkpoint MPI application using BLCR in Slurm.
The following is my sbatch script :

##SBATCH SCRIPT
#!/bin/bash
#SBATCH -J MatMul
#SBATCH -o cr/mm-%j.out
#SBATCH -A necis
#SBATCH -N 3
#SBATCH -n 24
#SBATCH --checkpoint=1
#SBATCH --checkpoint-dir=cr
#SBATCH --time=01:30:00
#SBATCH --mail-user=hus...@gmail.com
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

srun --mpi=pmi2 ./mm.o



I also have tried to run directly using srun command but I failed. The
following is the command I use and the error message that occured.

command :

srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o

error :

Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': Permission
denied
Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt'
Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': Permission
denied
Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt'
Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': Permission
denied
Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt'
Received results from task 6
Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': Permission
denied
Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt'
Unable to open file '/mirror/source/cr/275.0/.task.2.ckpt.tmp': Permission
denied
Failed to open checkpoint file '/mirror/source/cr/275.0/task.2.ckpt'
Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': Permission
denied
Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt'
Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': Permission
denied
Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt'


in cr directory there are 7 .ckpt files as follows :

task.1.ckpt, task.2.ckpt, task.5.ckpt, task.6.ckpt, task.7.ckpt,
task.8.ckpt and task.9.ckpt.

There are no checkpoint files called task.0.ckpt, task.3.ckpt and
task.4.ckpt as mentioned in the error message.
mirror is NFS directory that shared across the nodes. I set the cr
directory to have permission 777 just to avoid permission issue.

note : if I execute the command using sbatch job, I just get file named
script.ckpt. There is no task.[number].ckpt file.

Anyone please tell me how to solve this ?
Thank you in advance.

Regards,


Husen


[slurm-dev] Re: How to get command of a running/pending job

2016-05-13 Thread Husen R
dear all, Thank's a lot for your reply !

it really helps me.

Regards,


Husen

On Fri, May 13, 2016 at 9:20 PM, Benjamin Redling <
benjamin.ra...@uni-jena.de> wrote:

>
> On 2016-05-13 05:58, Husen R wrote:
> > Does slurm provide feature to get command that being executed/will be
> > executed by running/pending jobs ?
>
> scontrol show --detail job 
> or
> scontrol show -d job 
>
> Benjamin
> --
> FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
> vox: +49 3641 9 44323 | fax: +49 3641 9 44321
>


[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Husen R
Danny :

I'm unable to use srun_cr command. I got this error message from slurmctld
log file after submitting srun_cr with sbatch:

[2016-04-14T19:22:42.719] job_complete: JobID=67 State=0x1 NodeCnt=2
WEXITSTATUS 255

Any idea to fix this ?

- yes, my job needs more than 5 minutes.

Andy :

Yes, /mirror directory is shared across my cluster. I have configured it
using NFS.

Regards,



Husen



On Thu, Apr 14, 2016 at 6:15 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:

> I've found two things, first you could try srun_cr instead of srun and the
> second is, do your job needs more than 5 minutes?!
> But I'm not sure, so you may try it and post the result.
>
>
> Am 14.04.2016 um 12:56 schrieb Husen R:
>
>> Hello Danny,
>>
>> I have tried to restart using "scontrol checkpoint restart " but it
>> doesn't work.
>> In addition, ".0" directory and its content are doesn't exist in my
>> --checkpoint-dir.
>> The following is my batch job :
>>
>> =batch job===
>>
>> #!/bin/bash
>> #SBATCH -J MatMul
>> #SBATCH -o mm-%j.out
>> #SBATCH -A pro
>> #SBATCH -N 3
>> #SBATCH -n 24
>> #SBATCH --checkpoint=5
>> #SBATCH --checkpoint-dir=/mirror/source/cr
>> #SBATCH --time=01:30:00
>> #SBATCH --mail-user=hus...@gmail.com
>> #SBATCH --mail-type=begin
>> #SBATCH --mail-type=end
>>
>> srun --mpi=pmi2 ./mm.o
>>
>> ===end batch job
>>
>> is there something that prevents me from getting the right directory
>> structure ?
>>
>>
>> Regards,
>>
>>
>>
>> Husen
>>
>>
>>
>>
>> On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher <
>> danny.rotsc...@tu-dresden.de> wrote:
>>
>> Hello,
>>>
>>> usually the directory, which is specified by --checkpoint-dir, should
>>> have
>>> the following structure:
>>> 
>>> |__ script.ckpt
>>> |__ .0
>>>   |__ task.0.ckpt
>>>   |__ task.1.ckpt
>>>   |__ ...
>>>
>>> But you only have to run the following command to restart your batch job:
>>> scontrol checkpoint restart 
>>>
>>> I tried only batch jobs and currently I try to build MVAPICH2 with BLCR
>>> and Slurm support, because that mpi library is explicitly mentioned in
>>> the
>>> Slurm documentation.
>>>
>>> A colleague also tested DMTCP but no success.
>>>
>>> Kind reagards
>>> Danny
>>> TU Dresden
>>> Germany
>>>
>>>
>>> Am 14.04.2016 um 11:01 schrieb Husen R:
>>>
>>> Hi all,
>>>> Thank you for your reply
>>>>
>>>> Danny :
>>>> I have installed BLCR and SLURM successfully.
>>>> I also have configured CheckpointType, --checkpoint, --checkpoint-dir
>>>> and
>>>> JobCheckpointDir in order for slurm to support checkpoint.
>>>>
>>>> I have tried to checkpoint a simple MPI parallel application many times
>>>> in
>>>> my small cluster, and like you said, after checkpoint is completed there
>>>> is
>>>> a directory named with jobid in  --checkpoint-dir. in that directory
>>>> there
>>>> is a file named "script.ckpt". I tried to restart directly using srun
>>>> command below :
>>>>
>>>> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o
>>>>
>>>> where --restart-dir is directory that contains "script.ckpt".
>>>> Unfortunately, I got the following error :
>>>>
>>>> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file
>>>> or
>>>> directory
>>>> srun: error: compute-node: task 0: Exited with exit code 255
>>>>
>>>> As we can see from the error message above, there was no "task.0.ckpt"
>>>> file. I don't know how to get such file. The files that I got from
>>>> checkpoint operation is a file named "script.ckpt" in --checkpoint-dir
>>>> and
>>>> two files in JobCheckpointDir named ".ckpt" and
>>>> ".ckpt.old".
>>>>
>>>> According to the information in section srun in this link
>>>> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
>>>> completed there should be checkpoint files of the form ".ckpt"
>>>> and
>>>> "..ckpt&qu

[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Husen R
Hello Danny,

I have tried to restart using "scontrol checkpoint restart " but it
doesn't work.
In addition, ".0" directory and its content are doesn't exist in my
--checkpoint-dir.
The following is my batch job :

=batch job===

#!/bin/bash
#SBATCH -J MatMul
#SBATCH -o mm-%j.out
#SBATCH -A pro
#SBATCH -N 3
#SBATCH -n 24
#SBATCH --checkpoint=5
#SBATCH --checkpoint-dir=/mirror/source/cr
#SBATCH --time=01:30:00
#SBATCH --mail-user=hus...@gmail.com
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

srun --mpi=pmi2 ./mm.o

===end batch job

is there something that prevents me from getting the right directory
structure ?


Regards,



Husen




On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:

> Hello,
>
> usually the directory, which is specified by --checkpoint-dir, should have
> the following structure:
> 
> |__ script.ckpt
> |__ .0
>  |__ task.0.ckpt
>  |__ task.1.ckpt
>  |__ ...
>
> But you only have to run the following command to restart your batch job:
> scontrol checkpoint restart 
>
> I tried only batch jobs and currently I try to build MVAPICH2 with BLCR
> and Slurm support, because that mpi library is explicitly mentioned in the
> Slurm documentation.
>
> A colleague also tested DMTCP but no success.
>
> Kind reagards
> Danny
> TU Dresden
> Germany
>
>
> Am 14.04.2016 um 11:01 schrieb Husen R:
>
>> Hi all,
>> Thank you for your reply
>>
>> Danny :
>> I have installed BLCR and SLURM successfully.
>> I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
>> JobCheckpointDir in order for slurm to support checkpoint.
>>
>> I have tried to checkpoint a simple MPI parallel application many times in
>> my small cluster, and like you said, after checkpoint is completed there
>> is
>> a directory named with jobid in  --checkpoint-dir. in that directory there
>> is a file named "script.ckpt". I tried to restart directly using srun
>> command below :
>>
>> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o
>>
>> where --restart-dir is directory that contains "script.ckpt".
>> Unfortunately, I got the following error :
>>
>> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file
>> or
>> directory
>> srun: error: compute-node: task 0: Exited with exit code 255
>>
>> As we can see from the error message above, there was no "task.0.ckpt"
>> file. I don't know how to get such file. The files that I got from
>> checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and
>> two files in JobCheckpointDir named ".ckpt" and ".ckpt.old".
>>
>> According to the information in section srun in this link
>> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
>> completed there should be checkpoint files of the form ".ckpt" and
>> "..ckpt" in --checkpoint-dir.
>>
>> Any idea to solve this ?
>>
>> Manuel :
>>
>> Yes, BLCR doesn't support checkpoint/restart parallel/distributed
>> application by itself (
>> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
>> But it can be used by other software to do that (I hope the software is
>> SLURM..huhu)
>>
>> I have ever tried to restart mpi application using DMTCP but it doesn't
>> work.
>> Would you please tell me how to do that ?
>>
>>
>> Thank you in advance,
>>
>> Regards,
>>
>>
>> Husen
>>
>>
>>
>>
>>
>> On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
>> danny.rotsc...@tu-dresden.de> wrote:
>>
>> I forgot something to add, you have to create a directory for the
>>> checkpoint meta data, which is for default located in
>>> /var/slurm/checkpoint:
>>> mkdir -p /var/slurm/checkpoint
>>> chown -R slurm /var/slurm
>>> or you define your own directory in slurm.conf:
>>> JobCheckpointDir=
>>>
>>> The parameters you could check with:
>>> scontrol show config | grep checkpoint
>>>
>>> Kind regards,
>>> Danny
>>> TU Dresden
>>> Germany
>>>
>>> Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
>>>
>>> Hello,
>>>>
>>>> we don't get it to work too, but we already build Slurm with the BLCR.
>>>>
>>>> You first have to install the BLCR library, which is described on the
>>>> following website:
>>>>

[slurm-dev] Re: Slurm Checkpoint/Restart example

2016-04-14 Thread Husen R
Hi all,
Thank you for your reply

Danny :
I have installed BLCR and SLURM successfully.
I also have configured CheckpointType, --checkpoint, --checkpoint-dir and
JobCheckpointDir in order for slurm to support checkpoint.

I have tried to checkpoint a simple MPI parallel application many times in
my small cluster, and like you said, after checkpoint is completed there is
a directory named with jobid in  --checkpoint-dir. in that directory there
is a file named "script.ckpt". I tried to restart directly using srun
command below :

srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o

where --restart-dir is directory that contains "script.ckpt".
Unfortunately, I got the following error :

Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or
directory
srun: error: compute-node: task 0: Exited with exit code 255

As we can see from the error message above, there was no "task.0.ckpt"
file. I don't know how to get such file. The files that I got from
checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and
two files in JobCheckpointDir named ".ckpt" and ".ckpt.old".

According to the information in section srun in this link
http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is
completed there should be checkpoint files of the form ".ckpt" and
"..ckpt" in --checkpoint-dir.

Any idea to solve this ?

Manuel :

Yes, BLCR doesn't support checkpoint/restart parallel/distributed
application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi).
But it can be used by other software to do that (I hope the software is
SLURM..huhu)

I have ever tried to restart mpi application using DMTCP but it doesn't
work.
Would you please tell me how to do that ?


Thank you in advance,

Regards,


Husen





On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher <
danny.rotsc...@tu-dresden.de> wrote:

> I forgot something to add, you have to create a directory for the
> checkpoint meta data, which is for default located in /var/slurm/checkpoint:
> mkdir -p /var/slurm/checkpoint
> chown -R slurm /var/slurm
> or you define your own directory in slurm.conf:
> JobCheckpointDir=
>
> The parameters you could check with:
> scontrol show config | grep checkpoint
>
> Kind regards,
> Danny
> TU Dresden
> Germany
>
> Am 14.04.2016 um 06:41 schrieb Danny Rotscher:
>
>> Hello,
>>
>> we don't get it to work too, but we already build Slurm with the BLCR.
>>
>> You first have to install the BLCR library, which is described on the
>> following website:
>> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html
>>
>> Then we build and installed Slurm from source and BLCR checkpointing has
>> been included.
>>
>> After that you have to set at least one Parameter in the file
>> "slurm.conf":
>> CheckpointType=checkpoint/blcr
>>
>> It exists two ways to create ceckpointing, you could either make a
>> checkpoint by the following command from outside your job:
>> scontrol checkpoint create 
>> or you could let Slurm do some periodical checkpoints with the following
>> sbatch parameter:
>> #SBATCH --checkpoint 
>> We also tried:
>> #SBATCH --checkpoint :
>> e.g.
>> #SBATCH --checkpoint 0:10
>> to test it, but it doesn't work for us.
>>
>> We also set the parameter for the checkpoint directory:
>> #SBATCH --checkpoint-dir 
>>
>> After you create a checkpoint and in your checkpoint directory is created
>> a directory with name of your jobid, you could restart the job by the
>> following command:
>> scontrol checkpoint restart 
>>
>> We tested some sequential and openmp programs with different parameters
>> and it works (checkpoint creation and restarting),
>> but *we don't get any mpi library to work*, we already tested some
>> programs build with openmpi and intelmpi.
>> The checkpoint will be created but we get the following error when we
>> want to restart them:
>> - Failed to open file '/'
>> - cr_restore_all_files [28534]:  Unable to restore fd 3 (type=1,err=-21)
>> - cr_rstrt_child [28534]:  Unable to restore files!  (err=-21)
>> Restart failed: Is a directory
>> srun: error: taurusi4010: task 0: Exited with exit code 21
>>
>> So, it would be great if you could confirm our problems, maybe then
>> schedmd higher up the priority of such mails;-)
>> If you get it to work, please help us to understand how.
>>
>> Kind reagards,
>> Danny
>> TU Dresden
>> Germany
>>
>> Am 11.04.2016 um 10:09 schrieb Husen R:
>>
>>> Hi all,
>>>
>>> Based on the information in this link
>>> http://slurm.schedmd.com/checkpoint_blcr.html,
>>> Slurm able to checkpoint the whole batch jobs and then Restart execution
>>> of
>>> batch jobs and job steps from checkpoint files.
>>>
>>> Anyone please tell me how to do that ?
>>> I need help.
>>>
>>> Thank you in advance.
>>>
>>> Regards,
>>>
>>>
>>> Husen Rusdiansyah
>>> University of Indonesia
>>>
>>
>>
>


[slurm-dev] Slurm Checkpoint/Restart example

2016-04-11 Thread Husen R
Hi all,

Based on the information in this link
http://slurm.schedmd.com/checkpoint_blcr.html,
Slurm able to checkpoint the whole batch jobs and then Restart execution of
batch jobs and job steps from checkpoint files.

Anyone please tell me how to do that ?
I need help.

Thank you in advance.

Regards,


Husen Rusdiansyah
University of Indonesia


[slurm-dev] Re: sbatch always produces pending jobs

2016-04-08 Thread Husen R
Hi Emily,

Thank you for the information.

How to avoid node from having DRAIN state ?
I didn't set its state to DRAIN.

Thank you in advance

Regards,

Husen



On Fri, Apr 8, 2016 at 8:16 PM, E.M. Dragowsky <dragow...@case.edu> wrote:

> Hi, Husen --
>
> The DRAIN state means the node is not available for jobs, at least as far
> as I understand from the documentation describing scontrol:
>
> If you want to remove a node from service, you typically want to set it's
> state to "DRAIN".
>
> Cheers,
> ~ Emily
>
> --
> E.M. Dragowsky, Ph.D.
> ITS -- Research Computing
> Case Western Reserve University
> (216) 368-0082
>
> On Fri, Apr 8, 2016 at 8:47 AM, Husen R <hus...@gmail.com> wrote:
>
>> Hello Remi,
>>
>> Thank you for your reply.
>>
>> here is the output of 'sinfo' and 'sinfo -R' respectively:
>>
>> pro@head-node:~$ sinfo
>> PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
>> comeon*  up  30:00  1  drain head-node
>> pro@head-node:~$ sinfo -R
>> REASON   USER  TIMESTAMP   NODELIST
>> batch job complete f root  2016-04-08T16:16:38 head-node
>>
>> The state of my node is drain. I don't understand why the resources is
>> not available. Currently, I don't run any resource-hungry application on
>> that node.
>>
>> Regards,
>>
>>
>> Husen
>>
>>
>> On Fri, Apr 8, 2016 at 7:23 PM, Rémi Palancher <r...@rezib.org> wrote:
>>
>>>
>>> Le 08/04/2016 13:39, Husen R a écrit :
>>>
>>>> [...]
>>>> pro@head-node:/mirror/source$ squeue
>>>>   JOBID   PARTITIONNAME  USER ST   TIME
>>>>   NODES NODELIST(REASON)
>>>>  70comeon MatMul  pro PD   0:00
>>>>   1(Resources)
>>>>  71comeon MatMul  pro PD   0:00
>>>>   1(Resources)
>>>>  72comeon MatMul  pro PD   0:00
>>>>   1(Resources)
>>>>
>>>
>>> In the last column, squeue gives you the reason why the job are pending.
>>> "Resources" means there is not enough resources available to run the jobs.
>>>
>>> Check the state of your nodes using `sinfo`.
>>>
>>> Best,
>>> Rémi
>>>
>>
>>
>


[slurm-dev] Re: sbatch always produces pending jobs

2016-04-08 Thread Husen R
Hello Remi,

Thank you for your reply.

here is the output of 'sinfo' and 'sinfo -R' respectively:

pro@head-node:~$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
comeon*  up  30:00  1  drain head-node
pro@head-node:~$ sinfo -R
REASON   USER  TIMESTAMP   NODELIST
batch job complete f root  2016-04-08T16:16:38 head-node

The state of my node is drain. I don't understand why the resources is not
available. Currently, I don't run any resource-hungry application on that
node.

Regards,


Husen


On Fri, Apr 8, 2016 at 7:23 PM, Rémi Palancher <r...@rezib.org> wrote:

>
> Le 08/04/2016 13:39, Husen R a écrit :
>
>> [...]
>> pro@head-node:/mirror/source$ squeue
>>   JOBID   PARTITIONNAME  USER ST   TIME
>>   NODES NODELIST(REASON)
>>  70comeon MatMul  pro PD   0:00
>>   1(Resources)
>>  71comeon MatMul  pro PD   0:00
>>   1(Resources)
>>  72comeon MatMul  pro PD   0:00
>>   1(Resources)
>>
>
> In the last column, squeue gives you the reason why the job are pending.
> "Resources" means there is not enough resources available to run the jobs.
>
> Check the state of your nodes using `sinfo`.
>
> Best,
> Rémi
>


[slurm-dev] sbatch always produces pending jobs

2016-04-08 Thread Husen R
Hello all,

Everytime I use sbatch, the job is always in pending status so it never be
executed.
I have tried to find the solution in mail archive but I didn't find a match.
For debugging simplicity, I run slurmctld and slurmd in one machine.

Following is the output of squeue command :

pro@head-node:/mirror/source$ squeue
 JOBID   PARTITIONNAME  USER ST   TIME
 NODES NODELIST(REASON)
70comeon MatMul  pro PD   0:00
 1(Resources)
71comeon MatMul  pro PD   0:00
 1(Resources)
72comeon MatMul  pro PD   0:00
 1(Resources)

here is control machine and compute node configuration in slurm.conf:

ControlMachine=head-node
ControlAddr=head-node
#BackupController=
#BackupAddr=
...
...
...
# COMPUTE NODES
NodeName=DEFAULT CPUs=8 RealMemory=5949 TmpDisk=281483 State=UNKNOWN
NodeName=head-node NodeAddr=head-node SocketsPerBoard=1 CoresPerSocket=4
ThreadsPerCore=2

PartitionName=DEFAULT State=UP
PartitionName=comeon Nodes=head-node MaxTime=30 MaxNodes=2 Default=YES


and here is my sbatch script :

#!/bin/bash
#SBATCH -J MatMul
#SBATCH -o myMM.%j.out
#SBATCH -A pro
#SBATCH -N 1
#SBATCH -n 2
#SBATCH --time=00:30:00
#SBATCH --mail-user=hus...@gmail.com
#SBATCH --mail-type=begin
#SBATCH --mail-type=end

salloc mpiexec ./mm.o




Anyone please tell me how to solve this ?
is the something misconfigured ?

Thank you in advance

Regards,

Husen


[slurm-dev] Re: Failed to access munge.socket.2

2016-04-08 Thread Husen R
Hello Lachlan,

Thank you for your reply.

Yes, you're right. munge.socket.2 is created when the system runs..huhu
I use Slurm version 15.08.10. The OS is Ubuntu 14.04 LTS 64 bit.

Currently, I have installed munge and slurm successfully.
Previously, munge.socket.2 unable to be accessed because at configure time
I used a customized location in "--prefix=" option. So I decided to
reinstall munge using the instructions available in this link
https://github.com/dun/munge/wiki/Installation-Guide.

Regards,

Husen


On Fri, Apr 8, 2016 at 9:00 AM, Simpson Lachlan <
lachlan.simp...@petermac.org> wrote:

> Husen,
>
>
>
> You won’t be able to find the file – it’s created when the system runs so
> that the system knows something is running J
>
>
>
> Everything in /var/run is ephemeral.
>
>
>
> Ok, what version of slurm are you running, which bits have you installed
> and what OS are you installing it onto?
>
>
>
> Yes, there isn’t a munge in /var/run yet – that’s why you should create
> the tmpfiles.d like I said – that will create it on boot. In the meantime,
> you can just create the directory in /var/run/, you will need to chown
> munge:munge after you have created it.
>
>
>
> Cheers
>
> L.
>
>
>
> *From:* Husen R [mailto:hus...@gmail.com]
> *Sent:* Thursday, 7 April 2016 12:07 PM
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: Failed to access munge.socket.2
>
>
>
> Hello Lachlan,Chris
>
>
>
> Thank you for your reply.
>
>
>
> I don't know why "/usr/local" is appended to the path..
>
> I tried to locate munge.socket.2 manually using locate command and the
> file is not exist indeed.
>
> The directory /usr/local/var/run/munge is empty.
>
>
>
> There is no munge directory in /var/run. I don't know why the munge
> directory is located in /usr/local/var/run instead of in /var/run.
>
>
>
> I have ever installed slurm-llnl from repository before installing it from
> source. is this probably the cause of the problem ?
>
>
>
> Regards,
>
>
>
> Husen
>
>
>
> On Thu, Apr 7, 2016 at 8:26 AM, Christopher Samuel <sam...@unimelb.edu.au>
> wrote:
>
>
> On 06/04/16 19:50, Husen R wrote:
>
> > however, when I tried to run sbatch I get the following error message:
> >
> > Failed to access "/usr/local/var/run/munge/munge.socket.2": No such file
> > or directory
>
> Is that path really correct?
>
> On our systems it's: /var/run/munge/munge.socket.2
>
> Best of luck,
> Chris
> --
>  Christopher SamuelSenior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/  http://twitter.com/vlsci
>
>
> This email (including any attachments or links) may contain confidential
> and/or legally privileged information and is intended only to be read or
> used by the addressee. If you are not the intended addressee, any use,
> distribution, disclosure or copying of this email is strictly prohibited.
> Confidentiality and legal privilege attached to this email (including any
> attachments) are not waived or lost by reason of its mistaken delivery to
> you. If you have received this email in error, please delete it and notify
> us immediately by telephone or email. Peter MacCallum Cancer Centre
> provides no guarantee that this transmission is free of virus or that it
> has not been intercepted or altered and will not be liable for any delay in
> its receipt.


[slurm-dev] Re: Failed to access munge.socket.2

2016-04-07 Thread Husen R
Hello all,

Currently I have installed and configured slurm-15.08.10 and munge-05.12
successfully.

-- it seems "/usr/local" appeared in the path to munge.socket.2 (
/usr/local/var/run/munge/munge.socket.2) because at configure step I
included it as a "--prefix" value. So I decided to reinstall munge using
instructions available in this link
https://github.com/dun/munge/wiki/Installation-Guide and the problem is
solved.

-- According to my experience installing munge, it required machine reboot
once it is installed otherwise the error message regarding permissions
appeared when I attempt to use it. (I don't know if this is a normal
behavior or not).

Regards,


Husen



On Thu, Apr 7, 2016 at 8:26 AM, Christopher Samuel <sam...@unimelb.edu.au>
wrote:

>
> On 06/04/16 19:50, Husen R wrote:
>
> > however, when I tried to run sbatch I get the following error message:
> >
> > Failed to access "/usr/local/var/run/munge/munge.socket.2": No such file
> > or directory
>
> Is that path really correct?
>
> On our systems it's: /var/run/munge/munge.socket.2
>
> Best of luck,
> Chris
> --
>  Christopher SamuelSenior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/  http://twitter.com/vlsci
>


[slurm-dev] Re: Failed to access munge.socket.2

2016-04-06 Thread Husen R
Hello Lachlan,Chris

Thank you for your reply.

I don't know why "/usr/local" is appended to the path..
I tried to locate munge.socket.2 manually using locate command and the file
is not exist indeed.
The directory /usr/local/var/run/munge is empty.

There is no munge directory in /var/run. I don't know why the munge
directory is located in /usr/local/var/run instead of in /var/run.

I have ever installed slurm-llnl from repository before installing it from
source. is this probably the cause of the problem ?

Regards,

Husen

On Thu, Apr 7, 2016 at 8:26 AM, Christopher Samuel <sam...@unimelb.edu.au>
wrote:

>
> On 06/04/16 19:50, Husen R wrote:
>
> > however, when I tried to run sbatch I get the following error message:
> >
> > Failed to access "/usr/local/var/run/munge/munge.socket.2": No such file
> > or directory
>
> Is that path really correct?
>
> On our systems it's: /var/run/munge/munge.socket.2
>
> Best of luck,
> Chris
> --
>  Christopher SamuelSenior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/  http://twitter.com/vlsci
>


[slurm-dev] Failed to access munge.socket.2

2016-04-06 Thread Husen R
Hello everyone,

I have installed slurm-15.08.9 succesfully.
however, when I tried to run sbatch I get the following error message:

Failed to access "/usr/local/var/run/munge/munge.socket.2": No such file or
directory

I tried to solve this problem by reinstalling munge and recreating
munge.key.
I also have propagated the munge key to every nodes in my cluster (
https://github.com/dun/munge/wiki/Installation-Guide#starting-the-daemon) .

however, the error still appear..

Anyone please tell me how to solve this problem ?
Sorry for this very basic question.
Thank you in advance.

Regards,


Husen


[slurm-dev] checkpoint/restart feature in SLURM

2016-03-19 Thread Husen R
Dear Slurm-dev,


Does checkpoint/restart feature available in SLURM able to relocate MPI
application from one node to another node while it is running ?

For the example, I run MPI application in node A,B and C in a cluster and I
want to migrate/relocate process running in node A to other node, let's say
to node C while it is running.

is there a way to do this with SLURM ? Thank you.


Regards,

Husen