[slurm-dev] Re: Trying to get a simple slurm cluster going
On Mon, Jul 18, 2016 at 5:52 AM, P. Larry Nelsonwrote: > > Hello, > > While I am in search of real hardware on which to build/test Slurm, > I am attempting to just play around with it on a test VM (Scientific > Linux 6.8), which, of course, is using NATted networking and is a > standalone system protected from the outside world. > > I downloaded the latest (16.05.2) tarball and ran the rpmbuild > and then installed all the rpm's. Ran the Easy Configurator > and gave it the hostname of the VM for the ControlMachine > and the loopback address of 127.0.0.1 for the ControlAddr. > Try to fill ControlMachine and ControlAddr with the same value (ex. hostname) > > Made a munge key and it started just fine. > > When I do a 'service slurm start', it responds "OK" for both slurmctld > and slurmd, but slurmctld dies right away. > > If I do a 'slurmctld -Dvvv', I get: > slurmctld: pidfile not locked, assuming no running daemon > slurmctld: debug: creating clustername file: /var/spool/clustername > slurmctld: fatal: _create_clustername_file: failed to create file > /var/spool/clustername > > The slurm.conf has this for ClusterName: > ClusterName=SlurmCluster > > So, why is slurmctld trying to create file /var/spool/clustername > instead of /var/spool/SlurmCluster. > clustername is just a filename. You can see your ClusterName in that file. > > Slurmd and slurmctld are started as root. > I'm obviously missing something here > > Thanks! > - Larry > > > -- > P. Larry Nelson (217-244-9855) | IT Administrator > 457 Loomis Lab | High Energy Physics Group > 1110 W. Green St., Urbana, IL | Physics Dept., Univ. of Ill. > MailTo: lnel...@illinois.edu | > http://hep.physics.illinois.edu/home/lnelson/ > > -- > "Information without accountability is just noise." - P.L. Nelson > -- Post Graduate Student Faculty of Computer Science University of Indonesia Depok
[slurm-dev] Re: number of processes in slurm job
On Tue, Jul 12, 2016 at 3:04 PM, David Ramírez <drami...@sie.es> wrote: > Hi Husen > > > > I don’t use mcapich since some years. But I remenber when you compile > MVAPCIH2 you must indicate > > > > ./ configure --with -pm=no --with - pmi = slurm > Hi David, Thanks for the information > > > I hope to you can fix it ;) > > > > I used srun –pmi=mpi2 in some programs and Works too > > > > > > *De:* Husen R [mailto:hus...@gmail.com] > *Enviado el:* martes, 12 de julio de 2016 9:59 > > *Para:* slurm-dev <slurm-dev@schedmd.com> > *Asunto:* [slurm-dev] Re: number of processes in slurm job > > > > > > > > On Tue, Jul 12, 2016 at 2:53 PM, David Ramírez <drami...@sie.es> wrote: > > Hi Husen. > > > > Are you compiled OpenMPI with Slurm?? Indicate for example –with-slurm and > –with-pmi ¿? > > > > Hi David, > > I use MVAPICH2. > > I found that I have to use srun instead of mpirun. > > I use this command : srun --mpi=pmi2 ./mm.o 6000 > > it works. > > > > I use OpenMPI and Works fine without indicated mpi procs into my batch > file. > > > > *De:* Husen R [mailto:hus...@gmail.com] > *Enviado el:* martes, 12 de julio de 2016 9:41 > *Para:* slurm-dev <slurm-dev@schedmd.com> > *Asunto:* [slurm-dev] Re: number of processes in slurm job > > > > > > > > On Tue, Jul 12, 2016 at 2:24 PM, Loris Bennett <loris.benn...@fu-berlin.de> > wrote: > > > Husen R <hus...@gmail.com> writes: > > > Re: [slurm-dev] Re: number of processes in slurm job > > > > Hi, > > > > Thanks for your reply ! > > > > I use this sbatch script > > > > #!/bin/bash > > #SBATCH -J mm6kn2_03 > > #SBATCH -o 6kn203-%j.out > > #SBATCH -A necis > > #SBATCH -N 3 > > #SBATCH -n 16 > > #SBATCH --time=05:30:00 > > > > mpirun ./mm.o 6000 > > You need to tell 'mpirun' how many processes to start. If you do not, > probably all cores available will be used. So it looks like you have 6 > cores per node and thus 'mpirun' starts 18 processes. You should write > some thing like > > mpirun -np ${SLURM_NTASKS} ./mm.o 6000 > > > > without specifying -np value, using "#SBATCH -n 16" as written in my > sbatch script I hope mpirun will use 16 as the number of processes. > > however, I just realized mpirun doesn't read sbatch script. > > > Cheers, > > Loris > > > > regards, > > > > Husen > > > > On Tue, Jul 12, 2016 at 1:21 PM, Loris Bennett > > <loris.benn...@fu-berlin.de> wrote: > > > > Husen R <hus...@gmail.com> writes: > > > > > number of processes in slurm job > > > > > > > > > > Hi all, > > > > > > I tried to run a job on 3 nodes (N=3) with 16 number of processes > > > (n=16) but slurm automatically changes that n value to 18 (n=18). > > > > > > I also tried to use other combination of n values that are not > equally > > > devided by N but Slurm automatically changes those n values to > values > > > that are equally devided by N. > > > > > > How to change this behavior ? > > > I need to use a specific value of n for experimental purpose. > > > > > > Thank you in advance. > > > > > > Regards, > > > > > > Husen > > > > > > You need to give more details about what you did. How did you set the > > number of processes? > > > > Cheers, > > > > Loris > > > > -- > > Dr. Loris Bennett (Mr.) > > ZEDAT, Freie Universität Berlin Email > > loris.benn...@fu-berlin.de > > > > > > > > -- > Dr. Loris Bennett (Mr.) > ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de > > > > > > Este correo y sus archivos asociados son privados y confidenciales y va > dirigido exclusivamente a su destinatario. Si recibe este correo sin ser el > destinatario del mismo, le rogamos proceda a su eliminación y lo ponga en > conocimiento del emisor. La difusión por cualquier medio del contenido de > este correo podría ser sancionada conforme a lo previsto en las leyes > españolas. No se autoriza la utilización con fines comerciales o para su > incorporación a ficheros automatizados de las direcciones del emisor o del > destinatario . > > This mail and its attached files are confidential and are exclusively > intended to their addre
[slurm-dev] Re: number of processes in slurm job
On Tue, Jul 12, 2016 at 2:53 PM, David Ramírez <drami...@sie.es> wrote: > Hi Husen. > > > > Are you compiled OpenMPI with Slurm?? Indicate for example –with-slurm and > –with-pmi ¿? > Hi David, I use MVAPICH2. I found that I have to use srun instead of mpirun. I use this command : srun --mpi=pmi2 ./mm.o 6000 it works. > > > I use OpenMPI and Works fine without indicated mpi procs into my batch > file. > > > > *De:* Husen R [mailto:hus...@gmail.com] > *Enviado el:* martes, 12 de julio de 2016 9:41 > *Para:* slurm-dev <slurm-dev@schedmd.com> > *Asunto:* [slurm-dev] Re: number of processes in slurm job > > > > > > > > On Tue, Jul 12, 2016 at 2:24 PM, Loris Bennett <loris.benn...@fu-berlin.de> > wrote: > > > Husen R <hus...@gmail.com> writes: > > > Re: [slurm-dev] Re: number of processes in slurm job > > > > Hi, > > > > Thanks for your reply ! > > > > I use this sbatch script > > > > #!/bin/bash > > #SBATCH -J mm6kn2_03 > > #SBATCH -o 6kn203-%j.out > > #SBATCH -A necis > > #SBATCH -N 3 > > #SBATCH -n 16 > > #SBATCH --time=05:30:00 > > > > mpirun ./mm.o 6000 > > You need to tell 'mpirun' how many processes to start. If you do not, > probably all cores available will be used. So it looks like you have 6 > cores per node and thus 'mpirun' starts 18 processes. You should write > some thing like > > mpirun -np ${SLURM_NTASKS} ./mm.o 6000 > > > > without specifying -np value, using "#SBATCH -n 16" as written in my > sbatch script I hope mpirun will use 16 as the number of processes. > > however, I just realized mpirun doesn't read sbatch script. > > > Cheers, > > Loris > > > > regards, > > > > Husen > > > > On Tue, Jul 12, 2016 at 1:21 PM, Loris Bennett > > <loris.benn...@fu-berlin.de> wrote: > > > > Husen R <hus...@gmail.com> writes: > > > > > number of processes in slurm job > > > > > > > > > > Hi all, > > > > > > I tried to run a job on 3 nodes (N=3) with 16 number of processes > > > (n=16) but slurm automatically changes that n value to 18 (n=18). > > > > > > I also tried to use other combination of n values that are not > equally > > > devided by N but Slurm automatically changes those n values to > values > > > that are equally devided by N. > > > > > > How to change this behavior ? > > > I need to use a specific value of n for experimental purpose. > > > > > > Thank you in advance. > > > > > > Regards, > > > > > > Husen > > > > > > You need to give more details about what you did. How did you set the > > number of processes? > > > > Cheers, > > > > Loris > > > > -- > > Dr. Loris Bennett (Mr.) > > ZEDAT, Freie Universität Berlin Email > > loris.benn...@fu-berlin.de > > > > > > > > -- > Dr. Loris Bennett (Mr.) > ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de > > > > Este correo y sus archivos asociados son privados y confidenciales y va > dirigido exclusivamente a su destinatario. Si recibe este correo sin ser el > destinatario del mismo, le rogamos proceda a su eliminación y lo ponga en > conocimiento del emisor. La difusión por cualquier medio del contenido de > este correo podría ser sancionada conforme a lo previsto en las leyes > españolas. No se autoriza la utilización con fines comerciales o para su > incorporación a ficheros automatizados de las direcciones del emisor o del > destinatario . > > This mail and its attached files are confidential and are exclusively > intended to their addressee. In case you may receive this mail not being > its addressee, we beg you to let us know the error by reply and to proceed > to delete it. The circulation by any mean of this mail could be penalised > in accordance with the Spanish legislation. The use of both the transmitter > and the addressee’s address with a commercial aim, or in order to be > incorporated to automated files, is not authorised. >
[slurm-dev] Re: number of processes in slurm job
On Tue, Jul 12, 2016 at 2:37 PM, Carlos Fenoy <mini...@gmail.com> wrote: > If you do not specify the number of nodes does it work as expected? > it run with 2 nodes, not 3 nodes. I just realized, I have to use srun instead of mpirun in order for SLURM to run my job as I'm expected. > On Tue, 12 Jul 2016, 09:25 Loris Bennett, <loris.benn...@fu-berlin.de> > wrote: > >> >> Husen R <hus...@gmail.com> writes: >> >> > Re: [slurm-dev] Re: number of processes in slurm job >> > >> > Hi, >> > >> > Thanks for your reply ! >> > >> > I use this sbatch script >> > >> > #!/bin/bash >> > #SBATCH -J mm6kn2_03 >> > #SBATCH -o 6kn203-%j.out >> > #SBATCH -A necis >> > #SBATCH -N 3 >> > #SBATCH -n 16 >> > #SBATCH --time=05:30:00 >> > >> > mpirun ./mm.o 6000 >> >> You need to tell 'mpirun' how many processes to start. If you do not, >> probably all cores available will be used. So it looks like you have 6 >> cores per node and thus 'mpirun' starts 18 processes. You should write >> some thing like >> >> mpirun -np ${SLURM_NTASKS} ./mm.o 6000 >> >> Cheers, >> >> Loris >> >> > regards, >> > >> > Husen >> > >> > On Tue, Jul 12, 2016 at 1:21 PM, Loris Bennett >> > <loris.benn...@fu-berlin.de> wrote: >> > >> > Husen R <hus...@gmail.com> writes: >> > >> > > number of processes in slurm job >> > >> > >> > > >> > > Hi all, >> > > >> > > I tried to run a job on 3 nodes (N=3) with 16 number of processes >> > > (n=16) but slurm automatically changes that n value to 18 (n=18). >> > > >> > > I also tried to use other combination of n values that are not >> equally >> > > devided by N but Slurm automatically changes those n values to >> values >> > > that are equally devided by N. >> > > >> > > How to change this behavior ? >> > > I need to use a specific value of n for experimental purpose. >> > > >> > > Thank you in advance. >> > > >> > > Regards, >> > > >> > > Husen >> > >> > >> > You need to give more details about what you did. How did you set >> the >> > number of processes? >> > >> > Cheers, >> > >> > Loris >> > >> > -- >> > Dr. Loris Bennett (Mr.) >> > ZEDAT, Freie Universität Berlin Email >> > loris.benn...@fu-berlin.de >> > >> > >> > >> >> -- >> Dr. Loris Bennett (Mr.) >> ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de >> >
[slurm-dev] Re: number of processes in slurm job
On Tue, Jul 12, 2016 at 2:24 PM, Loris Bennett <loris.benn...@fu-berlin.de> wrote: > > Husen R <hus...@gmail.com> writes: > > > Re: [slurm-dev] Re: number of processes in slurm job > > > > Hi, > > > > Thanks for your reply ! > > > > I use this sbatch script > > > > #!/bin/bash > > #SBATCH -J mm6kn2_03 > > #SBATCH -o 6kn203-%j.out > > #SBATCH -A necis > > #SBATCH -N 3 > > #SBATCH -n 16 > > #SBATCH --time=05:30:00 > > > > mpirun ./mm.o 6000 > > You need to tell 'mpirun' how many processes to start. If you do not, > probably all cores available will be used. So it looks like you have 6 > cores per node and thus 'mpirun' starts 18 processes. You should write > some thing like > > mpirun -np ${SLURM_NTASKS} ./mm.o 6000 > without specifying -np value, using "#SBATCH -n 16" as written in my sbatch script I hope mpirun will use 16 as the number of processes. however, I just realized mpirun doesn't read sbatch script. > > Cheers, > > Loris > > > regards, > > > > Husen > > > > On Tue, Jul 12, 2016 at 1:21 PM, Loris Bennett > > <loris.benn...@fu-berlin.de> wrote: > > > > Husen R <hus...@gmail.com> writes: > > > > > number of processes in slurm job > > > > > > > > > > Hi all, > > > > > > I tried to run a job on 3 nodes (N=3) with 16 number of processes > > > (n=16) but slurm automatically changes that n value to 18 (n=18). > > > > > > I also tried to use other combination of n values that are not > equally > > > devided by N but Slurm automatically changes those n values to > values > > > that are equally devided by N. > > > > > > How to change this behavior ? > > > I need to use a specific value of n for experimental purpose. > > > > > > Thank you in advance. > > > > > > Regards, > > > > > > Husen > > > > > > You need to give more details about what you did. How did you set the > > number of processes? > > > > Cheers, > > > > Loris > > > > -- > > Dr. Loris Bennett (Mr.) > > ZEDAT, Freie Universität Berlin Email > > loris.benn...@fu-berlin.de > > > > > > > > -- > Dr. Loris Bennett (Mr.) > ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de >
[slurm-dev] Re: number of processes in slurm job
Hi, Thanks for your reply ! I use this sbatch script #!/bin/bash #SBATCH -J mm6kn2_03 #SBATCH -o 6kn203-%j.out #SBATCH -A necis #SBATCH -N 3 #SBATCH -n 16 #SBATCH --time=05:30:00 mpirun ./mm.o 6000 regards, Husen On Tue, Jul 12, 2016 at 1:21 PM, Loris Bennett <loris.benn...@fu-berlin.de> wrote: > > Husen R <hus...@gmail.com> writes: > > > number of processes in slurm job > > > > Hi all, > > > > I tried to run a job on 3 nodes (N=3) with 16 number of processes > > (n=16) but slurm automatically changes that n value to 18 (n=18). > > > > I also tried to use other combination of n values that are not equally > > devided by N but Slurm automatically changes those n values to values > > that are equally devided by N. > > > > How to change this behavior ? > > I need to use a specific value of n for experimental purpose. > > > > Thank you in advance. > > > > Regards, > > > > Husen > > You need to give more details about what you did. How did you set the > number of processes? > > Cheers, > > Loris > > -- > Dr. Loris Bennett (Mr.) > ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de >
[slurm-dev] Node order in squeue nodelist column
Hi I'm wondering,, how does slurm arrange nodes in squeue nodelist column ? are the nodes just arranged in alphabetical order (based on nodename) from left to right or is it priority order ? The following is the output of squeue command : JOBIDPARTITION NAME USER STTIME NODES NODELIST(REASON) 1194 part1 md50n2_0 necis R30:26 2 compute-node,head-node in nodelist column, compute-node always appeared first before head-node. I have tried to set weight to each node in slurm.conf, however the nodes sequence still unchanged. It seems that the first node in nodelist is responsible to compute MPI Rank 0, and it affects my experiment result. I want MPI rank 0 is computed in head-node, because that node is a submit node. Thank you in advance, Husen
[slurm-dev] Re: How to setup node sequence
Hi, I tried to set weight to each node but the order of nodes still not changed. The following is part of my slurm.conf NodeName=node1 weight=1 NodeName=node2 weight=10 ... NodeName=node3 weight=20 ... using that conf, I want slurm chooses node1 when there is a job that request one node, but the selected node is always node2. any idea how to solve this ? thank you in advance, Husen On Tue, Jun 14, 2016 at 12:57 PM, Husen R <hus...@gmail.com> wrote: > Thanks ! > > I'll check it out. > > Regards, > > > Husen > > On Mon, Jun 13, 2016 at 5:40 PM, Benjamin Redling < > benjamin.ra...@uni-jena.de> wrote: > >> >> >> >> On 06/13/2016 09:50, Husen R wrote: >> > Hi all, >> > >> > How to setup node sequence/order in slurm ? >> > I configured nodes in slurm.conf like this -> Nodes = >> head,compute,spare. >> > >> > Using that configuration, if I use one node in my job, I hope slurm will >> > choose head as computing node (as it is in a first order). However slurm >> > always choose compute, not head. >> > >> > how to fix this ? >> >> http://slurm.schedmd.com/slurm.conf.html >> >> " >> Weight >> The priority of the node for scheduling purposes. All things being >> equal, jobs will be allocated the nodes with the lowest weight which >> satisfies their requirements. For example, a heterogeneous collection of >> nodes might be placed into a single partition for greater system >> utilization, responsiveness and capability. It would be preferable to >> allocate smaller memory nodes rather than larger memory nodes if either >> will satisfy a job's requirements. The units of weight are arbitrary, >> but larger weights should be assigned to nodes with more processors, >> memory, disk space, higher processor speed, etc. Note that if a job >> allocation request can not be satisfied using the nodes with the lowest >> weight, the set of nodes with the next lowest weight is added to the set >> of nodes under consideration for use (repeat as needed for higher weight >> values). If you absolutely want to minimize the number of higher weight >> nodes allocated to a job (at a cost of higher scheduling overhead), give >> each node a distinct Weight value and they will be added to the pool of >> nodes being considered for scheduling individually. The default value is >> 1. >> " >> >> Benjamin >> -- >> FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html >> vox: +49 3641 9 44323 | fax: +49 3641 9 44321 >> > >
[slurm-dev] RE: working directory of a completed job
Hello Giovanni, Thank you for your reply! the job completion log is not yet configured in my slurm sonfig. On Tue, Jun 21, 2016 at 8:48 PM, Giovanni Torres <torre...@helix.nih.gov> wrote: > You could get this from the job completion log: > > > > $ scontrol show config | grep JobComp > > > > Giovanni > > > > *From:* Husen R [mailto:hus...@gmail.com] > *Sent:* Tuesday, June 21, 2016 1:59 AM > *To:* slurm-dev <slurm-dev@schedmd.com> > *Subject:* [slurm-dev] working directory of a completed job > > > > Hi, > > How to get a workdir of a completed job ? > > The command "scontrol show job JOBID" is only for running job. > > If I use that command for completed job, the following error message > appeared.. > "slurm_load_jobs error: Invalid job id specified" > > thank you in advance > > Regards, > > Husen > > >
[slurm-dev] Re: working directory of a completed job
Hello jason, Thank you for your reply ! I use slurmdbd as my AccountingStorageType... And finally, I just handle the jobs that are in running/pending state.. I use bash script command "scontrol show job jobid | grep WorkDir | cut -d "=" -f2" to get WorkDir.. Regards, Husen On Tue, Jun 21, 2016 at 9:48 PM, Jason Bacon <bacon4...@gmail.com> wrote: > > > Hello Husen, > > See http://slurm.schedmd.com/sacct.html. > > If you're using AccountingStorageType=accounting_storage/filetxt, you can > also grep/awk/more the JobComp file. > > Jason > > On 06/21/16 00:56, Husen R wrote: > >> working directory of a completed job >> Hi, >> >> How to get a workdir of a completed job ? >> The command "scontrol show job JOBID" is only for running job. >> >> If I use that command for completed job, the following error message >> appeared.. >> "slurm_load_jobs error: Invalid job id specified" >> >> thank you in advance >> Regards, >> >> Husen >> >> > > -- > All wars are civil wars, because all men are brothers ... Each one owes > infinitely more to the human race than to the particular country in > which he was born. > -- Francois Fenelon >
[slurm-dev] working directory of a completed job
Hi, How to get a workdir of a completed job ? The command "scontrol show job JOBID" is only for running job. If I use that command for completed job, the following error message appeared.. "slurm_load_jobs error: Invalid job id specified" thank you in advance Regards, Husen
[slurm-dev] Re: How to setup slurm database accounting feature
Hi Chris, Thank you for your reply ! I have followed your suggestion. I kill slurmdbd process and drop all tables in slurm_acct_db. However when I try to run "sudo sacctmgr add cluster hpctesis", the error message "Database is busy or waiting for lock from other user." still appeared. Therefore, I decided to change my clustername to something else and it works ! I don't know why my first clustername is doesn't work. in addition, I have a new question. Does sacct command only displays jobs in current day ? I use sacct to display all jobs, but what I got is jobs in current day only. All jobs executed before current day are not appeared. I know I can use "sacct -c" to display job completion but this command doesn't display jobs in RUNNING states as it intends to do so. so, is there a way to change sacct behavior ? so that I can display all jobs (RUNNING, FAILED, CANCELLED,COMPLETED etc) from every days available in slurm database at once. Thank you in advance. Regards, Husen On Mon, May 23, 2016 at 12:25 PM, Christopher Samuel <sam...@unimelb.edu.au> wrote: > > On 23/05/16 14:08, Husen R wrote: > > > anyone can tell me how to solve this please ? > > Kill all your slurmdbd's first, and check that all the associated > processes are gone. > > Then as long as you've not got any important data there (and that seems > unlikely if you can't create your first cluser) drop all the tables in > that database by hand and then start one slurmdbd in debugging mode with: > > slurmdbd -D -v > > and see what happens. > > By the way is this is standard MySQL or MariaDB or is it a clustered > version (Galera/Percona-xtradb/etc)? > > All the best, > Chris > -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci >
[slurm-dev] Re: How to setup slurm database accounting feature
Hi, in order to solve error message "Database is busy or waiting for lock from other user.", I tried to see process list in mysql. This is the output of "show processlist \G;" command: *** 1. row *** Id: 10 User: root Host: localhost db: slurm_acct_db Command: Sleep Time: 2556 State: Info: NULL *** 2. row *** Id: 11 User: root Host: localhost db: slurm_acct_db Command: Sleep Time: 153 State: Info: NULL *** 3. row *** Id: 39 User: root Host: localhost db: slurm_jobcomp_db Command: Sleep Time: 2556 State: Info: NULL *** 4. row *** Id: 41 User: root Host: localhost db: slurm_acct_db Command: Query Time: 2468 State: Waiting for table metadata lock Info: create table if not exists "hpctesis_assoc_table" (`creation_time` int unsigned not null, `mod_time` *** 5. row *** Id: 42 User: root Host: localhost db: slurm_acct_db Command: Query Time: 2405 State: Waiting for table metadata lock Info: create table if not exists "hpctesis_assoc_table" (`creation_time` int unsigned not null, `mod_time` *** 6. row *** Id: 43 User: root Host: localhost db: slurm_acct_db Command: Query Time: 2372 State: Waiting for table metadata lock Info: create table if not exists "hpctesis_assoc_table" (`creation_time` int unsigned not null, `mod_time` *** 7. row *** Id: 44 User: root Host: localhost db: slurm_acct_db Command: Query Time: 2357 State: Waiting for table metadata lock Info: create table if not exists "hpctesis_assoc_table" (`creation_time` int unsigned not null, `mod_time` *** 8. row *** Id: 46 User: root Host: localhost db: slurm_acct_db Command: Query Time: 2099 State: Waiting for table metadata lock Info: create table if not exists "hpctesis_assoc_table" (`creation_time` int unsigned not null, `mod_time` *** 9. row *** Id: 47 User: root Host: localhost db: NULL Command: Query Time: 0 State: NULL Info: show processlist 9 rows in set (0.00 sec) ERROR: No query specified Based on the output above, I see the process with the info "create table if not exists "hpctesis_assoc_table" (`creation_time` int unsigned not null, `mod_time`" and the state :"Waiting for table metadata lock". I checked to slurm_acct_db database, the table named hpctesis_assoc_table is exist but when I try to run SQL SELECT command mysql server seems to be not responding.. I guess this is the cause of the problem that prevents me from running sacctmgr command. anyone can tell me how to solve this please ? Thank you in advance, Regards, Husen On Mon, May 23, 2016 at 10:26 AM, Husen R <hus...@gmail.com> wrote: > Hi, > > Yes, I can connect as the slurmdbd 'storageuser'. I also can create and > drop tables. > I don't know how to solve this. > the message "Database is busy or waiting for lock from other user." is > keep appearing everytime I try to add cluster using sacctmgr. > > I need help > > Regards, > > Husen > > On Sun, May 22, 2016 at 2:05 PM, Daniel Letai <d...@letai.org.il> wrote: > >> It might be a permissions issue - can you connect as the slurmdbd >> 'storageuser' to your db and create and drop tables? >> From http://slurm.schedmd.com/accounting.html : >> >> >>- *StorageUser*: Define the name of the user we are going to connect >>to the database with to store the job accounting data. >> >> MySQL Configuration >> >> While Slurm will create the database tables automatically you will need >> to make sure the StorageUser is given permissions in the MySQL or MariaDB >> database to do so. As the *mysql* user grant privileges to that user >> using a command such as: >> >> GRANT ALL ON StorageLoc.* TO 'StorageUser'@'StorageHost'; >> (The ticks are needed) >> >> (You need to be root to do this. Also in the info for password usage >> there is a line that starts with '->'. This a continuation prompt since the >> previous mysql statement did not end with a ';'. It assumes that you wish >> to input more info.) >> >> If you want Slurm to create the database itself, and any future >> databases, you can change your grant line to be *.* instead of StorageLoc.* >> >> >>
[slurm-dev] Re: How to setup slurm database accounting feature
Hi, Yes, I can connect as the slurmdbd 'storageuser'. I also can create and drop tables. I don't know how to solve this. the message "Database is busy or waiting for lock from other user." is keep appearing everytime I try to add cluster using sacctmgr. I need help Regards, Husen On Sun, May 22, 2016 at 2:05 PM, Daniel Letai <d...@letai.org.il> wrote: > It might be a permissions issue - can you connect as the slurmdbd > 'storageuser' to your db and create and drop tables? > From http://slurm.schedmd.com/accounting.html : > > >- *StorageUser*: Define the name of the user we are going to connect >to the database with to store the job accounting data. > > MySQL Configuration > > While Slurm will create the database tables automatically you will need to > make sure the StorageUser is given permissions in the MySQL or MariaDB > database to do so. As the *mysql* user grant privileges to that user > using a command such as: > > GRANT ALL ON StorageLoc.* TO 'StorageUser'@'StorageHost'; > (The ticks are needed) > > (You need to be root to do this. Also in the info for password usage there > is a line that starts with '->'. This a continuation prompt since the > previous mysql statement did not end with a ';'. It assumes that you wish > to input more info.) > > If you want Slurm to create the database itself, and any future databases, > you can change your grant line to be *.* instead of StorageLoc.* > > > > > On 05/22/2016 06:16 AM, Husen R wrote: > > Hi, > > The following is the error message I got from slurmdbd.log. I got this > error message after I try to add my clustername=hpctesis to slurmdbd using > command "sudo sacctmgr add cluster hpctesis". > > > [2016-05-22T10:04:33.047] error: We should have gotten a new id: Table > 'slurm_acct_db.hpctesis_job_table' doesn't exist > [2016-05-22T10:04:33.047] error: couldn't add job 386 at job completion > [2016-05-22T10:04:33.047] DBD_JOB_COMPLETE: cluster not registered > > Should I create a table named hpctesis_job_table manually ? > > as far as I understood, slurm should able to do this by it self..am I > right ? > how to solve this ? > > I need help. > Thank you in advance, > > > Regards, > > > Husen > > On Sat, May 21, 2016 at 7:31 PM, Husen R <hus...@gmail.com> wrote: > >> Hi daniel, >> >> Thank you for your reply ! >> >> The error regarding mysql socket has been solved. >> I forget to run slurmdbd daemon prior to running slurmctld daemon. >> >> however, I got this error message when I try to add cluster using >> sacctmgr command : >> >> >> -- >> >> $ sudo sacctmgr add cluster comeon >> >> Adding Cluster(s) >> Name = comeon >> Would you like to commit changes? (You have 30 seconds to decide) >> (N/y): y >> Database is busy or waiting for lock from other user. >> >> --- >> >> How to fix this ? >> Thank you in advance. >> >> Regards, >> >> >> Husen >> >> On Sat, May 21, 2016 at 6:28 PM, Daniel Letai < <d...@letai.org.il> >> d...@letai.org.il> wrote: >> >>> >>> Does the socket file exists? >>> What's in your /etc/my.cnf (or my.cnf.d/some other config file) under >>> [mysqld]? >>> [mysqld] >>> socket=/path/to/datadir/mysql/mysql.sock >>> >>> If a socket value doesn't exist, either create one, or create a link >>> between the actual socket file and /var/run/mysqld/mysqld.sock >>> BTW - either you have a typo in your mail, or your socket is >>> misconfigured - never saw mysqld.soc (without 'k' at end) as the name of >>> the socket, although it's certainly legal. >>> >>> Other option is that the mysql server is not running - did you start the >>> daemon? >>> >>> On 05/21/2016 01:45 PM, Husen R wrote: >>> >>>> Re: [slurm-dev] How to setup slurm database accounting feature >>>> I checked slurmctld.log, I got this error message. how to solve this ? >>>> >>>> [2016-05-21T17:37:40.589] error: mysql_real_connect failed: 2002 Can't >>>> connect to local MySQL server through socket '/var/run/mysqld/mysqld.soc$ >>>> [2016-05-21T17:37:40.589] fatal: You haven't inited this storage yet. >>>> >>>> Thank you in advance >>>> oe >>
[slurm-dev] Re: How to setup slurm database accounting feature
Hi, The following is the error message I got from slurmdbd.log. I got this error message after I try to add my clustername=hpctesis to slurmdbd using command "sudo sacctmgr add cluster hpctesis". [2016-05-22T10:04:33.047] error: We should have gotten a new id: Table 'slurm_acct_db.hpctesis_job_table' doesn't exist [2016-05-22T10:04:33.047] error: couldn't add job 386 at job completion [2016-05-22T10:04:33.047] DBD_JOB_COMPLETE: cluster not registered Should I create a table named hpctesis_job_table manually ? as far as I understood, slurm should able to do this by it self..am I right ? how to solve this ? I need help. Thank you in advance, Regards, Husen On Sat, May 21, 2016 at 7:31 PM, Husen R <hus...@gmail.com> wrote: > Hi daniel, > > Thank you for your reply ! > > The error regarding mysql socket has been solved. > I forget to run slurmdbd daemon prior to running slurmctld daemon. > > however, I got this error message when I try to add cluster using sacctmgr > command : > > > -- > > $ sudo sacctmgr add cluster comeon > > Adding Cluster(s) > Name = comeon > Would you like to commit changes? (You have 30 seconds to decide) > (N/y): y > Database is busy or waiting for lock from other user. > > --- > > How to fix this ? > Thank you in advance. > > Regards, > > > Husen > > On Sat, May 21, 2016 at 6:28 PM, Daniel Letai <d...@letai.org.il> wrote: > >> >> Does the socket file exists? >> What's in your /etc/my.cnf (or my.cnf.d/some other config file) under >> [mysqld]? >> [mysqld] >> socket=/path/to/datadir/mysql/mysql.sock >> >> If a socket value doesn't exist, either create one, or create a link >> between the actual socket file and /var/run/mysqld/mysqld.sock >> BTW - either you have a typo in your mail, or your socket is >> misconfigured - never saw mysqld.soc (without 'k' at end) as the name of >> the socket, although it's certainly legal. >> >> Other option is that the mysql server is not running - did you start the >> daemon? >> >> On 05/21/2016 01:45 PM, Husen R wrote: >> >>> Re: [slurm-dev] How to setup slurm database accounting feature >>> I checked slurmctld.log, I got this error message. how to solve this ? >>> >>> [2016-05-21T17:37:40.589] error: mysql_real_connect failed: 2002 Can't >>> connect to local MySQL server through socket '/var/run/mysqld/mysqld.soc$ >>> [2016-05-21T17:37:40.589] fatal: You haven't inited this storage yet. >>> >>> Thank you in advance >>> oe >>> Regards, >>> >>> >>> Husen >>> >>> On Sat, May 21, 2016 at 3:16 PM, Husen R <hus...@gmail.com >> hus...@gmail.com>> wrote: >>> >>> dear all, >>> >>> I tried to configure slurm accounting feature using database. >>> I already read the instruction available in this page >>> http://slurm.schedmd.com/accounting.html, but the accounting >>> feature still not working. >>> I got this error message when I try to execute sacct command : >>> >>> sacct: error: Problem talking to the database: Connection refused >>> >>> the following is my slurm.conf: >>> >>> >>> --Slurm.conf >>> >>> # >>> # Sample /etc/slurm.conf for mcr.llnl.gov <http://mcr.llnl.gov> >>> >>> # >>> ControlMachine=head-node >>> ControlAddr=head-node >>> #BackupController=mcrj >>> #BackupAddr=emcrj >>> # >>> AuthType=auth/munge >>> CheckpointType=checkpoint/blcr >>> #Epilog=/usr/local/slurm/etc/epilog >>> FastSchedule=1 >>> #JobCompLoc=/var/tmp/jette/slurm.job.log >>> JobCompType=jobcomp/mysql >>> #AccountingStorageType=accounting_storage/mysql >>> AccountingStorageType=accounting_storage/slurmdbd >>> AccountingStorageHost=localhost >>> AccountingStoragePass=/var/run/munge/munge.socket.2 >>> ClusterName=comeon >>> JobCompHost=head-node >>> JobCompPass=password >>> JobCompPort=3306 >>> JobCompUser=root >>> JobCredentialPrivateKey=/usr/loca
[slurm-dev] Re: How to setup slurm database accounting feature
Hi daniel, Thank you for your reply ! The error regarding mysql socket has been solved. I forget to run slurmdbd daemon prior to running slurmctld daemon. however, I got this error message when I try to add cluster using sacctmgr command : -- $ sudo sacctmgr add cluster comeon Adding Cluster(s) Name = comeon Would you like to commit changes? (You have 30 seconds to decide) (N/y): y Database is busy or waiting for lock from other user. --- How to fix this ? Thank you in advance. Regards, Husen On Sat, May 21, 2016 at 6:28 PM, Daniel Letai <d...@letai.org.il> wrote: > > Does the socket file exists? > What's in your /etc/my.cnf (or my.cnf.d/some other config file) under > [mysqld]? > [mysqld] > socket=/path/to/datadir/mysql/mysql.sock > > If a socket value doesn't exist, either create one, or create a link > between the actual socket file and /var/run/mysqld/mysqld.sock > BTW - either you have a typo in your mail, or your socket is misconfigured > - never saw mysqld.soc (without 'k' at end) as the name of the socket, > although it's certainly legal. > > Other option is that the mysql server is not running - did you start the > daemon? > > On 05/21/2016 01:45 PM, Husen R wrote: > >> Re: [slurm-dev] How to setup slurm database accounting feature >> I checked slurmctld.log, I got this error message. how to solve this ? >> >> [2016-05-21T17:37:40.589] error: mysql_real_connect failed: 2002 Can't >> connect to local MySQL server through socket '/var/run/mysqld/mysqld.soc$ >> [2016-05-21T17:37:40.589] fatal: You haven't inited this storage yet. >> >> Thank you in advance >> oe >> Regards, >> >> >> Husen >> >> On Sat, May 21, 2016 at 3:16 PM, Husen R <hus...@gmail.com > hus...@gmail.com>> wrote: >> >> dear all, >> >> I tried to configure slurm accounting feature using database. >> I already read the instruction available in this page >> http://slurm.schedmd.com/accounting.html, but the accounting >> feature still not working. >> I got this error message when I try to execute sacct command : >> >> sacct: error: Problem talking to the database: Connection refused >> >> the following is my slurm.conf: >> >> >> --Slurm.conf >> >> # >> # Sample /etc/slurm.conf for mcr.llnl.gov <http://mcr.llnl.gov> >> >> # >> ControlMachine=head-node >> ControlAddr=head-node >> #BackupController=mcrj >> #BackupAddr=emcrj >> # >> AuthType=auth/munge >> CheckpointType=checkpoint/blcr >> #Epilog=/usr/local/slurm/etc/epilog >> FastSchedule=1 >> #JobCompLoc=/var/tmp/jette/slurm.job.log >> JobCompType=jobcomp/mysql >> #AccountingStorageType=accounting_storage/mysql >> AccountingStorageType=accounting_storage/slurmdbd >> AccountingStorageHost=localhost >> AccountingStoragePass=/var/run/munge/munge.socket.2 >> ClusterName=comeon >> JobCompHost=head-node >> JobCompPass=password >> JobCompPort=3306 >> JobCompUser=root >> JobCredentialPrivateKey=/usr/local/etc/slurm.key >> JobCredentialPublicCertificate=/usr/local/etc/slurm.cert >> MsgAggregationParams=WindowMsgs=2,WindowTime=100 >> PluginDir=/usr/local/lib/slurm >> JobCheckpointDir=/mirror/source/cr >> #Prolog=/usr/local/slurm/etc/prolog >> MailProg=/usr/bin/mail >> SchedulerType=sched/backfill >> SelectType=select/linear >> SlurmUser=slurm >> SlurmctldLogFile=/var/tmp/slurmctld.log >> SlurmctldPort=7002 >> SlurmctldTimeout=300 >> SlurmdPort=7003 >> SlurmdSpoolDir=/var/tmp/slurmd.spool >> SlurmdTimeout=300 >> SlurmdLogFile=/var/tmp/slurmd.log >> StateSaveLocation=/var/tmp/slurm.state >> #SwitchType=switch/none >> TreeWidth=50 >> # >> # Node Configurations >> # >> NodeName=DEFAULT CPUs=8 RealMemory=5949 TmpDisk=64000 State=UNKNOWN >> NodeName=head-node,compute-node,spare-node >> NodeAddr=head-node,compute-node,spare-node SocketsPerBoard=1 >> CoresPerSocket=4 ThreadsPerCore=2 >> # >> # Partition Configurations >&g
[slurm-dev] Re: How to setup slurm database accounting feature
I checked slurmctld.log, I got this error message. how to solve this ? [2016-05-21T17:37:40.589] error: mysql_real_connect failed: 2002 Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.soc$ [2016-05-21T17:37:40.589] fatal: You haven't inited this storage yet. Thank you in advance Regards, Husen On Sat, May 21, 2016 at 3:16 PM, Husen R <hus...@gmail.com> wrote: > dear all, > > I tried to configure slurm accounting feature using database. > I already read the instruction available in this page > http://slurm.schedmd.com/accounting.html, but the accounting feature > still not working. > I got this error message when I try to execute sacct command : > > sacct: error: Problem talking to the database: Connection refused > > the following is my slurm.conf: > > > --Slurm.conf > > # > # Sample /etc/slurm.conf for mcr.llnl.gov > # > ControlMachine=head-node > ControlAddr=head-node > #BackupController=mcrj > #BackupAddr=emcrj > # > AuthType=auth/munge > CheckpointType=checkpoint/blcr > #Epilog=/usr/local/slurm/etc/epilog > FastSchedule=1 > #JobCompLoc=/var/tmp/jette/slurm.job.log > JobCompType=jobcomp/mysql > #AccountingStorageType=accounting_storage/mysql > AccountingStorageType=accounting_storage/slurmdbd > AccountingStorageHost=localhost > AccountingStoragePass=/var/run/munge/munge.socket.2 > ClusterName=comeon > JobCompHost=head-node > JobCompPass=password > JobCompPort=3306 > JobCompUser=root > JobCredentialPrivateKey=/usr/local/etc/slurm.key > JobCredentialPublicCertificate=/usr/local/etc/slurm.cert > MsgAggregationParams=WindowMsgs=2,WindowTime=100 > PluginDir=/usr/local/lib/slurm > JobCheckpointDir=/mirror/source/cr > #Prolog=/usr/local/slurm/etc/prolog > MailProg=/usr/bin/mail > SchedulerType=sched/backfill > SelectType=select/linear > SlurmUser=slurm > SlurmctldLogFile=/var/tmp/slurmctld.log > SlurmctldPort=7002 > SlurmctldTimeout=300 > SlurmdPort=7003 > SlurmdSpoolDir=/var/tmp/slurmd.spool > SlurmdTimeout=300 > SlurmdLogFile=/var/tmp/slurmd.log > StateSaveLocation=/var/tmp/slurm.state > #SwitchType=switch/none > TreeWidth=50 > # > # Node Configurations > # > NodeName=DEFAULT CPUs=8 RealMemory=5949 TmpDisk=64000 State=UNKNOWN > NodeName=head-node,compute-node,spare-node > NodeAddr=head-node,compute-node,spare-node SocketsPerBoard=1 > CoresPerSocket=4 ThreadsPerCore=2 > # > # Partition Configurations > # > PartitionName=DEFAULT State=UP > PartitionName=comeon Nodes=head-node,compute-node,spare-node > MaxTime=168:00:00 MaxNodes=32 Default=YES > > > > > what is the difference between slurmdbd and mysql ? > based on the information in this page, > http://slurm.schedmd.com/accounting.html, slurmdbd has its own > configuration file called slurmdbd.conf. > is there any example of slurmdbd.conf file ? where should I store this > file ? how do I setup slurm to read slurmdbd.conf file ? > > I have installed mysql. I also have created slurm_acct_db database. > I need help. > > Thank you in advance > > regards, > > > Husen > > > >
[slurm-dev] How to setup slurm database accounting feature
dear all, I tried to configure slurm accounting feature using database. I already read the instruction available in this page http://slurm.schedmd.com/accounting.html, but the accounting feature still not working. I got this error message when I try to execute sacct command : sacct: error: Problem talking to the database: Connection refused the following is my slurm.conf: --Slurm.conf # # Sample /etc/slurm.conf for mcr.llnl.gov # ControlMachine=head-node ControlAddr=head-node #BackupController=mcrj #BackupAddr=emcrj # AuthType=auth/munge CheckpointType=checkpoint/blcr #Epilog=/usr/local/slurm/etc/epilog FastSchedule=1 #JobCompLoc=/var/tmp/jette/slurm.job.log JobCompType=jobcomp/mysql #AccountingStorageType=accounting_storage/mysql AccountingStorageType=accounting_storage/slurmdbd AccountingStorageHost=localhost AccountingStoragePass=/var/run/munge/munge.socket.2 ClusterName=comeon JobCompHost=head-node JobCompPass=password JobCompPort=3306 JobCompUser=root JobCredentialPrivateKey=/usr/local/etc/slurm.key JobCredentialPublicCertificate=/usr/local/etc/slurm.cert MsgAggregationParams=WindowMsgs=2,WindowTime=100 PluginDir=/usr/local/lib/slurm JobCheckpointDir=/mirror/source/cr #Prolog=/usr/local/slurm/etc/prolog MailProg=/usr/bin/mail SchedulerType=sched/backfill SelectType=select/linear SlurmUser=slurm SlurmctldLogFile=/var/tmp/slurmctld.log SlurmctldPort=7002 SlurmctldTimeout=300 SlurmdPort=7003 SlurmdSpoolDir=/var/tmp/slurmd.spool SlurmdTimeout=300 SlurmdLogFile=/var/tmp/slurmd.log StateSaveLocation=/var/tmp/slurm.state #SwitchType=switch/none TreeWidth=50 # # Node Configurations # NodeName=DEFAULT CPUs=8 RealMemory=5949 TmpDisk=64000 State=UNKNOWN NodeName=head-node,compute-node,spare-node NodeAddr=head-node,compute-node,spare-node SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 # # Partition Configurations # PartitionName=DEFAULT State=UP PartitionName=comeon Nodes=head-node,compute-node,spare-node MaxTime=168:00:00 MaxNodes=32 Default=YES what is the difference between slurmdbd and mysql ? based on the information in this page, http://slurm.schedmd.com/accounting.html, slurmdbd has its own configuration file called slurmdbd.conf. is there any example of slurmdbd.conf file ? where should I store this file ? how do I setup slurm to read slurmdbd.conf file ? I have installed mysql. I also have created slurm_acct_db database. I need help. Thank you in advance regards, Husen
[slurm-dev] Re: Slurm checkpoint error
This is the output of ls -a -l. The files that appeared in the error message are 0 bytes in size and they are all resulting from processes in the remote nodes. drwxrwxr-x 2 necis necis 4096 Mei 17 14:44 . drwxrwxrwx 12 root root 4096 Mei 17 17:24 .. -r 1 necis necis 0 Mei 17 14:42 .task.0.ckpt.tmp -r 1 necis necis 183630160 Mei 17 14:44 task.1.ckpt -r 1 necis necis 183630200 Mei 17 14:43 task.2.ckpt -r 1 necis necis 0 Mei 17 14:43 .task.2.ckpt.tmp -r 1 necis necis 0 Mei 17 14:42 .task.3.ckpt.tmp -r 1 necis necis 0 Mei 17 14:42 .task.4.ckpt.tmp -r 1 necis necis 183297635 Mei 17 14:43 task.5.ckpt -r 1 necis necis 183297635 Mei 17 14:43 task.6.ckpt -r 1 necis necis 183297635 Mei 17 14:43 task.7.ckpt -r 1 necis necis 183301731 Mei 17 14:43 task.8.ckpt -r 1 necis necis 183297635 Mei 17 14:43 task.9.ckpt Regards, Husen On Wed, May 18, 2016 at 7:38 AM, Husen R <hus...@gmail.com> wrote: > Hi, > > This is the output of ls -a: > > . .task.0.ckpt.tmp task.2.ckpt .task.3.ckpt.tmp task.5.ckpt > task.7.ckpt task.9.ckpt > .. task.1.ckpt .task.2.ckpt.tmp .task.4.ckpt.tmp task.6.ckpt > task.8.ckpt > > This is the output of ls : > > task.1.ckpt task.2.ckpt task.5.ckpt task.6.ckpt task.7.ckpt > task.8.ckpt task.9.ckpt > > > The temporary files appeared when I use ls -a command. What does it mean ? > I can create file in the directory with touch. > > I try to checkpoint mpi application using non root user with slurm > checkpoint interval feature. I don't directly checkpoint using > cr_checkpoint command. > > Regards, > > Husen > > On Tue, May 17, 2016 at 10:30 PM, Eric Roman <ero...@lbl.gov> wrote: > >> >> >> Are the temporary files created? >> >> Does ls -a on the directory show the missing files? >> >> Can you create files in that directory with touch? >> >> Finally, is cr_checkpoint being run by root? Or some other user? The >> checkpoint file will be created by the user invoking cr_checkpoint. >> >> Eric >> >> On Tue, May 17, 2016 at 01:10:32AM -0700, Husen R wrote: >> >dear all, >> >I failed everytime I try to checkpoint MPI application using BLCR in >> >Slurm. The following is my sbatch script : >> >##SBATCH SCRIPT >> >#!/bin/bash >> >#SBATCH -J MatMul >> >#SBATCH -o cr/mm-%j.out >> >#SBATCH -A necis >> >#SBATCH -N 3 >> >#SBATCH -n 24 >> >#SBATCH --checkpoint=1 >> >#SBATCH --checkpoint-dir=cr >> >#SBATCH --time=01:30:00 >> >#SBATCH --mail-user=[1]hus...@gmail.com >> >#SBATCH --mail-type=begin >> >#SBATCH --mail-type=end >> >srun --mpi=pmi2 ./mm.o >> > >> >I also have tried to run directly using srun command but I failed. >> The >> >following is the command I use and the error message that occured. >> >command :� >> >srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o� >> >error : >> >Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': >> Permission >> >denied >> >Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt' >> >Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': >> Permission >> >denied >> >Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt' >> >Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': >> Permission >> >denied >> >Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt' >> >Received results from task 6 >> >Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': >> Permission >> >denied >> >Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt' >> >Unable to open file '/mirror/source/cr/275.0/.task.2.ckpt.tmp': >> Permission >> >denied >> >Failed to open checkpoint file '/mirror/source/cr/275.0/task.2.ckpt' >> >Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': >> Permission >> >denied >> >Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt' >> >Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': >> Permission >> >denied >> >Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt
[slurm-dev] Slurm checkpoint error
dear all, I failed everytime I try to checkpoint MPI application using BLCR in Slurm. The following is my sbatch script : ##SBATCH SCRIPT #!/bin/bash #SBATCH -J MatMul #SBATCH -o cr/mm-%j.out #SBATCH -A necis #SBATCH -N 3 #SBATCH -n 24 #SBATCH --checkpoint=1 #SBATCH --checkpoint-dir=cr #SBATCH --time=01:30:00 #SBATCH --mail-user=hus...@gmail.com #SBATCH --mail-type=begin #SBATCH --mail-type=end srun --mpi=pmi2 ./mm.o I also have tried to run directly using srun command but I failed. The following is the command I use and the error message that occured. command : srun -N2 -n10 --mpi=pmi2 --checkpoint=1 ./mm.o error : Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': Permission denied Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt' Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': Permission denied Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt' Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': Permission denied Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt' Received results from task 6 Unable to open file '/mirror/source/cr/275.0/.task.4.ckpt.tmp': Permission denied Failed to open checkpoint file '/mirror/source/cr/275.0/task.4.ckpt' Unable to open file '/mirror/source/cr/275.0/.task.2.ckpt.tmp': Permission denied Failed to open checkpoint file '/mirror/source/cr/275.0/task.2.ckpt' Unable to open file '/mirror/source/cr/275.0/.task.3.ckpt.tmp': Permission denied Failed to open checkpoint file '/mirror/source/cr/275.0/task.3.ckpt' Unable to open file '/mirror/source/cr/275.0/.task.0.ckpt.tmp': Permission denied Failed to open checkpoint file '/mirror/source/cr/275.0/task.0.ckpt' in cr directory there are 7 .ckpt files as follows : task.1.ckpt, task.2.ckpt, task.5.ckpt, task.6.ckpt, task.7.ckpt, task.8.ckpt and task.9.ckpt. There are no checkpoint files called task.0.ckpt, task.3.ckpt and task.4.ckpt as mentioned in the error message. mirror is NFS directory that shared across the nodes. I set the cr directory to have permission 777 just to avoid permission issue. note : if I execute the command using sbatch job, I just get file named script.ckpt. There is no task.[number].ckpt file. Anyone please tell me how to solve this ? Thank you in advance. Regards, Husen
[slurm-dev] Re: How to get command of a running/pending job
dear all, Thank's a lot for your reply ! it really helps me. Regards, Husen On Fri, May 13, 2016 at 9:20 PM, Benjamin Redling < benjamin.ra...@uni-jena.de> wrote: > > On 2016-05-13 05:58, Husen R wrote: > > Does slurm provide feature to get command that being executed/will be > > executed by running/pending jobs ? > > scontrol show --detail job > or > scontrol show -d job > > Benjamin > -- > FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html > vox: +49 3641 9 44323 | fax: +49 3641 9 44321 >
[slurm-dev] Re: Slurm Checkpoint/Restart example
Danny : I'm unable to use srun_cr command. I got this error message from slurmctld log file after submitting srun_cr with sbatch: [2016-04-14T19:22:42.719] job_complete: JobID=67 State=0x1 NodeCnt=2 WEXITSTATUS 255 Any idea to fix this ? - yes, my job needs more than 5 minutes. Andy : Yes, /mirror directory is shared across my cluster. I have configured it using NFS. Regards, Husen On Thu, Apr 14, 2016 at 6:15 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: > I've found two things, first you could try srun_cr instead of srun and the > second is, do your job needs more than 5 minutes?! > But I'm not sure, so you may try it and post the result. > > > Am 14.04.2016 um 12:56 schrieb Husen R: > >> Hello Danny, >> >> I have tried to restart using "scontrol checkpoint restart " but it >> doesn't work. >> In addition, ".0" directory and its content are doesn't exist in my >> --checkpoint-dir. >> The following is my batch job : >> >> =batch job=== >> >> #!/bin/bash >> #SBATCH -J MatMul >> #SBATCH -o mm-%j.out >> #SBATCH -A pro >> #SBATCH -N 3 >> #SBATCH -n 24 >> #SBATCH --checkpoint=5 >> #SBATCH --checkpoint-dir=/mirror/source/cr >> #SBATCH --time=01:30:00 >> #SBATCH --mail-user=hus...@gmail.com >> #SBATCH --mail-type=begin >> #SBATCH --mail-type=end >> >> srun --mpi=pmi2 ./mm.o >> >> ===end batch job >> >> is there something that prevents me from getting the right directory >> structure ? >> >> >> Regards, >> >> >> >> Husen >> >> >> >> >> On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher < >> danny.rotsc...@tu-dresden.de> wrote: >> >> Hello, >>> >>> usually the directory, which is specified by --checkpoint-dir, should >>> have >>> the following structure: >>> >>> |__ script.ckpt >>> |__ .0 >>> |__ task.0.ckpt >>> |__ task.1.ckpt >>> |__ ... >>> >>> But you only have to run the following command to restart your batch job: >>> scontrol checkpoint restart >>> >>> I tried only batch jobs and currently I try to build MVAPICH2 with BLCR >>> and Slurm support, because that mpi library is explicitly mentioned in >>> the >>> Slurm documentation. >>> >>> A colleague also tested DMTCP but no success. >>> >>> Kind reagards >>> Danny >>> TU Dresden >>> Germany >>> >>> >>> Am 14.04.2016 um 11:01 schrieb Husen R: >>> >>> Hi all, >>>> Thank you for your reply >>>> >>>> Danny : >>>> I have installed BLCR and SLURM successfully. >>>> I also have configured CheckpointType, --checkpoint, --checkpoint-dir >>>> and >>>> JobCheckpointDir in order for slurm to support checkpoint. >>>> >>>> I have tried to checkpoint a simple MPI parallel application many times >>>> in >>>> my small cluster, and like you said, after checkpoint is completed there >>>> is >>>> a directory named with jobid in --checkpoint-dir. in that directory >>>> there >>>> is a file named "script.ckpt". I tried to restart directly using srun >>>> command below : >>>> >>>> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o >>>> >>>> where --restart-dir is directory that contains "script.ckpt". >>>> Unfortunately, I got the following error : >>>> >>>> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file >>>> or >>>> directory >>>> srun: error: compute-node: task 0: Exited with exit code 255 >>>> >>>> As we can see from the error message above, there was no "task.0.ckpt" >>>> file. I don't know how to get such file. The files that I got from >>>> checkpoint operation is a file named "script.ckpt" in --checkpoint-dir >>>> and >>>> two files in JobCheckpointDir named ".ckpt" and >>>> ".ckpt.old". >>>> >>>> According to the information in section srun in this link >>>> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is >>>> completed there should be checkpoint files of the form ".ckpt" >>>> and >>>> "..ckpt&qu
[slurm-dev] Re: Slurm Checkpoint/Restart example
Hello Danny, I have tried to restart using "scontrol checkpoint restart " but it doesn't work. In addition, ".0" directory and its content are doesn't exist in my --checkpoint-dir. The following is my batch job : =batch job=== #!/bin/bash #SBATCH -J MatMul #SBATCH -o mm-%j.out #SBATCH -A pro #SBATCH -N 3 #SBATCH -n 24 #SBATCH --checkpoint=5 #SBATCH --checkpoint-dir=/mirror/source/cr #SBATCH --time=01:30:00 #SBATCH --mail-user=hus...@gmail.com #SBATCH --mail-type=begin #SBATCH --mail-type=end srun --mpi=pmi2 ./mm.o ===end batch job is there something that prevents me from getting the right directory structure ? Regards, Husen On Thu, Apr 14, 2016 at 5:36 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: > Hello, > > usually the directory, which is specified by --checkpoint-dir, should have > the following structure: > > |__ script.ckpt > |__ .0 > |__ task.0.ckpt > |__ task.1.ckpt > |__ ... > > But you only have to run the following command to restart your batch job: > scontrol checkpoint restart > > I tried only batch jobs and currently I try to build MVAPICH2 with BLCR > and Slurm support, because that mpi library is explicitly mentioned in the > Slurm documentation. > > A colleague also tested DMTCP but no success. > > Kind reagards > Danny > TU Dresden > Germany > > > Am 14.04.2016 um 11:01 schrieb Husen R: > >> Hi all, >> Thank you for your reply >> >> Danny : >> I have installed BLCR and SLURM successfully. >> I also have configured CheckpointType, --checkpoint, --checkpoint-dir and >> JobCheckpointDir in order for slurm to support checkpoint. >> >> I have tried to checkpoint a simple MPI parallel application many times in >> my small cluster, and like you said, after checkpoint is completed there >> is >> a directory named with jobid in --checkpoint-dir. in that directory there >> is a file named "script.ckpt". I tried to restart directly using srun >> command below : >> >> srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o >> >> where --restart-dir is directory that contains "script.ckpt". >> Unfortunately, I got the following error : >> >> Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file >> or >> directory >> srun: error: compute-node: task 0: Exited with exit code 255 >> >> As we can see from the error message above, there was no "task.0.ckpt" >> file. I don't know how to get such file. The files that I got from >> checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and >> two files in JobCheckpointDir named ".ckpt" and ".ckpt.old". >> >> According to the information in section srun in this link >> http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is >> completed there should be checkpoint files of the form ".ckpt" and >> "..ckpt" in --checkpoint-dir. >> >> Any idea to solve this ? >> >> Manuel : >> >> Yes, BLCR doesn't support checkpoint/restart parallel/distributed >> application by itself ( >> https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). >> But it can be used by other software to do that (I hope the software is >> SLURM..huhu) >> >> I have ever tried to restart mpi application using DMTCP but it doesn't >> work. >> Would you please tell me how to do that ? >> >> >> Thank you in advance, >> >> Regards, >> >> >> Husen >> >> >> >> >> >> On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < >> danny.rotsc...@tu-dresden.de> wrote: >> >> I forgot something to add, you have to create a directory for the >>> checkpoint meta data, which is for default located in >>> /var/slurm/checkpoint: >>> mkdir -p /var/slurm/checkpoint >>> chown -R slurm /var/slurm >>> or you define your own directory in slurm.conf: >>> JobCheckpointDir= >>> >>> The parameters you could check with: >>> scontrol show config | grep checkpoint >>> >>> Kind regards, >>> Danny >>> TU Dresden >>> Germany >>> >>> Am 14.04.2016 um 06:41 schrieb Danny Rotscher: >>> >>> Hello, >>>> >>>> we don't get it to work too, but we already build Slurm with the BLCR. >>>> >>>> You first have to install the BLCR library, which is described on the >>>> following website: >>>>
[slurm-dev] Re: Slurm Checkpoint/Restart example
Hi all, Thank you for your reply Danny : I have installed BLCR and SLURM successfully. I also have configured CheckpointType, --checkpoint, --checkpoint-dir and JobCheckpointDir in order for slurm to support checkpoint. I have tried to checkpoint a simple MPI parallel application many times in my small cluster, and like you said, after checkpoint is completed there is a directory named with jobid in --checkpoint-dir. in that directory there is a file named "script.ckpt". I tried to restart directly using srun command below : srun --mpi=pmi2 --restart-dir=/mirror/source/cr/51 ./mm.o where --restart-dir is directory that contains "script.ckpt". Unfortunately, I got the following error : Failed to open(/mirror/source/cr/51/task.0.ckpt, O_RDONLY): No such file or directory srun: error: compute-node: task 0: Exited with exit code 255 As we can see from the error message above, there was no "task.0.ckpt" file. I don't know how to get such file. The files that I got from checkpoint operation is a file named "script.ckpt" in --checkpoint-dir and two files in JobCheckpointDir named ".ckpt" and ".ckpt.old". According to the information in section srun in this link http://slurm.schedmd.com/checkpoint_blcr.html, after checkpoint is completed there should be checkpoint files of the form ".ckpt" and "..ckpt" in --checkpoint-dir. Any idea to solve this ? Manuel : Yes, BLCR doesn't support checkpoint/restart parallel/distributed application by itself ( https://upc-bugs.lbl.gov/blcr/doc/html/FAQ.html#mpi). But it can be used by other software to do that (I hope the software is SLURM..huhu) I have ever tried to restart mpi application using DMTCP but it doesn't work. Would you please tell me how to do that ? Thank you in advance, Regards, Husen On Thu, Apr 14, 2016 at 12:03 PM, Danny Rotscher < danny.rotsc...@tu-dresden.de> wrote: > I forgot something to add, you have to create a directory for the > checkpoint meta data, which is for default located in /var/slurm/checkpoint: > mkdir -p /var/slurm/checkpoint > chown -R slurm /var/slurm > or you define your own directory in slurm.conf: > JobCheckpointDir= > > The parameters you could check with: > scontrol show config | grep checkpoint > > Kind regards, > Danny > TU Dresden > Germany > > Am 14.04.2016 um 06:41 schrieb Danny Rotscher: > >> Hello, >> >> we don't get it to work too, but we already build Slurm with the BLCR. >> >> You first have to install the BLCR library, which is described on the >> following website: >> https://upc-bugs.lbl.gov/blcr/doc/html/BLCR_Admin_Guide.html >> >> Then we build and installed Slurm from source and BLCR checkpointing has >> been included. >> >> After that you have to set at least one Parameter in the file >> "slurm.conf": >> CheckpointType=checkpoint/blcr >> >> It exists two ways to create ceckpointing, you could either make a >> checkpoint by the following command from outside your job: >> scontrol checkpoint create >> or you could let Slurm do some periodical checkpoints with the following >> sbatch parameter: >> #SBATCH --checkpoint >> We also tried: >> #SBATCH --checkpoint : >> e.g. >> #SBATCH --checkpoint 0:10 >> to test it, but it doesn't work for us. >> >> We also set the parameter for the checkpoint directory: >> #SBATCH --checkpoint-dir >> >> After you create a checkpoint and in your checkpoint directory is created >> a directory with name of your jobid, you could restart the job by the >> following command: >> scontrol checkpoint restart >> >> We tested some sequential and openmp programs with different parameters >> and it works (checkpoint creation and restarting), >> but *we don't get any mpi library to work*, we already tested some >> programs build with openmpi and intelmpi. >> The checkpoint will be created but we get the following error when we >> want to restart them: >> - Failed to open file '/' >> - cr_restore_all_files [28534]: Unable to restore fd 3 (type=1,err=-21) >> - cr_rstrt_child [28534]: Unable to restore files! (err=-21) >> Restart failed: Is a directory >> srun: error: taurusi4010: task 0: Exited with exit code 21 >> >> So, it would be great if you could confirm our problems, maybe then >> schedmd higher up the priority of such mails;-) >> If you get it to work, please help us to understand how. >> >> Kind reagards, >> Danny >> TU Dresden >> Germany >> >> Am 11.04.2016 um 10:09 schrieb Husen R: >> >>> Hi all, >>> >>> Based on the information in this link >>> http://slurm.schedmd.com/checkpoint_blcr.html, >>> Slurm able to checkpoint the whole batch jobs and then Restart execution >>> of >>> batch jobs and job steps from checkpoint files. >>> >>> Anyone please tell me how to do that ? >>> I need help. >>> >>> Thank you in advance. >>> >>> Regards, >>> >>> >>> Husen Rusdiansyah >>> University of Indonesia >>> >> >> >
[slurm-dev] Slurm Checkpoint/Restart example
Hi all, Based on the information in this link http://slurm.schedmd.com/checkpoint_blcr.html, Slurm able to checkpoint the whole batch jobs and then Restart execution of batch jobs and job steps from checkpoint files. Anyone please tell me how to do that ? I need help. Thank you in advance. Regards, Husen Rusdiansyah University of Indonesia
[slurm-dev] Re: sbatch always produces pending jobs
Hi Emily, Thank you for the information. How to avoid node from having DRAIN state ? I didn't set its state to DRAIN. Thank you in advance Regards, Husen On Fri, Apr 8, 2016 at 8:16 PM, E.M. Dragowsky <dragow...@case.edu> wrote: > Hi, Husen -- > > The DRAIN state means the node is not available for jobs, at least as far > as I understand from the documentation describing scontrol: > > If you want to remove a node from service, you typically want to set it's > state to "DRAIN". > > Cheers, > ~ Emily > > -- > E.M. Dragowsky, Ph.D. > ITS -- Research Computing > Case Western Reserve University > (216) 368-0082 > > On Fri, Apr 8, 2016 at 8:47 AM, Husen R <hus...@gmail.com> wrote: > >> Hello Remi, >> >> Thank you for your reply. >> >> here is the output of 'sinfo' and 'sinfo -R' respectively: >> >> pro@head-node:~$ sinfo >> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >> comeon* up 30:00 1 drain head-node >> pro@head-node:~$ sinfo -R >> REASON USER TIMESTAMP NODELIST >> batch job complete f root 2016-04-08T16:16:38 head-node >> >> The state of my node is drain. I don't understand why the resources is >> not available. Currently, I don't run any resource-hungry application on >> that node. >> >> Regards, >> >> >> Husen >> >> >> On Fri, Apr 8, 2016 at 7:23 PM, Rémi Palancher <r...@rezib.org> wrote: >> >>> >>> Le 08/04/2016 13:39, Husen R a écrit : >>> >>>> [...] >>>> pro@head-node:/mirror/source$ squeue >>>> JOBID PARTITIONNAME USER ST TIME >>>> NODES NODELIST(REASON) >>>> 70comeon MatMul pro PD 0:00 >>>> 1(Resources) >>>> 71comeon MatMul pro PD 0:00 >>>> 1(Resources) >>>> 72comeon MatMul pro PD 0:00 >>>> 1(Resources) >>>> >>> >>> In the last column, squeue gives you the reason why the job are pending. >>> "Resources" means there is not enough resources available to run the jobs. >>> >>> Check the state of your nodes using `sinfo`. >>> >>> Best, >>> Rémi >>> >> >> >
[slurm-dev] Re: sbatch always produces pending jobs
Hello Remi, Thank you for your reply. here is the output of 'sinfo' and 'sinfo -R' respectively: pro@head-node:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST comeon* up 30:00 1 drain head-node pro@head-node:~$ sinfo -R REASON USER TIMESTAMP NODELIST batch job complete f root 2016-04-08T16:16:38 head-node The state of my node is drain. I don't understand why the resources is not available. Currently, I don't run any resource-hungry application on that node. Regards, Husen On Fri, Apr 8, 2016 at 7:23 PM, Rémi Palancher <r...@rezib.org> wrote: > > Le 08/04/2016 13:39, Husen R a écrit : > >> [...] >> pro@head-node:/mirror/source$ squeue >> JOBID PARTITIONNAME USER ST TIME >> NODES NODELIST(REASON) >> 70comeon MatMul pro PD 0:00 >> 1(Resources) >> 71comeon MatMul pro PD 0:00 >> 1(Resources) >> 72comeon MatMul pro PD 0:00 >> 1(Resources) >> > > In the last column, squeue gives you the reason why the job are pending. > "Resources" means there is not enough resources available to run the jobs. > > Check the state of your nodes using `sinfo`. > > Best, > Rémi >
[slurm-dev] sbatch always produces pending jobs
Hello all, Everytime I use sbatch, the job is always in pending status so it never be executed. I have tried to find the solution in mail archive but I didn't find a match. For debugging simplicity, I run slurmctld and slurmd in one machine. Following is the output of squeue command : pro@head-node:/mirror/source$ squeue JOBID PARTITIONNAME USER ST TIME NODES NODELIST(REASON) 70comeon MatMul pro PD 0:00 1(Resources) 71comeon MatMul pro PD 0:00 1(Resources) 72comeon MatMul pro PD 0:00 1(Resources) here is control machine and compute node configuration in slurm.conf: ControlMachine=head-node ControlAddr=head-node #BackupController= #BackupAddr= ... ... ... # COMPUTE NODES NodeName=DEFAULT CPUs=8 RealMemory=5949 TmpDisk=281483 State=UNKNOWN NodeName=head-node NodeAddr=head-node SocketsPerBoard=1 CoresPerSocket=4 ThreadsPerCore=2 PartitionName=DEFAULT State=UP PartitionName=comeon Nodes=head-node MaxTime=30 MaxNodes=2 Default=YES and here is my sbatch script : #!/bin/bash #SBATCH -J MatMul #SBATCH -o myMM.%j.out #SBATCH -A pro #SBATCH -N 1 #SBATCH -n 2 #SBATCH --time=00:30:00 #SBATCH --mail-user=hus...@gmail.com #SBATCH --mail-type=begin #SBATCH --mail-type=end salloc mpiexec ./mm.o Anyone please tell me how to solve this ? is the something misconfigured ? Thank you in advance Regards, Husen
[slurm-dev] Re: Failed to access munge.socket.2
Hello Lachlan, Thank you for your reply. Yes, you're right. munge.socket.2 is created when the system runs..huhu I use Slurm version 15.08.10. The OS is Ubuntu 14.04 LTS 64 bit. Currently, I have installed munge and slurm successfully. Previously, munge.socket.2 unable to be accessed because at configure time I used a customized location in "--prefix=" option. So I decided to reinstall munge using the instructions available in this link https://github.com/dun/munge/wiki/Installation-Guide. Regards, Husen On Fri, Apr 8, 2016 at 9:00 AM, Simpson Lachlan < lachlan.simp...@petermac.org> wrote: > Husen, > > > > You won’t be able to find the file – it’s created when the system runs so > that the system knows something is running J > > > > Everything in /var/run is ephemeral. > > > > Ok, what version of slurm are you running, which bits have you installed > and what OS are you installing it onto? > > > > Yes, there isn’t a munge in /var/run yet – that’s why you should create > the tmpfiles.d like I said – that will create it on boot. In the meantime, > you can just create the directory in /var/run/, you will need to chown > munge:munge after you have created it. > > > > Cheers > > L. > > > > *From:* Husen R [mailto:hus...@gmail.com] > *Sent:* Thursday, 7 April 2016 12:07 PM > *To:* slurm-dev > *Subject:* [slurm-dev] Re: Failed to access munge.socket.2 > > > > Hello Lachlan,Chris > > > > Thank you for your reply. > > > > I don't know why "/usr/local" is appended to the path.. > > I tried to locate munge.socket.2 manually using locate command and the > file is not exist indeed. > > The directory /usr/local/var/run/munge is empty. > > > > There is no munge directory in /var/run. I don't know why the munge > directory is located in /usr/local/var/run instead of in /var/run. > > > > I have ever installed slurm-llnl from repository before installing it from > source. is this probably the cause of the problem ? > > > > Regards, > > > > Husen > > > > On Thu, Apr 7, 2016 at 8:26 AM, Christopher Samuel <sam...@unimelb.edu.au> > wrote: > > > On 06/04/16 19:50, Husen R wrote: > > > however, when I tried to run sbatch I get the following error message: > > > > Failed to access "/usr/local/var/run/munge/munge.socket.2": No such file > > or directory > > Is that path really correct? > > On our systems it's: /var/run/munge/munge.socket.2 > > Best of luck, > Chris > -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > > This email (including any attachments or links) may contain confidential > and/or legally privileged information and is intended only to be read or > used by the addressee. If you are not the intended addressee, any use, > distribution, disclosure or copying of this email is strictly prohibited. > Confidentiality and legal privilege attached to this email (including any > attachments) are not waived or lost by reason of its mistaken delivery to > you. If you have received this email in error, please delete it and notify > us immediately by telephone or email. Peter MacCallum Cancer Centre > provides no guarantee that this transmission is free of virus or that it > has not been intercepted or altered and will not be liable for any delay in > its receipt.
[slurm-dev] Re: Failed to access munge.socket.2
Hello all, Currently I have installed and configured slurm-15.08.10 and munge-05.12 successfully. -- it seems "/usr/local" appeared in the path to munge.socket.2 ( /usr/local/var/run/munge/munge.socket.2) because at configure step I included it as a "--prefix" value. So I decided to reinstall munge using instructions available in this link https://github.com/dun/munge/wiki/Installation-Guide and the problem is solved. -- According to my experience installing munge, it required machine reboot once it is installed otherwise the error message regarding permissions appeared when I attempt to use it. (I don't know if this is a normal behavior or not). Regards, Husen On Thu, Apr 7, 2016 at 8:26 AM, Christopher Samuel <sam...@unimelb.edu.au> wrote: > > On 06/04/16 19:50, Husen R wrote: > > > however, when I tried to run sbatch I get the following error message: > > > > Failed to access "/usr/local/var/run/munge/munge.socket.2": No such file > > or directory > > Is that path really correct? > > On our systems it's: /var/run/munge/munge.socket.2 > > Best of luck, > Chris > -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci >
[slurm-dev] Re: Failed to access munge.socket.2
Hello Lachlan,Chris Thank you for your reply. I don't know why "/usr/local" is appended to the path.. I tried to locate munge.socket.2 manually using locate command and the file is not exist indeed. The directory /usr/local/var/run/munge is empty. There is no munge directory in /var/run. I don't know why the munge directory is located in /usr/local/var/run instead of in /var/run. I have ever installed slurm-llnl from repository before installing it from source. is this probably the cause of the problem ? Regards, Husen On Thu, Apr 7, 2016 at 8:26 AM, Christopher Samuel <sam...@unimelb.edu.au> wrote: > > On 06/04/16 19:50, Husen R wrote: > > > however, when I tried to run sbatch I get the following error message: > > > > Failed to access "/usr/local/var/run/munge/munge.socket.2": No such file > > or directory > > Is that path really correct? > > On our systems it's: /var/run/munge/munge.socket.2 > > Best of luck, > Chris > -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci >
[slurm-dev] Failed to access munge.socket.2
Hello everyone, I have installed slurm-15.08.9 succesfully. however, when I tried to run sbatch I get the following error message: Failed to access "/usr/local/var/run/munge/munge.socket.2": No such file or directory I tried to solve this problem by reinstalling munge and recreating munge.key. I also have propagated the munge key to every nodes in my cluster ( https://github.com/dun/munge/wiki/Installation-Guide#starting-the-daemon) . however, the error still appear.. Anyone please tell me how to solve this problem ? Sorry for this very basic question. Thank you in advance. Regards, Husen
[slurm-dev] checkpoint/restart feature in SLURM
Dear Slurm-dev, Does checkpoint/restart feature available in SLURM able to relocate MPI application from one node to another node while it is running ? For the example, I run MPI application in node A,B and C in a cluster and I want to migrate/relocate process running in node A to other node, let's say to node C while it is running. is there a way to do this with SLURM ? Thank you. Regards, Husen