[slurm-dev] AUKS, AFS; SLURM

2017-07-10 Thread Glenn (Gedaliah) Wolosh
Hello;

I’ve installed slurm 16.05 on SL 7.3 using ohpc. I also have the latest version 
of AUKS. I was able to hack auks so that aklog successfully runs when either 
obtaining or renewing a krb5 ticket.

For example —
p-slogin.p-stheno.tartan.njit.edu-77 guest24>: kinit
Password for gues...@njit.edu:
p-slogin.p-stheno.tartan.njit.edu-78 guest24>: tokens

Tokens held by the Cache Manager:

   --End of list--
p-slogin.p-stheno.tartan.njit.edu-79 guest24>: auks -g
Auks API request succeed
p-slogin.p-stheno.tartan.njit.edu-80 guest24>: tokens

Tokens held by the Cache Manager:

User's (AFS ID 22967) tokens for a...@cad.njit.edu [Expires Jul 10 22:34]
   --End of list—

Works just as well with auks -R loop

I also set up a function slurm_spank_task_init() to call aklog in the auks 
spank plugin. Unfortunately, this does not work. 
I get the following error —
p-slogin.p-stheno.tartan.njit.edu-81 guest24>: srun hostname
aklog: Couldn't determine realm of user:aklog: unknown RPC error (-1765328189)  
while getting realm

My guess is that in this case the user running aklog is not “guest24” 

Here is some relevant lines fro the log —
2017-07-10T16:19:34.763] [78.0] debug3: Entering _handle_request
[2017-07-10T16:19:34.763] [78.0] debug3: Leaving  _handle_accept
[2017-07-10T16:19:34.773] [78.0] debug:  mpi type = (null)
[2017-07-10T16:19:34.773] [78.0] debug:  Using mpi/none
[2017-07-10T16:19:34.773] [78.0] debug:  task_p_pre_launch: 78.0, task 0
[2017-07-10T16:19:34.773] [78.0] spank-auks: running aklog
[2017-07-10T16:19:34.781] [78.0] debug2: spank: auks.so: task_init = 0
[2017-07-10T16:19:34.781] [78.0] debug:  [job 78] attempting to run slurm 
task_prolog [/opt/local/bin/TaskProlog]
[2017-07-10T16:19:34.813] [78.0] debug2: _set_limit: conf setrlimit RLIMIT_CPU 
no change in value: 18446744073709551615
[2017-07-10T16:19:34.813] [78.0] debug2: _set_limit: conf setrlimit 
RLIMIT_FSIZE no change in value: 18446744073709551615
[2017-07-10T16:19:34.813] [78.0] debug2: _set_limit: conf setrlimit RLIMIT_DATA 
no change in value: 18446744073709551615
[2017-07-10T16:19:34.813] [78.0] debug2: _set_limit: RLIMIT_STACK  : max:inf 
cur:inf req:8388608
[2017-07-10T16:19:34.813] [78.0] debug2: _set_limit: conf setrlimit 
RLIMIT_STACK succeeded
[2017-07-10T16:19:34.813] [78.0] debug2: _set_limit: conf setrlimit RLIMIT_CORE 
no change in value: 0
[2017-07-10T16:19:34.813] [78.0] debug2: _set_limit: conf setrlimit RLIMIT_RSS 
no change in value: 18446744073709551615
[2017-07-10T16:19:34.813] [78.0] debug2: _set_limit: conf setrlimit 
RLIMIT_NPROC no change in value: 4096
[2017-07-10T16:19:34.813] [78.0] debug2: _set_limit: RLIMIT_NOFILE : max:51200 
cur:51200 req:1024
[2017-07-10T16:19:34.813] [78.0] debug2: _set_limit: conf setrlimit 
RLIMIT_NOFILE succeeded
[2017-07-10T16:19:34.813] [78.0] debug:  Couldn't find SLURM_RLIMIT_MEMLOCK in 
environment
[2017-07-10T16:19:34.813] [78.0] debug2: _set_limit: conf setrlimit RLIMIT_AS 
no change in value: 18446744073709551615
[2017-07-10T16:19:34.815] [78.0] task 0 (5305) exited with exit code 0.

Note that the TaskProlog also calls aklog. This will get me a token using srun 
but will not get me a token when using sbatch. 

I also have in my slurm.conf “UsePAM=1” with the following  slurm pamfile

authrequiredpam_localuser.so
account requiredpam_unix.so
session requiredpam_limits.so
session requiredpam_afs_session.so

This doesn’t work either.

Any advice would be greatly appreciated.
___
Gedaliah Wolosh
IST Academic and Research Computing Systems (ARCS)
NJIT
GITC 2203
973 596 5437
gwol...@njit.edu



[slurm-dev] slurm + openmpi + suspend problem

2017-07-10 Thread Eugene Dedits

Hello SLURM-DEV


I have a problem with slurm, openmpi, and “scontrol suspend”. 

My setup is:
96-node cluster with IB, running rhel 6.8
slurm 17.02.1
openmpi 2.0.0 (built using Intel 2016 compiler)


I am running some application (hpl in this particular case) using batch script 
similar to:
-
#!/bin/bash
#SBATCH —partiotion=standard
#SBATCH -N 10
#SBATCH —ntasks-per-node=16

mpirun -np 160 xhpl | tee LOG
-

So I am running it on 160 cores, 2 nodes. 

Once job is submitted to the queue and is running I suspend it using
~# scontrol suspend JOBID

I see that indeed my job stopped producing output. I go to each of the 10
nodes that were assigned for my job and see if the xhpl processes are running
there with :

~# for i in {10..19}; do ssh node$i “top -b -n | head -n 50 | grep xhpl | wc 
-l”; done

I expect this little script to return 0 from every node (because suspend sent 
the
SIGSTOP and they shouldn’t show up in top). However I see that processes 
are reliable suspended only on node10. I get:
0
16
16
…
16

So 9 out of 10 nodes still have 16 MPI threads of my xhpl application running 
at 100%. 

If I run “scontrol resume JOBID” and then suspend it again I see that 
(sometimes) more
nodes have “xhpl” processes properly suspended. Every time I resume and suspend 
the
job, I see different nodes returning 0 in my “ssh-run-top” script. 

So all together it looks like the suspend mechanism doesn’t properly work in 
SLURM with 
OpenMPI. I’ve tried compiling OpenMPI with “—with-slurm 
—with-pmi=/path/to/my/slurm”. 
I’ve observed the same behavior. 

I would appreciate any help.   


Thanks,
Eugene. 



 

[slurm-dev] Re: knl_generic plugin on non-KNL node

2017-07-10 Thread Victor Gamayunov
Hi Gilles,

On Fri, Jul 7, 2017 at 1:16 AM Gilles Gouaillardet 
wrote:

> in your slurm.conf, you should have a line like this one
> NodeName=n[1-4] Feature=knl Sockets=1 CoresPerSocket=68 State=UNKNOWN
> at first, make sure your regular Xeon nodes do *not* have the 'knl' feature
>

I do have that, but it doesn't seem to make any difference.

i guess an other option is not to have the
> NodeFeaturesPlugins=knl_generic
> line on your regular Xeon nodes
> (note that unless you specify an option, you will get some warnings
> since all your slurm.conf are not identical)
>

I was thinking about doing that, but I didn't like the idea of having
different conf files. I'll try this anyway.

Am I right thinking that the decision to reboot the node is made based
on the bitmask which is modified by the plugin?

Thanks,
Victor


[slurm-dev] Re: srun can't use variables in a batch script after upgrade

2017-07-10 Thread Dennis Tants
Hello list,

thank you for your thoughts and help.
I found my problem to be caused by myself (as always).

Using srun to copy my files, I also copied the output file
which somehow resulted in SLURM not being able to log
further to it, except the end.
It switched copy with rsync and excluded my .sbatch .out files.
Now running the script worked like a charm.

Thank you guys.

Best regards,
Dennis

Am 10.07.2017 um 10:13 schrieb Carlos Fenoy:
> Re: [slurm-dev] Re: srun can't use variables in a batch script after
> upgrade
> Hi,
>
> any idea why the output of your job is not complete? There is nothing
> after "Copying files...". Does the /work/tants directory exists in all
> the nodes? The variable $SLURM_JOB_NAME is interpreted by bash so srun
> only sees "srun -N2 -n2 rm -rf /work/tants/mpicopytest"
>
> Regards,
> Carlos
>
> On Mon, Jul 10, 2017 at 10:02 AM, Dennis Tants
>  > wrote:
>
>
> Hello Loris,
>
> Am 10.07.2017 um 07:39 schrieb Loris Bennett:
> > Hi Dennis,
> >
> > Dennis Tants  > writes:
> >
> >> Hi list,
> >>
> >> I am a little bit lost right now and would appreciate your help.
> >> We have a little cluster with 16 nodes running with SLURM and it is
> >> doing everything we want, except a few
> >> little things I want to improve.
> >>
> >> So that is why I wanted to upgrade our old SLURM 15.X (don't
> know the
> >> exact version) to 17.02.4 on my test machine.
> >> I just deleted the old version completely with 'yum erase slurm-*'
> >> (CentOS 7 btw.) and build the new version with rpmbuild.
> >> Everything went fine so I started configuring a new
> slurm[dbd].conf.
> >> This time I also wanted to integrate backfill instead of FIFO
> >> and also use accounting (just to know which person uses the most
> >> resources). Because we had no databases yet I started
> >> slurmdbd and slurmctld without problems.
> >>
> >> Everything seemed fine with a simple mpi hello world test on
> one and two
> >> nodes.
> >> Now I wanted to enhance the script a bit more and include
> working in the
> >> local directory of the nodes which is /work.
> >> To get everything up and running I used the script which I
> attached for
> >> you (it also includes the output after running the script).
> >> It should basically just copy all data to
> /work/tants/$SLURM_JOB_NAME
> >> before doing the mpi hello world.
> >> But it seems that srun does not know $SLURM_JOB_NAME even
> though it is
> >> there.
> >> /work/tants belongs to the correct user and has rwx permissions.
> >>
> >> So did I just configure something wrong or what happened here?
> Nearly
> >> the same example is working on our cluster with
> >> 15.X. The script is only for testing purposes, thats why there
> are so
> >> many echo commands in there.
> >> If you see any mistake or can recommend better configurations I
> would
> >> glady hear them.
> >> Should you need any more information I will provide them.
> >> Thank you for your time!
> > Shouldn't the variable be $SBATCH_JOB_NAME?
> >
> > Cheers,
> >
> > Loris
> >
>
> when I use "echo $SLURM_JOB_NAME" it will tell me the name I specified
> with #SBATCH -J.
> It is not working with srun in this version (it was working in 15.x).
>
> However, when I now use "echo $SBATCH_JOB_NAME" it is just a blank
> variable. As told by someone from the list,
> I used the command "env" to verify which variables are available. This
> list includes SLURM_JOB_NAME
> with the name I specified. So $SLURM_JOB_NAME shouldn't be a problem.
>
> Thank you for your suggestion though.
> Any other hints?
>
> Best regards,
> Dennis
>
> --
> Dennis Tants
> Auszubildender: Fachinformatiker für Systemintegration
>
> ZARM - Zentrum für angewandte Raumfahrttechnologie und
> Mikrogravitation
> ZARM - Center of Applied Space Technology and Microgravity
>
> Universität Bremen
> Am Fallturm
> 28359 Bremen, Germany
>
> Telefon: 0421 218 57940
> E-Mail: ta...@zarm.uni-bremen.de 
>
> www.zarm.uni-bremen.de 
>
>
>
>
> -- 
> --
> Carles Fenoy

-- 
Dennis Tants
Auszubildender: Fachinformatiker für Systemintegration

ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation
ZARM - Center of Applied Space Technology and Microgravity

Universität Bremen
Am Fallturm
28359 Bremen, Germany

Telefon: 0421 218 57940
E-Mail: ta...@zarm.uni-bremen.de

www.zarm.uni-bremen.de



[slurm-dev] Re: slurm 17.2.06 min memory problem

2017-07-10 Thread Diego Zuccato

Il 10/07/2017 10:53, Roe Zohar ha scritto:

> Adding DefMemPerCpu was the only solution but I don't understand this
> behavior.
I've had to do the same (and gave a default of just 200MB, actually
forcing users to request the RAM they need).
IIUC, that's because if the user does not request a specific amount of
RAM and no default is given, how could SLURM know how much memory the
job needs? How can it pack different jobs on the same node w/o knowing
they won't interfere?


-- 
Diego Zuccato
Servizi Informatici
Dip. di Fisica e Astronomia (DIFA) - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786
mail: diego.zucc...@unibo.it


[slurm-dev] Re: slurm 17.2.06 min memory problem

2017-07-10 Thread Roe Zohar
Hi Loris,

I am using SelectType=select/cons_res.
Still, when sending jobs, every server getting only one job with
memory=*MaxServerMemory* inside the job.

Adding DefMemPerCpu was the only solution but I don't understand this
behavior.

Thank,
Roy

On Jul 10, 2017 8:47 AM, "Loris Bennett"  wrote:


Hi Roy,

Roe Zohar  writes:

> slurm 17.2.06 min memory problem
>
> Hi all,
> I have installed the last Slurm version and I have noticed a strange
behavior with the memory allocated for jobs.
> In my slurm conf I am having:
> SelectTypeParameters=CR_LLN,CR_CPU_Memory
>
> Now, when I am sending a new job with out giving it a --mem amount, it
automatically assign it all the server memory, which mean I am getting only
one job per server.
>
> I had to add DefMemPerCPU in order to get around that.
>
> Any body know why is that?
>
> Thanks,
> Roy

What value of SelectType are you using?  Note also that CR_LLN schedules
jobs to the least loaded nodes and so until all nodes have one job, you
will not more than one job per node.  See 'man slurm.conf'.

Regards

Loris

--
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: srun can't use variables in a batch script after upgrade

2017-07-10 Thread Loris Bennett

Hi Dennis,

Dennis Tants  writes:

> Hello Loris,
>
> Am 10.07.2017 um 07:39 schrieb Loris Bennett:
>> Hi Dennis,
>>
>> Dennis Tants  writes:
>>
>>> Hi list,
>>>
>>> I am a little bit lost right now and would appreciate your help.
>>> We have a little cluster with 16 nodes running with SLURM and it is
>>> doing everything we want, except a few
>>> little things I want to improve.
>>>
>>> So that is why I wanted to upgrade our old SLURM 15.X (don't know the
>>> exact version) to 17.02.4 on my test machine.
>>> I just deleted the old version completely with 'yum erase slurm-*'
>>> (CentOS 7 btw.) and build the new version with rpmbuild.
>>> Everything went fine so I started configuring a new slurm[dbd].conf.
>>> This time I also wanted to integrate backfill instead of FIFO
>>> and also use accounting (just to know which person uses the most
>>> resources). Because we had no databases yet I started
>>> slurmdbd and slurmctld without problems.
>>>
>>> Everything seemed fine with a simple mpi hello world test on one and two
>>> nodes.
>>> Now I wanted to enhance the script a bit more and include working in the
>>> local directory of the nodes which is /work.
>>> To get everything up and running I used the script which I attached for
>>> you (it also includes the output after running the script).
>>> It should basically just copy all data to /work/tants/$SLURM_JOB_NAME
>>> before doing the mpi hello world.
>>> But it seems that srun does not know $SLURM_JOB_NAME even though it is
>>> there.
>>> /work/tants belongs to the correct user and has rwx permissions.
>>>
>>> So did I just configure something wrong or what happened here? Nearly
>>> the same example is working on our cluster with
>>> 15.X. The script is only for testing purposes, thats why there are so
>>> many echo commands in there.
>>> If you see any mistake or can recommend better configurations I would
>>> glady hear them.
>>> Should you need any more information I will provide them.
>>> Thank you for your time!
>> Shouldn't the variable be $SBATCH_JOB_NAME?
>>
>> Cheers,
>>
>> Loris
>>
>
> when I use "echo $SLURM_JOB_NAME" it will tell me the name I specified
> with #SBATCH -J.
> It is not working with srun in this version (it was working in 15.x).
>
> However, when I now use "echo $SBATCH_JOB_NAME" it is just a blank
> variable. As told by someone from the list,
> I used the command "env" to verify which variables are available. This
> list includes SLURM_JOB_NAME
> with the name I specified. So $SLURM_JOB_NAME shouldn't be a problem.
>
> Thank you for your suggestion though.
> Any other hints?
>
> Best regards,
> Dennis

The manpage of srun says the following:

  SLURM_JOB_NAMESame as -J, --job-name except within an existing
allocation, in which case it is ignored to avoid
using the batch job’s name as the name of each
job step.

This sounds like it might mean that if you submit a job script via
sbatch and in this script call srun, the variable might not be defined.
However, the wording is a bit unclear and I have never tried this
myself.

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: srun can't use variables in a batch script after upgrade

2017-07-10 Thread Carlos Fenoy
Hi,

any idea why the output of your job is not complete? There is nothing after
"Copying files...". Does the /work/tants directory exists in all the nodes?
The variable $SLURM_JOB_NAME is interpreted by bash so srun only sees "srun
-N2 -n2 rm -rf /work/tants/mpicopytest"

Regards,
Carlos

On Mon, Jul 10, 2017 at 10:02 AM, Dennis Tants <
dennis.ta...@zarm.uni-bremen.de> wrote:

>
> Hello Loris,
>
> Am 10.07.2017 um 07:39 schrieb Loris Bennett:
> > Hi Dennis,
> >
> > Dennis Tants  writes:
> >
> >> Hi list,
> >>
> >> I am a little bit lost right now and would appreciate your help.
> >> We have a little cluster with 16 nodes running with SLURM and it is
> >> doing everything we want, except a few
> >> little things I want to improve.
> >>
> >> So that is why I wanted to upgrade our old SLURM 15.X (don't know the
> >> exact version) to 17.02.4 on my test machine.
> >> I just deleted the old version completely with 'yum erase slurm-*'
> >> (CentOS 7 btw.) and build the new version with rpmbuild.
> >> Everything went fine so I started configuring a new slurm[dbd].conf.
> >> This time I also wanted to integrate backfill instead of FIFO
> >> and also use accounting (just to know which person uses the most
> >> resources). Because we had no databases yet I started
> >> slurmdbd and slurmctld without problems.
> >>
> >> Everything seemed fine with a simple mpi hello world test on one and two
> >> nodes.
> >> Now I wanted to enhance the script a bit more and include working in the
> >> local directory of the nodes which is /work.
> >> To get everything up and running I used the script which I attached for
> >> you (it also includes the output after running the script).
> >> It should basically just copy all data to /work/tants/$SLURM_JOB_NAME
> >> before doing the mpi hello world.
> >> But it seems that srun does not know $SLURM_JOB_NAME even though it is
> >> there.
> >> /work/tants belongs to the correct user and has rwx permissions.
> >>
> >> So did I just configure something wrong or what happened here? Nearly
> >> the same example is working on our cluster with
> >> 15.X. The script is only for testing purposes, thats why there are so
> >> many echo commands in there.
> >> If you see any mistake or can recommend better configurations I would
> >> glady hear them.
> >> Should you need any more information I will provide them.
> >> Thank you for your time!
> > Shouldn't the variable be $SBATCH_JOB_NAME?
> >
> > Cheers,
> >
> > Loris
> >
>
> when I use "echo $SLURM_JOB_NAME" it will tell me the name I specified
> with #SBATCH -J.
> It is not working with srun in this version (it was working in 15.x).
>
> However, when I now use "echo $SBATCH_JOB_NAME" it is just a blank
> variable. As told by someone from the list,
> I used the command "env" to verify which variables are available. This
> list includes SLURM_JOB_NAME
> with the name I specified. So $SLURM_JOB_NAME shouldn't be a problem.
>
> Thank you for your suggestion though.
> Any other hints?
>
> Best regards,
> Dennis
>
> --
> Dennis Tants
> Auszubildender: Fachinformatiker für Systemintegration
>
> ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation
> ZARM - Center of Applied Space Technology and Microgravity
>
> Universität Bremen
> Am Fallturm
> 28359 Bremen, Germany
>
> Telefon: 0421 218 57940
> E-Mail: ta...@zarm.uni-bremen.de
>
> www.zarm.uni-bremen.de
>
>


-- 
--
Carles Fenoy


[slurm-dev] Re: srun can't use variables in a batch script after upgrade

2017-07-10 Thread Dennis Tants

Hello Loris,

Am 10.07.2017 um 07:39 schrieb Loris Bennett:
> Hi Dennis,
>
> Dennis Tants  writes:
>
>> Hi list,
>>
>> I am a little bit lost right now and would appreciate your help.
>> We have a little cluster with 16 nodes running with SLURM and it is
>> doing everything we want, except a few
>> little things I want to improve.
>>
>> So that is why I wanted to upgrade our old SLURM 15.X (don't know the
>> exact version) to 17.02.4 on my test machine.
>> I just deleted the old version completely with 'yum erase slurm-*'
>> (CentOS 7 btw.) and build the new version with rpmbuild.
>> Everything went fine so I started configuring a new slurm[dbd].conf.
>> This time I also wanted to integrate backfill instead of FIFO
>> and also use accounting (just to know which person uses the most
>> resources). Because we had no databases yet I started
>> slurmdbd and slurmctld without problems.
>>
>> Everything seemed fine with a simple mpi hello world test on one and two
>> nodes.
>> Now I wanted to enhance the script a bit more and include working in the
>> local directory of the nodes which is /work.
>> To get everything up and running I used the script which I attached for
>> you (it also includes the output after running the script).
>> It should basically just copy all data to /work/tants/$SLURM_JOB_NAME
>> before doing the mpi hello world.
>> But it seems that srun does not know $SLURM_JOB_NAME even though it is
>> there.
>> /work/tants belongs to the correct user and has rwx permissions.
>>
>> So did I just configure something wrong or what happened here? Nearly
>> the same example is working on our cluster with
>> 15.X. The script is only for testing purposes, thats why there are so
>> many echo commands in there.
>> If you see any mistake or can recommend better configurations I would
>> glady hear them.
>> Should you need any more information I will provide them.
>> Thank you for your time!
> Shouldn't the variable be $SBATCH_JOB_NAME?
>
> Cheers,
>
> Loris
>

when I use "echo $SLURM_JOB_NAME" it will tell me the name I specified
with #SBATCH -J.
It is not working with srun in this version (it was working in 15.x).

However, when I now use "echo $SBATCH_JOB_NAME" it is just a blank
variable. As told by someone from the list,
I used the command "env" to verify which variables are available. This
list includes SLURM_JOB_NAME
with the name I specified. So $SLURM_JOB_NAME shouldn't be a problem.

Thank you for your suggestion though.
Any other hints?

Best regards,
Dennis

-- 
Dennis Tants
Auszubildender: Fachinformatiker für Systemintegration

ZARM - Zentrum für angewandte Raumfahrttechnologie und Mikrogravitation
ZARM - Center of Applied Space Technology and Microgravity

Universität Bremen
Am Fallturm
28359 Bremen, Germany

Telefon: 0421 218 57940
E-Mail: ta...@zarm.uni-bremen.de

www.zarm.uni-bremen.de