[slurm-dev] Re: Jobs submitted simultaneously go on the same GPU

2017-04-11 Thread Christopher Samuel

On 10/04/17 21:08, Oliver Grant wrote:

> We did not have a gres.conf file. I've created one:
> cat /cm/shared/apps/slurm/var/etc/gres.conf
> # Configure support for our four GPU
> NodeName=node[001-018] Name=gpu File=/dev/nvidia[0-3]
> 
> I've read about "global" and "per-node" gres.conf, but I don't know how
> to implement them or if I need to?

Yes you do.

Here's an (anonymised) example from a cluster that I help with that has
both GPUs and MIC's on various nodes.

# We will have GPU & KNC nodes so add the GPU & MIC GresType to manage them
GresTypes=gpu,mic
# Node definitions for nodes with GPUs
NodeName=thing-gpu[001-005] Weight=3000 NodeAddr=thing-gpu[001-005] 
RealMemory=254000 CoresPerSocket=6 Sockets=2 Gres=gpu:k80:4
# Node definitions for nodes with Xeon Phi
NodeName=thing-knc[01-03] Weight=2000 NodeAddr=thing-knc[01-03] 
RealMemory=126000 CoresPerSocket=10 Sockets=2 ThreadsPerCore=2 Gres=mic:5110p:2

You'll also need to restart slurmctld & all slurmd's to pick up
this new config, I don't think "scontrol reconfigure" will deal
with this.

Best of luck,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Distinguishing past jobs that waited due to dependencies vs resources?

2017-04-11 Thread Christopher Samuel

Hi folks,

We're looking at wait times on our clusters historically but would like
to be able to distinguish jobs that had long wait times due to
dependencies rather than just waiting for resources (or because the user
had too many other jobs in the queue at that time).

A quick 'git grep' of the source code after reading 'man sacct' and not
finding anything (also running 'sacct -e' and not seeing anything useful
there either) doesn't offer much hope.

Anyone else dealing with this?

We're on 16.05.x at the moment with slurmdbd.

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Randomly jobs failures

2017-04-11 Thread Christopher Samuel

On 11/04/17 17:42, Andrea del Monaco wrote:

> [2017-04-11T08:22:03+02:00] error: Error opening file
> /cm/shared/apps/slurm/var/cm/statesave/job.830332/script, No such file
> or directory
> [2017-04-11T08:22:03+02:00] error: Error opening file
> /cm/shared/apps/slurm/var/cm/statesave/job.830332/environment, No such
> file or directory

I would suggest that you are looking at transient NFS failures (which
may not be logged).

Are you using NFSv3 or v4 to talk to the NFS server and what are the
OS's you are using for both?

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 Melbourne Bioinformatics - The University of Melbourne
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545


[slurm-dev] Re: Re:Best Way to Schedule Jobs based on predetermined Lists

2017-04-11 Thread maviko . wagner


Hello Thomas and others,

thanks again for the feedback. I agree, i don't actually need Slurm for 
my small-scale cluster.
However it's part of the baseline-assignment i'm working on to use as 
much hpc-established software as possible.


For now i settled with expanding the standard /sched/builtin plugin to 
support taking advise regarding node-selection based on my lists.
I'm aware that this will be working against the established scheduler 
and part of my project is to see how and why.


Your suggestion regarding "features" as suggestions is interesting 
though, and i believe it could actually be integrated with the shipped 
scheduler.
However this won't be my focus for now and might be something worth 
looking into for a follow-up project.


Currently i'm looking for advice on two topics, going a bit deeper into 
the code:


a) Where/How does slurm store whatever information i passed with "srun 
--export""?
I found both the job-desc and job-details structs, however i can't seem 
to find any info regarding the "export"-envvar-strings.

But it has to be stored somewhere internally, right?
I'd like to be able to access/modify it from within the scheduler if 
possible.


b) On some of my machines i need to set powercaps with rapl (utilizing a 
binary i usually call via ssh) from within slurm.
I suppose i could either try to setup custom prologues, or edit the 
existing one. I'm unsure if thats possible however.

Any advice on how i could realize that behaviour?

Thanks in advance,

M. Wagner

On 2017-04-10 20:28, Thomas M. Payerle wrote:

On 2017-04-05 16:00, maviko.wag...@fau.de wrote:

Hello Dani and Thomas,

[ ...]

However i think i did not specify clear enough what my cluster looks
like and what i'm trying to achieve.
Compared to a regular HPC-Cluster my testing cluster consists of as
little as 5 nodes (each having the same "grandscale"-features, so no
IB-nodes etc., and only differ in hardware details like cpu, #RAM
etc., including some MCUs akin to a raspberry pi).
The purpose of this cluster is to investigate how smart distribution
of workloads based on predetermined performance and energy data can
benefit hpc-clusters that consist of heterogenous systems that differ
greatly regarding energy consumption and performance.
Its just a small research project.


IF your only intent is to do this on your 5 node test cluster, you
probably do not need Slurm.  If you are looking to have something 
expand
to real clusters, then you really should be using something like 
features
and the scheduler.  The scheduler is already taking into account what 
nodes
provide what resources, what resources on the nodes are currently 
available
for use, and handling the complex (and generally, at least I found it 
to
be much more complex than I initially and naively thought it should be 
when

first started thinking about it) task of scheduling the jobs.

My IB example was just an example.  You could just as easily assign
Slurm "features" based on the feature set of the CPU, etc.  E.g., if
only
some nodes have CPUs with support AVX, label those nodes as "avx" and 
jobs

requiring AVX can restrict themselves to such nodes.

If you start specifying specific nodes in requests to the scheduler, 
you
are going to find yourself working against the scheduler, and that is 
not

likely to have a good outcome.   You are better off telling Slurm
which nodes have which features (essentially a one-time configuration
of the cluster)
and then have your code translate the requirements into a list of 
"features"

requested for the job under Slurm.

The only part that I see as a potential major problem is that, as I 
tried to
explain previously, the "features" requested by a job are REQUIREMENTS 
to
Slurm, not SUGGESTIONS.  E.g., if a job can run either with or without 
AVX
support, but runs better with AVX support, requiring the "avx" feature 
will
force the job to wait for a node supporting AVX, even if all the AVX 
nodes

are in use and there are plenty of non-AVX nodes which are idle.

I am not aware of anything in Slurm which handles such as SUGGESTIONS, 
and
doing such I believe would greatly complicate an already complex 
algorithm.

I believe anything done would need to modify the actual C code for the
scheduler.
It probably is not _too_ bad to have a situation wherein when a job
"suggesting" avx
starts to run, it picks any avx nodes currently available to it first.  
But that
is likely to have only limited success.  The closest thing currently in 
the
Slurm code base is the stuff for attempting to keep the nodes for a job 
on the
same leaf-switch (e.g. https://slurm.schedmd.com/topology.html), but I 
suspect
that would be quite complicated to handle across a number of 
"features".


[slurm-dev] Re: LDAP required?

2017-04-11 Thread Uwe Sauter

On modern systems, nscd or nslcd should have been replaced by sssd. sssd has 
much better caching then the older services.


Am 11.04.2017 um 17:17 schrieb Benjamin Redling:
> 
> AFAIK most request never hit LDAP servers.
> In production there is always a cache on the client side -- nscd might
> have issue, but that's another story.
> 
> Regards,
> Benjamin
> 
> On 2017-04-11 15:32, Grigory Shamov wrote:
>> On a larger cluster, deploying NIS, LDAP etc. might require some
>> thought, because you will be testing performance of your LDAP server’s
>> in worts case of a few hundred simultaneous requests, no? Thats why many
>> of specialized cluster tools like ROCKS, Perceus etc. would rather
>> synchronize files than doing LDAP.
>>
>> -- 
>> Grigory Shamov
>>
>>
>>
>>
>> From: Marcin Stolarek > >
>> Reply-To: slurm-dev >
>> Date: Tuesday, April 11, 2017 at 12:48 AM
>> To: slurm-dev >
>> Subject: [slurm-dev] Re: LDAP required?
>>
>> Re: [slurm-dev] Re: LDAP required?
>> but... is LDAP such a big issue?
>>
>> 2017-04-10 22:03 GMT+02:00 Jeff White > >:
>>
>> Using Salt/Ansible/Chef/Puppet/Engine is another way to get it
>> done.  Define your users in states/playbooks/whatever and don't
>> bother with painful LDAP or ancient NIS solutions.
>>
>> -- 
>> Jeff White
>> HPC Systems Engineer
>> Information Technology Services - WSU
>>
>> On 04/10/2017 09:39 AM, Alexey Safonov wrote:
>>> If you don't want to share passwd and setup LDAP which is complex
>>> task you can setup NIS. It will take 30 minutes of your time
>>>
>>> Alex
>>>
>>> 11 апр. 2017 г. 0:35 пользователь "Raymond Wan"
>>> > написал:
>>>
>>>
>>> Dear all,
>>>
>>> I'm trying to set up a small cluster of computers (i.e., less
>>> than 5
>>> nodes).  I don't expect the number of nodes to ever get larger
>>> than
>>> this.
>>>
>>> For SLURM to work, I understand from web pages such as
>>> https://slurm.schedmd.com/accounting.html
>>> 
>>> 
>>> that UIDs need to be shared
>>> across nodes.  Based on this web page, it seems sharing
>>> /etc/passwd
>>> between nodes appears sufficient.  The word LDAP is mentioned
>>> at the
>>> end of the paragraph as an alternative.
>>>
>>> I guess what I would like to know is whether it is acceptable to
>>> completely avoid LDAP and use the approach mentioned there?  The
>>> reason I'm asking is that I seem to be having a very nasty time
>>> setting up LDAP.  It doesn't seem as "easy" as I thought it
>>> would be
>>> [perhaps it was my fault for thinking it would be easy...].
>>>
>>> If I can set up a small cluster without LDAP, that would be great.
>>> But beyond this web page, I am wondering if there are
>>> suggestions for
>>> "best practices".  For example, in practice, do most
>>> administrators
>>> use LDAP?  If so and if it'll pay off in the end, then I can
>>> consider
>>> continuing with setting it up...
>>>
>>> Thanks a lot!
>>>
>>> Ray
>>>
>>
>>
> 


[slurm-dev] Re: LDAP required?

2017-04-11 Thread Benjamin Redling

AFAIK most request never hit LDAP servers.
In production there is always a cache on the client side -- nscd might
have issue, but that's another story.

Regards,
Benjamin

On 2017-04-11 15:32, Grigory Shamov wrote:
> On a larger cluster, deploying NIS, LDAP etc. might require some
> thought, because you will be testing performance of your LDAP server’s
> in worts case of a few hundred simultaneous requests, no? Thats why many
> of specialized cluster tools like ROCKS, Perceus etc. would rather
> synchronize files than doing LDAP.
> 
> -- 
> Grigory Shamov
> 
> 
> 
> 
> From: Marcin Stolarek  >
> Reply-To: slurm-dev >
> Date: Tuesday, April 11, 2017 at 12:48 AM
> To: slurm-dev >
> Subject: [slurm-dev] Re: LDAP required?
> 
> Re: [slurm-dev] Re: LDAP required?
> but... is LDAP such a big issue?
> 
> 2017-04-10 22:03 GMT+02:00 Jeff White  >:
> 
> Using Salt/Ansible/Chef/Puppet/Engine is another way to get it
> done.  Define your users in states/playbooks/whatever and don't
> bother with painful LDAP or ancient NIS solutions.
> 
> -- 
> Jeff White
> HPC Systems Engineer
> Information Technology Services - WSU
> 
> On 04/10/2017 09:39 AM, Alexey Safonov wrote:
>> If you don't want to share passwd and setup LDAP which is complex
>> task you can setup NIS. It will take 30 minutes of your time
>>
>> Alex
>>
>> 11 апр. 2017 г. 0:35 пользователь "Raymond Wan"
>> > написал:
>>
>>
>> Dear all,
>>
>> I'm trying to set up a small cluster of computers (i.e., less
>> than 5
>> nodes).  I don't expect the number of nodes to ever get larger
>> than
>> this.
>>
>> For SLURM to work, I understand from web pages such as
>> https://slurm.schedmd.com/accounting.html
>> 
>> 
>> that UIDs need to be shared
>> across nodes.  Based on this web page, it seems sharing
>> /etc/passwd
>> between nodes appears sufficient.  The word LDAP is mentioned
>> at the
>> end of the paragraph as an alternative.
>>
>> I guess what I would like to know is whether it is acceptable to
>> completely avoid LDAP and use the approach mentioned there?  The
>> reason I'm asking is that I seem to be having a very nasty time
>> setting up LDAP.  It doesn't seem as "easy" as I thought it
>> would be
>> [perhaps it was my fault for thinking it would be easy...].
>>
>> If I can set up a small cluster without LDAP, that would be great.
>> But beyond this web page, I am wondering if there are
>> suggestions for
>> "best practices".  For example, in practice, do most
>> administrators
>> use LDAP?  If so and if it'll pay off in the end, then I can
>> consider
>> continuing with setting it up...
>>
>> Thanks a lot!
>>
>> Ray
>>
> 
> 

-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323


[slurm-dev] Re: LDAP required?

2017-04-11 Thread Daniel Kidger
Are you sure you need /etc/password. NIS, LDAP or whatever?
For the simple case you can often get away with users only having UIDs on
compute nodes with no matching username.

By the way if using /etc/passwd then I would suggest you use *clush useradd
*or equivalent rather than copy /etc/passwd, and /etc/shadow.  One reason
is that a management node will have usernames associated with daemons that
do not exist on compute nodes

Dan Kidger, IBM


On 11 April 2017 at 14:32, Grigory Shamov 
wrote:

> On a larger cluster, deploying NIS, LDAP etc. might require some thought,
> because you will be testing performance of your LDAP server’s in worts case
> of a few hundred simultaneous requests, no? Thats why many of specialized
> cluster tools like ROCKS, Perceus etc. would rather synchronize files than
> doing LDAP.
>
> --
> Grigory Shamov
>
>
>
>
> From: Marcin Stolarek 
> Reply-To: slurm-dev 
> Date: Tuesday, April 11, 2017 at 12:48 AM
> To: slurm-dev 
> Subject: [slurm-dev] Re: LDAP required?
>
> but... is LDAP such a big issue?
>
> 2017-04-10 22:03 GMT+02:00 Jeff White :
>
>> Using Salt/Ansible/Chef/Puppet/Engine is another way to get it done.
>> Define your users in states/playbooks/whatever and don't bother with
>> painful LDAP or ancient NIS solutions.
>>
>> --
>> Jeff White
>> HPC Systems Engineer
>> Information Technology Services - WSU
>>
>> On 04/10/2017 09:39 AM, Alexey Safonov wrote:
>>
>> If you don't want to share passwd and setup LDAP which is complex task
>> you can setup NIS. It will take 30 minutes of your time
>>
>> Alex
>>
>> 11 апр. 2017 г. 0:35 пользователь "Raymond Wan" 
>> написал:
>>
>>>
>>> Dear all,
>>>
>>> I'm trying to set up a small cluster of computers (i.e., less than 5
>>> nodes).  I don't expect the number of nodes to ever get larger than
>>> this.
>>>
>>> For SLURM to work, I understand from web pages such as
>>> https://slurm.schedmd.com/accounting.html
>>> 
>>> that UIDs need to be shared
>>> across nodes.  Based on this web page, it seems sharing /etc/passwd
>>> between nodes appears sufficient.  The word LDAP is mentioned at the
>>> end of the paragraph as an alternative.
>>>
>>> I guess what I would like to know is whether it is acceptable to
>>> completely avoid LDAP and use the approach mentioned there?  The
>>> reason I'm asking is that I seem to be having a very nasty time
>>> setting up LDAP.  It doesn't seem as "easy" as I thought it would be
>>> [perhaps it was my fault for thinking it would be easy...].
>>>
>>> If I can set up a small cluster without LDAP, that would be great.
>>> But beyond this web page, I am wondering if there are suggestions for
>>> "best practices".  For example, in practice, do most administrators
>>> use LDAP?  If so and if it'll pay off in the end, then I can consider
>>> continuing with setting it up...
>>>
>>> Thanks a lot!
>>>
>>> Ray
>>>
>>
>>
>


[slurm-dev] Slurm license management

2017-04-11 Thread mercanca


Hi;

We are using slurm-16.05.5. I am trying to set dynamic licenses  
according to "Licenses Guide" as follows:


sacctmgr add resource name=matlab count=10 server=flex5  
servertype=flexlm type=license percentallowed=100




sacctmgr shows license:

sacctmgr show resource
  Name Server Type  Count % Allocated ServerType
-- --  -- --- --
matlab  flex5  License 10   0 flexlm



But when I try to scontrol is says:

scontrol show lic
No licenses configured in Slurm.



Also srun with "-L matlab@flex5" says:

srun: error: Unable to allocate resources: Invalid license specification



Is it a bug, or did I miss something. Thank you.


Ahmet Mercan.


[slurm-dev] Re: LDAP required?

2017-04-11 Thread Markus Koeberl

On Tuesday 11 April 2017 08:17:00 Raymond Wan wrote:
> 
> Dear all,
> 
> Thank you all of you for the many helpful alternatives!
> 
> Unfortunately, system administration isn't my main responsibility so
> I'm (regrettably) not very good at it and have found LDAP on Ubuntu to
> be very unfriendly to set up.  I do understand that it must be a good
> solution for a larger setup with a full-time system administrator.
> But, if I can get away with something simpler for a cluster of just a
> few nodes, then I might try that instead.
> 
> So far, no one seems to discourage me from simply copying /etc/passwd
> between servers.  I can understand that this solution seems a bit
> ad-hoc, but if it works and there are no "significant" downsides, I
> might give that a try.  In fact, perhaps I'll give this a try now, get
> the cluster up (since others are waiting for it) and while it is
> running play with one of the options that have been mentioned and see
> if it is worth swapping out /etc/passwd for this alternative...  I
> guess this should work?

Be careful depending on how you setup your hosts copying /etc/passwd might be 
dangerous if you use Ubuntu.
I am using Debian wheezy and set up all my hosts automatically using fai and 
use LDAP for managing user accounts.
It seams depending on the time I did the setup of a host the contends of 
/etc/passwd differ. The difference is in the order in which the system groups 
get created during the installation and therefore the uid and gid it gets.

On Debian based systems (I don’t know if it is the same for others) only a 
small number of system users and groups have a fix uid and gid. All the others 
get assigned dynamically during installation.

As long as you always clone a master image and install new software always in 
the same order on all hosts it might work to copy /etc/passwd.

A short script to create all missing users with fix uid and gid might be a much 
better and less error prone solution for your small setup...

> I suppose this isn't "urgent", but yes...getting the cluster set up
> with SLURM soon will allow others to use it.  Then, I can take my time
> with other options.  I guess I was worried if copying /etc/passwd will
> limit what I can do later.  I guess if Linux-based UIDs and GIDs
> match, then I shouldn't have any surprises?
> 
> Thank you for your replies!  They were most helpful!  I thought I had
> only two options for SLURM:  /etc/passwd vs LDAP.  I didn't realise of
> other choices available to me.  Thank you!


-- 
Markus Koeberl
Graz University of Technology
Signal Processing and Speech Communication Laboratory
E-mail: markus.koeb...@tugraz.at


[slurm-dev] Re: LDAP required?

2017-04-11 Thread Benjamin Redling

Am 11. April 2017 08:21:31 MESZ, schrieb Uwe Sauter :
>
>Ray,
>
>if you're going with the easy "copy" method just be sure that the nodes
>are all in the same state (user management-wise) before
>you do your first copy. Otherwise you might accidentally delete already
>existing users.
>
>I also encourage you to have a look into Ansible which makes it easy to
>copy files between nodes (and which helps not to forget a
>node when updateing the files).
>
>
>Regards,
>
>   Uwe
>
>Am 11.04.2017 um 08:17 schrieb Raymond Wan:
>> 
>> Dear all,
>> 
>> Thank you all of you for the many helpful alternatives!
>> 
>> Unfortunately, system administration isn't my main responsibility so
>> I'm (regrettably) not very good at it and have found LDAP on Ubuntu
>to
>> be very unfriendly to set up.  I do understand that it must be a good
>> solution for a larger setup with a full-time system administrator.
>> But, if I can get away with something simpler for a cluster of just a
>> few nodes, then I might try that instead.
>> 
>> So far, no one seems to discourage me from simply copying /etc/passwd
>> between servers.  I can understand that this solution seems a bit
>> ad-hoc, but if it works and there are no "significant" downsides, I
>> might give that a try.  In fact, perhaps I'll give this a try now,
>get
>> the cluster up (since others are waiting for it) and while it is
>> running play with one of the options that have been mentioned and see
>> if it is worth swapping out /etc/passwd for this alternative...  I
>> guess this should work?
>> 
>> I suppose this isn't "urgent", but yes...getting the cluster set up
>> with SLURM soon will allow others to use it.  Then, I can take my
>time
>> with other options.  I guess I was worried if copying /etc/passwd
>will
>> limit what I can do later.  I guess if Linux-based UIDs and GIDs
>> match, then I shouldn't have any surprises?
>> 
>> Thank you for your replies!  They were most helpful!  I thought I had
>> only two options for SLURM:  /etc/passwd vs LDAP.  I didn't realise
>of
>> other choices available to me.  Thank you!
>> 
>> Ray
>> 
>> 
>> 
>> 
>> On Tue, Apr 11, 2017 at 2:05 PM, Lachlan Musicman 
>wrote:
>>> On 11 April 2017 at 02:36, Raymond Wan  wrote:


 For SLURM to work, I understand from web pages such as
 https://slurm.schedmd.com/accounting.html that UIDs need to be
>shared
 across nodes.  Based on this web page, it seems sharing /etc/passwd
 between nodes appears sufficient.  The word LDAP is mentioned at
>the
 end of the paragraph as an alternative.

 I guess what I would like to know is whether it is acceptable to
 completely avoid LDAP and use the approach mentioned there?  The
 reason I'm asking is that I seem to be having a very nasty time
 setting up LDAP.  It doesn't seem as "easy" as I thought it would
>be
 [perhaps it was my fault for thinking it would be easy...].

 If I can set up a small cluster without LDAP, that would be great.
 But beyond this web page, I am wondering if there are suggestions
>for
 "best practices".  For example, in practice, do most administrators
 use LDAP?  If so and if it'll pay off in the end, then I can
>consider
 continuing with setting it up...
>>>
>>>
>>>
>>> We have had success with a FreeIPA installation to manage auth -
>every node
>>> is enrolled in a domain and each node runs SSSD (the FreeIPA
>client).
>>>
>>> Our auth actually backs onto an Active Directory domain - I don't
>even have
>>> to manage the users. Which, to be honest, is quite a relief.
>>>
>>> cheers
>>> L.
>>>
>>> --
>>> The most dangerous phrase in the language is, "We've always done it
>this
>>> way."
>>>
>>> - Grace Hopper
>>>

Indeed, look into ansible and avoid "ad-hoc".
Ansible has a module "user" to handle that case with grace -- no accidental 
overwriting.

Regards ,
Benjamin
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321


[slurm-dev] Randomly jobs failures

2017-04-11 Thread Andrea del Monaco
Hello There,

Some of the jobs crashes without any apparent valid reason:
Logs are the following:
Controller:
[2017-04-11T08:22:03+02:00] debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE
uid=0
[2017-04-11T08:22:03+02:00] debug2: _slurm_rpc_epilog_complete JobId=830468
Node=cnode001 usec=60
[2017-04-11T08:22:03+02:00] debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE
uid=0
[2017-04-11T08:22:03+02:00] debug2: _slurm_rpc_epilog_complete JobId=830468
Node=cnode007 usec=25
[2017-04-11T08:22:03+02:00] debug:  sched: Running job scheduler
[2017-04-11T08:22:03+02:00] debug2: found 92 usable nodes from config
containing cnode[001-100]
[2017-04-11T08:22:03+02:00] debug2: select_p_job_test for job 830332
[2017-04-11T08:22:03+02:00] sched: Allocate JobId=830332
NodeList=cnode[001,007,022,030-033,041-044,047-048,052-054,058-061]
#CPUs=320
[2017-04-11T08:22:03+02:00] debug2: prolog_slurmctld job 830332 prolog
completed
[2017-04-11T08:22:03+02:00] error: Error opening file
/cm/shared/apps/slurm/var/cm/statesave/job.830332/script, No such file or
directory
[2017-04-11T08:22:03+02:00] error: Error opening file
/cm/shared/apps/slurm/var/cm/statesave/job.830332/environment, No such file
or directory
[2017-04-11T08:22:03+02:00] debug2: Spawning RPC agent for msg_type 4005
[2017-04-11T08:22:03+02:00] debug2: got 1 threads to send out
[2017-04-11T08:22:03+02:00] debug2: Tree head got back 0 looking for 1
[2017-04-11T08:22:03+02:00] debug2: Tree head got back 1
[2017-04-11T08:22:03+02:00] debug2: Tree head got them all
[2017-04-11T08:22:03+02:00] debug2: node_did_resp cnode001
[2017-04-11T08:22:03+02:00] debug2: Processing RPC:
REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=830332
[2017-04-11T08:22:03+02:00] error: slurmd error running JobId=830332 on
node(s)=cnode001: Slurmd could not create a batch directory or file
[2017-04-11T08:22:03+02:00] update_node: node cnode001 reason set to: batch
job complete failure
[2017-04-11T08:22:03+02:00] update_node: node cnode001 state set to DRAINING
[2017-04-11T08:22:03+02:00] completing job 830332
[2017-04-11T08:22:03+02:00] Batch job launch failure, JobId=830332
[2017-04-11T08:22:03+02:00] debug2: Spawning RPC agent for msg_type 6011
[2017-04-11T08:22:03+02:00] sched: job_complete for JobId=830332 successful

Node:
[2017-04-11T08:22:03+02:00] debug2: Processing RPC: REQUEST_BATCH_JOB_LAUNCH
[2017-04-11T08:22:03+02:00] debug:  task_slurmd_batch_request: 830332
[2017-04-11T08:22:03+02:00] debug:  Calling
/cm/shared/apps/slurm/2.5.7/sbin/slurmstepd spank prolog
[2017-04-11T08:22:03+02:00] Reading slurm.conf file: /etc/slurm/slurm.conf
[2017-04-11T08:22:03+02:00] Running spank/prolog for jobid [830332] uid
[40281]
[2017-04-11T08:22:03+02:00] spank: opening plugin stack
/etc/slurm/plugstack.conf
[2017-04-11T08:22:03+02:00] debug:  [job 830332] attempting to run prolog
[/cm/local/apps/cmd/scripts/prolog]
[2017-04-11T08:22:03+02:00] Launching batch job 830332 for UID 40281
[2017-04-11T08:22:03+02:00] debug level is 6.
[2017-04-11T08:22:03+02:00] Job accounting gather LINUX plugin loaded
[2017-04-11T08:22:03+02:00] WARNING: We will use a much slower algorithm
with proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other
proctrack when using jobacct_gather/linux
[2017-04-11T08:22:03+02:00] switch NONE plugin loaded
[2017-04-11T08:22:03+02:00] Received cpu frequency information for 16 cpus
[2017-04-11T08:22:03+02:00] setup for a batch_job
[2017-04-11T08:22:03+02:00] [830332] _make_batch_script: called with NULL
script
[2017-04-11T08:22:03+02:00] [830332] batch script setup failed for job
830332.4294967294
[2017-04-11T08:22:03+02:00] [830332] sending REQUEST_COMPLETE_BATCH_SCRIPT,
error:4010
[2017-04-11T08:22:03+02:00] [830332] auth plugin for Munge (
http://code.google.com/p/munge/) loaded
[2017-04-11T08:22:03+02:00] [830332] _step_setup: no job returned
[2017-04-11T08:22:03+02:00] [830332] done with job
[2017-04-11T08:22:03+02:00] debug2: got this type of message 6011
[2017-04-11T08:22:03+02:00] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2017-04-11T08:22:03+02:00] debug:  _rpc_terminate_job, uid = 450
[2017-04-11T08:22:03+02:00] debug:  task_slurmd_release_resources: 830332
[2017-04-11T08:22:03+02:00] debug:  credential for job 830332 revoked
[2017-04-11T08:22:03+02:00] debug2: No steps in jobid 830332 to send signal
18
[2017-04-11T08:22:03+02:00] debug2: No steps in jobid 830332 to send signal
15
[2017-04-11T08:22:03+02:00] debug2: set revoke expiration for jobid 830332
to 1491892923 UTS
[2017-04-11T08:22:03+02:00] debug:  Waiting for job 830332's prolog to
complete
[2017-04-11T08:22:03+02:00] debug:  Finished wait for job 830332's prolog
to complete


I have already checked if /cm/shared/apps/slurm/var/cm/statesave is
accessible and it is, from the node and from the master node.

What i wonder is what triggers this behavior? Is that the Master is not
able to create the files so the slurm daemon on the compute node fails or
is the opposite?

The issue happens randomly and it is not 

[slurm-dev] Re: LDAP required?

2017-04-11 Thread Uwe Sauter

Ray,

if you're going with the easy "copy" method just be sure that the nodes are all 
in the same state (user management-wise) before
you do your first copy. Otherwise you might accidentally delete already 
existing users.

I also encourage you to have a look into Ansible which makes it easy to copy 
files between nodes (and which helps not to forget a
node when updateing the files).


Regards,

Uwe

Am 11.04.2017 um 08:17 schrieb Raymond Wan:
> 
> Dear all,
> 
> Thank you all of you for the many helpful alternatives!
> 
> Unfortunately, system administration isn't my main responsibility so
> I'm (regrettably) not very good at it and have found LDAP on Ubuntu to
> be very unfriendly to set up.  I do understand that it must be a good
> solution for a larger setup with a full-time system administrator.
> But, if I can get away with something simpler for a cluster of just a
> few nodes, then I might try that instead.
> 
> So far, no one seems to discourage me from simply copying /etc/passwd
> between servers.  I can understand that this solution seems a bit
> ad-hoc, but if it works and there are no "significant" downsides, I
> might give that a try.  In fact, perhaps I'll give this a try now, get
> the cluster up (since others are waiting for it) and while it is
> running play with one of the options that have been mentioned and see
> if it is worth swapping out /etc/passwd for this alternative...  I
> guess this should work?
> 
> I suppose this isn't "urgent", but yes...getting the cluster set up
> with SLURM soon will allow others to use it.  Then, I can take my time
> with other options.  I guess I was worried if copying /etc/passwd will
> limit what I can do later.  I guess if Linux-based UIDs and GIDs
> match, then I shouldn't have any surprises?
> 
> Thank you for your replies!  They were most helpful!  I thought I had
> only two options for SLURM:  /etc/passwd vs LDAP.  I didn't realise of
> other choices available to me.  Thank you!
> 
> Ray
> 
> 
> 
> 
> On Tue, Apr 11, 2017 at 2:05 PM, Lachlan Musicman  wrote:
>> On 11 April 2017 at 02:36, Raymond Wan  wrote:
>>>
>>>
>>> For SLURM to work, I understand from web pages such as
>>> https://slurm.schedmd.com/accounting.html that UIDs need to be shared
>>> across nodes.  Based on this web page, it seems sharing /etc/passwd
>>> between nodes appears sufficient.  The word LDAP is mentioned at the
>>> end of the paragraph as an alternative.
>>>
>>> I guess what I would like to know is whether it is acceptable to
>>> completely avoid LDAP and use the approach mentioned there?  The
>>> reason I'm asking is that I seem to be having a very nasty time
>>> setting up LDAP.  It doesn't seem as "easy" as I thought it would be
>>> [perhaps it was my fault for thinking it would be easy...].
>>>
>>> If I can set up a small cluster without LDAP, that would be great.
>>> But beyond this web page, I am wondering if there are suggestions for
>>> "best practices".  For example, in practice, do most administrators
>>> use LDAP?  If so and if it'll pay off in the end, then I can consider
>>> continuing with setting it up...
>>
>>
>>
>> We have had success with a FreeIPA installation to manage auth - every node
>> is enrolled in a domain and each node runs SSSD (the FreeIPA client).
>>
>> Our auth actually backs onto an Active Directory domain - I don't even have
>> to manage the users. Which, to be honest, is quite a relief.
>>
>> cheers
>> L.
>>
>> --
>> The most dangerous phrase in the language is, "We've always done it this
>> way."
>>
>> - Grace Hopper
>>


[slurm-dev] Re: LDAP required?

2017-04-11 Thread Raymond Wan

Dear all,

Thank you all of you for the many helpful alternatives!

Unfortunately, system administration isn't my main responsibility so
I'm (regrettably) not very good at it and have found LDAP on Ubuntu to
be very unfriendly to set up.  I do understand that it must be a good
solution for a larger setup with a full-time system administrator.
But, if I can get away with something simpler for a cluster of just a
few nodes, then I might try that instead.

So far, no one seems to discourage me from simply copying /etc/passwd
between servers.  I can understand that this solution seems a bit
ad-hoc, but if it works and there are no "significant" downsides, I
might give that a try.  In fact, perhaps I'll give this a try now, get
the cluster up (since others are waiting for it) and while it is
running play with one of the options that have been mentioned and see
if it is worth swapping out /etc/passwd for this alternative...  I
guess this should work?

I suppose this isn't "urgent", but yes...getting the cluster set up
with SLURM soon will allow others to use it.  Then, I can take my time
with other options.  I guess I was worried if copying /etc/passwd will
limit what I can do later.  I guess if Linux-based UIDs and GIDs
match, then I shouldn't have any surprises?

Thank you for your replies!  They were most helpful!  I thought I had
only two options for SLURM:  /etc/passwd vs LDAP.  I didn't realise of
other choices available to me.  Thank you!

Ray




On Tue, Apr 11, 2017 at 2:05 PM, Lachlan Musicman  wrote:
> On 11 April 2017 at 02:36, Raymond Wan  wrote:
>>
>>
>> For SLURM to work, I understand from web pages such as
>> https://slurm.schedmd.com/accounting.html that UIDs need to be shared
>> across nodes.  Based on this web page, it seems sharing /etc/passwd
>> between nodes appears sufficient.  The word LDAP is mentioned at the
>> end of the paragraph as an alternative.
>>
>> I guess what I would like to know is whether it is acceptable to
>> completely avoid LDAP and use the approach mentioned there?  The
>> reason I'm asking is that I seem to be having a very nasty time
>> setting up LDAP.  It doesn't seem as "easy" as I thought it would be
>> [perhaps it was my fault for thinking it would be easy...].
>>
>> If I can set up a small cluster without LDAP, that would be great.
>> But beyond this web page, I am wondering if there are suggestions for
>> "best practices".  For example, in practice, do most administrators
>> use LDAP?  If so and if it'll pay off in the end, then I can consider
>> continuing with setting it up...
>
>
>
> We have had success with a FreeIPA installation to manage auth - every node
> is enrolled in a domain and each node runs SSSD (the FreeIPA client).
>
> Our auth actually backs onto an Active Directory domain - I don't even have
> to manage the users. Which, to be honest, is quite a relief.
>
> cheers
> L.
>
> --
> The most dangerous phrase in the language is, "We've always done it this
> way."
>
> - Grace Hopper
>


[slurm-dev] Re: LDAP required?

2017-04-11 Thread Lachlan Musicman
On 11 April 2017 at 02:36, Raymond Wan  wrote:

>
> For SLURM to work, I understand from web pages such as
> https://slurm.schedmd.com/accounting.html that UIDs need to be shared
> across nodes.  Based on this web page, it seems sharing /etc/passwd
> between nodes appears sufficient.  The word LDAP is mentioned at the
> end of the paragraph as an alternative.
>
> I guess what I would like to know is whether it is acceptable to
> completely avoid LDAP and use the approach mentioned there?  The
> reason I'm asking is that I seem to be having a very nasty time
> setting up LDAP.  It doesn't seem as "easy" as I thought it would be
> [perhaps it was my fault for thinking it would be easy...].
>
> If I can set up a small cluster without LDAP, that would be great.
> But beyond this web page, I am wondering if there are suggestions for
> "best practices".  For example, in practice, do most administrators
> use LDAP?  If so and if it'll pay off in the end, then I can consider
> continuing with setting it up...
>


We have had success with a FreeIPA installation to manage auth - every node
is enrolled in a domain and each node runs SSSD (the FreeIPA client).

Our auth actually backs onto an Active Directory domain - I don't even have
to manage the users. Which, to be honest, is quite a relief.

cheers
L.

--
The most dangerous phrase in the language is, "We've always done it this
way."

- Grace Hopper