[slurm-dev] Re: On the need for slurm uid/gid consistency

2017-09-13 Thread Janne Blomqvist


On 2017-09-12 21:52, Phil K wrote:

I'm hoping someone can provide an explanation as to why slurm requires
uid/gid consistency
across nodes, with emphasis on the need for the 'SlurmUser' to be
uid/gid-consistent.   I know
that slurmctld and slurmdbd can run as user `slurm` and that this would
be safer than running
as root.  slurmd must run as root in any case, to my knowledge.  Is the
need for uid consistency,
esp with the SlurmUser a difficult barrier to overcome?  Please clarify
for me.  Thanks.  Phil



Yes, this is tedious. Either you need to create the slurm user with a 
consistent uid/gid when provisioning a node, or then ldap/nis/whatever needs to 
be up and running before you start any slurm daemons. It would be nicer if the 
rpm's could just create a slurm user when installing the packages for 
slurmctld/slurmdbd, and let the system allocate the uid/gid so it doesn't 
conflict with any other local uid/gid's but not having to ensure the slurm 
uid/gid is globally unique.

Anyway, I think the reason behind this is that slurmd needs to ensure that 
control messages coming from slurmctld really come from the slurmctld daemon, 
and not from some random unprivileged process. And as munge is needed anyway to 
ensure that end user uid/gid's are correct, it's also used to ensure that the 
control messages really come from the SlurmUser.

I guess in principle you could get rid of the requirement for a SlurmUser with 
consistent uid/gid, say by using certificates like TLS or ssh. But then you'd 
have to provision the certificates as part of the deployment, so I'm not sure 
that buys you any more ease of use in the end. Or does anybody have a better 
idea?

--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: Fair resource scheduling

2017-09-08 Thread Janne Blomqvist

On 2017-09-07 17:07, Patrick Goetz wrote:
> 
> On 08/25/2017 09:27 AM, Peter A Ruprecht wrote:
>> Regarding #2, we use https://slurm.schedmd.com/fair_tree.html and it
>> has really exceeded my expectations.
>>
> 
> 
> I was distracted by another project and am just now coming back to this.
>  It's not at all clear to me how the Fair Tree Fairshare Algorithm
> solves the problem I was asking about, namely giving higher priority to
> shorter jobs submitted by people who don't run jobs that often.  As far
> as I can tell from https://slurm.schedmd.com/fair_tree.html, this
> algorithm mostly maintains scheduling priorities across child tasks?  Am
> I missing something?
> 

Please read the description of the slurm fairshare algorithm, which is
explained in more detail at

https://slurm.schedmd.com/priority_multifactor.html#fairshare

Fair tree (which we also use and are pretty happy with) is a
modification of the above basic fairshare algorithm that works "better"
when you have a hierarchy of accounts. The hierarchy might or might not
be relevant for your setup, but the basic fairshare idea seems to be
exactly what you're asking for, if I understand you correctly.

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: SLURM 16.05.10-2 jobacct_gather/linux inconsistencies?

2017-09-07 Thread Janne Blomqvist
3)
> [2017-09-05T13:19:28.109] [10215464] debug:  jag_common_poll_data: Task 
> average frequency = 2527 pid
> 22050 mem size 4228 230384 time 0.01(0+0)
> [2017-09-05T13:19:28.281] [10215464.0] debug:  jag_common_poll_data: Task 
> average frequency = 2495
> pid 22126 mem size 3560932 7655316 time 239.55(234+4)
> [2017-09-05T13:19:58.113] [10215464] debug:  jag_common_poll_data: Task 
> average frequency = 2527 pid
> 22050 mem size 4228 230384 time 0.01(0+0)
> [2017-09-05T13:19:58.286] [10215464.0] debug:  jag_common_poll_data: Task 
> average frequency = 2493
> pid 22126 mem size 3560936 7655316 time 299.51(294+5)
>
>
>
> Has anyone else noticed anything similar on their cluster(s)?  I cannot 
> confirm if this was
> happening before we upgraded from 15.08.4 to 16.05.10-2.
>
> Thanks,
> John DeSantis
>
Not knowing the specifics of your setup, AFAIK the general advice remains that 
unless you're launching the job via srun instead of mpirun (that is, you're 
using the slurm MPI integration) all bets are off w.r.t. accounting. Since you 
seem to be using mvapich2(?), make sure you're compiling mvapich2 with

./configure --with-pmi=pmi2 --with-pm=slurm


and then launch the application with

srun --mpi=pmi2 vasp_std

instead of mpirun.

(from https://slurm.schedmd.com/mpi_guide.html#mvapich2 )



-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: Building slurm from source - munge plugin not getting build

2017-05-11 Thread Janne Blomqvist

On 2017-05-11 02:34, Dhiraj Reddy wrote:
> Hi,
>
> I am trying to build latest release version of slurm from source. I have
> used the following procedure to do configure, make and make install
>
> ./configure --enable-debug --prefix=~/opt/slurm-17 --sysconfdir=~/etc/slurm
> make -j4
> make install
>
> Now, running slurmctld and slurmd with verbose logging enabled, I get the
> following
>
> ~/opt/slurm-17/sbin/slurmctld -Dv
>
> slurmctld: error: Couldn't find the specified plugin name for crypto/munge
> looking at all files
> slurmctld: error: cannot find crypto plugin for crypto/munge
> slurmctld: error: cannot create crypto context for crypto/munge
> slurmctld: fatal: slurm_cred_creator_ctx_create((null)): Operation not
> permitted
>
>
> ~/opt/slurm-17/sbin/slurmd -Dv
>
> slurmd: error: Couldn't find the specified plugin name for auth/munge
> looking at all files
> slurmd: error: cannot find auth plugin for auth/munge
> slurmd: error: cannot create auth context for auth/munge
> slurmd: error: slurmd initialization failed
>
> I can see that the above errors are because the libraries auth_munge.so and
> crypto_munge.so are not available in ~/opt/slurm-17/lib directory.
>
> Can somebody explain what is wrong with my configuration and build process
> and how the auth_munge.so and crypto_munge.so libraries can be built.
>
> I have tried the following way of confirurign with no success
>
> 1. With munge binary present at /usr/bin/munge
> ./configure --enable-debug --prefix=~/opt/slurm-17 --sysconfdir=~/etc/slurm
> --with-munge=/usr/bin
>
The munge plugin is built only if it finds the munge headers. Do you have them 
installed? E.g. the package "munge-devel" on RHEL/CentOS with the EPEL repo, or 
"libmunge-dev" on Debian/Ubuntu.

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-09 Thread Janne Blomqvist


On 2017-05-09 10:27, Ole Holm Nielsen wrote:


On 05/09/2017 09:14 AM, Janne Blomqvist wrote:


On 2017-05-07 15:29, Ole Holm Nielsen wrote:


I'm announcing an initial version 0.1 of an Infiniband topology tool
"slurmibtopology.sh" for Slurm.


I have also created one, at

https://github.com/jabl/ibtopotool

You need the python networkx library (python-networkx package on centos
& Ubuntu, or install via pip).

Run with --help option to get some usage instructions. In addition to
generating slurm topology.conf, it can also generate graphviz dot files
for visualization.


Thanks for providing this tool to the Slurm community.  It seems that
tools for generating topology.conf have been developed in many places,
probably because it's an important task.

I installed python-networkx 1.8.1-12.el7 from EPEL on our CentOS 7.3
system and then executed ibtopotool.py, but it gives an error message:

# ./ibtopotool.py
Traceback (most recent call last):
  File "./ibtopotool.py", line 216, in 
graph = parse_ibtopo(args[0], options.shortlabels)
IndexError: list index out of range

Could you help solving this?


Duh.. I just pushed a fix, thanks for reporting.


--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: Creating init script in /etc/init.d while building from source

2017-05-09 Thread Janne Blomqvist


On 2017-05-09 09:09, Dhiraj Reddy wrote:

Hi,

How to create slurmd and slurmctld init scripts in the directory
/etc/init.d while building and installing slurm from source.

I think something should be done with the files ./init.d.slurm in /etc
directory but I don't know what do.

I am using Ubuntu 16.04.

Thanks
Dhiraj


Hi,

for Ubuntu 16.04 you should be using the systemd service files instead 
of init.d scripts. They are part of the rpm file when building for red 
hat based systems, don't know about ubuntu; but presumably you can find 
them somewhere in the source tree.


--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1

2017-05-09 Thread Janne Blomqvist


On 2017-05-07 15:29, Ole Holm Nielsen wrote:


I'm announcing an initial version 0.1 of an Infiniband topology tool
"slurmibtopology.sh" for Slurm.


I have also created one, at

https://github.com/jabl/ibtopotool

You need the python networkx library (python-networkx package on centos 
& Ubuntu, or install via pip).


Run with --help option to get some usage instructions. In addition to 
generating slurm topology.conf, it can also generate graphviz dot files 
for visualization.



--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: LDAP required?

2017-04-18 Thread Janne Blomqvist


On 2017-04-13 21:41, Kilian Cavalotti wrote:


Hi Janne,

On Thu, Apr 13, 2017 at 1:32 AM, Janne Blomqvist
<janne.blomqv...@aalto.fi> wrote:

Should work as of 16.05 unless you have some very peculiar setup. IIRC I
submitted some patch to get rid of the enumeration entirely, but
apparently SchedMD has customers who have multiple groups with the same
GID, and for that to work (whatever "work" means in that context) the
enumeration is necessary. But if you don't have crazy stuff like that it
should all work with enumeration disabled.


Well, even without dwelling into crazy stuff, enumeration is necessary
for things like getting a comprehensive list of all the members of a
primary group.

The way group membership usually works, users have:
* a primary group that is stored in the user record (either in
/etc/passwd or ou=accounts in LDAP)
* one or more secondary group(s), that are managed in a completely
separate branch (/etc/group or ou=groups in LDAP)

It's pretty easy to list all the members of a secondary group, because
they look like this: "secondary_group:user1,user2,..."
But for primary groups, they are in the form of "user1:primary_group",
so you have to be able to get the full list of users (through
enumeration) to be able to identify all the users that are part of
"primary_group"

And that's true, sssd is not reliable for enumeration, but it's still
required for some basic things.

Cheers,



The way slurm handles it without enumeration e.g. for checking whether a 
user is allowed to use a partition with an AllowGroups= specifier is 
that it checks all the groups listed in AllowGroups, and it also checks 
whether the user primary group is in AllowGroups. So it does not need 
enumeration in this case.


I'd even go as far as saying that software intended for use in large 
environments shouldn't rely on enumeration, period.


--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: LDAP required?

2017-04-13 Thread Janne Blomqvist

On 2017-04-13 15:09, Diego Zuccato wrote:
> 
> Il 12/04/2017 08:52, Janne Blomqvist ha scritto:
> 
>> BTW, do you have some kind of trust relationship between your FreeIPA
>> domain and the AD domain, or how do you do it? I did play around with
>> using FreeIPA for our cluster as well and somehow synchronizing it with
>> the university AD domain, but in the end we managed to convince the
>> university IT to allow us to join our nodes directly to AD, so we were
>> able to skip FreeIPA entirely.
> What are you using to join nodes to AD?
> 
> I've used samba-winbind in the past but it was very fragile, and am
> currently using PBIS-Open but it's having problems with colliding UIDs
> and GIDs (multi-domain forest with quite a lot of [100k+] users and even
> more groups).
> 

We use adcli (there's an rpm package called adcli in EL7, FWIW; upstream
seems to be http://cgit.freedesktop.org/realmd/adcli ).

For node provisioning, adcli allows pre-creating multiple machine
accounts with one command (with the help of python-hostlist you can
expand hostlist syntax), and then when the node first boots the node
joins to AD with a one-time password (run via ansible-pull).

A minor caveat is that we have some Samba gateway nodes to give laptops
and Windows workstations access to Lustre, and samba isn't happy with
the domain join that adcli does, and for these we use the samba "net ads
join ..." command to join them.

Not sure how any of this would work with colliding UID's/GID's.

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: LDAP required?

2017-04-13 Thread Janne Blomqvist

On 2017-04-13 02:30, Christopher Samuel wrote:
> 
> On 13/04/17 01:47, Jeff White wrote:
> 
>> +1 for Active Directory bashing. 
> 
> I wasn't intending to "bash" AD here, just that the AD that we were
> trying to use (and I suspect that Lachlan might me talking to) has tens
> of thousands of accounts in it and we just could not get the
> Slurm->sssd->AD chain to work reliably to be able to run a production
> system.
> 
> This was with both sssd trying to enumarate the whole domain

Err, yeah, with a large domain sssd will keel over if you don't disable
enumeration. Or well, perhaps not technically keel over, but in our
testing IIRC it hit some timeout before completing the enumeration and
then worked, err, somewhat erratically.

> and also
> (before that) trying to get Slurm to work without sssd enumeration.

Should work as of 16.05 unless you have some very peculiar setup. IIRC I
submitted some patch to get rid of the enumeration entirely, but
apparently SchedMD has customers who have multiple groups with the same
GID, and for that to work (whatever "work" means in that context) the
enumeration is necessary. But if you don't have crazy stuff like that it
should all work with enumeration disabled.

15.08 should also work with enumeration disabled, except for
AllowGroups/DenyGroups partition specifications.

> Smaller AD domains might work more reliably, but that's not where we sit

Our AD has around 30k user accounts, IIRC.

> so we fell back to using our own LDAP server with Karaage to manage
> project/account applications, adding people to slurmdbd, etc.

So how do you manage user accounts? Just curious if someone has a sane
middle ground between integrating with the organization user account
system (AD or whatever) and DIY.

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: LDAP required?

2017-04-12 Thread Janne Blomqvist
On 2017-04-11 09:04, Lachlan Musicman wrote:
> On 11 April 2017 at 02:36, Raymond Wan <rwan.w...@gmail.com
> <mailto:rwan.w...@gmail.com>> wrote:
> 
> 
> For SLURM to work, I understand from web pages such as
> https://slurm.schedmd.com/accounting.html
> <https://slurm.schedmd.com/accounting.html> that UIDs need to be shared
> across nodes.  Based on this web page, it seems sharing /etc/passwd
> between nodes appears sufficient.  The word LDAP is mentioned at the
> end of the paragraph as an alternative.
> 
> I guess what I would like to know is whether it is acceptable to
> completely avoid LDAP and use the approach mentioned there?  The
> reason I'm asking is that I seem to be having a very nasty time
> setting up LDAP.  It doesn't seem as "easy" as I thought it would be
> [perhaps it was my fault for thinking it would be easy...].
> 
> If I can set up a small cluster without LDAP, that would be great.
> But beyond this web page, I am wondering if there are suggestions for
> "best practices".  For example, in practice, do most administrators
> use LDAP?  If so and if it'll pay off in the end, then I can consider
> continuing with setting it up...
> 
> 
> 
> We have had success with a FreeIPA installation to manage auth - every
> node is enrolled in a domain and each node runs SSSD (the FreeIPA client).

+1. Setting up a LDAP + krb5 infrastructure by hand is quite a chore
(been there, done that), but FreeIPA more or less automates all that.

> Our auth actually backs onto an Active Directory domain - I don't even
> have to manage the users. Which, to be honest, is quite a relief.

+1. Or rather, make that +1000. Before, there would be a constant stream
of users coming to our office requesting accounts, or wanting to reset a
forgotten password, or reactivate an expired account etc.; now all of
this is offloaded to the university IT helpdesk.

BTW, do you have some kind of trust relationship between your FreeIPA
domain and the AD domain, or how do you do it? I did play around with
using FreeIPA for our cluster as well and somehow synchronizing it with
the university AD domain, but in the end we managed to convince the
university IT to allow us to join our nodes directly to AD, so we were
able to skip FreeIPA entirely.

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi



signature.asc
Description: OpenPGP digital signature


[slurm-dev] Re: Slurm & CGROUP

2017-03-16 Thread Janne Blomqvist
On 2017-03-15 17:52, Wensheng Deng wrote:
> No, it does not help:
>
> $ scontrol show config |grep -i jobacct
>
> *JobAcct*GatherFrequency  = 30
>
> *JobAcct*GatherType   = *jobacct*_gather/cgroup
>
> *JobAcct*GatherParams = NoShared
>
>
>
>
>
> On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng <w...@nyu.edu
> <mailto:w...@nyu.edu>> wrote:
>
> I think I tried that. let me try it again. Thank you!
>
> On Wed, Mar 15, 2017 at 11:43 AM, Chris Read <cr...@drw.com
> <mailto:cr...@drw.com>> wrote:
>
>
> We explicitly exclude shared usage from our measurement:
>
>
> JobAcctGatherType=jobacct_gather/cgroup
> JobAcctGatherParams=NoShare?
>
> Chris
>
>
> 
> From: Wensheng Deng <w...@nyu.edu <mailto:w...@nyu.edu>>
> Sent: 15 March 2017 10:28
> To: slurm-dev
> Subject: [ext] [slurm-dev] Re: Slurm & CGROUP
>
> It should be (sorry):
> we 'cp'ed a 5GB file from scratch to node local disk
>
>
> On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng <w...@nyu.edu
> <mailto:w...@nyu.edu><mailto:w...@nyu.edu
> <mailto:w...@nyu.edu>>> wrote:
> Hello experts:
>
> We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a
> 5GB job from scratch to node local disk, declared 5 GB memory
> for the job, and saw error message as below although the file
> was copied okay:
>
> slurmstepd: error: Exceeded job memory limit at some point.
>
> srun: error: [nodenameXXX]: task 0: Out Of Memory
>
> srun: Terminating job step 41.0
>
> slurmstepd: error: Exceeded job memory limit at some point.
>
>
> From the cgroup document
> https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
> <https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt>
> Features:
> - accounting anonymous pages, file caches, swap caches usage and
> limiting them.
>
> It seems that cgroup charges memory "RSS + file caches" to user
> process like 'cp', in our case, charged to user's jobs. swap is
> off in this case. The file cache can be small or very big, and
> it should not be charged to users'  batch jobs in my opinion.
> How do other sites circumvent this issue? The Slurm version is
> 16.05.4.
>
> Thank you and Best Regards.
>
>
>
>

Could you set AllowedRamSpace/AllowedSwapSpace in /etc/slurm/cgroup.conf to 
some big number? That way the job memory limit will be the cgroup soft limit, 
and the cgroup hard limit which is when the kernel will OOM kill the job would 
be "job_memory_limit * AllowedRamSpace" that is, some large value?

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi



signature.asc
Description: OpenPGP digital signature


[slurm-dev] Re: problem configuring mvapich + slurm: "error: mpi/pmi2: failed to send temp kvs to compute nodes"

2016-11-18 Thread Janne Blomqvist
debug2: failed connecting to specified socket
> '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 5029
> slurmd: debug3: Entering _rpc_forward_data, address:
> /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
> slurmd: debug2: failed connecting to specified socket
> '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6004
> slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
> slurmd: debug:  _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6004
> slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS
> slurmd: debug:  _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0
> slurmd: debug3: in the service_connection
> slurmd: debug2: got this type of message 6011
> slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB
> slurmd: debug:  _rpc_terminate_job, uid = 500
> slurmd: debug:  task_p_slurmd_release_resources: 754
> slurmd: debug3: state for jobid 744: ctime:1479371687 revoked:0 expires:0
> slurmd: debug3: state for jobid 745: ctime:1479371707 revoked:0 expires:0
> slurmd: debug3: state for jobid 746: ctime:1479371733 revoked:0 expires:0
> slurmd: debug3: state for jobid 747: ctime:1479371785 revoked:0 expires:0
> slurmd: debug3: state for jobid 748: ctime:1479374028 revoked:0 expires:0
> slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:1479374387
> expires:1479374387
> slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:0 expires:0
> slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0
> slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0
> slurmd: debug4: unable to create link for
> /home/localsoft/slurm/spool//cred_state ->
> /home/localsoft/slurm/spool//cred_state.old: File exists
> slurmd: debug4: unable to create link for
> /home/localsoft/slurm/spool//cred_state.new ->
> /home/localsoft/slurm/spool//cred_state: File exists
> slurmd: debug:  credential for job 754 revoked
> slurmd: debug2: No steps in jobid 754 to send signal 18
> slurmd: debug2: No steps in jobid 754 to send signal 15
> slurmd: debug4: sent SUCCESS
> slurmd: debug2: set revoke expiration for jobid 754 to 1479374560 UTS
> slurmd: debug:  Waiting for job 754's prolog to complete
> slurmd: debug:  Finished wait for job 754's prolog to complete
> slurmd: debug:  Calling /home/localsoft/slurm/sbin/slurmstepd spank epilog
> spank-epilog: debug:  Reading slurm.conf file:
> /home/localsoft/slurm/etc/slurm.conf
> spank-epilog: debug:  Running spank/epilog for jobid [754] uid [500]
> spank-epilog: debug:  spank: opening plugin stack
> /home/localsoft/slurm/etc/plugstack.conf
> slurmd: debug:  completed epilog for jobid 754
> slurmd: debug3: slurm_send_only_controller_msg: sent 192
> slurmd: debug:  Job 754: sent epilog complete msg: rc = 0
>
>
> As you can see, the problem seems to be these lines:
> slurmd: debug2: got this type of message 5029
> slurmd: debug3: Entering _rpc_forward_data, address:
> /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66
> slurmd: debug2: failed connecting to specified socket
> '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle
> slurmd: debug3: in the service_connection
>
>
> I have checked that these files exist in the shared storage and are
> accesible by the node complaining. They are however empty. Is this
> normal? What should I expect?
>
> $ ssh acme11 'ls -plah /home/localsoft/slurm/spool/'
> total 160K
> drwxr-xr-x  2 slurm slurm 4,0K nov 17 10:26 ./
> drwxr-xr-x 12 slurm slurm 4,0K nov 16 16:20 ../
> srwxrwxrwx  1 root  root 0 nov 17 10:26 acme11_755.0
> srwxrwxrwx  1 root  root 0 nov 17 10:26 acme12_755.0
> -rw---  1 root  root   284 nov 17 10:26 cred_state.old
> -rw---  1 slurm slurm 141K nov 16 14:24 slurmdbd.log
> -rw-r--r--  1 slurm slurm5 nov 16 14:24 slurmdbd.pid
> srwxr-xr-x  1 root  root 0 nov 17 10:26 sock.pmi2.755.0
>
>
> So any ideas?
>
> thanks for your help,
>
> Manuel
>
>
> PS: About mvapich compilation.
>
> I made quite a few tests, and I ended up compiling with:
> ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
> --with-slurm=/home/localsoft/slurm
>
> Before that I tried the instructions
> in http://slurm.schedmd.com/mpi_guide.html#mvapich2
> <http://slurm.schedmd.com/mpi_guide.html#mvapich2> but if fails:
> ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
>  --with-pmi=pmi2  --with-pm=slurm
> (...)
> checking for slurm/pmi2.h... no
> configure: error: could not find slurm/pmi2.h.  Configure aborted
>
> I also tried
> ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast
> --with-slurm=/home/localsoft/slurm --with-pmi=pmi2  --with-pm=slurm
> (...)
> checking whether we are cross compiling... configure: error: in
> `/root/mvapich2-2.2/src/mpi/romio':
> configure: error: cannot run C compiled programs.
> If you meant to cross compile, use `--host'.
> See `config.log' for more details
> configure: error: src/mpi/romio configure failed
>
>
Hi,

I think you really need both  "--with-pmi=pmi2 --with-pm=slurm" parameters to 
the configure command when building mvapich2. So you need to fix whatever 
issues is preventing it from finding slurm/pmi2.h (I have a vague recollection 
that at some point there was some problem with slurm makefiles not installing 
that file, or something like that).

On another note, it doesn't make sense to put special files like pipes or 
sockets on a network filesystem. At best it does no harm, but there might be 
problems if several nodes want to create, say, a socket special file at the 
same shared path.


-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi



signature.asc
Description: OpenPGP digital signature


[slurm-dev] Re: TmpFS directive getting ignored?

2016-11-18 Thread Janne Blomqvist
On 2016-11-17 15:21, Holger Naundorf wrote:
> Hello,
> I am currently setting up a test environment with SLURM to evaluate it
> for regular use.
>
> It looks as if the 'TmpFS' setting is not used - I set
>
> TmpFS=/scratch
>
> in slurm.conf, but if I run 'slurmd -C' on my nodes I get back the size
> of the partition containing /tmp for my 'TmpDisk' space. Within Slurm
> jobs $TMPDIR also gets set to /tmp.
>
>
> $ df -h /tmp/
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/sda296G  8.0G   84G   9% /
>
> $ df -h /scratch/
> Filesystem  Size  Used Avail Use% Mounted on
> /dev/sdb1   1.8T   68M  1.7T   1% /scratch
>
> $ ./sbin/slurmd -C
> ClusterName=X NodeName= CPUs=12 Boards=1 SocketsPerBoard=2
> CoresPerSocket=6 ThreadsPerCore=1 RealMemory=48258 TmpDisk=98301
>
> $ grep TmpFS ./etc/slurm.conf
> TmpFS=/scratch
>
>
> I am using Slurm v16.05.5.
>
> I am just starting with this, so maybe I am overlooking something
> obvious, but I did not find other tuning option for this in the
> documentation.
>
> Regards,
> Holger Naundorf
>
I've ran into the same. I think the issue is that "slurmd -C" doesn't read the 
config file (since the purpose is to print a line you can use when creating 
your config file). Once you start slurmd for real then it reads the config 
file, and TmpFS is handled like it should. Per se, the calculation that it does 
is not that difficult, you can do it yourself in the shell with something like

echo $(($(df --output=size /scratch|tail -1) / 1024))

And no, TmpFS has no effect on setting TMPDIR in jobs. If you want to do that, 
you have to do it yourself, e.g. with a TaskProlog (see slurm.conf man page) 
script.

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi



signature.asc
Description: OpenPGP digital signature


[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?

2016-09-29 Thread Janne Blomqvist
On 2016-09-27 10:39, Philippe wrote:
> If I can't use logrotate, what must I use ?

You can log via syslog, and let your syslog daemon handle the rotation
(and rate limiting, disk full, logging to a central log host and all the
other nice things that syslog can do for you).


-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi



signature.asc
Description: OpenPGP digital signature


[slurm-dev] Re: Struggling with QOS?

2016-09-29 Thread Janne Blomqvist
On 2016-09-29 04:11, Lachlan Musicman wrote:
> Hi,
> 
> After some fun incidents with accidental monopolization of the cluster,
> we decided to enforce some QOS.
[snip]
> What have I done wrong? I re-read the documentation this AM, but I can't
> see anything that might be preventing QOS from being applied except for
> maybe a qos hierarchy issue, but I've only set the two qos and they
> apply to distinct associations and partitions.

What you actually want here is GrpTRESRunMins, see
http://tech.ryancox.net/2014/04/scheduler-limit-remaining-cputime-per.html
for an explanation how it works.

Also, if you do this you'll probably want to add "safe" to your
AccountingStorageEnforce flags.


-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi



signature.asc
Description: OpenPGP digital signature


[slurm-dev] Re: Problem when adding user to secondary group

2016-05-26 Thread Janne Blomqvist
, which version of Slurm do you have? Did you update Slurm
>>>>> recently? Did you always had this problem or you discovered that
>>>>> problem recently? Check if a newer version of Slurm solves this
>>>>> problem otherwise report it here.
>>>>>
>>>>> Best Regards,
>>>>> Valantis
>>>>>
>>>>>
>>>>> On 05/25/2016 02:59 PM, Thekla Loizou wrote:
>>>>>>
>>>>>> Hi Valanti! :)
>>>>>>
>>>>>> We are using nslcd on the compute nodes.
>>>>>> We have indeed changed the default behavior/command of salloc but
>>>>>> I don't think that this is the issue because the same happens when
>>>>>> we submit jobs via sbatch. So I believe that this is not related
>>>>>> to the new command we are using.
>>>>>>
>>>>>> When logging in as root or as user on the compute nodes via ssh we
>>>>>> get all groups after running the "id" command,
>>>>>> but when logging in through a SLURM job (interactive with salloc
>>>>>> or not interactive with sbatch) we face the problem I described.
>>>>>>
>>>>>> We have also checked the environment of the user in both cases
>>>>>> (ssh or SLURM) and the only differences are the SLURM environment
>>>>>> variables and nothing else.
>>>>>>
>>>>>> Thanks,
>>>>>> Thekla
>>>>>>
>>>>>>
>>>>>> On 25/05/2016 02:07 μμ, Chrysovalantis Paschoulas wrote:
>>>>>>>
>>>>>>> Hi Thekla! :)
>>>>>>>
>>>>>>> For me it looks like it's a configuration issue of the client
>>>>>>> LDAP name
>>>>>>> service on the compute nodes. Which service are you using? nslcd or
>>>>>>> sssd? I can see that you have change the default behavior/command of
>>>>>>> salloc and the command gives you a prompt on the compute node
>>>>>>> directly
>>>>>>> (by default salloc will return a shell on the login node where it
>>>>>>> was
>>>>>>> called). Check and be sure that you are not doing something wrong
>>>>>>> in the
>>>>>>> new salloc command that you defined in slurm.conf
>>>>>>> (SallocDefaultCommand
>>>>>>> option).
>>>>>>>
>>>>>>> Can you try to go as root on the compute nodes and try to resolve
>>>>>>> a uid
>>>>>>> with the id command? What does it give you there, all groups or some
>>>>>>> secondary groups are missing? If the secondary groups are missing
>>>>>>> then
>>>>>>> it's not a problem of Slurm but the config of the ID resolving
>>>>>>> service.
>>>>>>> As far as I know Slurm changes the environment after salloc (e.g.
>>>>>>> exports SLURM_ env vars) but shouldn't change the behavior of
>>>>>>> commands
>>>>>>> like id..
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Chrysovalantis Paschoulas
>>>>>>>
>>>>>>>
>>>>>>> On 05/25/2016 10:32 AM, Thekla Loizou wrote:
>>>>>>>>
>>>>>>>> Dear all,
>>>>>>>>
>>>>>>>> We have noticed a very strange problem every time we add an
>>>>>>>> existing
>>>>>>>> user to a secondary group.
>>>>>>>> We manage our users in LDAP. When we add a user to a new group and
>>>>>>>> then type the "id" and "groups" commands we see that the user was
>>>>>>>> indeed added to the new group. The same happens when running the
>>>>>>>> command "getent groups".
>>>>>>>>
>>>>>>>> For example,  for a user "thekla" whose primary group was
>>>>>>>> "cstrc" and
>>>>>>>> now was also added to the group "build" we get:
>>>>>>>> [thekla@node01 ~]$ id
>>>>>>>> uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc),10257(build)
>>>>>>>> [thekla@node01 ~]$ groups
>>>>>>>> cstrc build
>>>>>>>> [thekla@node01 ~]$ getent group | grep build
>>>>>>>> build:*:10257:thekla
>>>>>>>>
>>>>>>>> The above output is the correct one and it is given to us when
>>>>>>>> we ssh
>>>>>>>> to one of the compute nodes.
>>>>>>>>
>>>>>>>> But, when we submit a job on the nodes (so getting access through
>>>>>>>> SLURM and not with direct ssh), we cannot see the new group the
>>>>>>>> user
>>>>>>>> was added to:
>>>>>>>> [thekla@prometheus ~]$ salloc -N1
>>>>>>>> salloc: Granted job allocation 8136
>>>>>>>> [thekla@node01 ~]$ id
>>>>>>>> uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc)
>>>>>>>> [thekla@node01 ~]$ groups
>>>>>>>> cstrc
>>>>>>>>
>>>>>>>> While, the following output shows the correct result:
>>>>>>>> [thekla@node01 ~]$ getent group | grep build
>>>>>>>> build:*:10257:thekla
>>>>>>>>
>>>>>>>> This problem appears only when we get access through SLURM i.e.
>>>>>>>> when
>>>>>>>> we run a job.
>>>>>>>>
>>>>>>>> Has anyone faced this problem before? The only way we found for
>>>>>>>> solving this is to restart the SLURM service on the compute nodes
>>>>>>>> every time we add a user to a new group.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Thekla
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 
>>>>>>>
>>>>>>> 
>>>>>>>
>>>>>>> Forschungszentrum Juelich GmbH
>>>>>>> 52425 Juelich
>>>>>>> Sitz der Gesellschaft: Juelich
>>>>>>> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
>>>>>>> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
>>>>>>> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt
>>>>>>> (Vorsitzender),
>>>>>>> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
>>>>>>> Prof. Dr. Sebastian M. Schmidt
>>>>>>> 
>>>>>>>
>>>>>>> 
>>>>>>>


-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi



signature.asc
Description: OpenPGP digital signature


[slurm-dev] Re: slurmd

2016-05-11 Thread Janne Blomqvist


On 2016-05-10 19:14, kevin@gmail.com wrote:


Background:
I’m trying to write a slurmd task plugin to bind mount /tmp to 
/tmp/USERID/JOBID.

Question 1: should I be using a task plugin or a spank plugin to do this?


There are a number of options here. You can look at the slides from the past 
few slurm user group meetings, there's some info there.

What we ended up with was to use pam_namespace to setup various bind-mounted 
dirs (in our case, /tmp, /var/tmp, and /dev/shm), and then an epilog script to 
clean it up afterwards. Below is our /etc/security/namespace.d/site.conf:

/tmp /l/tmp-inst/  user  root,bin,adm
/var/tmp /l/vartmp_inst/  user  root,bin,adm
/dev/shm /dev/shm/inst/  user  root,bin,adm

See the pam_namespace(8) and namespace.conf(5) man pages for more info. And 
then in /etc/pam.d/slurm add

session   required  pam_namespace.so

Finally, slurm epilog script below, modify as appropriate.

#!/bin/bash

# If pam_namespace is used to create per-job /tmp/, /var/tmp, /dev/shm,
# clean it here in the epilog when no jobs are running on the node.
# Annoyingly, squeue always exits with status 0, so we must check that
# the output is empty, that is no jobs by the user running on the node
# and no error occurred (timeout etc.)
userlist=$(/usr/bin/squeue -w $HOSTNAME -o%u -h -u $SLURM_JOB_USER -t R,S,CF 
2>&1)
if [ -z $userlist ]; then
/bin/rm -rf /l/tmp-inst/$SLURM_JOB_USER /l/vartmp_inst/$SLURM_JOB_USER 
/dev/shm/$SLURM_JOB_USER
fi




Question 2:
I’m launching slurmd with the following line
sudo slurmd -N linux0 -D -vv

but the debug statements in slurmstepd code aren’t being printed to screen.
I assume that the slurmstepd code is being run in a fork of slurred.
Where can I find debug output fromslurmstepd?



Nowhere, see https://bugs.schedmd.com/show_bug.cgi?id=2631 . The workaround is to just 
run slurmd without "-D" and tail -f the syslog.


--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: Slurm on CentOS 7.x

2016-03-14 Thread Janne Blomqvist


On 2016-03-13 12:42, Rémi Palancher wrote:


Le 12/03/2016 18:32, Jagga Soorma a écrit :

Any ideas why the slurm service in the client might be throwing those
timed out errors?


`systemctl status` shows you're using SYSV init script for slurm:

 > # systemctl status slurm
 > slurm.service - LSB: slurm daemon management
 > Loaded: loaded (/etc/rc.d/init.d/slurm)
^
I guess in this case "timeout" means the init script didn't terminate
quickly enough for systemd. Maybe an interactive run of this init script
with `set -x` may help to see what's going on?

However, I would definitely recommand using native *.service files when
using systemd. Slurm provides those files in etc/ dir:

https://github.com/SchedMD/slurm/tree/master/etc

You can install them with the RPMs. This way, systemd will be much more
precise when reporting errors.


Indeed, but note that the slurm rpm spec has a bug where even if one 
builds it on a systemd system, it still includes the old init.d scripts 
in addition to the systemd service files..




--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: EL6 clusters with cgroups enabled

2016-02-11 Thread Janne Blomqvist


On 2016-02-11 07:06, Christopher Samuel wrote:


On 11/02/16 06:15, Christopher B Coffey wrote:

> I’m curious which kernel you are running on your el6 clusters that
> have cgroups enabled in slurm.  I have an issue where some workloads
> cause 100’s-1000’s of flocks to occur relating to the memory cleanup
> portion in the cgroup.

This is kernel code, or userspace?

My understanding of the kernel developers concerns over memory cgroups
was around the extra overhead in memory allocation inside the kernel.

Here's a write up from LWN from the 2012 mm minisummit at the Kernel
Summit on the issue:

https://lwn.net/Articles/516533/

Interestingly the RHEL page mentions a memory overhead on x86-64. but
not a performance issue, so whether they backported later patches to
reduce the impact of memory cgroups I cannot tell right now.

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html

I did benchmarking a few years back when were transitioning to
RHEL6 and Slurm with memory cgroups enabled and couldn't see any
significant difference in performance.  Unfortunately I suspect I
cleaned all that up some time ago. :-(

We use them and haven't noticed any issues yet.

All the best,
Chris


See http://slurm.schedmd.com/slurm_ug_2012/SUG-2012-Cgroup.pdf

slides 31-35.

I don't know if RedHat has backported the 2.6.38 memcg changes to the 2.6.32 
version they use in RHEL6.



--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: AllowGroups and AD

2016-02-10 Thread Janne Blomqvist


On 2016-02-10 15:12, Diego Zuccato wrote:


Hello all.

I think I'm doing something wrong, but I don't understand what.

I'm trying to limit users allowed to use a partition (that, coming from
Torque, I think is the equivalent of a queue), but obviously I'm failing. :(

Frontend and work nodes are all Debians joined to AD via Winbind (that
ensures consistent UID/GID mapping, at the expense of having many groups
and a bit of slowness while looking 'em up).
On every node I can run 'id' and it says (redacted):
uid=108036(diego.zuccato) gid=100013(domain_users)
gruppi=100013(domain_users),[...],242965(str957.tecnici),[...]

(it takes about 10s to get the complete list of groups).

Linux ACLs work as expected (if I set a file to be readable only by
Str957.tecnici I can read it), but when I do
scontrol update PartitionName=pp_base AllowGroups=str957.tecnici
or even
scontrol update PartitionName=pp_base AllowGroups=242965

when I try to sbath a job I get:
diego.zuccato@Str957-cluster:~$ sbatch aaa.sh
sbatch: error: Batch job submission failed: User's group not permitted
to use this partition
diego.zuccato@Str957-cluster:~$ newgrp Str957.tecnici
diego.zuccato@Str957-cluster:~$ sbatch aaa.sh
sbatch: error: Batch job submission failed: User's group not permitted
to use this partition

So I won't get recognized even if I change my primary GID :(

I've been in that group since way before installing the cluster, and I
already tried rebooting everyting to refresh the cache.

Another detail that can be useful:
diego.zuccato@Str957-cluster:~$ time getent group str957.tecnici
str957.tecnici:x:242965:[...],diego.zuccato,[...]

real0m0.012s
user0m0.000s
sys 0m0.000s

Any hints?

TIA


Hi,

do you have user and group enumeration enabled in winbind? I.e. does

$ getent passwd

and

$ getent group

return nothing, or the entire user and group lists?

FWIW, slurm 16.05 will have some changes to work better in environments with 
enumeration disabled, see http://bugs.schedmd.com/show_bug.cgi?id=1629

--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: User management

2016-01-14 Thread Janne Blomqvist

On 2016-01-13 23:54, Trey Dockendorf wrote:
> We use 389 Directory Server for our LDAP and SSSD for clients, works well.
>
We also use sssd on the clients, but instead of running or own LDAP(+kerberos) 
infrastructure, we use the university Active Directory.

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: slurmd can't mount cpuacct cgroup namespace on RHEL 7.2 ?

2015-12-22 Thread Janne Blomqvist


On 2015-12-21 07:32, Christopher Samuel wrote:


Hi folks,

I'm helping bring up a new cluster with Slurm 15.08.5 on RHEL 7.2
and I've run into an odd case where trying to launch a process with
srun triggers this failure:

[2015-12-21T15:38:00.213] unable to mount cpuacct cgroup namespace: Device or 
resource busy
[2015-12-21T15:38:00.213] jobacct_gather/cgroup: unable to create cpuacct 
namespace

I suspect this might be systemd related, but as I've limited
experience with it so far I'm not certain.

This is what is failing according to strace:

12725 mount("cgroup", "/cgroup/cpuacct", "cgroup", MS_NOSUID|MS_NODEV|MS_NOEXEC, 
"cpuacct") = -1 EBUSY (Device or resource busy)

...and it might be related to this existing mount courtesy
of systemd in /proc/mounts:

cgroup /sys/fs/cgroup/cpu,cpuacct cgroup 
rw,nosuid,nodev,noexec,relatime,cpuacct,cpu 0 0

Anyone else seen this, or got any ideas?


1. When using systemd, or some other tool that mounts the cgroup file 
systems early in the boot process (e.g. cgconfig), you should not try to 
mount the cgroup filesystems from slurmd. That is, in 
/etc/slurm/cgroup.conf put "CgroupAutomount=no".


2. Modern distros use /sys/fs/cgroup for the defacto standard root mount 
point for cgroups filesystems. Previously there was some distro-specific 
variations (/cgroup, /dev/cgroups, whatever). Thus, put into 
/etc/slurm/cgroup.conf the line "CgroupMountpoint=/sys/fs/cgroup". (At 
some point it might make sense to change the slurm default 
cgroupmountpoint to match the modern standard..)


3. For details, see 
https://wiki.freedesktop.org/www/Software/systemd/PaxControlGroups/



--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] RE: multifactor2 priority weighting

2015-05-21 Thread Janne Blomqvist

Hi,

I'm the author of the multifactor2 algorithm. The multifactor2/ticket based 
fairshare algorithm, in the end, produces fair-share values normalized between 
0 and 1, like the other available algorithms. The algorithm has 3 phases:

1. Mark which users have jobs in the queue
2. Starting from the root of the account tree, distribute tickets, weighted by 
a fair-share factor (which is between 0 and 100, as Don mentioned).
3. Calculate the final fair-share priority, such that the user with the most 
tickets gets the priority 1.0, the others proportional to that depending on how 
many tickets they have.

The goal of the algorithm was to better balance fair-share in an account 
hierarchy than the original fair-share algorithm. If there is no hierarchy, I 
don't think it provides much benefit.

Also, if the OP is already on 14.11, I recommend looking at the fair-tree 
algorithm

http://slurm.schedmd.com/fair_tree.html

This is what we currently are running. It solves some issues with the ticket 
based algorithm, and I think currently it's the best choice for ensuring 
fairness between account hierarchies.

On 2015-05-13 18:21, Lipari, Don wrote:
 I need to retract my comments below.  While what I wrote is consistent with 
 some of the text from the priority_multifactor2.html page, the writers 
 introduced a secondary term, fair-share priority which looks to be a 
 normalized fair-share factor.  And I am not directly familiar with this 
 nuance.

 Don

  -Original Message-
  From: Lipari, Don [mailto:lipa...@llnl.gov]
  Sent: Wednesday, May 13, 2015 7:56 AM
  To: slurm-dev
  Subject: [slurm-dev] RE: multifactor2 priority weighting
 
  The original multi-factor plugin generated fair-share factors ranging
  between 0.0 and 1.0.  Under this formula, 0.5 was the factor a
  user/account would see if their usage was commensurate with their
  shares.  The multi-factor2 plugin fair-share factors range between 0.0
  and 100.0, with 1.0 indicating a perfectly serviced user/account.
 
  Don
 
  -Original Message-
  From: gareth.willi...@csiro.au [mailto:gareth.willi...@csiro.au]
  Sent: Tuesday, May 12, 2015 10:05 PM
  To: slurm-dev
  Subject: [slurm-dev] multifactor2 priority weighting
 
 
  Hi All,
 
  We've just switched to multifactor2 priority as it seemed like a
  good
  idea (and low risk) and it is working but the priority factors are
  much lower (maybe 100x). The final paragraph on
  http://slurm.schedmd.com/priority_multifactor2.html says we might
  need
  to reweight but this magnitude seems odd.
 
  Does anyone have any insight that they might like to share?
 
  Regards,
 
  Gareth
 
  BTW. The following may matter.
  We currently do not have a tree as such, all users are direct
  children
  of root. This probably makes using the ticket based algorithm less
  compelling.
  Also, we have set FairShareDampeningFactor=40 as we have many
  relatively inactive users and this may have an impact though I'd
  expect it to be ignored with the multifactor2 scheme.
  We have PriorityWeightFairshare=1


-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  NBE
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: Slurm versions 14.11.6 is now available

2015-04-27 Thread Janne Blomqvist

On 2015-04-25 01:59, David Bigagli wrote:


Hi all,
the reason to compile without optimization is to be able to have
a meaningful stack when attaching gdb to the daemons or when analysing
core files. If the optimization is on crucial variables in the stack are
optimized out preventing exact diagnoses of issues. This of course is
configurable, we only changed the default, if sites wish to compile with
optimization use the config option --disable-debug. So this is not about
sweeping bugs under the carpet it is exactly the opposite it is a tool
to debug more efficiently.


FWIW, newer gcc versions have an option -Og, which enables optimizations which don't 
interfere with debugging. Might be worth adding a configure check if one uses a recent enough gcc, 
and enable that option then? IIRC the optimizations are roughly similar to what -O1 
gives.

Anyway, is there a way to enable optimization but keep the debug symbols? For our 
production builds, I think we'd like to have -O2 -g.


The reasons for using statvfs versus statfs is that statfs is deprecated
and replaced by the POSIX statvfs, so it is portable across platforms,
indeed NetBSD and Solaris do not have statfs.
Since all platforms have stavfs the code in get_tmp_disk() under the
#define (HAVE_STATFS) is obsolete and will possibly be removed in the
next major release.

Indeed, statvfs is in POSIX and should work everywhere on a decently new 
system. However, as I mentioned in my previous message, on Linux prior to 
kernel 2.6.36 and glibc 2.13, it's not as robust as the (non-standard) statfs. 
Hence I would prefer that it would be used on Linux in preference for statvfs, 
as most slurm clusters are presumably still running on older kernel/glibc 
versions. Something like the attached patch?


--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  NBE
+358503841576 || janne.blomqv...@aalto.fi
diff --git a/src/slurmd/slurmd/get_mach_stat.c b/src/slurmd/slurmd/get_mach_stat.c
index d7c5eb1..81994a6 100644
--- a/src/slurmd/slurmd/get_mach_stat.c
+++ b/src/slurmd/slurmd/get_mach_stat.c
@@ -216,57 +216,42 @@ extern int
 get_tmp_disk(uint32_t *tmp_disk, char *tmp_fs)
 {
 	int error_code = 0;
-
-#if defined(HAVE_STATVFS)
-	struct statvfs stat_buf;
-	uint64_t total_size = 0;
+	unsigned long long total_size = 0;
 	char *tmp_fs_name = tmp_fs;
 
-	*tmp_disk = 0;
-	total_size = 0;
-
 	if (tmp_fs_name == NULL)
 		tmp_fs_name = /tmp;
-	if (statvfs(tmp_fs_name, stat_buf) == 0) {
-		total_size = stat_buf.f_blocks * stat_buf.f_frsize;
-		total_size /= 1024 * 1024;
-	}
-	else if (errno != ENOENT) {
-		error_code = errno;
-		error (get_tmp_disk: error %d executing statvfs on %s,
-			errno, tmp_fs_name);
-	}
-	*tmp_disk += (uint32_t)total_size;
 
-#elif defined(HAVE_STATFS)
+#ifdef(__linux__)
+	/* Prior to Linux 2.6.36 and glibc 2.13, statvfs() can get
+	 * stuck if ANY mount in the system is hung, so use the
+	 * non-standard statfs() instead. Furthermore, as of Linux
+	 * 2.6+ struct statfs contains the f_frsize field which gives
+	 * the size of the blocks reported in the f_blocks field. */
 	struct statfs stat_buf;
-	long   total_size;
-	float page_size;
-	char *tmp_fs_name = tmp_fs;
-
-	*tmp_disk = 0;
-	total_size = 0;
-	page_size = (sysconf(_SC_PAGE_SIZE) / 1048576.0); /* MG per page */
 
-	if (tmp_fs_name == NULL)
-		tmp_fs_name = /tmp;
-#if defined (__sun)
-	if (statfs(tmp_fs_name, stat_buf, 0, 0) == 0) {
-#else
 	if (statfs(tmp_fs_name, stat_buf) == 0) {
-#endif
-		total_size = (long)stat_buf.f_blocks;
+		total_size = stat_buf.f_blocks * stat_buf.f_frsize;
+		total_size /= 1024 * 1024;
+	} else if (errno != ENOENT) {
+		error_code = errno;
+		error (get_tmp_disk: error %d executing statfs on %s,
+		   errno, tmp_fs_name);
+}
+#elif defined(HAVE_STATVFS)
+	struct statvfs stat_buf;
+
+	if (statvfs(tmp_fs_name, stat_buf) == 0) {
+		total_size = stat_buf.f_blocks * stat_buf.f_frsize;
+		total_size /= 1024 * 1024;
 	}
 	else if (errno != ENOENT) {
 		error_code = errno;
-		error (get_tmp_disk: error %d executing statfs on %s,
+		error (get_tmp_disk: error %d executing statvfs on %s,
 			errno, tmp_fs_name);
 	}
-
-	*tmp_disk += (uint32_t)(total_size * page_size);
-#else
-	*tmp_disk = 1;
 #endif
+	*tmp_disk = (uint32_t)total_size;
 	return error_code;
 }
 


[slurm-dev] Re: Slurm versions 14.11.6 is now available

2015-04-24 Thread Janne Blomqvist

On 2015-04-24 02:03, Moe Jette wrote:


Slurm version 14.11.6 is now available with quite a few bug fixes as
listed below.

Slurm downloads are available from
http://slurm.schedmd.com/download.html

* Changes in Slurm 14.11.6
==


[snip]


  -- Enable compiling without optimizations and with debugging symbols by
 default. Disable this by configuring with --disable-debug.


Always including debug symbols is good (the only cost is a little bit of 
disk space, should never really be a problem), but disabling 
optimization by default?? In our environment, slurmctld consumes a 
decent chunk of cpu time, I would loathe to see it getting a lot (?) 
slower.


Typically, problems which are fixed by disabling optimization are due 
to violations of the C standard or such which for some reason just 
doesn't happen to trigger with -O0. Perhaps I'm being needlessly harsh 
here, but I'd prefer if the bugs were fixed properly rather than being 
papered over like this.



  -- Use standard statvfs(2) syscall if available, in preference to
 non-standard statfs.


This is not actually such a good idea. Prior to Linux kernel 2.6.36 and 
glibc 2.13, the implementation of statvfs required checking all entries 
in /proc/mounts. If any of those other filesystems are not available 
(e.g. a hung NFS mount), the statvfs call would thus hang. See e.g.


http://man7.org/linux/man-pages/man2/statvfs.2.html

Not directly related to this change, there is also a bit of silliness in 
the statfs() code for get_tmp_disk(), namely that it assumes that the fs 
record size is the same as the memory page size. As of Linux 2.6 the 
struct statfs contains a field f_frsize which contains the correct 
record size. I suggest the attached patch which should fix both of these 
issues.




--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  NBE
+358503841576 || janne.blomqv...@aalto.fi
diff --git a/src/slurmd/slurmd/get_mach_stat.c b/src/slurmd/slurmd/get_mach_stat.c
index d7c5eb1..6493755 100644
--- a/src/slurmd/slurmd/get_mach_stat.c
+++ b/src/slurmd/slurmd/get_mach_stat.c
@@ -217,7 +217,28 @@ get_tmp_disk(uint32_t *tmp_disk, char *tmp_fs)
 {
 	int error_code = 0;
 
-#if defined(HAVE_STATVFS)
+#ifdef(__linux__)
+	/* Prior to Linux 2.6.36 and glibc 2.13, statvfs() can get
+	 * stuck if ANY mount in the system is hung, so use the
+	 * non-standard statfs() instead. Furthermore, as of Linux
+	 * 2.6+ struct statfs contains the f_frsize field which gives
+	 * the size of the blocks reported in the f_blocks field. */
+	struct statfs stat_buf;
+	unsigned long total_size;
+	char *tmp_fs_name = tmp_fs;
+
+	if (tmp_fs_name == NULL)
+		tmp_fs_name = /tmp;
+	if (statfs(tmp_fs_name, stat_buf) == 0) {
+		total_size = stat_buf.f_blocks / 1024;
+		total_size *= stat_buf.f_frsize;
+		total_size /= 1024;
+	} else if (errno != ENOENT) {
+		error_code = errno;
+		error (get_tmp_disk: error %d executing statvfs on %s,
+		   errno, tmp_fs_name);
+} *tmp_disk = (uint32_t)total_size;  
+#elif defined(HAVE_STATVFS)
 	struct statvfs stat_buf;
 	uint64_t total_size = 0;
 	char *tmp_fs_name = tmp_fs;


[slurm-dev] Re: Slurm versions 14.11.6 is now available

2015-04-24 Thread Janne Blomqvist

On 24 April 2015 12:41:38 EEST, Janne Blomqvist janne.blomqv...@aalto.fi 
wrote:
On 2015-04-24 02:03, Moe Jette wrote:

 Slurm version 14.11.6 is now available with quite a few bug fixes as
 listed below.

 Slurm downloads are available from
 http://slurm.schedmd.com/download.html

 * Changes in Slurm 14.11.6
 ==

[snip]

   -- Enable compiling without optimizations and with debugging
symbols by
  default. Disable this by configuring with --disable-debug.

Always including debug symbols is good (the only cost is a little bit
of 
disk space, should never really be a problem), but disabling 
optimization by default?? In our environment, slurmctld consumes a 
decent chunk of cpu time, I would loathe to see it getting a lot (?) 
slower.

Typically, problems which are fixed by disabling optimization are due

to violations of the C standard or such which for some reason just 
doesn't happen to trigger with -O0. Perhaps I'm being needlessly harsh 
here, but I'd prefer if the bugs were fixed properly rather than being 
papered over like this.

   -- Use standard statvfs(2) syscall if available, in preference to
  non-standard statfs.

This is not actually such a good idea. Prior to Linux kernel 2.6.36 and

glibc 2.13, the implementation of statvfs required checking all entries

in /proc/mounts. If any of those other filesystems are not available 
(e.g. a hung NFS mount), the statvfs call would thus hang. See e.g.

http://man7.org/linux/man-pages/man2/statvfs.2.html

Not directly related to this change, there is also a bit of silliness
in 
the statfs() code for get_tmp_disk(), namely that it assumes that the
fs 
record size is the same as the memory page size. As of Linux 2.6 the 
struct statfs contains a field f_frsize which contains the correct 
record size. I suggest the attached patch which should fix both of
these 
issues.

Hi,

come to think of it, in my patch the type of total_size should be unsigned 
long long to avoid potential overflows on 32-bit Linux targets.

Cheers, 
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.


[slurm-dev] Re: submitting 100k job array causes slurmctld to socket timeout

2015-03-17 Thread Janne Blomqvist
On 2015-03-15 15:46, Daniel Letai wrote:
 Hi,
 
 Testing a new slurm cluster (14.11.4) on a 1k nodes cluster.
 
 Several things we've tried:
 Increase slurmctld threads (8 ports range)
 Increase munge threads (threads=10)
 Increase messageTimeout to 30
 
 
 We are using accounting (db on different server)
 
 Thanks for any help

Take a look at

http://slurm.schedmd.com/high_throughput.html

For us, setting somaxconn to 4096 fixed the socket timeout issues
(sysctl net.core.somaxconn=4096). Check with netstat -s | grep
LISTEN for listen queue overflows, does the number increase, and if it
does, does bumping somaxconn fix it?

Put a line like

net.core.somaxconn = 4096

in /etc/sysctl.conf if you want the setting to survive a reboot.

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  NBE
+358503841576 || janne.blomqv...@aalto.fi



signature.asc
Description: OpenPGP digital signature


[slurm-dev] acct_gather_energy/ipmi configuration

2014-09-03 Thread Janne Blomqvist


Hi,

has anyone got the acct_gather_energy/ipmi plugin to work correctly? In 
acct_gather.conf I have the lines


EnergyIPMIFrequency=30
EnergyIPMICalcAdjustment=yes

and in slurm.conf

DebugFlags=Profile
AcctGatherNodeFreq=30
AcctGatherEnergyType=acct_gather_energy/ipmi

However, the end result is that in the slurmd logs when starting slurmd 
a line like


[2014-08-28T10:44:52.179] Power sensor not found.

appears.

I suspect that the reason is related to the fact that I cannot retrieve 
the power readings with the ipmi-sensors command. With ipmi-sensors 
-W discretereading I can get a reading for the power supplies, but it 
seems to be the nameplate capacity rather than the current consumption. 
Same for using ipmitool and ipmiutil rather than ipmi-sensors.


However, using ipmi-dcmi --get-system-power-statistics (part of 
freeipmi) does appear to work.


So my question, I guess, is that is there some way to configure the 
acct_gather_energy/ipmi plugin to retrieve these DCMI power values 
instead of whatever it tries to do now? I looked briefly into the source 
code and there is a big bunch of undocumented EnergyIPMI 
configuration parameters, but I didn't figure out if any of those could 
be used to use DCMI.


(The hardware in question is various HP Proliant servers somewhere 
between 1 and 4 years old)


--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  BECS
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: acct_gather_energy/ipmi configuration

2014-09-03 Thread Janne Blomqvist

Hi,

thanks for confirming that my configuration is correct. Indeed ipmi-sensors 
--non-abbreviated-units | grep Watts  doesn't show anything, so I guess I'm 
out of luck until DCMI support is added to the plugin.

On 2014-09-03 15:23, Thomas Cadeau wrote:
 Hi Janne,

 The configuration is correct. Please try the command :
 $ ipmi-sensors --non-abbreviated-units | grep Watts
 If you have nothing, then the IPMI plugin cannot be used on this hardware.

 We plan to add an option to used the DCMI power to support more hardwares.

 The undocumented options EnergyIPMI are wrappers for
 ipmi-sensors options.
 I never have reason to use them.
 Except timeout and reflush, when I used BMC in unstable dev state,
 but in this case I had troubles with ipmi-sensors too.

 Thomas


 Le 03/09/2014 11:44, Janne Blomqvist a écrit :
 
  Hi,
 
  has anyone got the acct_gather_energy/ipmi plugin to work correctly?
  In acct_gather.conf I have the lines
 
  EnergyIPMIFrequency=30
  EnergyIPMICalcAdjustment=yes
 
  and in slurm.conf
 
  DebugFlags=Profile
  AcctGatherNodeFreq=30
  AcctGatherEnergyType=acct_gather_energy/ipmi
 
  However, the end result is that in the slurmd logs when starting
  slurmd a line like
 
  [2014-08-28T10:44:52.179] Power sensor not found.
 
  appears.
 
  I suspect that the reason is related to the fact that I cannot
  retrieve the power readings with the ipmi-sensors command. With
  ipmi-sensors -W discretereading I can get a reading for the power
  supplies, but it seems to be the nameplate capacity rather than the
  current consumption. Same for using ipmitool and ipmiutil rather than
  ipmi-sensors.
 
  However, using ipmi-dcmi --get-system-power-statistics (part of
  freeipmi) does appear to work.
 
  So my question, I guess, is that is there some way to configure the
  acct_gather_energy/ipmi plugin to retrieve these DCMI power values
  instead of whatever it tries to do now? I looked briefly into the
  source code and there is a big bunch of undocumented EnergyIPMI
  configuration parameters, but I didn't figure out if any of those
  could be used to use DCMI.
 
  (The hardware in question is various HP Proliant servers somewhere
  between 1 and 4 years old)
 



-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  BECS
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: Feedback on integration tests systemd/slurm and questions

2014-08-29 Thread Janne Blomqvist

On 2014-08-28 19:17, Rémi Palancher wrote:
 I would be glad to have your insightful lights on this matter :) I would
 also appreciate to get feedback from other people who have done other
 tests with slurm and systemd!

Haven't tested anything yet, but with RHEL/CentOS 7 already available, I 
suspect it won't be long before people are starting to roll out clusters based 
on those OS'es. So the topic certainly deserves some attention, thanks for 
bringing it up!

 The funny thing about all of this is that it will become totally
 irrelevant with the upcoming releases of the linux kernel (3.16+) and
 the ongoing effort on the cgroup unified hierarchy[3][4]! So if
 modifications should be done in Slurm on cgroup management, it would be
 wise to take this into account.

 [3] http://lwn.net/Articles/601840/
 [4]
 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroups/unified-hierarchy.txt

Seems that in the brave new unified hierarchy cgroup world cgroups must be 
controlled by communicating with the cgroup controller process (which would be 
systemd  on systemd-using systems which should be most of them), rather than 
manipulating the cgroups fs directly. Systemd provides a D-Bus API for this, see

http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/

That implies quite a lot of changes in the slurm cgroups support..

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  BECS
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: Slurm versions 14.03.5 and 14.11.0-pre2 are now available

2014-07-15 Thread Janne Blomqvist


On Fri 11 Jul 2014 12:24:48 AM EEST, je...@schedmd.com wrote:


Slurm versions 14.03.5 and 14.11.0-pre2 are now available. Version
14.03.5 includes about 40 relatively minor bug fixes and enhancements
as described below.

Highlights of changes in Slurm version 14.03.5 include:
 -- Added extra index's into the database for better performance when
deleting users.


It would have been nice if this change would have been mentioned in big 
blinking colorful letters or something like that.


As it is, we routinely updated to 14.03.5 from 14.03.4 without any 
maintenance break or such, after all just a bugfix release, what could 
go wrong? Well, what did go wrong was that slurmdbd was offline for 40 
minutes while it added those extra indexes.. :(


Other than that, 14.03.5 seems to be running fine here.

--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  BECS
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: Shared TmpFS

2014-04-24 Thread Janne Blomqvist


On 2014-04-23T17:26:36 EEST, Alfonso Pardo wrote:

Hi,

I had some errors from a premature terminate jobs with this message:

slurmd[bd-p14-01]: error: unlink(/tmp/slurmd/job60560/slurm_script):
No such file or directory
slurmd[bd-p14-01]: error: rmdir(/tmp/slurmd/job60560): No such file or
directory


Should “TmpFS” location be a shared file system?


No. Or maybe it's possible, but why? Typically /tmp is considered a 
machine-local directory.


That being said, the error messages you quote have nothing to do with 
the slurm.conf TmpFS setting but rather tell that your SlurmdSpoolDir 
is set to /tmp/slurmd. That is likely a bad idea, as there might be 
various /tmp cleaner scripts such as tmpwatch emptying /tmp regularly, 
leading to errors like you see (been there, done that). Just leave it 
at the default value unless you have good reasons to do otherwise. Note 
that it requires some trickery to move the contents of the 
SlurmdSpoolDir if you want to do it on the fly without losing track of 
running jobs.



We don’t have TmpDisk parameter established (default value). How many
space is reasonable for this parameter?


Depends on how large disks you have on your nodes, no? However, the 
trend seems to be that /tmp is a relatively small space, frequently on 
a ram disk (tmpfs) rather than backed by a real disk [1]. So you might 
not want to encourage your users to write code assuming a large /tmp is 
available. A large machine-local space is probably better to place at 
/var/tmp or something site-specific such as /local.



[1] http://0pointer.de/blog/projects/tmp.html

--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  BECS
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: xmalloc in Slurm

2014-03-31 Thread Janne Blomqvist
 xmalloc is trying
to accomplish handling OOM conditions calloc will eliminate the memset
call that can't be handled. Not only *could* calloc be used for
efficiency, it *should* be used so xmalloc functions as intended (and
more gracefully handles a memory allocation error).

Now, all the way back to why this originally came up: I think a more
appropriate change is to fix xmalloc to use calloc which should on its
own give a speedup to cases like the one Cao was seeing without changing
its semantics.

If necessary, another malloc wrapper *could* be added that just used
malloc, without zeroing the memory. However, the only benefit over the
calloc-based wrapper would be when a system is under sufficient load
that the kernel has not had time to collect some all-zero pages and
there is a slight delay to calloc. I suspect the system load is really
going to be a bigger slowdown than the small delay in calloc and there
is no benefit to a non-zeroing malloc wrapper.


A proper fix would IMHO be to (in addition to the int/size_t issues 
discussed above)


1) Create a calloc wrapper (xcalloc), and update all current users of 
xmalloc to use that instead.


2) Remove the memset from xmalloc, so that xmalloc properly is a malloc 
wrapper.


3) Audit users of xcalloc, change back to use xmalloc where zeroing 
isn't needed.




--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  BECS
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: SLURM on FreeBSD

2013-11-27 Thread Janne Blomqvist


On 2013-11-25 18:38, Jason Bacon wrote:

A few patches are attached.  I'm withholding additional patches to
task_cgroup_cpuset.c pending further testing with FreeBSD's hwloc.


Not that it's my decision to make, but IMHO since cgroups are 
(extremely) Linux-specific, rather than making an #ifdef mess out of the 
cgroup plugin, it would be better to create a separate task/hwloc plugin.



--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  BECS
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: Slurmctld dies after restart: Address already in use

2013-09-24 Thread Janne Blomqvist

On 2013-09-24 03:12, Moe Jette wrote:


The SlurmctldProlog/Epilog don't need these open files. I've copied
logic from slurmd to close most files when the daemon starts and set the
close_on_exec flag for files the daemon opens:
https://github.com/SchedMD/slurm/commit/29094e33fcbb4f29e9512059bbdd18ba3504134c


That fixes several of the problems. I'm not sure why the job_state.new
file is reported by lsof, but will probably investigate further at a
later time.


Recent Unix'es support a O_CLOEXEC flag to open(), which avoids the 
potential race condition between opening and setting close-on-exec with 
fcntl() (and of course, one syscall less). The attached patch does this 
for a few cases. There are still many more places where this approach 
could be used; for sockets there is also the Linux-specific SOCK_CLOEXEC 
flag.


--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  BECS
+358503841576 || janne.blomqv...@aalto.fi
diff --git a/src/common/daemonize.c b/src/common/daemonize.c
index 62545ed..4cc8e7a 100644
--- a/src/common/daemonize.c
+++ b/src/common/daemonize.c
@@ -159,17 +159,19 @@ int
 create_pidfile(const char *pidfile, uid_t uid)
 {
 	FILE *fp;
-	int fd = -1;
+	int fd;
 
 	xassert(pidfile != NULL);
 	xassert(pidfile[0] == '/');
 
-	if (!(fp = fopen(pidfile, w))) {
+	fd = creat_cloexec(pidfile, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP 
+			   | S_IROTH | S_IWOTH);
+	if (fd  0) {
 		error(Unable to open pidfile `%s': %m, pidfile);
 		return -1;
 	}
 
-	fd = fileno(fp);
+	fp = fdopen(fd, w);
 
 	if (fd_get_write_lock(fd)  0) {
 		error (Unable to lock pidfile `%s': %m, pidfile);
diff --git a/src/common/fd.c b/src/common/fd.c
index 0bcf353..9c4a413 100644
--- a/src/common/fd.c
+++ b/src/common/fd.c
@@ -82,6 +82,33 @@ void fd_set_noclose_on_exec(int fd)
 	return;
 }
 
+
+int open_cloexec(const char *pathname, int flags)
+{
+#ifdef O_CLOEXEC
+	return open(pathname, flags | O_CLOEXEC);
+#else
+	int fd = open(pathname, flags);
+	if (fd = 0)
+		fd_set_close_on_exec(fd);
+	return fd;
+#endif
+}
+
+
+int creat_cloexec(const char *pathname, mode_t mode)
+{
+#ifdef O_CLOEXEC
+	return open(pathname, O_CREAT|O_WRONLY|O_TRUNC|O_CLOEXEC, mode);
+#else
+	int fd = creat(pathname, mode);
+	if (fd = 0)
+		fd_set_close_on_exec(fd);
+	return fd;
+#endif
+}
+
+
 int fd_is_blocking(int fd)
 {
 	int val = 0;
diff --git a/src/common/fd.h b/src/common/fd.h
index 704c0e3..2a1dcc2 100644
--- a/src/common/fd.h
+++ b/src/common/fd.h
@@ -58,6 +58,15 @@ static inline void closeall(int fd)
 		close(fd++);
 }
 
+/* Open a fd with close-on-exec (POSIX 2008, Linux 2.6.23+), emulating
+ * it on systems that lack it.  */
+int open_cloexec(const char *pathname, int flags);
+
+/* Create a fd with close-on-exec (POSIX 2008, Linux 2.6.23+),
+ * emulating it on systems that lack it.  */
+int creat_cloexec(const char *pathname, mode_t mode);
+
+
 void fd_set_close_on_exec(int fd);
 /*
  *  Sets the file descriptor (fd) to be closed on exec().
diff --git a/src/slurmctld/controller.c b/src/slurmctld/controller.c
index 1040f91..b098d93 100644
--- a/src/slurmctld/controller.c
+++ b/src/slurmctld/controller.c
@@ -2089,8 +2089,6 @@ static void _init_pidfile(void)
 	 * fd open to maintain the write lock */
 	pid_fd = create_pidfile(slurmctld_conf.slurmctld_pidfile,
 slurmctld_conf.slurm_user_id);
-	if (pid_fd = 0)
-		fd_set_close_on_exec(pid_fd);
 }
 
 /*
diff --git a/src/slurmd/slurmd/slurmd.c b/src/slurmd/slurmd/slurmd.c
index 997a2d9..f08fcfc 100644
--- a/src/slurmd/slurmd/slurmd.c
+++ b/src/slurmd/slurmd/slurmd.c
@@ -314,8 +314,6 @@ main (int argc, char *argv[])
 	   so we keep the write lock of the pidfile.
 	*/
 	pidfd = create_pidfile(conf-pidfile, 0);
-	if (pidfd = 0)
-		fd_set_close_on_exec(pidfd);
 
 	rfc2822_timestamp(time_stamp, sizeof(time_stamp));
 	info(%s started on %s, slurm_prog_name, time_stamp);
@@ -1506,11 +1504,10 @@ _slurmd_init(void)
 		init_gids_cache(0);
 	slurm_conf_unlock();
 
-	if ((devnull = open(/dev/null, O_RDWR))  0) {
+	if ((devnull = open_cloexec(/dev/null, O_RDWR))  0) {
 		error(Unable to open /dev/null: %m);
 		return SLURM_FAILURE;
 	}
-	fd_set_close_on_exec(devnull);
 
 	/* make sure we have slurmstepd installed */
 	if (stat(conf-stepd_loc, stat_buf))


[slurm-dev] Re: AllowGroups broken in Slurm 2.6.0?

2013-08-23 Thread Janne Blomqvist


On 2013-08-22T18:08:23 EEST, Moe Jette wrote:

getgrnam_r() does not return the any users in the group jette below,
it does work for the other groups (e.g. admin):

$ uname -a
Linux jette 3.5.0-37-generic #58-Ubuntu SMP Mon Jul 8 22:07:55 UTC
2013 x86_64 x86_64 x86_64 GNU/Linux

$ id
uid=1001(jette) gid=1001(jette)
groups=1001(jette),4(adm),20(dialout),21(fax),24(cdrom),25(floppy),26(tape),29(audio),30(dip),44(video),46(plugdev),104(fuse),110(netdev),112(lpadmin),120(admin),122(sambashare)


$ grep jette /etc/group
admin:x:120:jette,da
jette:x:1001:
...


Yeah, primary groups are often special and ought to be handled properly 
rather than discarded as broken configuration, my bad. I suppose if 
one wants to ask whether some user belongs to some group without 
enumeration, one would need to first check the user primary group 
(getpw{nam,uid}_r()), then if that doesn't match use getgrnam_r(). But 
that would need some changes in the surrounding code as well..


Cheers,




Quoting Janne Blomqvist janne.blomqv...@aalto.fi:


On 2013-08-22T00:22:07 EEST, Moe Jette wrote:

This does not work on AIX, Darwin, Cygwin or even some Linux
configurations. If it works for you that is great, but we probably
need to stay with the belts and suspenders approach.


Although I haven't got access to all those systems to test on, I'm
not sure I agree with your assertion. All the ifdef dancing in the
code is about enumerating the user and group databases (variants of
setpwent/getpwent/endpwent and setgrent/getgrent/endgrent), but since
the point of the patch was to get rid of the enumeration and instead
use the group member list returned by getgrnam_r(), I removed the
ifdef stuff as no longer needed.

Thus, AFAICS, my patched version fails if getgrnam_r() doesn't return
the (correct) member list, but such a serious bug in the C library
sounds quite unlikely (except for the cases of broken user/group
databases I mentioned previously).




Quoting Janne Blomqvist janne.blomqv...@aalto.fi:


On 2013-08-20 03:23, John Thiltges wrote:

These are two AllowGroups snags we've run into:

   Unlike nscd, sssd doesn't allow enumeration by default (and is
   case-sensitive). We add this to /etc/sssd/sssd.conf on the
slurmctld
   node:
enumerate = True
case_sensitive = False


Looking at src/slurmctld/groups.c:get_group_members it seems to be a
case of using both belts and suspenders in order to work around
broken configurations (multiple groups with the same GID, or user
primary groups not being listed in the group data entry), resulting
in something that turns out to not work in case enumeration is
disabled. As many systems such as sssd and winbind disable
enumeration by default (for performance, and maybe
security-by-obscurity), IMHO it would be better to avoid relying on
such a feature. getgrnam_r() already returns all the members of the
group, there's no need to iterate over both the entire user and group
databases to see if entries match.

The attached patch simplifies the code to just use the group member
list returned by the getgrnam_r() call, without enumerating all users
or groups.



   If you have large groups, you might run into the buffer size
limit,
   which is 65k characters. (PW_BUF_SIZE in uid.h)


The patch also fixes this, by calling getgrnam_r() in a loop,
increasing the buffer size if it was too small.



--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  BECS
+358503841576 || janne.blomqv...@aalto.fi






--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  BECS
+358503841576 || janne.blomqv...@aalto.fi









--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  BECS
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: AllowGroups broken in Slurm 2.6.0?

2013-08-20 Thread Janne Blomqvist

On 2013-08-20 03:23, John Thiltges wrote:

These are two AllowGroups snags we've run into:

Unlike nscd, sssd doesn't allow enumeration by default (and is
case-sensitive). We add this to /etc/sssd/sssd.conf on the slurmctld
node:
 enumerate = True
 case_sensitive = False


Looking at src/slurmctld/groups.c:get_group_members it seems to be a 
case of using both belts and suspenders in order to work around broken 
configurations (multiple groups with the same GID, or user primary 
groups not being listed in the group data entry), resulting in something 
that turns out to not work in case enumeration is disabled. As many 
systems such as sssd and winbind disable enumeration by default (for 
performance, and maybe security-by-obscurity), IMHO it would be better 
to avoid relying on such a feature. getgrnam_r() already returns all the 
members of the group, there's no need to iterate over both the entire 
user and group databases to see if entries match.


The attached patch simplifies the code to just use the group member list 
returned by the getgrnam_r() call, without enumerating all users or groups.




If you have large groups, you might run into the buffer size limit,
which is 65k characters. (PW_BUF_SIZE in uid.h)


The patch also fixes this, by calling getgrnam_r() in a loop, increasing 
the buffer size if it was too small.




--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  BECS
+358503841576 || janne.blomqv...@aalto.fi
diff --git a/src/slurmctld/groups.c b/src/slurmctld/groups.c
index 7728484..fd7d758 100644
--- a/src/slurmctld/groups.c
+++ b/src/slurmctld/groups.c
@@ -83,21 +83,13 @@ struct group_cache_rec {
  * NOTE: User root has implicitly access to every group
  * NOTE: The caller must xfree non-NULL return values
  */
-extern uid_t *get_group_members(char *group_name)
+extern uid_t *get_group_members(const char *group_name)
 {
-	char grp_buffer[PW_BUF_SIZE];
+	char *grp_buffer;
+	long buflen;
   	struct group grp,  *grp_result = NULL;
-	struct passwd *pwd_result = NULL;
-	uid_t *group_uids = NULL, my_uid;
-	gid_t my_gid;
-	int i, j, uid_cnt;
-#ifdef HAVE_AIX
-	FILE *fp = NULL;
-#elif defined (__APPLE__) || defined (__CYGWIN__)
-#else
-	char pw_buffer[PW_BUF_SIZE];
-	struct passwd pw;
-#endif
+	uid_t *group_uids = NULL;
+	size_t i, j, uid_cnt;
 
 	group_uids = _get_group_cache(group_name);
 	if (group_uids)	{	/* We found in cache */
@@ -105,84 +97,46 @@ extern uid_t *get_group_members(char *group_name)
 		return group_uids;
 	}
 
-	/* We need to check for !grp_result, since it appears some
-	 * versions of this function do not return an error on failure.
-	 */
-	if (getgrnam_r(group_name, grp, grp_buffer, PW_BUF_SIZE,
-		   grp_result) || (grp_result == NULL)) {
+	buflen = sysconf(_SC_GETGR_R_SIZE_MAX);
+	if (buflen = 0)
+		buflen = 1024;
+	grp_buffer = xmalloc(buflen);
+
+	/* Call getgrnam_r in a loop, increasing the buffer size if it
+	 * turned out it was too small. */
+	while (1) {
+		int res = getgrnam_r(group_name, grp, grp_buffer, buflen,
+ grp_result);
+		if (res) {
+			switch(errno) {
+			case ERANGE:
+buflen *= 2;
+xrealloc(grp_buffer, buflen);
+break;
+			default:
+error(getgrnam_r failed);
+xfree(grp_buffer);
+return NULL;
+			}
+		} else
+			break;
+	}
+	if (grp_result == NULL) {
 		error(Could not find configured group %s, group_name);
+		xfree(grp_buffer);
 		return NULL;
 	}
-	my_gid = grp_result-gr_gid;
-
-	j = 0;
 	uid_cnt = 0;
-#ifdef HAVE_AIX
-	setgrent_r(fp);
-	while (!getgrent_r(grp, grp_buffer, PW_BUF_SIZE, fp)) {
-		grp_result = grp;
-#elif defined (__APPLE__) || defined (__CYGWIN__)
-	setgrent();
-	while ((grp_result = getgrent()) != NULL) {
-#else
-	setgrent();
-	while (getgrent_r(grp, grp_buffer, PW_BUF_SIZE,
-			  grp_result) == 0  grp_result != NULL) {
-#endif
-	if (grp_result-gr_gid == my_gid) {
-			if (strcmp(grp_result-gr_name, group_name)) {
-debug(including members of group '%s' as it 
-  corresponds to the same gid as group
-   '%s',grp_result-gr_name,group_name);
-			}
-
-		for (i=0; grp_result-gr_mem[i]; i++) {
-if (uid_from_string(grp_result-gr_mem[i],
-		my_uid)  0) {
-	/* Group member without valid login */
-	continue;
-}
-if (my_uid == 0)
-	continue;
-if (j+1 = uid_cnt) {
-	uid_cnt += 100;
-	xrealloc(group_uids, 
-		 (sizeof(uid_t) * uid_cnt));
-}
-group_uids[j++] = my_uid;
-			}
-		}
-	}
-#ifdef HAVE_AIX
-	endgrent_r(fp);
-	setpwent_r(fp);
-	while (!getpwent_r(pw, pw_buffer, PW_BUF_SIZE, fp)) {
-		pwd_result = pw;
-#else
-	endgrent();
-	setpwent();
-#if defined (__sun)
-	while ((pwd_result = getpwent_r(pw, pw_buffer, PW_BUF_SIZE)) != NULL) {
-#elif defined (__APPLE__) || defined (__CYGWIN__)
-	while ((pwd_result = getpwent()) != NULL) {
-#else
-	while (!getpwent_r(pw, pw_buffer, PW_BUF_SIZE, pwd_result)) {
-#endif
-#endif

[slurm-dev] Re: slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)

2013-08-07 Thread Janne Blomqvist


On 2013-08-07 09:19, Christopher Samuel wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 23/07/13 17:06, Christopher Samuel wrote:


Bringing up a new IBM SandyBridge cluster I'm running a NAMD test
case and noticed that if I run it with srun rather than mpirun it
goes over 20% slower.


Following on from this issue, we've found that whilst mpirun gives
acceptable performance the memory accounting doesn't appear to be correct.

Anyone seen anything similar, or any ideas on what could be going on?


See my message from yesterday

https://groups.google.com/d/msg/slurm-devel/BlZ2-NwwCCg/03DnMEWYHqUJ

for what I think is the reason. That is, the memory accounting is per 
task, and when launching using mpirun the number of tasks does not 
correspond to the number of MPI processes, but rather to the number of 
orted processes (1 per node).




Here are two identical NAMD jobs running over 69 nodes using 16 nodes
per core, this one launched with mpirun (Open-MPI 1.6.5):


== slurm-94491.out ==
WallClock: 101.176193  CPUTime: 101.176193  Memory: 1268.554688 MB
End of program

[samuel@barcoo-test Mem]$ sacct -j 94491 -o JobID,MaxRSS,MaxVMSize
JobID MaxRSS  MaxVMSize
-  -- --
94491
94491.batch6504068K  11167820K
94491.05952048K   9028060K


This one launched with srun (about 60% slower):

== slurm-94505.out ==
WallClock: 163.314163  CPUTime: 163.314163  Memory: 1253.511719 MB
End of program

[samuel@barcoo-test Mem]$ sacct -j 94505 -o JobID,MaxRSS,MaxVMSize
JobID MaxRSS  MaxVMSize
-  -- --
94505
94505.batch   7248K   1582692K
94505.01022744K   1307112K



cheers!
Chris
- --
  Christopher SamuelSenior Systems Administrator
  VLSCI - Victorian Life Sciences Computation Initiative
  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
  http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIB5sEACgkQO2KABBYQAh9QMQCfQ57w0YqVDwgyGRqUe3dSvQDj
e9cAnRRx/kDNUNqUCuFGY87mXf2fMOr+
=JUPK
-END PGP SIGNATURE-




--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  BECS
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: RLIMIT_DATA effectively a no-op on Linux

2013-07-22 Thread Janne Blomqvist

On 2013-07-20T15:23:41 EEST, Chris Samuel wrote:

 Hi there,

 On Sat, 20 Jul 2013 02:53:52 AM Bjørn-Helge Mevik wrote:

 With the recent changes in glibc in how virtual memory is allocated for
 threaded applications, limiting virtual memory usage for threaded
 applications is IMO not a good idea.  (One example: our slurcltd has
 allocated 16.1 GiB virtual memory, but is only using 104 MiB resident.)

 Would you have a pointer to these changes please?

 From a recent message by yours truly to a slurm-dev thread about 
slurmctld memory consumption:


Yes, this is what we're seeing as well. 6.5 GiB VMEM, 376 MB RSS. The
change was that as of glibc 2.10 a more scalable malloc() implementation
is used. The new implementation creates up to 8 (2 on 32-bit) pools per
core, each 64 MB in size. Thus in our case, where slurmctld runs on a
machine with 12 cores, we have up to 12*8*64=6144 MB in those malloc 
pools.


See http://udrepper.livejournal.com/20948.html


I would go even further than Bjørn-Helge and claim that limiting 
virtual memory is, in general, the wrong thing to do. Address space is 
essentially free and doesn't impact other applications so IMHO the 
workload manager has no business limiting that. The glibc malloc() 
behavior being just one situation where trying to limit virtual memory 
goes wrong. There are other situations where allocating lots of virtual 
memory is common. E.g. garbage collected runtimes such as Java often 
allocate huge heaps to use as the garbage collection arena but only a 
small fraction of that is actually used.

 I would suggest looking at cgroups for limiting memory usage.

 Unfortunately cgroups doesn't limit usage (i.e. cause malloc() to fail should
 it have reached its limit); if I understand it correctly it just invokes the
 OOM killer on a candidate process within the cgroup once the limit is reached.
 :-(

Yes, that's my understanding as well. On the positive side, few 
applications can sensibly handle malloc() failures anyway. Often the 
best that can be done without heroic effort is to just print an error 
message to stderr and abort(), which is not terribly different from 
being killed by the OOM killer anyway..

There are a few efforts in the Linux kernel community to do something 
about this that roughly go in a couple slightly different directions:

- Provide some notification to applications that you're exceeding your 
memory limit, release some memory quickly or face the wrath of the OOM 
killer. See

https://lwn.net/Articles/552789/

https://lwn.net/Articles/548180/

- Provide a mechanism for applications to mark memory ranges as 
volatile, where the kernel can drop them if memory gets tight instead 
of going on an OOM killer spree.

https://lwn.net/Articles/522135/

https://lwn.net/Articles/554098/


That being said, AFAIK nothing of the above yet exists in the upstream 
kernel today. So for now IMHO the least bad approach is to just limit 
RSS as slurm already does (either with cgroups or by polling), and 
killing jobs if the limit is exceeded.

--
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS  BECS
+358503841576 || janne.blomqv...@aalto.fi


[slurm-dev] Re: slurmctld consuming tons of memory

2013-06-27 Thread Janne Blomqvist

On 2013-06-26 13:46, Bjørn-Helge Mevik wrote:

 Hongjia Cao hj...@nudt.edu.cn writes:

 I have encountered that slurmctld uses more than 20GB of virtual memory.
 But the RSS is less than 1GB. I am not sure whether this is OK or there
 is some leakage.

 On Linux boxes with newer versions of glibc, slurmctld (as well as any
 other process that uses a lot of threads) will use a lot of VMEM.
 There was a change in glibc 2.something (I think it was) in how VMEM is
 allocated for threads.  For instance, our slurmctld right now uses 16
 GiB VMEM, but only 117 MiB RSS.

Yes, this is what we're seeing as well. 6.5 GiB VMEM, 376 MB RSS. The 
change was that as of glibc 2.10 a more scalable malloc() implementation 
is used. The new implementation creates up to 8 (2 on 32-bit) pools per 
core, each 64 MB in size. Thus in our case, where slurmctld runs on a 
machine with 12 cores, we have up to 12*8*64=6144 MB in those malloc pools.


See http://udrepper.livejournal.com/20948.html

-- 
Janne Blomqvist


[slurm-dev] Re: slurm 2.5 upgrade experiences

2012-12-31 Thread Janne Blomqvist

On 2012-12-28T22:21:04 EET, Moe Jette wrote:

 Sorry for the lost jobs. There were two new RPCs that needed to be
 added to a table, but were not. This prevented slurmd daemons running
 version 2.4. from communicating with the slurmctld daemon running
 version 2.5. The fix is here at the location shown below and will be
 in version 2.5.1:

 https://github.com/SchedMD/slurm/commit/844f70a2c233e57f55948494bbd9c377163813fb

 I have already changed the code for version 2.6 to only list the RPCs
 that can not work between releases (job step creation and task
 launch). That is a much smaller list and will rarely change, which
 should prevent this problem in future releases.

Hi,

thanks for fixing this!

I wonder, would it be possible to make the job killing logic a bit less 
aggressive? In both cases where we have lost jobs during a slurm 
upgrade, it has been due to communication failure between slurmctld and 
slurmd's causing jobs to be killed after the timeout, even though in 
both situations the jobs were actually running just fine.

Currently it seems that once slurmctld-slurmd communication fails, a 
timer starts, and after SlurmdTimeout seconds jobs running on the node 
are killed. I understand that this kind of timeout-based killing logic 
is necessary to handle the case when one node in a multi-node job dies, 
and the job itself hangs rather than dies. Otherwise the hung job would 
occupy the healthy nodes until the job was killed by the job time limit.

I'm thinking that one could do something different in the case when all 
the nodes for a job disappear. E.g. for serial or small parallel jobs 
that fit within a single node, or somebody tripping over the network 
cable connecting the slurmctld node to the rest of the cluster, or some 
supposedly quick maintenance task taking longer than expected, etc. In 
such a case, couldn't slurmctld wait with killing the job until 1) a 
node returns to service and slurmd (or is it slurmstepd?)  reports that 
the job is no longer present, in which case the job can be immediately 
killed 2) a node returns to service and slurmd/slurmstepd reports that 
the job is still there, start the timer and wait until SlurmdTimeout 
for the other nodes to return, and if they don't, kill the job.


 Moe


 Quoting Janne Blomqvist janne.blomqv...@aalto.fi:


 Hi,

 just a few quick notes about our experience upgrading from 2.4.x (2.4.3
 IIRC) to 2.5.0 yesterday.

 First, the bad news. Some, but not all, jobs were lost. Luckily the
 cluster was mostly idle after the Christmas holidays, but still. First,
 slurmdbd was upgraded, that went fine although it took a few minutes to
 update the DB tables. Next the slurmctld master, followed by the
 slurmctld backup, and then the compute nodes. AFAICT the reason for the
 job loss was that after the master slurmctld was updated it was unable
 to communicate with the compute nodes, and after the timeout expired it
 killed the jobs. After starting slurmctld 2.5.0 in the log file we had
 lots of messages like

 [2012-12-27T10:22:19+02:00] error: Invalid Protocol Version 6144 from
 uid=0 at 10.10.253.52:6818
 [2012-12-27T10:22:19+02:00] error: slurm_receive_msgs: Protocol version
 has changed, re-link your code

 followed by

 [2012-12-27T10:27:16+02:00] error: Nodes
 cn[01-18,20-64,68-224],fn[01-02],gpu[001-008],tb[003,005-008] not responding

 and then finally lots of messages like

 [2012-12-27T10:28:59+02:00] Killing job_id 2722550 on failed node cn29

 I suspect that the jobs that were not lost happened to run on nodes
 where the slurmd upgrade finished before the timeout. Previously we have
 successfully done on-the-fly upgrades between slurm major versions, and
 also the release notes said it should work, so it was a bit surprising
 that it would fail now. Oh well.


 On a more positive note, we took the new multifactor2 plugin into use,
 and so far it seems to work as designed, although due to the holidays
 there's not much action in the queue so it's still too early for further
 conclusions.

 --
 Janne Blomqvist





--
Janne Blomqvist


[slurm-dev] RE: Making hierarchical fair share hierarchical

2012-08-24 Thread Janne Blomqvist

On 2012-08-24 03:05, Lipari, Don wrote:

Janne,

As one of those who designed the fair-share component to the multi-factor 
priority plugin, I'm open to your suggestions below.  I would recommend you 
create it as a separate multi-factor plugin so that we could have the 
opportunity to switch back and forth and examine the differences in behavior.  
If it winds up delivering more fairness across all cases, then I'm sure we 
would abandon the current implementation someday.


A separate plugin sounds like a good approach, yes. I'll look into that. 
For the time being, in case anyone is interested, the core algorithm 
change can be tested with the attached patch (so far only compile-tested..).



I'm not sure how best to handle the sshare utility with regard to the fields it 
displays.  It might make sense to create a new sshare that displays the 
pertinent components of the new algorithm.

Would you mind citing the reference(s) that motivated the formula you presented 
below?


There's a lot of papers on fair share scheduling, but most of them are 
related to scheduling network packets, and while one can get some idea 
from them the terminology is often somewhat different.


One paper on fair share CPU scheduling which is reasonably easy to 
understand is


J. Kay and P. Lauder [1988], `A Fair Share Scheduler', Communications of 
the ACM, 31(1):44-55 (1988),

http://sydney.edu.au/engineering/it/~judy/Research_fair/89_share.pdf

add an addendum

http://sydney.edu.au/engineering/it/~piers/papers/Share/share2.html


The particular algorithm was (re-)invented by yours truly, later I found 
something quite close to it in


www.cs.umu.se/~elmroth/papers/fsgrid.pdf


MOAB also uses a similar algorithm, except that the tree hierarchy is 
fixed, and the scale factor for each level of the tree is configurable.


www.eecs.harvard.edu/~chaki/bib/papers/jackson01maui.pdf



 While I understand the shortcomings in the scenario of the A, B, and C 
accounts you describe, coming up with an elegant design that supports a 
multi-level hierarchy is not a trivial problem, particularly when one of the 
requirements is to create something that users will understand intuitively.


Indeed! FWIW, one way in which the proposed algorithm can fail is if 
several accounts have used much more than their fair share; then users 
in those accounts with little usage can clobber the parent FS 
contribution. Might not be a huge problem in practice, though.


An algorithm which I think is quite robust, and does not either suffer 
from the problems with deep trees in the algorithm I suggested, is the 
one used by SGE (or whatever it's called nowadays). There you start with 
a number of tickets at the root of the tree, then distribute those 
tickets to the child nodes in proportion with their fair share. So at 
the end of this phase the tickets will have been distributed to the 
queued jobs. Then you give the job with the highest number the priority 
1.0, and the other jobs priorities proportional to the number of tickets 
they have vs. the number of tickets of the first job. This gives a 
fairly robust algorithm which gives priorities hierarchically like we 
want, but at the cost of adding state to each queued job (number of 
tickets it has).


Google finds a number of papers on things like lottery scheduling or 
stride scheduling, which are slightly relevant to the ticket 
algorithm. E.g.


www.waldspurger.org/carl/papers/phd-mit-tr667.pdf

I'll try to find time to see whether such an algorithm can be 
incorporated into slurm, and if it would be better than the one I 
originally suggested..




Don


-Original Message-
From: Janne Blomqvist [mailto:janne.blomqv...@aalto.fi]
Sent: Thursday, August 23, 2012 12:05 AM
To: slurm-dev
Subject: [slurm-dev] Making hierarchical fair share hierarchical


Hello,

we are seeing an issue with the hierarchical fair share algorithm which
we'd like to get fixed.  We think our suggested improvements would be of
benefit to others as well, but we'd like some inputs on how, or indeed
whether, to proceed.

For some background, we have a cluster shared between several
departments, and we have configured the fair-share such that each
department account has a share corresponding to its financial
contribution (so far it's simple, 3 departments with equal shares each).
Thus it's politically important for us that the fair-share priority
reflects the usage factor of each department as well as between users
belonging to an account. And yes, we're certainly aware that fair share
cannot guarantee that departments get their fair share, but it should at
least push in that direction.

Thus, we'd like the fair-share factor of a user reflect the account
hierarchy, such that users belonging to underserved accounts always have
a higher fair-share priority than users belonging to an account which
has used more than it's fair share.

The problem we're seeing is that the current hierarchical fair-share
algorithm does