[slurm-dev] Re: On the need for slurm uid/gid consistency
On 2017-09-12 21:52, Phil K wrote: I'm hoping someone can provide an explanation as to why slurm requires uid/gid consistency across nodes, with emphasis on the need for the 'SlurmUser' to be uid/gid-consistent. I know that slurmctld and slurmdbd can run as user `slurm` and that this would be safer than running as root. slurmd must run as root in any case, to my knowledge. Is the need for uid consistency, esp with the SlurmUser a difficult barrier to overcome? Please clarify for me. Thanks. Phil Yes, this is tedious. Either you need to create the slurm user with a consistent uid/gid when provisioning a node, or then ldap/nis/whatever needs to be up and running before you start any slurm daemons. It would be nicer if the rpm's could just create a slurm user when installing the packages for slurmctld/slurmdbd, and let the system allocate the uid/gid so it doesn't conflict with any other local uid/gid's but not having to ensure the slurm uid/gid is globally unique. Anyway, I think the reason behind this is that slurmd needs to ensure that control messages coming from slurmctld really come from the slurmctld daemon, and not from some random unprivileged process. And as munge is needed anyway to ensure that end user uid/gid's are correct, it's also used to ensure that the control messages really come from the SlurmUser. I guess in principle you could get rid of the requirement for a SlurmUser with consistent uid/gid, say by using certificates like TLS or ssh. But then you'd have to provision the certificates as part of the deployment, so I'm not sure that buys you any more ease of use in the end. Or does anybody have a better idea? -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: Fair resource scheduling
On 2017-09-07 17:07, Patrick Goetz wrote: > > On 08/25/2017 09:27 AM, Peter A Ruprecht wrote: >> Regarding #2, we use https://slurm.schedmd.com/fair_tree.html and it >> has really exceeded my expectations. >> > > > I was distracted by another project and am just now coming back to this. > It's not at all clear to me how the Fair Tree Fairshare Algorithm > solves the problem I was asking about, namely giving higher priority to > shorter jobs submitted by people who don't run jobs that often. As far > as I can tell from https://slurm.schedmd.com/fair_tree.html, this > algorithm mostly maintains scheduling priorities across child tasks? Am > I missing something? > Please read the description of the slurm fairshare algorithm, which is explained in more detail at https://slurm.schedmd.com/priority_multifactor.html#fairshare Fair tree (which we also use and are pretty happy with) is a modification of the above basic fairshare algorithm that works "better" when you have a hierarchy of accounts. The hierarchy might or might not be relevant for your setup, but the basic fairshare idea seems to be exactly what you're asking for, if I understand you correctly. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: SLURM 16.05.10-2 jobacct_gather/linux inconsistencies?
3) > [2017-09-05T13:19:28.109] [10215464] debug: jag_common_poll_data: Task > average frequency = 2527 pid > 22050 mem size 4228 230384 time 0.01(0+0) > [2017-09-05T13:19:28.281] [10215464.0] debug: jag_common_poll_data: Task > average frequency = 2495 > pid 22126 mem size 3560932 7655316 time 239.55(234+4) > [2017-09-05T13:19:58.113] [10215464] debug: jag_common_poll_data: Task > average frequency = 2527 pid > 22050 mem size 4228 230384 time 0.01(0+0) > [2017-09-05T13:19:58.286] [10215464.0] debug: jag_common_poll_data: Task > average frequency = 2493 > pid 22126 mem size 3560936 7655316 time 299.51(294+5) > > > > Has anyone else noticed anything similar on their cluster(s)? I cannot > confirm if this was > happening before we upgraded from 15.08.4 to 16.05.10-2. > > Thanks, > John DeSantis > Not knowing the specifics of your setup, AFAIK the general advice remains that unless you're launching the job via srun instead of mpirun (that is, you're using the slurm MPI integration) all bets are off w.r.t. accounting. Since you seem to be using mvapich2(?), make sure you're compiling mvapich2 with ./configure --with-pmi=pmi2 --with-pm=slurm and then launch the application with srun --mpi=pmi2 vasp_std instead of mpirun. (from https://slurm.schedmd.com/mpi_guide.html#mvapich2 ) -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: Building slurm from source - munge plugin not getting build
On 2017-05-11 02:34, Dhiraj Reddy wrote: > Hi, > > I am trying to build latest release version of slurm from source. I have > used the following procedure to do configure, make and make install > > ./configure --enable-debug --prefix=~/opt/slurm-17 --sysconfdir=~/etc/slurm > make -j4 > make install > > Now, running slurmctld and slurmd with verbose logging enabled, I get the > following > > ~/opt/slurm-17/sbin/slurmctld -Dv > > slurmctld: error: Couldn't find the specified plugin name for crypto/munge > looking at all files > slurmctld: error: cannot find crypto plugin for crypto/munge > slurmctld: error: cannot create crypto context for crypto/munge > slurmctld: fatal: slurm_cred_creator_ctx_create((null)): Operation not > permitted > > > ~/opt/slurm-17/sbin/slurmd -Dv > > slurmd: error: Couldn't find the specified plugin name for auth/munge > looking at all files > slurmd: error: cannot find auth plugin for auth/munge > slurmd: error: cannot create auth context for auth/munge > slurmd: error: slurmd initialization failed > > I can see that the above errors are because the libraries auth_munge.so and > crypto_munge.so are not available in ~/opt/slurm-17/lib directory. > > Can somebody explain what is wrong with my configuration and build process > and how the auth_munge.so and crypto_munge.so libraries can be built. > > I have tried the following way of confirurign with no success > > 1. With munge binary present at /usr/bin/munge > ./configure --enable-debug --prefix=~/opt/slurm-17 --sysconfdir=~/etc/slurm > --with-munge=/usr/bin > The munge plugin is built only if it finds the munge headers. Do you have them installed? E.g. the package "munge-devel" on RHEL/CentOS with the EPEL repo, or "libmunge-dev" on Debian/Ubuntu. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1
On 2017-05-09 10:27, Ole Holm Nielsen wrote: On 05/09/2017 09:14 AM, Janne Blomqvist wrote: On 2017-05-07 15:29, Ole Holm Nielsen wrote: I'm announcing an initial version 0.1 of an Infiniband topology tool "slurmibtopology.sh" for Slurm. I have also created one, at https://github.com/jabl/ibtopotool You need the python networkx library (python-networkx package on centos & Ubuntu, or install via pip). Run with --help option to get some usage instructions. In addition to generating slurm topology.conf, it can also generate graphviz dot files for visualization. Thanks for providing this tool to the Slurm community. It seems that tools for generating topology.conf have been developed in many places, probably because it's an important task. I installed python-networkx 1.8.1-12.el7 from EPEL on our CentOS 7.3 system and then executed ibtopotool.py, but it gives an error message: # ./ibtopotool.py Traceback (most recent call last): File "./ibtopotool.py", line 216, in graph = parse_ibtopo(args[0], options.shortlabels) IndexError: list index out of range Could you help solving this? Duh.. I just pushed a fix, thanks for reporting. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: Creating init script in /etc/init.d while building from source
On 2017-05-09 09:09, Dhiraj Reddy wrote: Hi, How to create slurmd and slurmctld init scripts in the directory /etc/init.d while building and installing slurm from source. I think something should be done with the files ./init.d.slurm in /etc directory but I don't know what do. I am using Ubuntu 16.04. Thanks Dhiraj Hi, for Ubuntu 16.04 you should be using the systemd service files instead of init.d scripts. They are part of the rpm file when building for red hat based systems, don't know about ubuntu; but presumably you can find them somewhere in the source tree. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: Announce: Infiniband topology tool "slurmibtopology.sh" version 0.1
On 2017-05-07 15:29, Ole Holm Nielsen wrote: I'm announcing an initial version 0.1 of an Infiniband topology tool "slurmibtopology.sh" for Slurm. I have also created one, at https://github.com/jabl/ibtopotool You need the python networkx library (python-networkx package on centos & Ubuntu, or install via pip). Run with --help option to get some usage instructions. In addition to generating slurm topology.conf, it can also generate graphviz dot files for visualization. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: LDAP required?
On 2017-04-13 21:41, Kilian Cavalotti wrote: Hi Janne, On Thu, Apr 13, 2017 at 1:32 AM, Janne Blomqvist <janne.blomqv...@aalto.fi> wrote: Should work as of 16.05 unless you have some very peculiar setup. IIRC I submitted some patch to get rid of the enumeration entirely, but apparently SchedMD has customers who have multiple groups with the same GID, and for that to work (whatever "work" means in that context) the enumeration is necessary. But if you don't have crazy stuff like that it should all work with enumeration disabled. Well, even without dwelling into crazy stuff, enumeration is necessary for things like getting a comprehensive list of all the members of a primary group. The way group membership usually works, users have: * a primary group that is stored in the user record (either in /etc/passwd or ou=accounts in LDAP) * one or more secondary group(s), that are managed in a completely separate branch (/etc/group or ou=groups in LDAP) It's pretty easy to list all the members of a secondary group, because they look like this: "secondary_group:user1,user2,..." But for primary groups, they are in the form of "user1:primary_group", so you have to be able to get the full list of users (through enumeration) to be able to identify all the users that are part of "primary_group" And that's true, sssd is not reliable for enumeration, but it's still required for some basic things. Cheers, The way slurm handles it without enumeration e.g. for checking whether a user is allowed to use a partition with an AllowGroups= specifier is that it checks all the groups listed in AllowGroups, and it also checks whether the user primary group is in AllowGroups. So it does not need enumeration in this case. I'd even go as far as saying that software intended for use in large environments shouldn't rely on enumeration, period. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: LDAP required?
On 2017-04-13 15:09, Diego Zuccato wrote: > > Il 12/04/2017 08:52, Janne Blomqvist ha scritto: > >> BTW, do you have some kind of trust relationship between your FreeIPA >> domain and the AD domain, or how do you do it? I did play around with >> using FreeIPA for our cluster as well and somehow synchronizing it with >> the university AD domain, but in the end we managed to convince the >> university IT to allow us to join our nodes directly to AD, so we were >> able to skip FreeIPA entirely. > What are you using to join nodes to AD? > > I've used samba-winbind in the past but it was very fragile, and am > currently using PBIS-Open but it's having problems with colliding UIDs > and GIDs (multi-domain forest with quite a lot of [100k+] users and even > more groups). > We use adcli (there's an rpm package called adcli in EL7, FWIW; upstream seems to be http://cgit.freedesktop.org/realmd/adcli ). For node provisioning, adcli allows pre-creating multiple machine accounts with one command (with the help of python-hostlist you can expand hostlist syntax), and then when the node first boots the node joins to AD with a one-time password (run via ansible-pull). A minor caveat is that we have some Samba gateway nodes to give laptops and Windows workstations access to Lustre, and samba isn't happy with the domain join that adcli does, and for these we use the samba "net ads join ..." command to join them. Not sure how any of this would work with colliding UID's/GID's. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: LDAP required?
On 2017-04-13 02:30, Christopher Samuel wrote: > > On 13/04/17 01:47, Jeff White wrote: > >> +1 for Active Directory bashing. > > I wasn't intending to "bash" AD here, just that the AD that we were > trying to use (and I suspect that Lachlan might me talking to) has tens > of thousands of accounts in it and we just could not get the > Slurm->sssd->AD chain to work reliably to be able to run a production > system. > > This was with both sssd trying to enumarate the whole domain Err, yeah, with a large domain sssd will keel over if you don't disable enumeration. Or well, perhaps not technically keel over, but in our testing IIRC it hit some timeout before completing the enumeration and then worked, err, somewhat erratically. > and also > (before that) trying to get Slurm to work without sssd enumeration. Should work as of 16.05 unless you have some very peculiar setup. IIRC I submitted some patch to get rid of the enumeration entirely, but apparently SchedMD has customers who have multiple groups with the same GID, and for that to work (whatever "work" means in that context) the enumeration is necessary. But if you don't have crazy stuff like that it should all work with enumeration disabled. 15.08 should also work with enumeration disabled, except for AllowGroups/DenyGroups partition specifications. > Smaller AD domains might work more reliably, but that's not where we sit Our AD has around 30k user accounts, IIRC. > so we fell back to using our own LDAP server with Karaage to manage > project/account applications, adding people to slurmdbd, etc. So how do you manage user accounts? Just curious if someone has a sane middle ground between integrating with the organization user account system (AD or whatever) and DIY. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: LDAP required?
On 2017-04-11 09:04, Lachlan Musicman wrote: > On 11 April 2017 at 02:36, Raymond Wan <rwan.w...@gmail.com > <mailto:rwan.w...@gmail.com>> wrote: > > > For SLURM to work, I understand from web pages such as > https://slurm.schedmd.com/accounting.html > <https://slurm.schedmd.com/accounting.html> that UIDs need to be shared > across nodes. Based on this web page, it seems sharing /etc/passwd > between nodes appears sufficient. The word LDAP is mentioned at the > end of the paragraph as an alternative. > > I guess what I would like to know is whether it is acceptable to > completely avoid LDAP and use the approach mentioned there? The > reason I'm asking is that I seem to be having a very nasty time > setting up LDAP. It doesn't seem as "easy" as I thought it would be > [perhaps it was my fault for thinking it would be easy...]. > > If I can set up a small cluster without LDAP, that would be great. > But beyond this web page, I am wondering if there are suggestions for > "best practices". For example, in practice, do most administrators > use LDAP? If so and if it'll pay off in the end, then I can consider > continuing with setting it up... > > > > We have had success with a FreeIPA installation to manage auth - every > node is enrolled in a domain and each node runs SSSD (the FreeIPA client). +1. Setting up a LDAP + krb5 infrastructure by hand is quite a chore (been there, done that), but FreeIPA more or less automates all that. > Our auth actually backs onto an Active Directory domain - I don't even > have to manage the users. Which, to be honest, is quite a relief. +1. Or rather, make that +1000. Before, there would be a constant stream of users coming to our office requesting accounts, or wanting to reset a forgotten password, or reactivate an expired account etc.; now all of this is offloaded to the university IT helpdesk. BTW, do you have some kind of trust relationship between your FreeIPA domain and the AD domain, or how do you do it? I did play around with using FreeIPA for our cluster as well and somehow synchronizing it with the university AD domain, but in the end we managed to convince the university IT to allow us to join our nodes directly to AD, so we were able to skip FreeIPA entirely. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi signature.asc Description: OpenPGP digital signature
[slurm-dev] Re: Slurm & CGROUP
On 2017-03-15 17:52, Wensheng Deng wrote: > No, it does not help: > > $ scontrol show config |grep -i jobacct > > *JobAcct*GatherFrequency = 30 > > *JobAcct*GatherType = *jobacct*_gather/cgroup > > *JobAcct*GatherParams = NoShared > > > > > > On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng <w...@nyu.edu > <mailto:w...@nyu.edu>> wrote: > > I think I tried that. let me try it again. Thank you! > > On Wed, Mar 15, 2017 at 11:43 AM, Chris Read <cr...@drw.com > <mailto:cr...@drw.com>> wrote: > > > We explicitly exclude shared usage from our measurement: > > > JobAcctGatherType=jobacct_gather/cgroup > JobAcctGatherParams=NoShare? > > Chris > > > > From: Wensheng Deng <w...@nyu.edu <mailto:w...@nyu.edu>> > Sent: 15 March 2017 10:28 > To: slurm-dev > Subject: [ext] [slurm-dev] Re: Slurm & CGROUP > > It should be (sorry): > we 'cp'ed a 5GB file from scratch to node local disk > > > On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng <w...@nyu.edu > <mailto:w...@nyu.edu><mailto:w...@nyu.edu > <mailto:w...@nyu.edu>>> wrote: > Hello experts: > > We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a > 5GB job from scratch to node local disk, declared 5 GB memory > for the job, and saw error message as below although the file > was copied okay: > > slurmstepd: error: Exceeded job memory limit at some point. > > srun: error: [nodenameXXX]: task 0: Out Of Memory > > srun: Terminating job step 41.0 > > slurmstepd: error: Exceeded job memory limit at some point. > > > From the cgroup document > https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt > <https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt> > Features: > - accounting anonymous pages, file caches, swap caches usage and > limiting them. > > It seems that cgroup charges memory "RSS + file caches" to user > process like 'cp', in our case, charged to user's jobs. swap is > off in this case. The file cache can be small or very big, and > it should not be charged to users' batch jobs in my opinion. > How do other sites circumvent this issue? The Slurm version is > 16.05.4. > > Thank you and Best Regards. > > > > Could you set AllowedRamSpace/AllowedSwapSpace in /etc/slurm/cgroup.conf to some big number? That way the job memory limit will be the cgroup soft limit, and the cgroup hard limit which is when the kernel will OOM kill the job would be "job_memory_limit * AllowedRamSpace" that is, some large value? -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi signature.asc Description: OpenPGP digital signature
[slurm-dev] Re: problem configuring mvapich + slurm: "error: mpi/pmi2: failed to send temp kvs to compute nodes"
debug2: failed connecting to specified socket > '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle > slurmd: debug3: in the service_connection > slurmd: debug2: got this type of message 5029 > slurmd: debug3: Entering _rpc_forward_data, address: > /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66 > slurmd: debug2: failed connecting to specified socket > '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle > slurmd: debug3: in the service_connection > slurmd: debug2: got this type of message 6004 > slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS > slurmd: debug: _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0 > slurmd: debug3: in the service_connection > slurmd: debug2: got this type of message 6004 > slurmd: debug2: Processing RPC: REQUEST_SIGNAL_TASKS > slurmd: debug: _rpc_signal_tasks: sending signal 9 to step 754.0 flag 0 > slurmd: debug3: in the service_connection > slurmd: debug2: got this type of message 6011 > slurmd: debug2: Processing RPC: REQUEST_TERMINATE_JOB > slurmd: debug: _rpc_terminate_job, uid = 500 > slurmd: debug: task_p_slurmd_release_resources: 754 > slurmd: debug3: state for jobid 744: ctime:1479371687 revoked:0 expires:0 > slurmd: debug3: state for jobid 745: ctime:1479371707 revoked:0 expires:0 > slurmd: debug3: state for jobid 746: ctime:1479371733 revoked:0 expires:0 > slurmd: debug3: state for jobid 747: ctime:1479371785 revoked:0 expires:0 > slurmd: debug3: state for jobid 748: ctime:1479374028 revoked:0 expires:0 > slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:1479374387 > expires:1479374387 > slurmd: debug3: state for jobid 753: ctime:1479374372 revoked:0 expires:0 > slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0 > slurmd: debug3: state for jobid 754: ctime:1479374425 revoked:0 expires:0 > slurmd: debug4: unable to create link for > /home/localsoft/slurm/spool//cred_state -> > /home/localsoft/slurm/spool//cred_state.old: File exists > slurmd: debug4: unable to create link for > /home/localsoft/slurm/spool//cred_state.new -> > /home/localsoft/slurm/spool//cred_state: File exists > slurmd: debug: credential for job 754 revoked > slurmd: debug2: No steps in jobid 754 to send signal 18 > slurmd: debug2: No steps in jobid 754 to send signal 15 > slurmd: debug4: sent SUCCESS > slurmd: debug2: set revoke expiration for jobid 754 to 1479374560 UTS > slurmd: debug: Waiting for job 754's prolog to complete > slurmd: debug: Finished wait for job 754's prolog to complete > slurmd: debug: Calling /home/localsoft/slurm/sbin/slurmstepd spank epilog > spank-epilog: debug: Reading slurm.conf file: > /home/localsoft/slurm/etc/slurm.conf > spank-epilog: debug: Running spank/epilog for jobid [754] uid [500] > spank-epilog: debug: spank: opening plugin stack > /home/localsoft/slurm/etc/plugstack.conf > slurmd: debug: completed epilog for jobid 754 > slurmd: debug3: slurm_send_only_controller_msg: sent 192 > slurmd: debug: Job 754: sent epilog complete msg: rc = 0 > > > As you can see, the problem seems to be these lines: > slurmd: debug2: got this type of message 5029 > slurmd: debug3: Entering _rpc_forward_data, address: > /home/localsoft/slurm/spool//sock.pmi2.754.0, len: 66 > slurmd: debug2: failed connecting to specified socket > '/home/localsoft/slurm/spool//sock.pmi2.754.0': Stale file handle > slurmd: debug3: in the service_connection > > > I have checked that these files exist in the shared storage and are > accesible by the node complaining. They are however empty. Is this > normal? What should I expect? > > $ ssh acme11 'ls -plah /home/localsoft/slurm/spool/' > total 160K > drwxr-xr-x 2 slurm slurm 4,0K nov 17 10:26 ./ > drwxr-xr-x 12 slurm slurm 4,0K nov 16 16:20 ../ > srwxrwxrwx 1 root root 0 nov 17 10:26 acme11_755.0 > srwxrwxrwx 1 root root 0 nov 17 10:26 acme12_755.0 > -rw--- 1 root root 284 nov 17 10:26 cred_state.old > -rw--- 1 slurm slurm 141K nov 16 14:24 slurmdbd.log > -rw-r--r-- 1 slurm slurm5 nov 16 14:24 slurmdbd.pid > srwxr-xr-x 1 root root 0 nov 17 10:26 sock.pmi2.755.0 > > > So any ideas? > > thanks for your help, > > Manuel > > > PS: About mvapich compilation. > > I made quite a few tests, and I ended up compiling with: > ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast > --with-slurm=/home/localsoft/slurm > > Before that I tried the instructions > in http://slurm.schedmd.com/mpi_guide.html#mvapich2 > <http://slurm.schedmd.com/mpi_guide.html#mvapich2> but if fails: > ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast > --with-pmi=pmi2 --with-pm=slurm > (...) > checking for slurm/pmi2.h... no > configure: error: could not find slurm/pmi2.h. Configure aborted > > I also tried > ./configure --prefix=/home/localsoft/mvapich2 --disable-mcast > --with-slurm=/home/localsoft/slurm --with-pmi=pmi2 --with-pm=slurm > (...) > checking whether we are cross compiling... configure: error: in > `/root/mvapich2-2.2/src/mpi/romio': > configure: error: cannot run C compiled programs. > If you meant to cross compile, use `--host'. > See `config.log' for more details > configure: error: src/mpi/romio configure failed > > Hi, I think you really need both "--with-pmi=pmi2 --with-pm=slurm" parameters to the configure command when building mvapich2. So you need to fix whatever issues is preventing it from finding slurm/pmi2.h (I have a vague recollection that at some point there was some problem with slurm makefiles not installing that file, or something like that). On another note, it doesn't make sense to put special files like pipes or sockets on a network filesystem. At best it does no harm, but there might be problems if several nodes want to create, say, a socket special file at the same shared path. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi signature.asc Description: OpenPGP digital signature
[slurm-dev] Re: TmpFS directive getting ignored?
On 2016-11-17 15:21, Holger Naundorf wrote: > Hello, > I am currently setting up a test environment with SLURM to evaluate it > for regular use. > > It looks as if the 'TmpFS' setting is not used - I set > > TmpFS=/scratch > > in slurm.conf, but if I run 'slurmd -C' on my nodes I get back the size > of the partition containing /tmp for my 'TmpDisk' space. Within Slurm > jobs $TMPDIR also gets set to /tmp. > > > $ df -h /tmp/ > Filesystem Size Used Avail Use% Mounted on > /dev/sda296G 8.0G 84G 9% / > > $ df -h /scratch/ > Filesystem Size Used Avail Use% Mounted on > /dev/sdb1 1.8T 68M 1.7T 1% /scratch > > $ ./sbin/slurmd -C > ClusterName=X NodeName= CPUs=12 Boards=1 SocketsPerBoard=2 > CoresPerSocket=6 ThreadsPerCore=1 RealMemory=48258 TmpDisk=98301 > > $ grep TmpFS ./etc/slurm.conf > TmpFS=/scratch > > > I am using Slurm v16.05.5. > > I am just starting with this, so maybe I am overlooking something > obvious, but I did not find other tuning option for this in the > documentation. > > Regards, > Holger Naundorf > I've ran into the same. I think the issue is that "slurmd -C" doesn't read the config file (since the purpose is to print a line you can use when creating your config file). Once you start slurmd for real then it reads the config file, and TmpFS is handled like it should. Per se, the calculation that it does is not that difficult, you can do it yourself in the shell with something like echo $(($(df --output=size /scratch|tail -1) / 1024)) And no, TmpFS has no effect on setting TMPDIR in jobs. If you want to do that, you have to do it yourself, e.g. with a TaskProlog (see slurm.conf man page) script. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi signature.asc Description: OpenPGP digital signature
[slurm-dev] Re: Slurmctld auto restart and kill running job, why ?
On 2016-09-27 10:39, Philippe wrote: > If I can't use logrotate, what must I use ? You can log via syslog, and let your syslog daemon handle the rotation (and rate limiting, disk full, logging to a central log host and all the other nice things that syslog can do for you). -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi signature.asc Description: OpenPGP digital signature
[slurm-dev] Re: Struggling with QOS?
On 2016-09-29 04:11, Lachlan Musicman wrote: > Hi, > > After some fun incidents with accidental monopolization of the cluster, > we decided to enforce some QOS. [snip] > What have I done wrong? I re-read the documentation this AM, but I can't > see anything that might be preventing QOS from being applied except for > maybe a qos hierarchy issue, but I've only set the two qos and they > apply to distinct associations and partitions. What you actually want here is GrpTRESRunMins, see http://tech.ryancox.net/2014/04/scheduler-limit-remaining-cputime-per.html for an explanation how it works. Also, if you do this you'll probably want to add "safe" to your AccountingStorageEnforce flags. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi signature.asc Description: OpenPGP digital signature
[slurm-dev] Re: Problem when adding user to secondary group
, which version of Slurm do you have? Did you update Slurm >>>>> recently? Did you always had this problem or you discovered that >>>>> problem recently? Check if a newer version of Slurm solves this >>>>> problem otherwise report it here. >>>>> >>>>> Best Regards, >>>>> Valantis >>>>> >>>>> >>>>> On 05/25/2016 02:59 PM, Thekla Loizou wrote: >>>>>> >>>>>> Hi Valanti! :) >>>>>> >>>>>> We are using nslcd on the compute nodes. >>>>>> We have indeed changed the default behavior/command of salloc but >>>>>> I don't think that this is the issue because the same happens when >>>>>> we submit jobs via sbatch. So I believe that this is not related >>>>>> to the new command we are using. >>>>>> >>>>>> When logging in as root or as user on the compute nodes via ssh we >>>>>> get all groups after running the "id" command, >>>>>> but when logging in through a SLURM job (interactive with salloc >>>>>> or not interactive with sbatch) we face the problem I described. >>>>>> >>>>>> We have also checked the environment of the user in both cases >>>>>> (ssh or SLURM) and the only differences are the SLURM environment >>>>>> variables and nothing else. >>>>>> >>>>>> Thanks, >>>>>> Thekla >>>>>> >>>>>> >>>>>> On 25/05/2016 02:07 μμ, Chrysovalantis Paschoulas wrote: >>>>>>> >>>>>>> Hi Thekla! :) >>>>>>> >>>>>>> For me it looks like it's a configuration issue of the client >>>>>>> LDAP name >>>>>>> service on the compute nodes. Which service are you using? nslcd or >>>>>>> sssd? I can see that you have change the default behavior/command of >>>>>>> salloc and the command gives you a prompt on the compute node >>>>>>> directly >>>>>>> (by default salloc will return a shell on the login node where it >>>>>>> was >>>>>>> called). Check and be sure that you are not doing something wrong >>>>>>> in the >>>>>>> new salloc command that you defined in slurm.conf >>>>>>> (SallocDefaultCommand >>>>>>> option). >>>>>>> >>>>>>> Can you try to go as root on the compute nodes and try to resolve >>>>>>> a uid >>>>>>> with the id command? What does it give you there, all groups or some >>>>>>> secondary groups are missing? If the secondary groups are missing >>>>>>> then >>>>>>> it's not a problem of Slurm but the config of the ID resolving >>>>>>> service. >>>>>>> As far as I know Slurm changes the environment after salloc (e.g. >>>>>>> exports SLURM_ env vars) but shouldn't change the behavior of >>>>>>> commands >>>>>>> like id.. >>>>>>> >>>>>>> Best Regards, >>>>>>> Chrysovalantis Paschoulas >>>>>>> >>>>>>> >>>>>>> On 05/25/2016 10:32 AM, Thekla Loizou wrote: >>>>>>>> >>>>>>>> Dear all, >>>>>>>> >>>>>>>> We have noticed a very strange problem every time we add an >>>>>>>> existing >>>>>>>> user to a secondary group. >>>>>>>> We manage our users in LDAP. When we add a user to a new group and >>>>>>>> then type the "id" and "groups" commands we see that the user was >>>>>>>> indeed added to the new group. The same happens when running the >>>>>>>> command "getent groups". >>>>>>>> >>>>>>>> For example, for a user "thekla" whose primary group was >>>>>>>> "cstrc" and >>>>>>>> now was also added to the group "build" we get: >>>>>>>> [thekla@node01 ~]$ id >>>>>>>> uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc),10257(build) >>>>>>>> [thekla@node01 ~]$ groups >>>>>>>> cstrc build >>>>>>>> [thekla@node01 ~]$ getent group | grep build >>>>>>>> build:*:10257:thekla >>>>>>>> >>>>>>>> The above output is the correct one and it is given to us when >>>>>>>> we ssh >>>>>>>> to one of the compute nodes. >>>>>>>> >>>>>>>> But, when we submit a job on the nodes (so getting access through >>>>>>>> SLURM and not with direct ssh), we cannot see the new group the >>>>>>>> user >>>>>>>> was added to: >>>>>>>> [thekla@prometheus ~]$ salloc -N1 >>>>>>>> salloc: Granted job allocation 8136 >>>>>>>> [thekla@node01 ~]$ id >>>>>>>> uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc) >>>>>>>> [thekla@node01 ~]$ groups >>>>>>>> cstrc >>>>>>>> >>>>>>>> While, the following output shows the correct result: >>>>>>>> [thekla@node01 ~]$ getent group | grep build >>>>>>>> build:*:10257:thekla >>>>>>>> >>>>>>>> This problem appears only when we get access through SLURM i.e. >>>>>>>> when >>>>>>>> we run a job. >>>>>>>> >>>>>>>> Has anyone faced this problem before? The only way we found for >>>>>>>> solving this is to restart the SLURM service on the compute nodes >>>>>>>> every time we add a user to a new group. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Thekla >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Forschungszentrum Juelich GmbH >>>>>>> 52425 Juelich >>>>>>> Sitz der Gesellschaft: Juelich >>>>>>> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 >>>>>>> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher >>>>>>> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt >>>>>>> (Vorsitzender), >>>>>>> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, >>>>>>> Prof. Dr. Sebastian M. Schmidt >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi signature.asc Description: OpenPGP digital signature
[slurm-dev] Re: slurmd
On 2016-05-10 19:14, kevin@gmail.com wrote: Background: I’m trying to write a slurmd task plugin to bind mount /tmp to /tmp/USERID/JOBID. Question 1: should I be using a task plugin or a spank plugin to do this? There are a number of options here. You can look at the slides from the past few slurm user group meetings, there's some info there. What we ended up with was to use pam_namespace to setup various bind-mounted dirs (in our case, /tmp, /var/tmp, and /dev/shm), and then an epilog script to clean it up afterwards. Below is our /etc/security/namespace.d/site.conf: /tmp /l/tmp-inst/ user root,bin,adm /var/tmp /l/vartmp_inst/ user root,bin,adm /dev/shm /dev/shm/inst/ user root,bin,adm See the pam_namespace(8) and namespace.conf(5) man pages for more info. And then in /etc/pam.d/slurm add session required pam_namespace.so Finally, slurm epilog script below, modify as appropriate. #!/bin/bash # If pam_namespace is used to create per-job /tmp/, /var/tmp, /dev/shm, # clean it here in the epilog when no jobs are running on the node. # Annoyingly, squeue always exits with status 0, so we must check that # the output is empty, that is no jobs by the user running on the node # and no error occurred (timeout etc.) userlist=$(/usr/bin/squeue -w $HOSTNAME -o%u -h -u $SLURM_JOB_USER -t R,S,CF 2>&1) if [ -z $userlist ]; then /bin/rm -rf /l/tmp-inst/$SLURM_JOB_USER /l/vartmp_inst/$SLURM_JOB_USER /dev/shm/$SLURM_JOB_USER fi Question 2: I’m launching slurmd with the following line sudo slurmd -N linux0 -D -vv but the debug statements in slurmstepd code aren’t being printed to screen. I assume that the slurmstepd code is being run in a fork of slurred. Where can I find debug output fromslurmstepd? Nowhere, see https://bugs.schedmd.com/show_bug.cgi?id=2631 . The workaround is to just run slurmd without "-D" and tail -f the syslog. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: Slurm on CentOS 7.x
On 2016-03-13 12:42, Rémi Palancher wrote: Le 12/03/2016 18:32, Jagga Soorma a écrit : Any ideas why the slurm service in the client might be throwing those timed out errors? `systemctl status` shows you're using SYSV init script for slurm: > # systemctl status slurm > slurm.service - LSB: slurm daemon management > Loaded: loaded (/etc/rc.d/init.d/slurm) ^ I guess in this case "timeout" means the init script didn't terminate quickly enough for systemd. Maybe an interactive run of this init script with `set -x` may help to see what's going on? However, I would definitely recommand using native *.service files when using systemd. Slurm provides those files in etc/ dir: https://github.com/SchedMD/slurm/tree/master/etc You can install them with the RPMs. This way, systemd will be much more precise when reporting errors. Indeed, but note that the slurm rpm spec has a bug where even if one builds it on a systemd system, it still includes the old init.d scripts in addition to the systemd service files.. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: EL6 clusters with cgroups enabled
On 2016-02-11 07:06, Christopher Samuel wrote: On 11/02/16 06:15, Christopher B Coffey wrote: > I’m curious which kernel you are running on your el6 clusters that > have cgroups enabled in slurm. I have an issue where some workloads > cause 100’s-1000’s of flocks to occur relating to the memory cleanup > portion in the cgroup. This is kernel code, or userspace? My understanding of the kernel developers concerns over memory cgroups was around the extra overhead in memory allocation inside the kernel. Here's a write up from LWN from the 2012 mm minisummit at the Kernel Summit on the issue: https://lwn.net/Articles/516533/ Interestingly the RHEL page mentions a memory overhead on x86-64. but not a performance issue, so whether they backported later patches to reduce the impact of memory cgroups I cannot tell right now. https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-memory.html I did benchmarking a few years back when were transitioning to RHEL6 and Slurm with memory cgroups enabled and couldn't see any significant difference in performance. Unfortunately I suspect I cleaned all that up some time ago. :-( We use them and haven't noticed any issues yet. All the best, Chris See http://slurm.schedmd.com/slurm_ug_2012/SUG-2012-Cgroup.pdf slides 31-35. I don't know if RedHat has backported the 2.6.38 memcg changes to the 2.6.32 version they use in RHEL6. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: AllowGroups and AD
On 2016-02-10 15:12, Diego Zuccato wrote: Hello all. I think I'm doing something wrong, but I don't understand what. I'm trying to limit users allowed to use a partition (that, coming from Torque, I think is the equivalent of a queue), but obviously I'm failing. :( Frontend and work nodes are all Debians joined to AD via Winbind (that ensures consistent UID/GID mapping, at the expense of having many groups and a bit of slowness while looking 'em up). On every node I can run 'id' and it says (redacted): uid=108036(diego.zuccato) gid=100013(domain_users) gruppi=100013(domain_users),[...],242965(str957.tecnici),[...] (it takes about 10s to get the complete list of groups). Linux ACLs work as expected (if I set a file to be readable only by Str957.tecnici I can read it), but when I do scontrol update PartitionName=pp_base AllowGroups=str957.tecnici or even scontrol update PartitionName=pp_base AllowGroups=242965 when I try to sbath a job I get: diego.zuccato@Str957-cluster:~$ sbatch aaa.sh sbatch: error: Batch job submission failed: User's group not permitted to use this partition diego.zuccato@Str957-cluster:~$ newgrp Str957.tecnici diego.zuccato@Str957-cluster:~$ sbatch aaa.sh sbatch: error: Batch job submission failed: User's group not permitted to use this partition So I won't get recognized even if I change my primary GID :( I've been in that group since way before installing the cluster, and I already tried rebooting everyting to refresh the cache. Another detail that can be useful: diego.zuccato@Str957-cluster:~$ time getent group str957.tecnici str957.tecnici:x:242965:[...],diego.zuccato,[...] real0m0.012s user0m0.000s sys 0m0.000s Any hints? TIA Hi, do you have user and group enumeration enabled in winbind? I.e. does $ getent passwd and $ getent group return nothing, or the entire user and group lists? FWIW, slurm 16.05 will have some changes to work better in environments with enumeration disabled, see http://bugs.schedmd.com/show_bug.cgi?id=1629 -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: User management
On 2016-01-13 23:54, Trey Dockendorf wrote: > We use 389 Directory Server for our LDAP and SSSD for clients, works well. > We also use sssd on the clients, but instead of running or own LDAP(+kerberos) infrastructure, we use the university Active Directory. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: slurmd can't mount cpuacct cgroup namespace on RHEL 7.2 ?
On 2015-12-21 07:32, Christopher Samuel wrote: Hi folks, I'm helping bring up a new cluster with Slurm 15.08.5 on RHEL 7.2 and I've run into an odd case where trying to launch a process with srun triggers this failure: [2015-12-21T15:38:00.213] unable to mount cpuacct cgroup namespace: Device or resource busy [2015-12-21T15:38:00.213] jobacct_gather/cgroup: unable to create cpuacct namespace I suspect this might be systemd related, but as I've limited experience with it so far I'm not certain. This is what is failing according to strace: 12725 mount("cgroup", "/cgroup/cpuacct", "cgroup", MS_NOSUID|MS_NODEV|MS_NOEXEC, "cpuacct") = -1 EBUSY (Device or resource busy) ...and it might be related to this existing mount courtesy of systemd in /proc/mounts: cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpuacct,cpu 0 0 Anyone else seen this, or got any ideas? 1. When using systemd, or some other tool that mounts the cgroup file systems early in the boot process (e.g. cgconfig), you should not try to mount the cgroup filesystems from slurmd. That is, in /etc/slurm/cgroup.conf put "CgroupAutomount=no". 2. Modern distros use /sys/fs/cgroup for the defacto standard root mount point for cgroups filesystems. Previously there was some distro-specific variations (/cgroup, /dev/cgroups, whatever). Thus, put into /etc/slurm/cgroup.conf the line "CgroupMountpoint=/sys/fs/cgroup". (At some point it might make sense to change the slurm default cgroupmountpoint to match the modern standard..) 3. For details, see https://wiki.freedesktop.org/www/Software/systemd/PaxControlGroups/ -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] RE: multifactor2 priority weighting
Hi, I'm the author of the multifactor2 algorithm. The multifactor2/ticket based fairshare algorithm, in the end, produces fair-share values normalized between 0 and 1, like the other available algorithms. The algorithm has 3 phases: 1. Mark which users have jobs in the queue 2. Starting from the root of the account tree, distribute tickets, weighted by a fair-share factor (which is between 0 and 100, as Don mentioned). 3. Calculate the final fair-share priority, such that the user with the most tickets gets the priority 1.0, the others proportional to that depending on how many tickets they have. The goal of the algorithm was to better balance fair-share in an account hierarchy than the original fair-share algorithm. If there is no hierarchy, I don't think it provides much benefit. Also, if the OP is already on 14.11, I recommend looking at the fair-tree algorithm http://slurm.schedmd.com/fair_tree.html This is what we currently are running. It solves some issues with the ticket based algorithm, and I think currently it's the best choice for ensuring fairness between account hierarchies. On 2015-05-13 18:21, Lipari, Don wrote: I need to retract my comments below. While what I wrote is consistent with some of the text from the priority_multifactor2.html page, the writers introduced a secondary term, fair-share priority which looks to be a normalized fair-share factor. And I am not directly familiar with this nuance. Don -Original Message- From: Lipari, Don [mailto:lipa...@llnl.gov] Sent: Wednesday, May 13, 2015 7:56 AM To: slurm-dev Subject: [slurm-dev] RE: multifactor2 priority weighting The original multi-factor plugin generated fair-share factors ranging between 0.0 and 1.0. Under this formula, 0.5 was the factor a user/account would see if their usage was commensurate with their shares. The multi-factor2 plugin fair-share factors range between 0.0 and 100.0, with 1.0 indicating a perfectly serviced user/account. Don -Original Message- From: gareth.willi...@csiro.au [mailto:gareth.willi...@csiro.au] Sent: Tuesday, May 12, 2015 10:05 PM To: slurm-dev Subject: [slurm-dev] multifactor2 priority weighting Hi All, We've just switched to multifactor2 priority as it seemed like a good idea (and low risk) and it is working but the priority factors are much lower (maybe 100x). The final paragraph on http://slurm.schedmd.com/priority_multifactor2.html says we might need to reweight but this magnitude seems odd. Does anyone have any insight that they might like to share? Regards, Gareth BTW. The following may matter. We currently do not have a tree as such, all users are direct children of root. This probably makes using the ticket based algorithm less compelling. Also, we have set FairShareDampeningFactor=40 as we have many relatively inactive users and this may have an impact though I'd expect it to be ignored with the multifactor2 scheme. We have PriorityWeightFairshare=1 -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS NBE +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: Slurm versions 14.11.6 is now available
On 2015-04-25 01:59, David Bigagli wrote: Hi all, the reason to compile without optimization is to be able to have a meaningful stack when attaching gdb to the daemons or when analysing core files. If the optimization is on crucial variables in the stack are optimized out preventing exact diagnoses of issues. This of course is configurable, we only changed the default, if sites wish to compile with optimization use the config option --disable-debug. So this is not about sweeping bugs under the carpet it is exactly the opposite it is a tool to debug more efficiently. FWIW, newer gcc versions have an option -Og, which enables optimizations which don't interfere with debugging. Might be worth adding a configure check if one uses a recent enough gcc, and enable that option then? IIRC the optimizations are roughly similar to what -O1 gives. Anyway, is there a way to enable optimization but keep the debug symbols? For our production builds, I think we'd like to have -O2 -g. The reasons for using statvfs versus statfs is that statfs is deprecated and replaced by the POSIX statvfs, so it is portable across platforms, indeed NetBSD and Solaris do not have statfs. Since all platforms have stavfs the code in get_tmp_disk() under the #define (HAVE_STATFS) is obsolete and will possibly be removed in the next major release. Indeed, statvfs is in POSIX and should work everywhere on a decently new system. However, as I mentioned in my previous message, on Linux prior to kernel 2.6.36 and glibc 2.13, it's not as robust as the (non-standard) statfs. Hence I would prefer that it would be used on Linux in preference for statvfs, as most slurm clusters are presumably still running on older kernel/glibc versions. Something like the attached patch? -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS NBE +358503841576 || janne.blomqv...@aalto.fi diff --git a/src/slurmd/slurmd/get_mach_stat.c b/src/slurmd/slurmd/get_mach_stat.c index d7c5eb1..81994a6 100644 --- a/src/slurmd/slurmd/get_mach_stat.c +++ b/src/slurmd/slurmd/get_mach_stat.c @@ -216,57 +216,42 @@ extern int get_tmp_disk(uint32_t *tmp_disk, char *tmp_fs) { int error_code = 0; - -#if defined(HAVE_STATVFS) - struct statvfs stat_buf; - uint64_t total_size = 0; + unsigned long long total_size = 0; char *tmp_fs_name = tmp_fs; - *tmp_disk = 0; - total_size = 0; - if (tmp_fs_name == NULL) tmp_fs_name = /tmp; - if (statvfs(tmp_fs_name, stat_buf) == 0) { - total_size = stat_buf.f_blocks * stat_buf.f_frsize; - total_size /= 1024 * 1024; - } - else if (errno != ENOENT) { - error_code = errno; - error (get_tmp_disk: error %d executing statvfs on %s, - errno, tmp_fs_name); - } - *tmp_disk += (uint32_t)total_size; -#elif defined(HAVE_STATFS) +#ifdef(__linux__) + /* Prior to Linux 2.6.36 and glibc 2.13, statvfs() can get + * stuck if ANY mount in the system is hung, so use the + * non-standard statfs() instead. Furthermore, as of Linux + * 2.6+ struct statfs contains the f_frsize field which gives + * the size of the blocks reported in the f_blocks field. */ struct statfs stat_buf; - long total_size; - float page_size; - char *tmp_fs_name = tmp_fs; - - *tmp_disk = 0; - total_size = 0; - page_size = (sysconf(_SC_PAGE_SIZE) / 1048576.0); /* MG per page */ - if (tmp_fs_name == NULL) - tmp_fs_name = /tmp; -#if defined (__sun) - if (statfs(tmp_fs_name, stat_buf, 0, 0) == 0) { -#else if (statfs(tmp_fs_name, stat_buf) == 0) { -#endif - total_size = (long)stat_buf.f_blocks; + total_size = stat_buf.f_blocks * stat_buf.f_frsize; + total_size /= 1024 * 1024; + } else if (errno != ENOENT) { + error_code = errno; + error (get_tmp_disk: error %d executing statfs on %s, + errno, tmp_fs_name); +} +#elif defined(HAVE_STATVFS) + struct statvfs stat_buf; + + if (statvfs(tmp_fs_name, stat_buf) == 0) { + total_size = stat_buf.f_blocks * stat_buf.f_frsize; + total_size /= 1024 * 1024; } else if (errno != ENOENT) { error_code = errno; - error (get_tmp_disk: error %d executing statfs on %s, + error (get_tmp_disk: error %d executing statvfs on %s, errno, tmp_fs_name); } - - *tmp_disk += (uint32_t)(total_size * page_size); -#else - *tmp_disk = 1; #endif + *tmp_disk = (uint32_t)total_size; return error_code; }
[slurm-dev] Re: Slurm versions 14.11.6 is now available
On 2015-04-24 02:03, Moe Jette wrote: Slurm version 14.11.6 is now available with quite a few bug fixes as listed below. Slurm downloads are available from http://slurm.schedmd.com/download.html * Changes in Slurm 14.11.6 == [snip] -- Enable compiling without optimizations and with debugging symbols by default. Disable this by configuring with --disable-debug. Always including debug symbols is good (the only cost is a little bit of disk space, should never really be a problem), but disabling optimization by default?? In our environment, slurmctld consumes a decent chunk of cpu time, I would loathe to see it getting a lot (?) slower. Typically, problems which are fixed by disabling optimization are due to violations of the C standard or such which for some reason just doesn't happen to trigger with -O0. Perhaps I'm being needlessly harsh here, but I'd prefer if the bugs were fixed properly rather than being papered over like this. -- Use standard statvfs(2) syscall if available, in preference to non-standard statfs. This is not actually such a good idea. Prior to Linux kernel 2.6.36 and glibc 2.13, the implementation of statvfs required checking all entries in /proc/mounts. If any of those other filesystems are not available (e.g. a hung NFS mount), the statvfs call would thus hang. See e.g. http://man7.org/linux/man-pages/man2/statvfs.2.html Not directly related to this change, there is also a bit of silliness in the statfs() code for get_tmp_disk(), namely that it assumes that the fs record size is the same as the memory page size. As of Linux 2.6 the struct statfs contains a field f_frsize which contains the correct record size. I suggest the attached patch which should fix both of these issues. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS NBE +358503841576 || janne.blomqv...@aalto.fi diff --git a/src/slurmd/slurmd/get_mach_stat.c b/src/slurmd/slurmd/get_mach_stat.c index d7c5eb1..6493755 100644 --- a/src/slurmd/slurmd/get_mach_stat.c +++ b/src/slurmd/slurmd/get_mach_stat.c @@ -217,7 +217,28 @@ get_tmp_disk(uint32_t *tmp_disk, char *tmp_fs) { int error_code = 0; -#if defined(HAVE_STATVFS) +#ifdef(__linux__) + /* Prior to Linux 2.6.36 and glibc 2.13, statvfs() can get + * stuck if ANY mount in the system is hung, so use the + * non-standard statfs() instead. Furthermore, as of Linux + * 2.6+ struct statfs contains the f_frsize field which gives + * the size of the blocks reported in the f_blocks field. */ + struct statfs stat_buf; + unsigned long total_size; + char *tmp_fs_name = tmp_fs; + + if (tmp_fs_name == NULL) + tmp_fs_name = /tmp; + if (statfs(tmp_fs_name, stat_buf) == 0) { + total_size = stat_buf.f_blocks / 1024; + total_size *= stat_buf.f_frsize; + total_size /= 1024; + } else if (errno != ENOENT) { + error_code = errno; + error (get_tmp_disk: error %d executing statvfs on %s, + errno, tmp_fs_name); +} *tmp_disk = (uint32_t)total_size; +#elif defined(HAVE_STATVFS) struct statvfs stat_buf; uint64_t total_size = 0; char *tmp_fs_name = tmp_fs;
[slurm-dev] Re: Slurm versions 14.11.6 is now available
On 24 April 2015 12:41:38 EEST, Janne Blomqvist janne.blomqv...@aalto.fi wrote: On 2015-04-24 02:03, Moe Jette wrote: Slurm version 14.11.6 is now available with quite a few bug fixes as listed below. Slurm downloads are available from http://slurm.schedmd.com/download.html * Changes in Slurm 14.11.6 == [snip] -- Enable compiling without optimizations and with debugging symbols by default. Disable this by configuring with --disable-debug. Always including debug symbols is good (the only cost is a little bit of disk space, should never really be a problem), but disabling optimization by default?? In our environment, slurmctld consumes a decent chunk of cpu time, I would loathe to see it getting a lot (?) slower. Typically, problems which are fixed by disabling optimization are due to violations of the C standard or such which for some reason just doesn't happen to trigger with -O0. Perhaps I'm being needlessly harsh here, but I'd prefer if the bugs were fixed properly rather than being papered over like this. -- Use standard statvfs(2) syscall if available, in preference to non-standard statfs. This is not actually such a good idea. Prior to Linux kernel 2.6.36 and glibc 2.13, the implementation of statvfs required checking all entries in /proc/mounts. If any of those other filesystems are not available (e.g. a hung NFS mount), the statvfs call would thus hang. See e.g. http://man7.org/linux/man-pages/man2/statvfs.2.html Not directly related to this change, there is also a bit of silliness in the statfs() code for get_tmp_disk(), namely that it assumes that the fs record size is the same as the memory page size. As of Linux 2.6 the struct statfs contains a field f_frsize which contains the correct record size. I suggest the attached patch which should fix both of these issues. Hi, come to think of it, in my patch the type of total_size should be unsigned long long to avoid potential overflows on 32-bit Linux targets. Cheers, -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
[slurm-dev] Re: submitting 100k job array causes slurmctld to socket timeout
On 2015-03-15 15:46, Daniel Letai wrote: Hi, Testing a new slurm cluster (14.11.4) on a 1k nodes cluster. Several things we've tried: Increase slurmctld threads (8 ports range) Increase munge threads (threads=10) Increase messageTimeout to 30 We are using accounting (db on different server) Thanks for any help Take a look at http://slurm.schedmd.com/high_throughput.html For us, setting somaxconn to 4096 fixed the socket timeout issues (sysctl net.core.somaxconn=4096). Check with netstat -s | grep LISTEN for listen queue overflows, does the number increase, and if it does, does bumping somaxconn fix it? Put a line like net.core.somaxconn = 4096 in /etc/sysctl.conf if you want the setting to survive a reboot. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS NBE +358503841576 || janne.blomqv...@aalto.fi signature.asc Description: OpenPGP digital signature
[slurm-dev] acct_gather_energy/ipmi configuration
Hi, has anyone got the acct_gather_energy/ipmi plugin to work correctly? In acct_gather.conf I have the lines EnergyIPMIFrequency=30 EnergyIPMICalcAdjustment=yes and in slurm.conf DebugFlags=Profile AcctGatherNodeFreq=30 AcctGatherEnergyType=acct_gather_energy/ipmi However, the end result is that in the slurmd logs when starting slurmd a line like [2014-08-28T10:44:52.179] Power sensor not found. appears. I suspect that the reason is related to the fact that I cannot retrieve the power readings with the ipmi-sensors command. With ipmi-sensors -W discretereading I can get a reading for the power supplies, but it seems to be the nameplate capacity rather than the current consumption. Same for using ipmitool and ipmiutil rather than ipmi-sensors. However, using ipmi-dcmi --get-system-power-statistics (part of freeipmi) does appear to work. So my question, I guess, is that is there some way to configure the acct_gather_energy/ipmi plugin to retrieve these DCMI power values instead of whatever it tries to do now? I looked briefly into the source code and there is a big bunch of undocumented EnergyIPMI configuration parameters, but I didn't figure out if any of those could be used to use DCMI. (The hardware in question is various HP Proliant servers somewhere between 1 and 4 years old) -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: acct_gather_energy/ipmi configuration
Hi, thanks for confirming that my configuration is correct. Indeed ipmi-sensors --non-abbreviated-units | grep Watts doesn't show anything, so I guess I'm out of luck until DCMI support is added to the plugin. On 2014-09-03 15:23, Thomas Cadeau wrote: Hi Janne, The configuration is correct. Please try the command : $ ipmi-sensors --non-abbreviated-units | grep Watts If you have nothing, then the IPMI plugin cannot be used on this hardware. We plan to add an option to used the DCMI power to support more hardwares. The undocumented options EnergyIPMI are wrappers for ipmi-sensors options. I never have reason to use them. Except timeout and reflush, when I used BMC in unstable dev state, but in this case I had troubles with ipmi-sensors too. Thomas Le 03/09/2014 11:44, Janne Blomqvist a écrit : Hi, has anyone got the acct_gather_energy/ipmi plugin to work correctly? In acct_gather.conf I have the lines EnergyIPMIFrequency=30 EnergyIPMICalcAdjustment=yes and in slurm.conf DebugFlags=Profile AcctGatherNodeFreq=30 AcctGatherEnergyType=acct_gather_energy/ipmi However, the end result is that in the slurmd logs when starting slurmd a line like [2014-08-28T10:44:52.179] Power sensor not found. appears. I suspect that the reason is related to the fact that I cannot retrieve the power readings with the ipmi-sensors command. With ipmi-sensors -W discretereading I can get a reading for the power supplies, but it seems to be the nameplate capacity rather than the current consumption. Same for using ipmitool and ipmiutil rather than ipmi-sensors. However, using ipmi-dcmi --get-system-power-statistics (part of freeipmi) does appear to work. So my question, I guess, is that is there some way to configure the acct_gather_energy/ipmi plugin to retrieve these DCMI power values instead of whatever it tries to do now? I looked briefly into the source code and there is a big bunch of undocumented EnergyIPMI configuration parameters, but I didn't figure out if any of those could be used to use DCMI. (The hardware in question is various HP Proliant servers somewhere between 1 and 4 years old) -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: Feedback on integration tests systemd/slurm and questions
On 2014-08-28 19:17, Rémi Palancher wrote: I would be glad to have your insightful lights on this matter :) I would also appreciate to get feedback from other people who have done other tests with slurm and systemd! Haven't tested anything yet, but with RHEL/CentOS 7 already available, I suspect it won't be long before people are starting to roll out clusters based on those OS'es. So the topic certainly deserves some attention, thanks for bringing it up! The funny thing about all of this is that it will become totally irrelevant with the upcoming releases of the linux kernel (3.16+) and the ongoing effort on the cgroup unified hierarchy[3][4]! So if modifications should be done in Slurm on cgroup management, it would be wise to take this into account. [3] http://lwn.net/Articles/601840/ [4] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/cgroups/unified-hierarchy.txt Seems that in the brave new unified hierarchy cgroup world cgroups must be controlled by communicating with the cgroup controller process (which would be systemd on systemd-using systems which should be most of them), rather than manipulating the cgroups fs directly. Systemd provides a D-Bus API for this, see http://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ That implies quite a lot of changes in the slurm cgroups support.. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: Slurm versions 14.03.5 and 14.11.0-pre2 are now available
On Fri 11 Jul 2014 12:24:48 AM EEST, je...@schedmd.com wrote: Slurm versions 14.03.5 and 14.11.0-pre2 are now available. Version 14.03.5 includes about 40 relatively minor bug fixes and enhancements as described below. Highlights of changes in Slurm version 14.03.5 include: -- Added extra index's into the database for better performance when deleting users. It would have been nice if this change would have been mentioned in big blinking colorful letters or something like that. As it is, we routinely updated to 14.03.5 from 14.03.4 without any maintenance break or such, after all just a bugfix release, what could go wrong? Well, what did go wrong was that slurmdbd was offline for 40 minutes while it added those extra indexes.. :( Other than that, 14.03.5 seems to be running fine here. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: Shared TmpFS
On 2014-04-23T17:26:36 EEST, Alfonso Pardo wrote: Hi, I had some errors from a premature terminate jobs with this message: slurmd[bd-p14-01]: error: unlink(/tmp/slurmd/job60560/slurm_script): No such file or directory slurmd[bd-p14-01]: error: rmdir(/tmp/slurmd/job60560): No such file or directory Should “TmpFS” location be a shared file system? No. Or maybe it's possible, but why? Typically /tmp is considered a machine-local directory. That being said, the error messages you quote have nothing to do with the slurm.conf TmpFS setting but rather tell that your SlurmdSpoolDir is set to /tmp/slurmd. That is likely a bad idea, as there might be various /tmp cleaner scripts such as tmpwatch emptying /tmp regularly, leading to errors like you see (been there, done that). Just leave it at the default value unless you have good reasons to do otherwise. Note that it requires some trickery to move the contents of the SlurmdSpoolDir if you want to do it on the fly without losing track of running jobs. We don’t have TmpDisk parameter established (default value). How many space is reasonable for this parameter? Depends on how large disks you have on your nodes, no? However, the trend seems to be that /tmp is a relatively small space, frequently on a ram disk (tmpfs) rather than backed by a real disk [1]. So you might not want to encourage your users to write code assuming a large /tmp is available. A large machine-local space is probably better to place at /var/tmp or something site-specific such as /local. [1] http://0pointer.de/blog/projects/tmp.html -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: xmalloc in Slurm
xmalloc is trying to accomplish handling OOM conditions calloc will eliminate the memset call that can't be handled. Not only *could* calloc be used for efficiency, it *should* be used so xmalloc functions as intended (and more gracefully handles a memory allocation error). Now, all the way back to why this originally came up: I think a more appropriate change is to fix xmalloc to use calloc which should on its own give a speedup to cases like the one Cao was seeing without changing its semantics. If necessary, another malloc wrapper *could* be added that just used malloc, without zeroing the memory. However, the only benefit over the calloc-based wrapper would be when a system is under sufficient load that the kernel has not had time to collect some all-zero pages and there is a slight delay to calloc. I suspect the system load is really going to be a bigger slowdown than the small delay in calloc and there is no benefit to a non-zeroing malloc wrapper. A proper fix would IMHO be to (in addition to the int/size_t issues discussed above) 1) Create a calloc wrapper (xcalloc), and update all current users of xmalloc to use that instead. 2) Remove the memset from xmalloc, so that xmalloc properly is a malloc wrapper. 3) Audit users of xcalloc, change back to use xmalloc where zeroing isn't needed. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: SLURM on FreeBSD
On 2013-11-25 18:38, Jason Bacon wrote: A few patches are attached. I'm withholding additional patches to task_cgroup_cpuset.c pending further testing with FreeBSD's hwloc. Not that it's my decision to make, but IMHO since cgroups are (extremely) Linux-specific, rather than making an #ifdef mess out of the cgroup plugin, it would be better to create a separate task/hwloc plugin. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: Slurmctld dies after restart: Address already in use
On 2013-09-24 03:12, Moe Jette wrote: The SlurmctldProlog/Epilog don't need these open files. I've copied logic from slurmd to close most files when the daemon starts and set the close_on_exec flag for files the daemon opens: https://github.com/SchedMD/slurm/commit/29094e33fcbb4f29e9512059bbdd18ba3504134c That fixes several of the problems. I'm not sure why the job_state.new file is reported by lsof, but will probably investigate further at a later time. Recent Unix'es support a O_CLOEXEC flag to open(), which avoids the potential race condition between opening and setting close-on-exec with fcntl() (and of course, one syscall less). The attached patch does this for a few cases. There are still many more places where this approach could be used; for sockets there is also the Linux-specific SOCK_CLOEXEC flag. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi diff --git a/src/common/daemonize.c b/src/common/daemonize.c index 62545ed..4cc8e7a 100644 --- a/src/common/daemonize.c +++ b/src/common/daemonize.c @@ -159,17 +159,19 @@ int create_pidfile(const char *pidfile, uid_t uid) { FILE *fp; - int fd = -1; + int fd; xassert(pidfile != NULL); xassert(pidfile[0] == '/'); - if (!(fp = fopen(pidfile, w))) { + fd = creat_cloexec(pidfile, S_IRUSR | S_IWUSR | S_IRGRP | S_IWGRP + | S_IROTH | S_IWOTH); + if (fd 0) { error(Unable to open pidfile `%s': %m, pidfile); return -1; } - fd = fileno(fp); + fp = fdopen(fd, w); if (fd_get_write_lock(fd) 0) { error (Unable to lock pidfile `%s': %m, pidfile); diff --git a/src/common/fd.c b/src/common/fd.c index 0bcf353..9c4a413 100644 --- a/src/common/fd.c +++ b/src/common/fd.c @@ -82,6 +82,33 @@ void fd_set_noclose_on_exec(int fd) return; } + +int open_cloexec(const char *pathname, int flags) +{ +#ifdef O_CLOEXEC + return open(pathname, flags | O_CLOEXEC); +#else + int fd = open(pathname, flags); + if (fd = 0) + fd_set_close_on_exec(fd); + return fd; +#endif +} + + +int creat_cloexec(const char *pathname, mode_t mode) +{ +#ifdef O_CLOEXEC + return open(pathname, O_CREAT|O_WRONLY|O_TRUNC|O_CLOEXEC, mode); +#else + int fd = creat(pathname, mode); + if (fd = 0) + fd_set_close_on_exec(fd); + return fd; +#endif +} + + int fd_is_blocking(int fd) { int val = 0; diff --git a/src/common/fd.h b/src/common/fd.h index 704c0e3..2a1dcc2 100644 --- a/src/common/fd.h +++ b/src/common/fd.h @@ -58,6 +58,15 @@ static inline void closeall(int fd) close(fd++); } +/* Open a fd with close-on-exec (POSIX 2008, Linux 2.6.23+), emulating + * it on systems that lack it. */ +int open_cloexec(const char *pathname, int flags); + +/* Create a fd with close-on-exec (POSIX 2008, Linux 2.6.23+), + * emulating it on systems that lack it. */ +int creat_cloexec(const char *pathname, mode_t mode); + + void fd_set_close_on_exec(int fd); /* * Sets the file descriptor (fd) to be closed on exec(). diff --git a/src/slurmctld/controller.c b/src/slurmctld/controller.c index 1040f91..b098d93 100644 --- a/src/slurmctld/controller.c +++ b/src/slurmctld/controller.c @@ -2089,8 +2089,6 @@ static void _init_pidfile(void) * fd open to maintain the write lock */ pid_fd = create_pidfile(slurmctld_conf.slurmctld_pidfile, slurmctld_conf.slurm_user_id); - if (pid_fd = 0) - fd_set_close_on_exec(pid_fd); } /* diff --git a/src/slurmd/slurmd/slurmd.c b/src/slurmd/slurmd/slurmd.c index 997a2d9..f08fcfc 100644 --- a/src/slurmd/slurmd/slurmd.c +++ b/src/slurmd/slurmd/slurmd.c @@ -314,8 +314,6 @@ main (int argc, char *argv[]) so we keep the write lock of the pidfile. */ pidfd = create_pidfile(conf-pidfile, 0); - if (pidfd = 0) - fd_set_close_on_exec(pidfd); rfc2822_timestamp(time_stamp, sizeof(time_stamp)); info(%s started on %s, slurm_prog_name, time_stamp); @@ -1506,11 +1504,10 @@ _slurmd_init(void) init_gids_cache(0); slurm_conf_unlock(); - if ((devnull = open(/dev/null, O_RDWR)) 0) { + if ((devnull = open_cloexec(/dev/null, O_RDWR)) 0) { error(Unable to open /dev/null: %m); return SLURM_FAILURE; } - fd_set_close_on_exec(devnull); /* make sure we have slurmstepd installed */ if (stat(conf-stepd_loc, stat_buf))
[slurm-dev] Re: AllowGroups broken in Slurm 2.6.0?
On 2013-08-22T18:08:23 EEST, Moe Jette wrote: getgrnam_r() does not return the any users in the group jette below, it does work for the other groups (e.g. admin): $ uname -a Linux jette 3.5.0-37-generic #58-Ubuntu SMP Mon Jul 8 22:07:55 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux $ id uid=1001(jette) gid=1001(jette) groups=1001(jette),4(adm),20(dialout),21(fax),24(cdrom),25(floppy),26(tape),29(audio),30(dip),44(video),46(plugdev),104(fuse),110(netdev),112(lpadmin),120(admin),122(sambashare) $ grep jette /etc/group admin:x:120:jette,da jette:x:1001: ... Yeah, primary groups are often special and ought to be handled properly rather than discarded as broken configuration, my bad. I suppose if one wants to ask whether some user belongs to some group without enumeration, one would need to first check the user primary group (getpw{nam,uid}_r()), then if that doesn't match use getgrnam_r(). But that would need some changes in the surrounding code as well.. Cheers, Quoting Janne Blomqvist janne.blomqv...@aalto.fi: On 2013-08-22T00:22:07 EEST, Moe Jette wrote: This does not work on AIX, Darwin, Cygwin or even some Linux configurations. If it works for you that is great, but we probably need to stay with the belts and suspenders approach. Although I haven't got access to all those systems to test on, I'm not sure I agree with your assertion. All the ifdef dancing in the code is about enumerating the user and group databases (variants of setpwent/getpwent/endpwent and setgrent/getgrent/endgrent), but since the point of the patch was to get rid of the enumeration and instead use the group member list returned by getgrnam_r(), I removed the ifdef stuff as no longer needed. Thus, AFAICS, my patched version fails if getgrnam_r() doesn't return the (correct) member list, but such a serious bug in the C library sounds quite unlikely (except for the cases of broken user/group databases I mentioned previously). Quoting Janne Blomqvist janne.blomqv...@aalto.fi: On 2013-08-20 03:23, John Thiltges wrote: These are two AllowGroups snags we've run into: Unlike nscd, sssd doesn't allow enumeration by default (and is case-sensitive). We add this to /etc/sssd/sssd.conf on the slurmctld node: enumerate = True case_sensitive = False Looking at src/slurmctld/groups.c:get_group_members it seems to be a case of using both belts and suspenders in order to work around broken configurations (multiple groups with the same GID, or user primary groups not being listed in the group data entry), resulting in something that turns out to not work in case enumeration is disabled. As many systems such as sssd and winbind disable enumeration by default (for performance, and maybe security-by-obscurity), IMHO it would be better to avoid relying on such a feature. getgrnam_r() already returns all the members of the group, there's no need to iterate over both the entire user and group databases to see if entries match. The attached patch simplifies the code to just use the group member list returned by the getgrnam_r() call, without enumerating all users or groups. If you have large groups, you might run into the buffer size limit, which is 65k characters. (PW_BUF_SIZE in uid.h) The patch also fixes this, by calling getgrnam_r() in a loop, increasing the buffer size if it was too small. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: AllowGroups broken in Slurm 2.6.0?
On 2013-08-20 03:23, John Thiltges wrote: These are two AllowGroups snags we've run into: Unlike nscd, sssd doesn't allow enumeration by default (and is case-sensitive). We add this to /etc/sssd/sssd.conf on the slurmctld node: enumerate = True case_sensitive = False Looking at src/slurmctld/groups.c:get_group_members it seems to be a case of using both belts and suspenders in order to work around broken configurations (multiple groups with the same GID, or user primary groups not being listed in the group data entry), resulting in something that turns out to not work in case enumeration is disabled. As many systems such as sssd and winbind disable enumeration by default (for performance, and maybe security-by-obscurity), IMHO it would be better to avoid relying on such a feature. getgrnam_r() already returns all the members of the group, there's no need to iterate over both the entire user and group databases to see if entries match. The attached patch simplifies the code to just use the group member list returned by the getgrnam_r() call, without enumerating all users or groups. If you have large groups, you might run into the buffer size limit, which is 65k characters. (PW_BUF_SIZE in uid.h) The patch also fixes this, by calling getgrnam_r() in a loop, increasing the buffer size if it was too small. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi diff --git a/src/slurmctld/groups.c b/src/slurmctld/groups.c index 7728484..fd7d758 100644 --- a/src/slurmctld/groups.c +++ b/src/slurmctld/groups.c @@ -83,21 +83,13 @@ struct group_cache_rec { * NOTE: User root has implicitly access to every group * NOTE: The caller must xfree non-NULL return values */ -extern uid_t *get_group_members(char *group_name) +extern uid_t *get_group_members(const char *group_name) { - char grp_buffer[PW_BUF_SIZE]; + char *grp_buffer; + long buflen; struct group grp, *grp_result = NULL; - struct passwd *pwd_result = NULL; - uid_t *group_uids = NULL, my_uid; - gid_t my_gid; - int i, j, uid_cnt; -#ifdef HAVE_AIX - FILE *fp = NULL; -#elif defined (__APPLE__) || defined (__CYGWIN__) -#else - char pw_buffer[PW_BUF_SIZE]; - struct passwd pw; -#endif + uid_t *group_uids = NULL; + size_t i, j, uid_cnt; group_uids = _get_group_cache(group_name); if (group_uids) { /* We found in cache */ @@ -105,84 +97,46 @@ extern uid_t *get_group_members(char *group_name) return group_uids; } - /* We need to check for !grp_result, since it appears some - * versions of this function do not return an error on failure. - */ - if (getgrnam_r(group_name, grp, grp_buffer, PW_BUF_SIZE, - grp_result) || (grp_result == NULL)) { + buflen = sysconf(_SC_GETGR_R_SIZE_MAX); + if (buflen = 0) + buflen = 1024; + grp_buffer = xmalloc(buflen); + + /* Call getgrnam_r in a loop, increasing the buffer size if it + * turned out it was too small. */ + while (1) { + int res = getgrnam_r(group_name, grp, grp_buffer, buflen, + grp_result); + if (res) { + switch(errno) { + case ERANGE: +buflen *= 2; +xrealloc(grp_buffer, buflen); +break; + default: +error(getgrnam_r failed); +xfree(grp_buffer); +return NULL; + } + } else + break; + } + if (grp_result == NULL) { error(Could not find configured group %s, group_name); + xfree(grp_buffer); return NULL; } - my_gid = grp_result-gr_gid; - - j = 0; uid_cnt = 0; -#ifdef HAVE_AIX - setgrent_r(fp); - while (!getgrent_r(grp, grp_buffer, PW_BUF_SIZE, fp)) { - grp_result = grp; -#elif defined (__APPLE__) || defined (__CYGWIN__) - setgrent(); - while ((grp_result = getgrent()) != NULL) { -#else - setgrent(); - while (getgrent_r(grp, grp_buffer, PW_BUF_SIZE, - grp_result) == 0 grp_result != NULL) { -#endif - if (grp_result-gr_gid == my_gid) { - if (strcmp(grp_result-gr_name, group_name)) { -debug(including members of group '%s' as it - corresponds to the same gid as group - '%s',grp_result-gr_name,group_name); - } - - for (i=0; grp_result-gr_mem[i]; i++) { -if (uid_from_string(grp_result-gr_mem[i], - my_uid) 0) { - /* Group member without valid login */ - continue; -} -if (my_uid == 0) - continue; -if (j+1 = uid_cnt) { - uid_cnt += 100; - xrealloc(group_uids, - (sizeof(uid_t) * uid_cnt)); -} -group_uids[j++] = my_uid; - } - } - } -#ifdef HAVE_AIX - endgrent_r(fp); - setpwent_r(fp); - while (!getpwent_r(pw, pw_buffer, PW_BUF_SIZE, fp)) { - pwd_result = pw; -#else - endgrent(); - setpwent(); -#if defined (__sun) - while ((pwd_result = getpwent_r(pw, pw_buffer, PW_BUF_SIZE)) != NULL) { -#elif defined (__APPLE__) || defined (__CYGWIN__) - while ((pwd_result = getpwent()) != NULL) { -#else - while (!getpwent_r(pw, pw_buffer, PW_BUF_SIZE, pwd_result)) { -#endif -#endif
[slurm-dev] Re: slurm-dev Memory accounting issues with mpirun (was Re: Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun)
On 2013-08-07 09:19, Christopher Samuel wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 23/07/13 17:06, Christopher Samuel wrote: Bringing up a new IBM SandyBridge cluster I'm running a NAMD test case and noticed that if I run it with srun rather than mpirun it goes over 20% slower. Following on from this issue, we've found that whilst mpirun gives acceptable performance the memory accounting doesn't appear to be correct. Anyone seen anything similar, or any ideas on what could be going on? See my message from yesterday https://groups.google.com/d/msg/slurm-devel/BlZ2-NwwCCg/03DnMEWYHqUJ for what I think is the reason. That is, the memory accounting is per task, and when launching using mpirun the number of tasks does not correspond to the number of MPI processes, but rather to the number of orted processes (1 per node). Here are two identical NAMD jobs running over 69 nodes using 16 nodes per core, this one launched with mpirun (Open-MPI 1.6.5): == slurm-94491.out == WallClock: 101.176193 CPUTime: 101.176193 Memory: 1268.554688 MB End of program [samuel@barcoo-test Mem]$ sacct -j 94491 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94491 94491.batch6504068K 11167820K 94491.05952048K 9028060K This one launched with srun (about 60% slower): == slurm-94505.out == WallClock: 163.314163 CPUTime: 163.314163 Memory: 1253.511719 MB End of program [samuel@barcoo-test Mem]$ sacct -j 94505 -o JobID,MaxRSS,MaxVMSize JobID MaxRSS MaxVMSize - -- -- 94505 94505.batch 7248K 1582692K 94505.01022744K 1307112K cheers! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlIB5sEACgkQO2KABBYQAh9QMQCfQ57w0YqVDwgyGRqUe3dSvQDj e9cAnRRx/kDNUNqUCuFGY87mXf2fMOr+ =JUPK -END PGP SIGNATURE- -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: RLIMIT_DATA effectively a no-op on Linux
On 2013-07-20T15:23:41 EEST, Chris Samuel wrote: Hi there, On Sat, 20 Jul 2013 02:53:52 AM Bjørn-Helge Mevik wrote: With the recent changes in glibc in how virtual memory is allocated for threaded applications, limiting virtual memory usage for threaded applications is IMO not a good idea. (One example: our slurcltd has allocated 16.1 GiB virtual memory, but is only using 104 MiB resident.) Would you have a pointer to these changes please? From a recent message by yours truly to a slurm-dev thread about slurmctld memory consumption: Yes, this is what we're seeing as well. 6.5 GiB VMEM, 376 MB RSS. The change was that as of glibc 2.10 a more scalable malloc() implementation is used. The new implementation creates up to 8 (2 on 32-bit) pools per core, each 64 MB in size. Thus in our case, where slurmctld runs on a machine with 12 cores, we have up to 12*8*64=6144 MB in those malloc pools. See http://udrepper.livejournal.com/20948.html I would go even further than Bjørn-Helge and claim that limiting virtual memory is, in general, the wrong thing to do. Address space is essentially free and doesn't impact other applications so IMHO the workload manager has no business limiting that. The glibc malloc() behavior being just one situation where trying to limit virtual memory goes wrong. There are other situations where allocating lots of virtual memory is common. E.g. garbage collected runtimes such as Java often allocate huge heaps to use as the garbage collection arena but only a small fraction of that is actually used. I would suggest looking at cgroups for limiting memory usage. Unfortunately cgroups doesn't limit usage (i.e. cause malloc() to fail should it have reached its limit); if I understand it correctly it just invokes the OOM killer on a candidate process within the cgroup once the limit is reached. :-( Yes, that's my understanding as well. On the positive side, few applications can sensibly handle malloc() failures anyway. Often the best that can be done without heroic effort is to just print an error message to stderr and abort(), which is not terribly different from being killed by the OOM killer anyway.. There are a few efforts in the Linux kernel community to do something about this that roughly go in a couple slightly different directions: - Provide some notification to applications that you're exceeding your memory limit, release some memory quickly or face the wrath of the OOM killer. See https://lwn.net/Articles/552789/ https://lwn.net/Articles/548180/ - Provide a mechanism for applications to mark memory ranges as volatile, where the kernel can drop them if memory gets tight instead of going on an OOM killer spree. https://lwn.net/Articles/522135/ https://lwn.net/Articles/554098/ That being said, AFAIK nothing of the above yet exists in the upstream kernel today. So for now IMHO the least bad approach is to just limit RSS as slurm already does (either with cgroups or by polling), and killing jobs if the limit is exceeded. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS BECS +358503841576 || janne.blomqv...@aalto.fi
[slurm-dev] Re: slurmctld consuming tons of memory
On 2013-06-26 13:46, Bjørn-Helge Mevik wrote: Hongjia Cao hj...@nudt.edu.cn writes: I have encountered that slurmctld uses more than 20GB of virtual memory. But the RSS is less than 1GB. I am not sure whether this is OK or there is some leakage. On Linux boxes with newer versions of glibc, slurmctld (as well as any other process that uses a lot of threads) will use a lot of VMEM. There was a change in glibc 2.something (I think it was) in how VMEM is allocated for threads. For instance, our slurmctld right now uses 16 GiB VMEM, but only 117 MiB RSS. Yes, this is what we're seeing as well. 6.5 GiB VMEM, 376 MB RSS. The change was that as of glibc 2.10 a more scalable malloc() implementation is used. The new implementation creates up to 8 (2 on 32-bit) pools per core, each 64 MB in size. Thus in our case, where slurmctld runs on a machine with 12 cores, we have up to 12*8*64=6144 MB in those malloc pools. See http://udrepper.livejournal.com/20948.html -- Janne Blomqvist
[slurm-dev] Re: slurm 2.5 upgrade experiences
On 2012-12-28T22:21:04 EET, Moe Jette wrote: Sorry for the lost jobs. There were two new RPCs that needed to be added to a table, but were not. This prevented slurmd daemons running version 2.4. from communicating with the slurmctld daemon running version 2.5. The fix is here at the location shown below and will be in version 2.5.1: https://github.com/SchedMD/slurm/commit/844f70a2c233e57f55948494bbd9c377163813fb I have already changed the code for version 2.6 to only list the RPCs that can not work between releases (job step creation and task launch). That is a much smaller list and will rarely change, which should prevent this problem in future releases. Hi, thanks for fixing this! I wonder, would it be possible to make the job killing logic a bit less aggressive? In both cases where we have lost jobs during a slurm upgrade, it has been due to communication failure between slurmctld and slurmd's causing jobs to be killed after the timeout, even though in both situations the jobs were actually running just fine. Currently it seems that once slurmctld-slurmd communication fails, a timer starts, and after SlurmdTimeout seconds jobs running on the node are killed. I understand that this kind of timeout-based killing logic is necessary to handle the case when one node in a multi-node job dies, and the job itself hangs rather than dies. Otherwise the hung job would occupy the healthy nodes until the job was killed by the job time limit. I'm thinking that one could do something different in the case when all the nodes for a job disappear. E.g. for serial or small parallel jobs that fit within a single node, or somebody tripping over the network cable connecting the slurmctld node to the rest of the cluster, or some supposedly quick maintenance task taking longer than expected, etc. In such a case, couldn't slurmctld wait with killing the job until 1) a node returns to service and slurmd (or is it slurmstepd?) reports that the job is no longer present, in which case the job can be immediately killed 2) a node returns to service and slurmd/slurmstepd reports that the job is still there, start the timer and wait until SlurmdTimeout for the other nodes to return, and if they don't, kill the job. Moe Quoting Janne Blomqvist janne.blomqv...@aalto.fi: Hi, just a few quick notes about our experience upgrading from 2.4.x (2.4.3 IIRC) to 2.5.0 yesterday. First, the bad news. Some, but not all, jobs were lost. Luckily the cluster was mostly idle after the Christmas holidays, but still. First, slurmdbd was upgraded, that went fine although it took a few minutes to update the DB tables. Next the slurmctld master, followed by the slurmctld backup, and then the compute nodes. AFAICT the reason for the job loss was that after the master slurmctld was updated it was unable to communicate with the compute nodes, and after the timeout expired it killed the jobs. After starting slurmctld 2.5.0 in the log file we had lots of messages like [2012-12-27T10:22:19+02:00] error: Invalid Protocol Version 6144 from uid=0 at 10.10.253.52:6818 [2012-12-27T10:22:19+02:00] error: slurm_receive_msgs: Protocol version has changed, re-link your code followed by [2012-12-27T10:27:16+02:00] error: Nodes cn[01-18,20-64,68-224],fn[01-02],gpu[001-008],tb[003,005-008] not responding and then finally lots of messages like [2012-12-27T10:28:59+02:00] Killing job_id 2722550 on failed node cn29 I suspect that the jobs that were not lost happened to run on nodes where the slurmd upgrade finished before the timeout. Previously we have successfully done on-the-fly upgrades between slurm major versions, and also the release notes said it should work, so it was a bit surprising that it would fail now. Oh well. On a more positive note, we took the new multifactor2 plugin into use, and so far it seems to work as designed, although due to the holidays there's not much action in the queue so it's still too early for further conclusions. -- Janne Blomqvist -- Janne Blomqvist
[slurm-dev] RE: Making hierarchical fair share hierarchical
On 2012-08-24 03:05, Lipari, Don wrote: Janne, As one of those who designed the fair-share component to the multi-factor priority plugin, I'm open to your suggestions below. I would recommend you create it as a separate multi-factor plugin so that we could have the opportunity to switch back and forth and examine the differences in behavior. If it winds up delivering more fairness across all cases, then I'm sure we would abandon the current implementation someday. A separate plugin sounds like a good approach, yes. I'll look into that. For the time being, in case anyone is interested, the core algorithm change can be tested with the attached patch (so far only compile-tested..). I'm not sure how best to handle the sshare utility with regard to the fields it displays. It might make sense to create a new sshare that displays the pertinent components of the new algorithm. Would you mind citing the reference(s) that motivated the formula you presented below? There's a lot of papers on fair share scheduling, but most of them are related to scheduling network packets, and while one can get some idea from them the terminology is often somewhat different. One paper on fair share CPU scheduling which is reasonably easy to understand is J. Kay and P. Lauder [1988], `A Fair Share Scheduler', Communications of the ACM, 31(1):44-55 (1988), http://sydney.edu.au/engineering/it/~judy/Research_fair/89_share.pdf add an addendum http://sydney.edu.au/engineering/it/~piers/papers/Share/share2.html The particular algorithm was (re-)invented by yours truly, later I found something quite close to it in www.cs.umu.se/~elmroth/papers/fsgrid.pdf MOAB also uses a similar algorithm, except that the tree hierarchy is fixed, and the scale factor for each level of the tree is configurable. www.eecs.harvard.edu/~chaki/bib/papers/jackson01maui.pdf While I understand the shortcomings in the scenario of the A, B, and C accounts you describe, coming up with an elegant design that supports a multi-level hierarchy is not a trivial problem, particularly when one of the requirements is to create something that users will understand intuitively. Indeed! FWIW, one way in which the proposed algorithm can fail is if several accounts have used much more than their fair share; then users in those accounts with little usage can clobber the parent FS contribution. Might not be a huge problem in practice, though. An algorithm which I think is quite robust, and does not either suffer from the problems with deep trees in the algorithm I suggested, is the one used by SGE (or whatever it's called nowadays). There you start with a number of tickets at the root of the tree, then distribute those tickets to the child nodes in proportion with their fair share. So at the end of this phase the tickets will have been distributed to the queued jobs. Then you give the job with the highest number the priority 1.0, and the other jobs priorities proportional to the number of tickets they have vs. the number of tickets of the first job. This gives a fairly robust algorithm which gives priorities hierarchically like we want, but at the cost of adding state to each queued job (number of tickets it has). Google finds a number of papers on things like lottery scheduling or stride scheduling, which are slightly relevant to the ticket algorithm. E.g. www.waldspurger.org/carl/papers/phd-mit-tr667.pdf I'll try to find time to see whether such an algorithm can be incorporated into slurm, and if it would be better than the one I originally suggested.. Don -Original Message- From: Janne Blomqvist [mailto:janne.blomqv...@aalto.fi] Sent: Thursday, August 23, 2012 12:05 AM To: slurm-dev Subject: [slurm-dev] Making hierarchical fair share hierarchical Hello, we are seeing an issue with the hierarchical fair share algorithm which we'd like to get fixed. We think our suggested improvements would be of benefit to others as well, but we'd like some inputs on how, or indeed whether, to proceed. For some background, we have a cluster shared between several departments, and we have configured the fair-share such that each department account has a share corresponding to its financial contribution (so far it's simple, 3 departments with equal shares each). Thus it's politically important for us that the fair-share priority reflects the usage factor of each department as well as between users belonging to an account. And yes, we're certainly aware that fair share cannot guarantee that departments get their fair share, but it should at least push in that direction. Thus, we'd like the fair-share factor of a user reflect the account hierarchy, such that users belonging to underserved accounts always have a higher fair-share priority than users belonging to an account which has used more than it's fair share. The problem we're seeing is that the current hierarchical fair-share algorithm does