[slurm-dev] Re: Cluster nodes

2017-11-05 Thread Chris Samuel
On Sunday, 5 November 2017 11:09:29 AM AEDT ايمان wrote: > I want to run parallel java code on more than one nodes , but it executed > only on one nodes ? Java is not magically able to span nodes, you need to ensure your program can handle that and has the necessary supporting

[slurm-dev] Re: How to strictly limit the memory per CPU

2017-11-02 Thread Chris Samuel
On Thursday, 2 November 2017 8:02:47 PM AEDT Rajiv Nishtala wrote: > And also using cgroups; https://slurm.schedmd.com/cgroup.conf.html That will constrain the memory a job can use to what it has asked for, but I think that the original poster was asking how to stop a user asking for that

[slurm-dev] Re: question within SLURM

2017-10-24 Thread Chris Samuel
On Tuesday, 24 October 2017 7:51:23 PM AEDT Rajiv Nishtala wrote: > I'm trying to play with the part of the code that is responsible for killing > a job if it exceeds a memory limit, for instance via cgroups or so. With cgroups it is the Linux kernel, not Slurm, that is responsible for killing

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-20 Thread Chris Samuel
On Friday, 20 October 2017 9:53:06 AM AEDT Lachlan Musicman wrote: > Latest version of sssd can take shortnames and search through domains. I'm not sure if that works though if you've got two different people with the same username in different domains though. cheers, Chris -- Christopher

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-19 Thread Chris Samuel
On Thursday, 19 October 2017 7:41:37 PM AEDT Nadav Toledo wrote: > running : id -u domain_name\\username , does return its uid So your system is not finding users as just "username", but instead only as domain_name\\username which is probably not ideal. You probably want to see if you can

[slurm-dev] Re: Qos limits associations and AD auth

2017-10-19 Thread Chris Samuel
On Thursday, 19 October 2017 4:43:13 PM AEDT Nadav Toledo wrote: > so adding manually only works if I dont restart slurmctrld... That usually points to a communication problem for slurmdbd trying to tell the slurmctld about these changes via an RPC. What does this say? sacctmgr list clusters

[slurm-dev] Re: Tasks distribution

2017-10-09 Thread Chris Samuel
On Monday, 9 October 2017 8:46:21 PM AEDT Sysadmin CAOS wrote: > Mmmm, yes... CentOS only offers PMIX packages and I don't know where I > can find PMI{1,2} packages... How should I compile SLURM? To compile Slurm to support PMIx you need to have this in your configure:

[slurm-dev] Re: Tasks distribution

2017-10-09 Thread Chris Samuel
On Monday, 9 October 2017 8:11:29 PM AEDT Chris Samuel wrote: > Do you mean --with-pmix=${PATH_TO_PMIX} instead? Sorry, I thought you were configuring Slurm with PMIx support there! -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics - The Univers

[slurm-dev] Re: Tasks distribution

2017-10-09 Thread Chris Samuel
On Monday, 9 October 2017 7:11:06 PM AEDT Sysadmin CAOS wrote: > I have compiled OpenMPI 1.8.1 with --with-pmi=/usr/lib64 (where is > located libpmix.so file) Do you mean --with-pmix=${PATH_TO_PMIX} instead? -- Christopher SamuelSenior Systems Administrator Melbourne Bioinformatics

[slurm-dev] Re: How to set %SLURM_JOB_ID as an argument of python in sbatch script?

2017-08-22 Thread Chris Samuel
On Tuesday, 22 August 2017 9:46:17 PM AEST 刘科 wrote: > python ~/opt/script/auto_run_te_by_steps.py {} `%SLURM_JOB_ID` Try this instead: python ~/opt/script/auto_run_te_by_steps.py '{}' ${SLURM_JOB_ID} That uses single quotes to stop the shell from trying to expand whatever that is (I'm not a

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-15 Thread Chris Samuel
On Tuesday, 15 August 2017 4:34:55 PM AEST John Hearns wrote: > For the /proc/self you need to start an interactive job under Slurm. You can actually use srun to join an existing job via the --jobid option. [samuel@barcoo ~]$ srun --jobid 6821761 --pty -u /bin/bash -i -l [samuel@barcoo033 ~]$

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-15 Thread Chris Samuel
On Tuesday, 15 August 2017 1:16:08 PM AEST Lachlan Musicman wrote: > Oh, that explains more. > > Now it looks like: > [...] OK, so that looks better! So what does this say: cat /sys/fs/cgroup/cpuset/slurm/uid_1506/job_1998/step_batch/cpuset.cpus > I seem to have a lot of guff in there that

[slurm-dev] Re: Compute nodes going to drained/draining state

2017-05-25 Thread Chris Samuel
On Thursday, 25 May 2017 6:51:26 PM AEST Baker D. J. wrote: > Thank you for your response to my email. I've taken a look at one of the > compute nodes that has been drained by the SLURM system -- please see > below. If appears to suggest the node was drained due to a job failing > (running out

[slurm-dev] Re: reporting used memory with job Accounting or Completion plugins?

2017-03-10 Thread Chris Samuel
On Friday, 10 March 2017 2:26:08 PM AEDT Grigory Shamov wrote: > Another newbie question: does SLURM report any used memory (as well as > other resource usage, other than wall time) statistics for jobs, as part > of either Accounting or Completion records ? If you use the slurmdbd accounting

[slurm-dev] Re: error: Node localhost has low socket*core*thread count

2016-11-05 Thread Chris Samuel
On Saturday, 5 November 2016 12:04:03 AM AEDT Peter van Heusden wrote: > Thanks! That was the problem - I misunderstood from my NICD colleagues what > the machine config is and it has only a single CPU: No worries, glad that helped! -- Christopher SamuelSenior Systems Administrator

[slurm-dev] Re: error: Node localhost has low socket*core*thread count

2016-11-05 Thread Chris Samuel
On Friday, 4 November 2016 11:39:30 PM AEDT Peter van Heusden wrote: > Nov 5 02:12:09 bio-linux slurmctld[27239]: error: Node localhost has low > socket*core*thread count (4 < 8) > Nov 5 02:12:09 bio-linux slurmctld[27239]: error: Node localhost has low cpu > count (4 < 8) So Slurm is

[slurm-dev] Re: Slurmdbd

2016-09-23 Thread Chris Samuel
On Friday, 23 September 2016 12:17:26 AM AEST Lachlan Musicman wrote: > Is there a description of what each field is in the slurmdbd? I don't think the Slurm developers support direct access to the database like that and fields are liable to change on major releases. Tools like XDMoD get

[slurm-dev] Re: slurm_job_reason_string() not callable from outside Slurm?

2016-05-27 Thread Chris Samuel
On Thursday, 26 May 2016 11:37:02 PM AEST Christopher Samuel wrote: > Which is really odd as the code already calls into the Slurm libraries > for other functions such as slurm_hostlist_create(), etc. Solved! The slurm_showq code I'm hacking on is C++, not straight C, so I had to change my

[slurm-dev] Re: Regards Postgres Plugin for SLURM

2016-03-18 Thread Chris Samuel
On Fri, 18 Mar 2016 06:53:38 AM Doguparthi, Subramanyam wrote: > We are from Hewlett Packard Enterprise and evaluating SLURM > for one of our requirements. Database our application uses is Postgres and > we don’t see any working plugin available. Is it possible to help us with >

[slurm-dev] Re: What cluster provisioning system do you use?

2016-03-15 Thread Chris Samuel
On Tue, 15 Mar 2016 05:40:29 AM Bjørn-Helge Mevik wrote: > I apologize for the slightly off-topic subject, but I could not think of > a better forum to ask. If you know of a more proper place to ask this, > I'd be happy to know about it. http://beowulf.org/ There's actually a very recent

[slurm-dev] Re: User education tools for fair share

2016-03-01 Thread Chris Samuel
Hi Loris, On Tue, 1 Mar 2016 12:29:12 AM Loris Bennett wrote: > To help the user understand their current fairshare/priority status, I > usually point them to 'sprio', generally in the following incantation: > > sprio -l | sort -nk3 > > to get the jobs sorted by priority. Thanks for that,

[slurm-dev] Re: scontrol reboot won't reboot reserved nodes?

2016-03-01 Thread Chris Samuel
On Mon, 29 Feb 2016 11:50:18 PM Uwe Sauter wrote: > Did you configure the RebootProgram parameter in slurm.conf and is that > script working? Remember: this script is run on the compute node, therefore > it must be available on the compute node and must be executable. Yes, all our clusters are

[slurm-dev] RE: max user jobs limit?

2016-02-18 Thread Chris Samuel
On Thu, 18 Feb 2016 02:46:20 PM Skouson, Gary B wrote: > As far as I can tell, there isn’t really a way to set a default combined > limit across the cluster for the jobs a particular user can run You can!We do: sacctmgr modify cluster ${CLUSTER} set maxjobs=192 We thought we had to use

[slurm-dev] Re: Slurm accounting recommendations?

2015-12-28 Thread Chris Samuel
On Mon, 28 Dec 2015 08:25:04 PM Simpson Lachlan wrote: > Out of interest – and because I don’t really see any docs about this – is > XDMod just a front end to slurmdbd? > > If I go with XDMod do I even need slurmdbd? One of our staff is playing with XDMod at the moment and what you do is

[slurm-dev] Re: slurmd can't mount cpuacct cgroup namespace on RHEL 7.2 ?

2015-12-22 Thread Chris Samuel
On Tue, 22 Dec 2015 12:57:04 AM Janne Blomqvist wrote: > 1. When using systemd, or some other tool that mounts the cgroup file > systems early in the boot process (e.g. cgconfig), you should not try to > mount the cgroup filesystems from slurmd. That is, in > /etc/slurm/cgroup.conf put

[slurm-dev] Re: [slurm-devel] update SLURM 2.6.7 to SLURM 15.0.8.4

2015-11-15 Thread Chris Samuel
On Sat, 14 Nov 2015 04:09:03 PM Apolinar Martinez Melchor wrote: > We want to update SLURM 2.6.7 to SLURM 15.08.4 Be aware that you cannot upgrade directly from 2.6 to 15.08, it's too big a jump. A version of Slurm only supports the last two minor releases, so 15.08 only supports upgrades

[slurm-dev] Re: Problem to run a job with more memory only on the node where the job start

2015-10-20 Thread Chris Samuel
On Mon, 19 Oct 2015 12:44:59 PM Ghislain LE MEUR wrote: > Is it possible to start a job on one of these 2 nodes with 512G and also > compute on others nodes with 128G of memory ? The job needs only more > memory on the first node where the job start. I suspect what you are after is the support

[slurm-dev] Re: Problem using --ntasks (slurm 14.11.9)

2015-10-07 Thread Chris Samuel
On Wed, 7 Oct 2015 12:50:53 AM g...@cines.fr wrote: > #!/bin/bash > #SBATCH --nodes=4 > #SBATCH --ntasks=7 > #SBATCH --ntasks-per-node=2 [...] > With slurm version 14.11.9, when the job is submitted with sbatch command we > get : > > SLURM_NTASKS=8 > SLURM_NPROCS=8 >

[slurm-dev] Re: Torque's routing queue like mechanism in Slurm

2015-10-01 Thread Chris Samuel
On Thu, 1 Oct 2015 02:23:30 AM vaibhav pol wrote: >Does in Slurm, can we route jobs between partition. Previously I was > working on PBS (Torque) it has routing queue functionality these queues > are able to route jobs to different queue. In Slurm you can submit to multiple partitions at

[slurm-dev] Adding a new cluster to slurmdbd - how to get existing accounts/associations to be recognised?

2015-08-20 Thread Chris Samuel
Hi there, We've upgraded our slurmdbd to 14.11.8 as part of our prep for a new cluster and I'm having issues with the creation of the new cluster. We can add the new cluster snowy to slurmdbd with sacctmgr, but any changes to accounts after that now fail with: # sacctmgr modify user set

[slurm-dev] Re: Adding a new cluster to slurmdbd - how to get existing accounts/associations to be recognised?

2015-08-20 Thread Chris Samuel
On Fri, 21 Aug 2015 02:59:56 AM Danny Auble wrote: You need to add the accounts to the cluster. If you want it like your other cluster an easy way to do that is use sacctmgr to dump the cluster and then change the cluster name in the file and load it in with sacctmgr. Thanks Danny - it's odd

[slurm-dev] Re: Adding a new cluster to slurmdbd - how to get existing accounts/associations to be recognised?

2015-08-20 Thread Chris Samuel
On Fri, 21 Aug 2015 02:59:56 AM Danny Auble wrote: You need to add the accounts to the cluster. If you want it like your other cluster an easy way to do that is use sacctmgr to dump the cluster and then change the cluster name in the file and load it in with sacctmgr. Dumped another cluster,

[slurm-dev] Re: Off-topic: What accounting system do you use?

2015-06-25 Thread Chris Samuel
On Thu, 25 Jun 2015 02:03:47 AM Bjørn-Helge Mevik wrote: Christopher Samuel sam...@unimelb.edu.au writes: http://karaage.readthedocs.org/en/latest/introduction.html Karaage looks interesting for managing projects and users. Can it manage usage limits? Sadly that's one thing it doesn't

[slurm-dev] Re: cgroups support in slurm (sbatch vs salloc)

2015-05-07 Thread Chris Samuel
On Thu, 7 May 2015 04:01:25 AM Igor Kozin wrote: My real question is why running salloc --mem-per-cpu=1000 --ntasks=1 bash does not create cgroups and therefore gets you an unlimited interactive session? My understanding is that salloc will give you a session on the same node you run it,

[slurm-dev] Slurm and docker/containers

2015-05-07 Thread Chris Samuel
Hi folks, There's been discussion in the past about Slurm and docker and for the first time today I was asked by a user if it was possible yet to run docker containers inside Slurm. Their use case is they want to distribute bioinformatics tools inside a docker container and want to be able

[slurm-dev] Re: OpenMPI and 14.3 to 14.11 upgrade

2015-02-05 Thread Chris Samuel
On Thu, 5 Feb 2015 03:27:25 PM Peter A Ruprecht wrote: I ask because some of our users have started reporting a 10x increase in run-times of OpenMPI jobs since we upgraded to 14.11.3 from 14.3. It's possible there is some other problem going on in our cluster, but all of our hardware checks

[slurm-dev] Re: Unable to use Spank-X11 plugin

2015-01-27 Thread Chris Samuel
On Tue, 27 Jan 2015 10:11:23 PM Dennis Zheleznyak wrote: Anyone ? Don't run it in an sbatch as it will be disconnected from your login session and any $DISPLAY you may have won't make any sense. I believe you need to run the srun from the login node itself. -- Christopher Samuel

[slurm-dev] Re: Accounting RawUsage on systems with HyperThreading

2015-01-09 Thread Chris Samuel
On Fri, 9 Jan 2015 05:11:39 AM Hendryk Bockelmann wrote: Do you have any idea if this is an intended behaviour or a bug? We hit this same issue when we were bringing up our Intel Sandybridge cluster back in 2013 (as SMT for SB was better than previous generations) and it was a significant

[slurm-dev] Re: User's primary group lost when submitting with different GID

2014-12-11 Thread Chris Samuel
Hi Andrew, On Thu, 11 Dec 2014 07:06:13 AM Andrew J. Prout wrote: Looks like the groups id command actually mix reporting of the current process and what's in /etc/group, so don't work to debug this issue. Ah, sorry! I should have strace'd groups first to confirm it wasn't talking to

[slurm-dev] Re: User's primary group lost when submitting with different GID

2014-12-10 Thread Chris Samuel
On Wed, 10 Dec 2014 09:20:27 AM Andrew J. Prout wrote: Below are some examples of what I'm seeing with SLURM 14.11.0. Notice that group 1000 disappears in the sleep process. I can't replicate this on Slurm 14.03.10, FWIW, it seems to do the right thing. Also remember that you can grab your

[slurm-dev] Re: Build error on CentOS 5.10

2014-10-02 Thread Chris Samuel
On Thu, 2 Oct 2014 12:15:10 AM Dennis Zheleznyak wrote: Hi Chris, Hiya, Those all look fine, but I've just noticed you're building RPMs there and that's something we don't do with Slurm so I'm afraid I'm not sure I can help there, sorry! :-( All the best, Chris -- Christopher Samuel

[slurm-dev] Re: shellshock patch uses a different function export, caused some errors on our Slurm cluster

2014-09-29 Thread Chris Samuel
On Mon, 29 Sep 2014 02:10:07 AM Alan Orth wrote: Other users wouldn't have noticed because we updated all of our infrastructure in one go using ansible[0] last Friday. We use xCAT to manage our clusters and whilst we could have done that if we had wished it would have caused any jobs queued

[slurm-dev] RE: Feedback on integration tests systemd/slurm and questions

2014-08-29 Thread Chris Samuel
On Fri, 29 Aug 2014 11:03:10 AM Martin Perry wrote: Thanks for investigating this. It looks like some work will be required to fully integrate Slurm cgroups with systemd. I suspect that's going to depend mightily on the version of systemd and the kernel you are using (for points already

[slurm-dev] Re: Feedback on integration tests systemd/slurm and questions

2014-08-29 Thread Chris Samuel
On Fri, 29 Aug 2014 06:02:10 AM Janne Blomqvist wrote: Haven't tested anything yet, but with RHEL/CentOS 7 already available, I suspect it won't be long before people are starting to roll out clusters based on those OS'es. So the topic certainly deserves some attention, thanks for bringing

[slurm-dev] Re: Upgrading and not losing jobs

2014-08-24 Thread Chris Samuel
On Sun, 24 Aug 2014 01:58:05 AM Dennis Zheleznyak wrote: I'm upgrading Slurm from 2.4.4 to the latest 14.X version, when I tried to simulate it in a virtual environment the running jobs were deleted every single time. As Uwe said I suspect that's too large a jump to be supported, you might

[slurm-dev] Re: Using the same (mounted) slurm installation on all nodes

2014-07-25 Thread Chris Samuel
On Fri, 25 Jul 2014 01:34:09 AM Bastian Krüger wrote: I recently began working with a cluster that consists of 1 control node and several computation node and it was set up a couple of years ago by someone else. In this current setup, there is only one actual slurm installation, which is

[slurm-dev] Re: Requeue and resubmit after networking issue

2014-05-19 Thread Chris Samuel
On Mon, 19 May 2014 04:37:03 AM Teshome Dagne Mulugeta wrote: Is there a way to keep the running jobs continue after a netwokring issue between slurm daemon and nodes? I suspect the answer is the --no-kill option for sbatch. Best of luck! Chris -- Christopher SamuelSenior Systems

[slurm-dev] Default values for GrpCPURunMins, GrpCPUs and GrpJobs in slurm.conf?

2013-08-22 Thread Chris Samuel
Hi folks, As we are transitioning from Torque+Moab to a pure Slurm setup I'm trying to wrap my head around how we can implement similar scheduling limits on our x86 systems to the rules we were using with Moab. One thing that I've found is that GrpCPURunMins, GrpCPUs and GrpJobs are what we

[slurm-dev] Re: RLIMIT_DATA effectively a no-op on Linux

2013-07-20 Thread Chris Samuel
Hi there, On Sat, 20 Jul 2013 02:53:52 AM Bjørn-Helge Mevik wrote: With the recent changes in glibc in how virtual memory is allocated for threaded applications, limiting virtual memory usage for threaded applications is IMO not a good idea. (One example: our slurcltd has allocated 16.1