[slurm-dev] Re: How can I run multi job on one gpu

2017-10-23 Thread Wensheng Deng
There is a early thread related to this: https://groups.google.com/forum/#!searchin/slurm-devel/gres$20gpu$20oversubscribe%7Csort:date/slurm-devel/WPmkNPedKeM/r7EDvX7jujgJ On Sat, Oct 21, 2017 at 10:58 PM, Chaofeng Zhang wrote: > CUDA support it, gpu is shared mode by

[slurm-dev] Re: srun: error: _server_read: fd 18 error reading header: Connection reset by peer

2017-10-17 Thread Wensheng Deng
It is probably that it could not find the executable singularity? On Tue, Oct 17, 2017 at 8:04 AM, Chaofeng Zhang wrote: > I met the error when using slurm. > > > > srun: error: _server_read: fd 18 error reading header: Connection reset by > peer > srun: error:

[slurm-dev] Re: How to set 'future' node state?

2017-07-14 Thread Wensheng Deng
Make sure the FUTURE node is not included in any partition. On Fri, Jul 14, 2017 at 5:27 PM, Robbert Eggermont wrote: > > Hello, > > We're adding some nodes to our cluster (17.02.5). In preparation, we've > defined the nodes in our slurm.conf with "State=FUTURE" (as

[slurm-dev] Re: Running job, finished workload

2017-06-08 Thread Wensheng Deng
Hi Alden, The CPU time is probably summarization of that from all 8 CPU cores. By any chance do you have any runaway process from the job on the node, such as epilogue etc? I am guessing... On Thu, Jun 8, 2017 at 2:39 PM Stradling, Alden Reid (ars9ac) < ars...@virginia.edu> wrote: > I have a

[slurm-dev] Re: slurmdb error

2017-04-26 Thread Wensheng Deng
Also check to see if munge is functioning properly. On Wed, Apr 26, 2017 at 10:00 AM, Jeff Tan wrote: > Hi Mahmood > > > [root@cluster ~]# ps aux | grep slurmdb > > root 3406 0.0 0.0 338636 2672 ?Sl 00:26 0:01 > > /usr/sbin/slurmdbd > > root 17146

[slurm-dev] Re: How to allow Epilog script to run for job that is cancelled

2017-04-13 Thread Wensheng Deng
Hi, several months ago when I started learning Slurm and reading through the web pages, I made this picture to help myself understanding the *prolog and *epilog interactions with job steps. Please see the attachment. If you see any corrections necessary, please inform. Thank you! Best Regards.

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Wensheng Deng
sn't terminated. Note: other cgroup files like > > > memory.memsw.xxx are also in play if you are using swap space > > > > > > As to how to manage this. You can either not use cgroup and use an > > > alternative plugin, you could also try the JobAcctGatherParam

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Wensheng Deng
ote: other cgroup files like memory.memsw.xxx are also in play if you are > using swap space > > > > As to how to manage this. You can either not use cgroup and use an > alternative plugin, you could also try the JobAcctGatherParams parameter > NoOverMemoryKill (the documentati

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Wensheng Deng
he JobAcctGatherParams parameter > NoOverMemoryKill (the documentation say use this with caution, see > https://slurm.schedmd.com/slurm.conf.html), or you can try and account > for the cache by using the jobacct_gather/cgroup. Unfortunately, because > of a bug this plugin does report c

[slurm-dev] Re: Slurm & CGROUP

2017-03-17 Thread Wensheng Deng
The file is copied fine. It is just the message error annoying. On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist <janne.blomqv...@aalto.fi> wrote: > On 2017-03-15 17:52, Wensheng Deng wrote: > > No, it does not help: > > > > $ scontrol show config |grep

[slurm-dev] Re: Slurm & CGROUP

2017-03-15 Thread Wensheng Deng
No, it does not help: $ scontrol show config |grep -i jobacct *JobAcct*GatherFrequency = 30 *JobAcct*GatherType = *jobacct*_gather/cgroup *JobAcct*GatherParams = NoShared On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng <w...@nyu.edu> wrote: > I think I tried that. l

[slurm-dev] Re: Slurm & CGROUP

2017-03-15 Thread Wensheng Deng
arams=NoShare? > > Chris > > > ____ > From: Wensheng Deng <w...@nyu.edu> > Sent: 15 March 2017 10:28 > To: slurm-dev > Subject: [ext] [slurm-dev] Re: Slurm & CGROUP > > It should be (sorry): > we 'cp'ed a 5GB file from scratch to node local disk

[slurm-dev] Re: Slurm & CGROUP

2017-03-15 Thread Wensheng Deng
It should be (sorry): we 'cp'ed a 5GB file from scratch to node local disk On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng <w...@nyu.edu> wrote: > Hello experts: > > We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a 5GB job > from scratch to node local disk, de

[slurm-dev] Slurm & CGROUP

2017-03-15 Thread Wensheng Deng
Hello experts: We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a 5GB job from scratch to node local disk, declared 5 GB memory for the job, and saw error message as below although the file was copied okay: slurmstepd: error: Exceeded job memory limit at some point. srun: error:

[slurm-dev] Re: job arrays, fifo queueing not wanted

2016-12-14 Thread Wensheng Deng
Hello Michael, Have you tried this ? I am learning, and curious to know... https://slurm.schedmd.com/scontrol.html top job_id Move the specified job ID to the top of the queue of jobs belonging to the identical user ID, partition name, account, and QOS. Any job not matching all of t\ hose

[slurm-dev] Re: X11 plugin problems

2016-11-07 Thread Wensheng Deng
Hi, I am relatively to Slurm -facing the same issue. As an user I could ssh back from the compute node to the login node without being asked for a password. My OS is CentOS 7.2, and Slurm 16.05.4. In slurmd.log, there are message as the following: [2016-11-07T11:04:31.958] [10004626.0] debug2: