There is a early thread related to this:
https://groups.google.com/forum/#!searchin/slurm-devel/gres$20gpu$20oversubscribe%7Csort:date/slurm-devel/WPmkNPedKeM/r7EDvX7jujgJ
On Sat, Oct 21, 2017 at 10:58 PM, Chaofeng Zhang
wrote:
> CUDA support it, gpu is shared mode by
It is probably that it could not find the executable singularity?
On Tue, Oct 17, 2017 at 8:04 AM, Chaofeng Zhang wrote:
> I met the error when using slurm.
>
>
>
> srun: error: _server_read: fd 18 error reading header: Connection reset by
> peer
> srun: error:
Make sure the FUTURE node is not included in any partition.
On Fri, Jul 14, 2017 at 5:27 PM, Robbert Eggermont
wrote:
>
> Hello,
>
> We're adding some nodes to our cluster (17.02.5). In preparation, we've
> defined the nodes in our slurm.conf with "State=FUTURE" (as
Hi Alden,
The CPU time is probably summarization of that from all 8 CPU cores. By any
chance do you have any runaway process from the job on the node, such as
epilogue etc? I am guessing...
On Thu, Jun 8, 2017 at 2:39 PM Stradling, Alden Reid (ars9ac) <
ars...@virginia.edu> wrote:
> I have a
Also check to see if munge is functioning properly.
On Wed, Apr 26, 2017 at 10:00 AM, Jeff Tan wrote:
> Hi Mahmood
>
> > [root@cluster ~]# ps aux | grep slurmdb
> > root 3406 0.0 0.0 338636 2672 ?Sl 00:26 0:01
> > /usr/sbin/slurmdbd
> > root 17146
Hi, several months ago when I started learning Slurm and reading through
the web pages, I made this picture to help myself understanding the *prolog
and *epilog interactions with job steps. Please see the attachment. If you
see any corrections necessary, please inform. Thank you!
Best Regards.
sn't terminated. Note: other cgroup files like
> > > memory.memsw.xxx are also in play if you are using swap space
> > >
> > > As to how to manage this. You can either not use cgroup and use an
> > > alternative plugin, you could also try the JobAcctGatherParam
ote: other cgroup files like memory.memsw.xxx are also in play if you are
> using swap space
>
>
>
> As to how to manage this. You can either not use cgroup and use an
> alternative plugin, you could also try the JobAcctGatherParams parameter
> NoOverMemoryKill (the documentati
he JobAcctGatherParams parameter
> NoOverMemoryKill (the documentation say use this with caution, see
> https://slurm.schedmd.com/slurm.conf.html), or you can try and account
> for the cache by using the jobacct_gather/cgroup. Unfortunately, because
> of a bug this plugin does report c
The file is copied fine. It is just the message error annoying.
On Thu, Mar 16, 2017 at 8:55 AM, Janne Blomqvist <janne.blomqv...@aalto.fi>
wrote:
> On 2017-03-15 17:52, Wensheng Deng wrote:
> > No, it does not help:
> >
> > $ scontrol show config |grep
No, it does not help:
$ scontrol show config |grep -i jobacct
*JobAcct*GatherFrequency = 30
*JobAcct*GatherType = *jobacct*_gather/cgroup
*JobAcct*GatherParams = NoShared
On Wed, Mar 15, 2017 at 11:45 AM, Wensheng Deng <w...@nyu.edu> wrote:
> I think I tried that. l
arams=NoShare?
>
> Chris
>
>
> ____
> From: Wensheng Deng <w...@nyu.edu>
> Sent: 15 March 2017 10:28
> To: slurm-dev
> Subject: [ext] [slurm-dev] Re: Slurm & CGROUP
>
> It should be (sorry):
> we 'cp'ed a 5GB file from scratch to node local disk
It should be (sorry):
we 'cp'ed a 5GB file from scratch to node local disk
On Wed, Mar 15, 2017 at 11:26 AM, Wensheng Deng <w...@nyu.edu> wrote:
> Hello experts:
>
> We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a 5GB job
> from scratch to node local disk, de
Hello experts:
We turn on TaskPlugin=task/cgroup. In one Slurm job, we 'cp'ed a 5GB job
from scratch to node local disk, declared 5 GB memory for the job, and saw
error message as below although the file was copied okay:
slurmstepd: error: Exceeded job memory limit at some point.
srun: error:
Hello Michael,
Have you tried this ? I am learning, and curious to know...
https://slurm.schedmd.com/scontrol.html
top job_id
Move the specified job ID to the top of the queue of jobs belonging to the
identical user ID, partition name, account, and QOS. Any job not matching
all of t\
hose
Hi,
I am relatively to Slurm -facing the same issue. As an user I could ssh
back from the compute node to the login node without being asked for a
password.
My OS is CentOS 7.2, and Slurm 16.05.4.
In slurmd.log, there are message as the following:
[2016-11-07T11:04:31.958] [10004626.0] debug2:
16 matches
Mail list logo