On Tue, Jun 12, 2018 at 11:08:44AM -0400, Prentice Bisbal wrote:
> On 06/12/2018 12:33 AM, Chris Samuel wrote:
>
> >Hi Prentice!
> >
> >On Tuesday, 12 June 2018 4:11:55 AM AEST Prentice Bisbal wrote:
> >
> >>I to make this work, I will be using job_submit.lua to apply this logic
> >>and assign a
> On Jun 12, 2018, at 11:08 AM, Prentice Bisbal wrote:
>
> On 06/12/2018 12:33 AM, Chris Samuel wrote:
>
>> Hi Prentice!
>>
>> On Tuesday, 12 June 2018 4:11:55 AM AEST Prentice Bisbal wrote:
>>
>>> I to make this work, I will be using job_submit.lua to apply this logic
>>> and assign a job
On 06/12/2018 12:33 AM, Chris Samuel wrote:
Hi Prentice!
On Tuesday, 12 June 2018 4:11:55 AM AEST Prentice Bisbal wrote:
I to make this work, I will be using job_submit.lua to apply this logic
and assign a job to a partition. If a user requests a specific partition
not in line with these
On Tue, Jun 12, 2018 at 02:28:25PM +1000, Chris Samuel wrote:
> On Sunday, 10 June 2018 1:48:18 AM AEST Skylar Thompson wrote:
>
> > Unfortunately we don't have a mechanism to limit
> > network usage or local scratch usage
>
> Our trick in Slurm is to use the slurmdprolog script to set an XFS
On Tuesday, 12 June 2018 6:13:49 PM AEST Kilian Cavalotti wrote:
> Slurm has a scheduler option that could probably help with that:
> https://slurm.schedmd.com/slurm.conf.html#OPT_pack_serial_at_end
Ah I knew I'd seen something like that before! I got fixated on CR_Pack_Nodes
which is not for
On Tue, Jun 12, 2018 at 6:33 AM, Chris Samuel wrote:
> However, I do think Scott's approach is potentially very useful, by directing
> jobs < full node to one end of a list of nodes and jobs that want full nodes
> to the other end of the list (especially if you use the partition idea to
> ensure
> However, I do think Scott's approach is potentially very useful, by
directing
> jobs < full node to one end of a list of nodes and jobs that want full
nodes
> to the other end of the list (especially if you use the partition idea to
> ensure that not all nodes are accessible to small jobs).
Hi Prentice!
On Tuesday, 12 June 2018 4:11:55 AM AEST Prentice Bisbal wrote:
> I to make this work, I will be using job_submit.lua to apply this logic
> and assign a job to a partition. If a user requests a specific partition
> not in line with these specifications, job_submit.lua will reassign
On Sunday, 10 June 2018 1:48:18 AM AEST Skylar Thompson wrote:
> Unfortunately we don't have a mechanism to limit
> network usage or local scratch usage
Our trick in Slurm is to use the slurmdprolog script to set an XFS project
quota for that job ID on the per-job directory (created by a plugin
Chris,
I'm dealing with this problem myself right now. We use Slurm here. We
really have one large, very heterogeneous cluster that's treated as
multiple smaller clusters through creating multiple partitions, each
with their own QOS. We also have some users who don't understand the
On Mon, Jun 11, 2018 at 02:36:14PM +0200, John Hearns via Beowulf wrote:
> Skylar Thomson wrote:
> >Unfortunately we don't have a mechanism to limit
> >network usage or local scratch usage, but the former is becoming less of a
> >problem with faster edge networking, and we have an opt-in
Skylar Thomson wrote:
>Unfortunately we don't have a mechanism to limit
>network usage or local scratch usage, but the former is becoming less of a
>problem with faster edge networking, and we have an opt-in bookkeeping
mechanism
>for the latter that isn't enforced but works well enough to keep
On Sunday, 10 June 2018 10:33:22 PM AEST Scott Atchley wrote:
[lists]
> Yes. It may be specific to Cray/Moab.
No, I think that applies quite nicely to Slurm too.
> Good luck. If you want to discuss, please do not hesitate to ask. We have
> another paper pending along the same lines.
Thanks!
On Sun, Jun 10, 2018 at 4:53 AM, Chris Samuel wrote:
> On Sunday, 10 June 2018 1:22:07 AM AEST Scott Atchley wrote:
>
> > Hi Chris,
>
> Hey Scott,
>
> > We have looked at this _a_ _lot_ on Titan:
> >
> > A Multi-faceted Approach to Job Placement for Improved Performance on
> > Extreme-Scale
On Sun, Jun 10, 2018 at 06:46:04PM +1000, Chris Samuel wrote:
> On Sunday, 10 June 2018 1:48:18 AM AEST Skylar Thompson wrote:
>
> > We're a Grid Engine shop, and we have the execd/shepherds place each job in
> > its own cgroup with CPU and memory limits in place.
>
> Slurm has supports cgroups
On Sunday, 10 June 2018 1:22:07 AM AEST Scott Atchley wrote:
> Hi Chris,
Hey Scott,
> We have looked at this _a_ _lot_ on Titan:
>
> A Multi-faceted Approach to Job Placement for Improved Performance on
> Extreme-Scale Systems
>
> https://ieeexplore.ieee.org/document/7877165/
Thanks! IEEE has
On Sunday, 10 June 2018 1:48:18 AM AEST Skylar Thompson wrote:
> We're a Grid Engine shop, and we have the execd/shepherds place each job in
> its own cgroup with CPU and memory limits in place.
Slurm has supports cgroups as well (and we use it extensively), the idea here
is more to try and
We're a Grid Engine shop, and we have the execd/shepherds place each job in
its own cgroup with CPU and memory limits in place. This lets our users
make efficient use of our HPC resources whether they're running single-slot
jobs, or multi-node jobs. Unfortunately we don't have a mechanism to limit
Hi Chris,
We have looked at this _a_ _lot_ on Titan:
A Multi-faceted Approach to Job Placement for Improved Performance on
Extreme-Scale Systems
https://ieeexplore.ieee.org/document/7877165/
This issue we have is small jobs "inside" large jobs interfering with the
larger jobs. The item that is
On Saturday, 9 June 2018 12:39:02 AM AEST Bill Abbott wrote:
> We set PriorityFavorSmall=NO and PriorityWeightJobSize to some
> appropriately large value in slurm.conf, which helps.
I guess that helps getting jobs going (and we use something similar), but my
question was more about placement.
On Saturday, 9 June 2018 12:16:16 AM AEST Paul Edmon wrote:
> Yeah this one is tricky. In general we take the wildwest approach here, but
> I've had users use --contiguous and their job takes forever to run.
:-)
> I suppose one method would would be enforce that each job take a full node
> and
This isn't quite the same issue, but several times I have observed a
large multiCPU machine lock up because the accounting records associates
with a zillion tiny rapidly launched jobs made an enormous
/var/account/pacct file and filled the small root filesystem. Actually
it wasn't usually
We set PriorityFavorSmall=NO and PriorityWeightJobSize to some
appropriately large value in slurm.conf, which helps.
We also used to limit the number of total jobs a single user could run
to something like 30% of the cluster, so a user could run a single mpi
job that takes all nodes, but
Yeah this one is tricky. In general we take the wildwest approach here,
but I've had users use --contiguous and their job takes forever to run.
I suppose one method would would be enforce that each job take a full
node and parallel jobs always have contiguous. As I recall Slurm will
Hi Chris,
> Message: 2
> Date: Fri, 08 Jun 2018 17:21:56 +1000
> From: Chris Samuel
> To: beowulf@beowulf.org
> Subject: [Beowulf] Avoiding/mitigating fragmentation of systems by
> small jobs?
> Message-ID: <2427060.afPWsf2KXH@quad>
> Content-Type: text/plain; charset="us-ascii"
>
>
Chris, good question. I can't give a direct asnwer there, but let me share
my experiences.
In the past I managed SGI ICE clusters and a large memory UV system with
PBSPro queuing.
The engineers submitted CFD solver jobs using scripts, and we only allowed
them to use a multiple of N cpus,
in fact
26 matches
Mail list logo