Hi Aaron,

Thanks for the advice. Indeed, SLURM correctly recognizes the core and
socket count. However, when I allow to share jobs on the node, it allocates
the processors in a random fashion across the machine, and we do not see
any scaling. When I run with --mem_bind=sockets, it runs all the job
starting from the first socket, without recognizing that it can be occupied
already. This means that we need to provide the socket number explicitly
per srun to achieve the desired behavior. Instead, we would like to have
node-per-socket system and let SLURM decide which one is free.

If there was a workaround without doing this, we would readily accept it
too.

Kind regards,
Artem


On 3 June 2012 18:49, Aaron Knister <[email protected]> wrote:

> Hi Artem,
>
> Granted I have never used SLURM on a large NUMA system but I would think
> that having one node is fine as long as SLURM properly recognizes your
> socket and core count. If you want to allocate sockets not cores then I
> would suggest using the consumable resources select plugin with either
> CR_Socket or CR_Socket_Memory for the value of SelectTypeParameter.
> Hopefully some other folks will chime in :)
>
> -Aaron
>
> Sent from my iPhone
>
> On Jun 3, 2012, at 12:20 PM, Artem Kulachenko <[email protected]>
> wrote:
>
> Hi Everyone,
>
> We have one middle size SMP machine with NUMA architectures having 20
> sockets and 160 cores. We would like to use SLURM, which currently sees it
> as a single node. I wonder if there is a way to split it into 20 nodes (one
> per socket) in order to run the jobs locally at neighboring cores using
> local memory without the need to assign the processors manually.
>
> I will very appreciate any advice.
>
> Kind regards,
> Artem
>
>
>

Reply via email to