Unfortunately, I don't recall the details. I did find an article on the web, but this was back around February.

In a nutshell, our slurmctld was mysteriously crashing on CentOS 6.5. I think someone on this list pointed me to the Linux kernel issue, so it might be in the archives.

After I increased the memory limit from 2G to 10G, the problem ceased.

I now have the following in /etc/security/limits.d/91-as.conf on our controller nodes:

*   soft    as      16777216
*   hard    as      16777216
*   soft    memlock 16777216
*   hard    memlock 16777216

Slurmctld has been rock solid since this change. This cluster has 1136 cores, BTW.

On 08/18/14 14:58, Marcin Stolarek wrote:
Re: [slurm-dev] How to size the controller systems

W dniu poniedziaƂek, 18 sierpnia 2014 Jason Bacon <[email protected] <mailto:[email protected]>> napisaƂ(a):


    The controller generally shouldn't require much, but if you're
    running Linux, be aware that the way memory use is measured in
    recent kernels makes it look like slurmctld is using a lot of RAM


Can you point me to detailed information about that ? How is the memory measured?

     when multiple threads are active.  I had to up the per-process
    limit to 10G on our CentOS 6.5 controller nodes, even though
    slurmctld was using less than 1G in reality.

    Regards,

        Jason

    On 8/18/14 1:08 PM, Louis Capps wrote:

    Hi,
    We are looking at using SLURM for a large 6000 node cluster and
    need more info on the support systems.  Can you point me to a
    sizing guide or info on the requirements for the primary and
    backup controllers for SLURM including CPU, memory and local disk
    requirements?

    Thx,
    Louis


    
*******************************************************************************************
    Louis Capps     ([email protected]
    <javascript:_e(%7B%7D,'cvml','[email protected]');>)
      --- Systems Architect - Federal High Performance Computing - US
    Federal IMT - IBM Corporation
      --- Office (512)286-5556, t/l 363-5556 --- fax 678-6146 ---
    cell (512)796-4501
          --- Bld 045, 3C80, Austin, TX
    http://www-1.ibm.com/servers/deepcomputing/
    http://www-03.ibm.com/systems/clusters/
    
*******************************************************************************************




-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
       Jason W. Bacon
       [email protected]  <javascript:_e(%7B%7D,'cvml','[email protected]');>

       Circumstances don't make a man:
       They reveal him.
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Jason W. Bacon
  [email protected]

  Circumstances don't make a man:
  They reveal him.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Reply via email to