[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-26 Thread Michael Jennings

On Friday, 23 June 2017, at 00:46:58 (-0600),
Ole Holm Nielsen wrote:

> The requirement of removing "=" must be a bug since "rpmbuild --help" says:
> 
>   --with=   enable configure  for build
>   --without=disable configure  for build

RPM binaries handle certain command-line options via popt aliases; see
/usr/lib/rpm/rpmpopt-*.  It very well may be a popt bug...I'm not sure
how much exercise that particular code path gets

> The PDSH homepage has apparently moved recently to
> https://github.com/grondo/pdsh and now offers version pdsh-2.32
> (2017-06-22).  However, I'm currently unable to build an RPM from
> this version.

Yeah, looks like the (new?) maintainer is making some invalid build
assumptions (such as assuming "git describe" will work, thereby
breaking builds from tarballs) that will take some time to work
through.  I will hammer through it eventually, but at the moment I
have too much other stuff going on to spare the cycles to do it myself
just now.  Maybe someone else will beat me to it.  :-)

> Michael: From where can one download pdsh-2.29-1.el7.src.rpm?  I can
> only find version 2.31 at 
> https://dl.fedoraproject.org/pub/epel/7/SRPMS/p/pdsh-2.31-1.el7.src.rpm,
> and this version won't build on EL6.

It may or may not still be there, but I pulled it from:

https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/pdsh/pdsh-2.29.tar.bz2

HTH,
Michael

-- 
Michael E. Jennings 
HPC Systems Team, Los Alamos National Laboratory
Bldg. 03-2327, Rm. 2341 W: +1 (505) 606-0605


[slurm-dev] Call For ABSTRACTS - Slurm User Group Meeting 2017

2017-06-26 Thread Jacob Jenson
As a reminder, Slurm User Group Meeting Abstract submissions are *due at 
the end of this week* on *June 30, 2017*.


You are invited to submit an abstract of a tutorial, technical 
presentation or site report to be given at the Slurm User Group Meeting 
2017. This event is sponsored and organized by the National Energy 
Research Scientific Computing Center (NERSC) and SchedMD. It will be 
held at the National Energy Scientific Computing Center (NERSC) on 24-26 
September 2017.


Everyone who wants to present their own usage, developments, site 
report, or tutorial about Slurm is invited to send an abstract to 
sl...@schedmd.com


*Important Dates:*
30 June 2017: Abstracts due
15 July 2017: Notification of acceptance
24-26 September 2017: Slurm User Group Meeting



[slurm-dev] Re: slurm-dev Announce: Node status tool "pestat" for Slurm updated to version 0.50

2017-06-26 Thread Adrian Sevcenco


On 06/22/2017 01:34 PM, Ole Holm Nielsen wrote:


I'm announcing an updated version 0.50 of the node status tool "pestat" 
for Slurm.  I discovered how to obtain the node Free Memory with sinfo, 
so now we can do nice things with memory usage!


Hi! thank you for the great tool! i don't know if this is intended but :

[Monday 26.06.17 18:12] adrian@sev : ~  $
sinfo -N -t idle -o "%N %P %C %O %m %e %t" | column -t
NODELIST   PARTITION  CPUS(A/I/O/T)  CPU_LOAD  MEMORY  FREE_MEM  STATE
localhost  local* 0/8/0/80.03  14984   201   idle

[Monday 26.06.17 18:13] adrian@sev : ~  $
free -m
  totalusedfree  shared  buff/cache 
available
Mem:  14984 392 182 134   14409 
 14081

Swap:  8191   08191

[Monday 26.06.17 18:13] adrian@sev : ~  $
pestat
Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem 
Joblist
State Use/Tot  (MB) (MB) 
JobId User ...

localhost  local* idle   0   80.03 14984  201*


while it is clear that the reported free mem is what is reported by free 
as "free" one might argue that buffers/cache is memory available for 
usage as it will shrink with the application usage ...


Maybe the FREE_MEM should be reported as (free + cached) ?

Thank you!!
Adrian




New features:

1. The "pestat -f" will flag nodes with less than 20% free memory.

2. Now "pestat -m 1000" will print nodes with less than 1000 MB free 
memory.


3. Use "pestat -M 20" to print nodes with greater than 20 MB 
free memory.  Jobs on such under-utilized nodes might better be 
submitted to lower-memory nodes.


Download the tool (a short bash script) from 
https://ftp.fysik.dtu.dk/Slurm/pestat. If your commands do not live in 
/usr/bin, please make appropriate changes in the CONFIGURE section at 
the top of the script.


Usage: pestat [-p partition(s)] [-u username] [-q qoslist] [-s statelist]
 [-f | -m free_mem | -M free_mem ] [-V] [-h]
where:
 -p partition: Select only partion 
 -u username: Print only user 
 -q qoslist: Print only QOS in the qoslist 
 -s statelist: Print only nodes with state in 
 -f: Print only nodes that are flagged by * (unexpected load etc.)
 -m free_mem: Print only nodes with free memory LESS than free_mem MB
 -M free_mem: Print only nodes with free memory GREATER than 
free_mem MB (under-utilized)

 -h: Print this help information
 -V: Version information


I use "pestat -f" all the time because it prints and flags (in color) 
only the nodes which have an unexpected CPU load or node status, for 
example:


# pestat  -f
Print only nodes that are flagged by *
Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem 
Joblist
 State Use/Tot  (MB) (MB) 
JobId User ...
 a066  xeon8*alloc   8   88.04 23900  173* 
91683 user01
 a067  xeon8*alloc   8   88.07 23900  181* 
91683 user01
 a083  xeon8*alloc   8   88.06 23900  172* 
91683 user01



The -s option is useful for checking on possibly unusual node states, 
for example:


# pestat -s mixed




--
--
Adrian Sevcenco, Ph.D.   |
Institute of Space Science - ISS, Romania|
adrian.sevcenco at {cern.ch,spacescience.ro} |
--


[slurm-dev] Re: Announce: Node status tool "pestat" for Slurm updated to version 0.50

2017-06-26 Thread Belgin, Mehmet

Hi Ole,


It's possible that it was a temporary glitch, because all look OK to me now.

Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem  Joblist
State Use/Tot  (MB) (MB)  JobId 
User ...
devel-pcomp1  vtest* idle   0  120.06129080   124674
devel-vcomp1  vtest* idle   0   20.00  5845 4371
...

I don't really know what caused the zero values before, but yet again I was 
playing with several components at a time, including HA.

Thank you!
-Mehmet



From: Ole Holm Nielsen 
Sent: Monday, June 26, 2017 6:06:46 AM
To: slurm-dev
Subject: [slurm-dev] Re: Announce: Node status tool "pestat" for Slurm updated 
to version 0.50


On 23-06-2017 17:20, Belgin, Mehmet wrote:
> One thing I noticed is that pestat reports zero Freemem until a job is
> allocated on nodes. I’d expect it to report the same value as Memsize if
> no jobs are running. I wanted to offer this as a suggestion since zero
> free memory on idle nodes may be a bit confusing for users.
...
> Before Job allocation
> # pestat -p vtest
> Print only nodes in partition vtest
> Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem
>   Joblist
>  State Use/Tot  (MB) (MB)
>   JobId User ...
> devel-pcomp1  vtest* idle   0  120.02129080 *0*
> devel-vcomp1  vtest* idle   0   20.02  5845 *0*
> devel-vcomp2  vtest* idle   0   20.00  5845 *0*
> devel-vcomp3  vtest* idle   0   20.03  5845 *0*
> devel-vcomp4  vtest* idle   0   20.01  5845 *0*

I'm not seeing the incorrect Freemem that you report.  I get sensible
numbers for Freemem:

# pestat -s idle
Select only nodes with state=idle
Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem Joblist
 State Use/Tot  (MB) (MB)
JobId User ...
 a017  xeon8* idle   0   84.25*2390021590
 a077  xeon8* idle   0   83.47*2390022964
 b003  xeon8* idle   0   88.01*2390016839
 b046  xeon8* idle   0   80.01 2390022393
 b066  xeon8* idle   0   82.84*2390018610
 b081  xeon8* idle   0   80.01 2390021351
 g021  xeon16 idle   0  160.01 6400052393
 g022  xeon16 idle   0  160.01 6400060717
 g039  xeon16 idle   0  160.01 6400061795
 g048  xeon16 idle   0  160.01 6400062338
 g074  xeon16 idle   0  160.01 6400062274
 g076  xeon16 idle   0  160.01 6400058854

You should use sinfo directly to verify Slurm's data:

  sinfo -N -t idle -o "%N %P %C %O %m %e %t"

FYI: We run Slurm 16.05 and have configured Cgroups.

/Ole


[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Ole Holm Nielsen


Thanks Paul!  Would you know the answer to:

Question 2: Can anyone confirm that the output "slurmdbd: debug2: 
Everything rolled up" indeed signifies that conversion is complete?


Thanks,
Ole

On 06/26/2017 04:02 PM, Paul Edmon wrote:


Yeah, we keep around a test cluster environment for that purpose to vet 
slurm upgrades before we roll them on the production cluster.


Thus far no problems.  However, paranoia is usually a good thing for 
cases like this.


-Paul Edmon-


On 06/26/2017 07:30 AM, Ole Holm Nielsen wrote:


On 06/26/2017 01:24 PM, Loris Bennett wrote:

We have also upgraded in place continuously from 2.2.4 to currently
16.05.10 without any problems.  As I mentioned previously, it can be
handy to make a copy of the statesave directory, once the daemons have
been stopped.

However, if you want to know how long the upgrade might take, then yours
is a good approach.  What is your use case here?  Do you want to inform
the users about the length of the outage with regard to job submission?


I want to be 99.9% sure that upgrading (my first one) will actually 
work.  I also want to know how roughly long the slurmdbd will be down 
so that the cluster doesn't kill all jobs due to timeouts.  Better to 
be safe than sorry.


I don't expect to inform the users, since the operation is expected to 
run smoothly without troubles for user jobs.


Thanks,
Ole


Re: [slurm-dev] Dry run upgrade procedure for the slurmdbd database
Question 2: Can anyone confirm that the output "slurmdbd: debug2: Everything rolled up" indeed signifies that conversion is complete? 
We did it in place, worked as noted on the tin. It was less painful

than I expected. TBH, your procedures are admirable, but you shouldn't
worry - it's a relatively smooth process.

cheers
L.

--
"Mission Statement: To provide hope and inspiration for collective 
action, to build collective power, to achieve collective 
transformation, rooted in grief and rage but pointed towards vision 
and dreams."


- Patrisse Cullors, Black Lives Matter founder

On 26 June 2017 at 20:04, Ole Holm Nielsen 
 wrote:


  We're planning to upgrade Slurm 16.05 to 17.02 soon. The most 
critical step seems to me to be the upgrade of the slurmdbd 
database, which may also take tens of minutes.


  I thought it's a good idea to test the slurmdbd database upgrade 
locally on a drained compute node in order to verify both 
correctness and the time required.


  I've developed the dry run upgrade procedure documented in the 
Wiki page 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm


  Question 1: Would people who have real-world Slurm upgrade 
experience kindly offer comments on this procedure?


  My testing was actually successful, and the database conversion 
took less than 5 minutes in our case.


  A crucial step is starting the slurmdbd manually after the 
upgrade. But how can we be sure that the database conversion has 
been 100% completed?


  Question 2: Can anyone confirm that the output "slurmdbd: debug2: 
Everything rolled up" indeed signifies that conversion is complete?


  Thanks,
  Ole








--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620


[slurm-dev]

2017-06-26 Thread Somayah H
Hi,

I'm currently doing an investigation between the SLURM and PBS/TORQUE
especially about their weakness and the strengths. Also, the different
services that provided in one of them and the other not. Finally, and the
important thing is the difference in their performance.
I would be very grateful if anyone has experiences with both of above
schedulers and can provide me with needed information.

Thanks in advance.
Regards,
Soma.


[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Paul Edmon


Yeah, we keep around a test cluster environment for that purpose to vet 
slurm upgrades before we roll them on the production cluster.


Thus far no problems.  However, paranoia is usually a good thing for 
cases like this.


-Paul Edmon-


On 06/26/2017 07:30 AM, Ole Holm Nielsen wrote:


On 06/26/2017 01:24 PM, Loris Bennett wrote:

We have also upgraded in place continuously from 2.2.4 to currently
16.05.10 without any problems.  As I mentioned previously, it can be
handy to make a copy of the statesave directory, once the daemons have
been stopped.

However, if you want to know how long the upgrade might take, then yours
is a good approach.  What is your use case here?  Do you want to inform
the users about the length of the outage with regard to job submission?


I want to be 99.9% sure that upgrading (my first one) will actually 
work.  I also want to know how roughly long the slurmdbd will be down 
so that the cluster doesn't kill all jobs due to timeouts.  Better to 
be safe than sorry.


I don't expect to inform the users, since the operation is expected to 
run smoothly without troubles for user jobs.


Thanks,
Ole


Re: [slurm-dev] Dry run upgrade procedure for the slurmdbd database

We did it in place, worked as noted on the tin. It was less painful
than I expected. TBH, your procedures are admirable, but you shouldn't
worry - it's a relatively smooth process.

cheers
L.

--
"Mission Statement: To provide hope and inspiration for collective 
action, to build collective power, to achieve collective 
transformation, rooted in grief and rage but pointed towards vision 
and dreams."


- Patrisse Cullors, Black Lives Matter founder

On 26 June 2017 at 20:04, Ole Holm Nielsen 
 wrote:


  We're planning to upgrade Slurm 16.05 to 17.02 soon. The most 
critical step seems to me to be the upgrade of the slurmdbd 
database, which may also take tens of minutes.


  I thought it's a good idea to test the slurmdbd database upgrade 
locally on a drained compute node in order to verify both 
correctness and the time required.


  I've developed the dry run upgrade procedure documented in the 
Wiki page 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm


  Question 1: Would people who have real-world Slurm upgrade 
experience kindly offer comments on this procedure?


  My testing was actually successful, and the database conversion 
took less than 5 minutes in our case.


  A crucial step is starting the slurmdbd manually after the 
upgrade. But how can we be sure that the database conversion has 
been 100% completed?


  Question 2: Can anyone confirm that the output "slurmdbd: debug2: 
Everything rolled up" indeed signifies that conversion is complete?


  Thanks,
  Ole








[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Ole Holm Nielsen


On 06/26/2017 01:24 PM, Loris Bennett wrote:

We have also upgraded in place continuously from 2.2.4 to currently
16.05.10 without any problems.  As I mentioned previously, it can be
handy to make a copy of the statesave directory, once the daemons have
been stopped.

However, if you want to know how long the upgrade might take, then yours
is a good approach.  What is your use case here?  Do you want to inform
the users about the length of the outage with regard to job submission?


I want to be 99.9% sure that upgrading (my first one) will actually 
work.  I also want to know how roughly long the slurmdbd will be down so 
that the cluster doesn't kill all jobs due to timeouts.  Better to be 
safe than sorry.


I don't expect to inform the users, since the operation is expected to 
run smoothly without troubles for user jobs.


Thanks,
Ole


Re: [slurm-dev] Dry run upgrade procedure for the slurmdbd database

We did it in place, worked as noted on the tin. It was less painful
than I expected. TBH, your procedures are admirable, but you shouldn't
worry - it's a relatively smooth process.

cheers
L.

--
"Mission Statement: To provide hope and inspiration for collective action, to build 
collective power, to achieve collective transformation, rooted in grief and rage but 
pointed towards vision and dreams."

- Patrisse Cullors, Black Lives Matter founder

On 26 June 2017 at 20:04, Ole Holm Nielsen  wrote:

  We're planning to upgrade Slurm 16.05 to 17.02 soon. The most critical step 
seems to me to be the upgrade of the slurmdbd database, which may also take 
tens of minutes.

  I thought it's a good idea to test the slurmdbd database upgrade locally on a 
drained compute node in order to verify both correctness and the time required.

  I've developed the dry run upgrade procedure documented in the Wiki page 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

  Question 1: Would people who have real-world Slurm upgrade experience kindly 
offer comments on this procedure?

  My testing was actually successful, and the database conversion took less 
than 5 minutes in our case.

  A crucial step is starting the slurmdbd manually after the upgrade. But how 
can we be sure that the database conversion has been 100% completed?

  Question 2: Can anyone confirm that the output "slurmdbd: debug2: Everything 
rolled up" indeed signifies that conversion is complete?

  Thanks,
  Ole






--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620


[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Loris Bennett

Hi Ole,

We have also upgraded in place continuously from 2.2.4 to currently
16.05.10 without any problems.  As I mentioned previously, it can be
handy to make a copy of the statesave directory, once the daemons have
been stopped.

However, if you want to know how long the upgrade might take, then yours
is a good approach.  What is your use case here?  Do you want to inform
the users about the length of the outage with regard to job submission?

Cheers,

Loris

Lachlan Musicman  writes:

> Re: [slurm-dev] Dry run upgrade procedure for the slurmdbd database 
>
> We did it in place, worked as noted on the tin. It was less painful
> than I expected. TBH, your procedures are admirable, but you shouldn't
> worry - it's a relatively smooth process.
>
> cheers
> L.
>
> --
> "Mission Statement: To provide hope and inspiration for collective action, to 
> build collective power, to achieve collective transformation, rooted in grief 
> and rage but pointed towards vision and dreams."
>
> - Patrisse Cullors, Black Lives Matter founder
>
> On 26 June 2017 at 20:04, Ole Holm Nielsen  wrote:
>
>  We're planning to upgrade Slurm 16.05 to 17.02 soon. The most critical step 
> seems to me to be the upgrade of the slurmdbd database, which may also take 
> tens of minutes.
>
>  I thought it's a good idea to test the slurmdbd database upgrade locally on 
> a drained compute node in order to verify both correctness and the time 
> required.
>
>  I've developed the dry run upgrade procedure documented in the Wiki page 
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
>
>  Question 1: Would people who have real-world Slurm upgrade experience kindly 
> offer comments on this procedure?
>
>  My testing was actually successful, and the database conversion took less 
> than 5 minutes in our case.
>
>  A crucial step is starting the slurmdbd manually after the upgrade. But how 
> can we be sure that the database conversion has been 100% completed?
>
>  Question 2: Can anyone confirm that the output "slurmdbd: debug2: Everything 
> rolled up" indeed signifies that conversion is complete?
>
>  Thanks,
>  Ole
>
>

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Lachlan Musicman
We did it in place, worked as noted on the tin. It was less painful than I
expected. TBH, your procedures are admirable, but you shouldn't worry -
it's a relatively smooth process.

cheers
L.

--
"Mission Statement: To provide hope and inspiration for collective action,
to build collective power, to achieve collective transformation, rooted in
grief and rage but pointed towards vision and dreams."

 - Patrisse Cullors, *Black Lives Matter founder*

On 26 June 2017 at 20:04, Ole Holm Nielsen 
wrote:

>
> We're planning to upgrade Slurm 16.05 to 17.02 soon.  The most critical
> step seems to me to be the upgrade of the slurmdbd database, which may also
> take tens of minutes.
>
> I thought it's a good idea to test the slurmdbd database upgrade locally
> on a drained compute node in order to verify both correctness and the time
> required.
>
> I've developed the dry run upgrade procedure documented in the Wiki page
> https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm
>
> Question 1: Would people who have real-world Slurm upgrade experience
> kindly offer comments on this procedure?
>
> My testing was actually successful, and the database conversion took less
> than 5 minutes in our case.
>
> A crucial step is starting the slurmdbd manually after the upgrade.  But
> how can we be sure that the database conversion has been 100% completed?
>
> Question 2: Can anyone confirm that the output "slurmdbd: debug2:
> Everything rolled up" indeed signifies that conversion is complete?
>
> Thanks,
> Ole
>


[slurm-dev] Re: Controlling the output of 'scontrol show hostlist'?

2017-06-26 Thread Ole Holm Nielsen


Hi Kilian,

Thanks for explaining how to configure ClusterShell correctly for Slurm! 
 I've updated my Wiki information in 
https://wiki.fysik.dtu.dk/niflheim/SLURM#clustershell now.


I would suggest you to add your examples to the ClusterShell 
documentation, where I feel it may be hidden or missing.


/Ole

On 06/23/2017 06:37 PM, Kilian Cavalotti wrote:

But how do I configure fro Slurm??  I've copied the example file to
/etc/clustershell/groups.conf.d/slurm.conf, but this doesn't enable Slurm
partitions (here: xeon24) as ClusterShell groups:

# clush -g xeon24 date
Usage: clush [options] command
clush: error: No node to run on.

Could you kindly explain this (and perhaps add examples to the
documentation)?
> 
Cheers,



--
Ole Holm Nielsen
PhD, Manager of IT services
Department of Physics, Technical University of Denmark,
Building 307, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Tel: (+45) 4525 3187 / Mobile (+45) 5180 1620

Sure! That's because the groups.conf.d/slurm.conf file defines new
group sources [1]. ClusterShell supports multiple group sources, ie.
multiple sources of information to define groups. There is a default
one, defined in groups.conf, which will be used when a group name is
used, without specifying anything else, as in your "clush -g xeon24
date" command. But since the "slurm" group source is not the default,
it's not used to map the "xeon24" group to the corresponding Slurm
partition.

So, you can either:

* use the -s option to specify a group source, or prefix the group
name with the group source name in the command line, like this:

 $ clush -s slurm -g xeon24 date

or, more compact:

$ clush -w@slurm:xeon24 date

* or if you don't plan to use any other group source than "slurm", you
can make it the default with the following in
/etc/clustershell/groups.conf:

[Main]
# Default group source
default: slurm


With the example Slurm group source, you can easily execute commands
on all the nodes from a given partition, but also on nodes based on
their Slurm state, like:

$ clush -w@slurmstate:drained date


Hope this makes things a bit clearer.

[1] 
https://clustershell.readthedocs.io/en/latest/config.html#external-group-sources


[slurm-dev] Re: Announce: Node status tool "pestat" for Slurm updated to version 0.50

2017-06-26 Thread Ole Holm Nielsen


On 23-06-2017 17:20, Belgin, Mehmet wrote:
One thing I noticed is that pestat reports zero Freemem until a job is 
allocated on nodes. I’d expect it to report the same value as Memsize if 
no jobs are running. I wanted to offer this as a suggestion since zero 
free memory on idle nodes may be a bit confusing for users.

...

Before Job allocation
# pestat -p vtest
Print only nodes in partition vtest
Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem 
  Joblist
 State Use/Tot  (MB) (MB) 
  JobId User ...

devel-pcomp1  vtest* idle   0  120.02129080 *0*
devel-vcomp1  vtest* idle   0   20.02  5845 *0*
devel-vcomp2  vtest* idle   0   20.00  5845 *0*
devel-vcomp3  vtest* idle   0   20.03  5845 *0*
devel-vcomp4  vtest* idle   0   20.01  5845 *0*


I'm not seeing the incorrect Freemem that you report.  I get sensible 
numbers for Freemem:


# pestat -s idle
Select only nodes with state=idle
Hostname   Partition Node Num_CPU  CPUload  Memsize  Freemem Joblist
State Use/Tot  (MB) (MB) 
JobId User ...

a017  xeon8* idle   0   84.25*2390021590
a077  xeon8* idle   0   83.47*2390022964
b003  xeon8* idle   0   88.01*2390016839
b046  xeon8* idle   0   80.01 2390022393
b066  xeon8* idle   0   82.84*2390018610
b081  xeon8* idle   0   80.01 2390021351
g021  xeon16 idle   0  160.01 6400052393
g022  xeon16 idle   0  160.01 6400060717
g039  xeon16 idle   0  160.01 6400061795
g048  xeon16 idle   0  160.01 6400062338
g074  xeon16 idle   0  160.01 6400062274
g076  xeon16 idle   0  160.01 6400058854

You should use sinfo directly to verify Slurm's data:

 sinfo -N -t idle -o "%N %P %C %O %m %e %t"

FYI: We run Slurm 16.05 and have configured Cgroups.

/Ole


[slurm-dev] Dry run upgrade procedure for the slurmdbd database

2017-06-26 Thread Ole Holm Nielsen


We're planning to upgrade Slurm 16.05 to 17.02 soon.  The most critical 
step seems to me to be the upgrade of the slurmdbd database, which may 
also take tens of minutes.


I thought it's a good idea to test the slurmdbd database upgrade locally 
on a drained compute node in order to verify both correctness and the 
time required.


I've developed the dry run upgrade procedure documented in the Wiki page 
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm


Question 1: Would people who have real-world Slurm upgrade experience 
kindly offer comments on this procedure?


My testing was actually successful, and the database conversion took 
less than 5 minutes in our case.


A crucial step is starting the slurmdbd manually after the upgrade.  But 
how can we be sure that the database conversion has been 100% completed?


Question 2: Can anyone confirm that the output "slurmdbd: debug2: 
Everything rolled up" indeed signifies that conversion is complete?


Thanks,
Ole