Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Ole Holm Nielsen

On 11/2/20 2:25 PM, navin srivastava wrote:

Currently we are running slurm version 17.11.x and wanted to move to 20.x.

We are building the New server with Slurm 20.2 version and planning to 
upgrade the client nodes from 17.x to 20.x.


wanted to check if we can upgrade the Client from 17.x to 20.x directly or 
we need to go through 17.x to 18.x and 19.x then 20.x


I have described the Slurm upgrade process in my Wiki page:
https://wiki.fysik.dtu.dk/niflheim/Slurm_installation#upgrading-slurm

It's based upon experiences and Slurm documentation and seems to work 
correctly.


/Ole



Re: [slurm-users] Reserving a GPU (Christopher Benjamin Coffey)

2020-11-02 Thread Christopher Benjamin Coffey
Hi All,

Anyone know if its possible yet to reserve a gpu?  Maybe in 20.02? Thanks!

Best,
Chris
 
-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
 
 

On 5/19/20, 3:04 PM, "slurm-users on behalf of Christopher Benjamin Coffey" 
 wrote:

Hi Lisa,

Im actually referring to the ability to create a reservation that includes 
a gpu resource. It doesn't seem to be possible, which seems strange. This would 
be very helpful for us to have a floating gpu reservation.

Best,
Chris

-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167



On 5/19/20, 1:47 PM, "slurm-users on behalf of Lisa Kay Weihl" 
 wrote:


I am a newbie at the Slurm setup but if by reservable you also mean a 
consumable resource I am able to request gpus and I have Slurm 20.02.1 and cuda 
10.2.  I just set this up within the last month.




***
Lisa Weihl Systems Administrator
Computer Science, Bowling Green State University
Tel: (419) 372-0116   |Fax: (419) 372-8061
lwe...@bgsu.edu

https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.bgsu.edu%2Fdata=02%7C01%7Cchris.coffey%40nau.edu%7C420d7184e9294544592b08d7fc40a552%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637255226891924186sdata=yH7Kkzq4j%2BY0Ai%2F1A3f2wwDFa4tMs%2F8GmD7Dxlj48pw%3Dreserved=0​











Message: 1
Date: Tue, 19 May 2020 18:19:26 +
From: Christopher Benjamin Coffey 
To: Slurm User Community List 
Subject: Re: [slurm-users] Reserving a GPU
Message-ID: <387dee1d-f060-47c3-afb9-0309684c2...@nau.edu>
Content-Type: text/plain; charset="utf-8"

Hi All,

Can anyone confirm that GPU is still not a reservable resource? It 
doesn't seem to be possible still in 19.05.6. I haven't tried 20.02 series.

Best,
Chris

-- 
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167



?On 11/11/18, 1:19 AM, "slurm-users on behalf of Chris Samuel" 
 wrote:

On Tuesday, 6 November 2018 5:30:31 AM AEDT Christopher Benjamin 
Coffey wrote:

> Can anyone else confirm that it is not possible to reserve a GPU? 
Seems a
> bit strange.

This looks like the bug that was referred to previously.



https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbugs.schedmd.com%2Fshow_bug.cgi%3Fid%3D5771data=02%7C01%7Cchris.coffey%40nau.edu%7C420d7184e9294544592b08d7fc40a552%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637255226891924186sdata=spfIw40jttePSibZmjw1y3x6xpTVCrhOytBjsRX%2B1O8%3Dreserved=0
 


Although looking at the manual page for scontrol in the current 
master it only says:

   TRES=
  Comma-separated list of TRES required for the 
reservation. Current
  supported TRES types with reservations are: CPU, 
Node, License and
  BB.

But it's early days yet for that release..

All the best,
Chris
-- 
 Chris Samuel  :  

https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2Fdata=02%7C01%7Cchris.coffey%40nau.edu%7C420d7184e9294544592b08d7fc40a552%7C27d49e9f89e14aa099a3d35b57b2ba03%7C0%7C0%7C637255226891924186sdata=MSwvLoOLkHeUkqyL4qqP0qsHKFYTmAleGEGNebgu0IY%3Dreserved=0
 

 
 :  Melbourne, VIC











Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Paul Edmon
We have hit this when we naively ran using the service and it timed out 
and borked the database.  Fortunately we had a backup to go back to.  
Since then we have run it straight from the command line.  Like yours 
our production DB is now 23 GB for 6 months worth of data so major 
schema updates take roughly 1-2 hours for us.


-Paul Edmon-

On 11/2/2020 11:15 AM, Chris Samuel wrote:

On 11/2/20 7:31 am, Paul Edmon wrote:

e. Run slurmdbd -Dv to do the database upgrade. Depending on the 
upgrade this can take a while because of database schema changes.


I'd like to emphasis the importance of doing the DB upgrade in this 
way, do not use systemctl for this as if systemd runs out of patience 
waiting for slurmdbd to finish the migration and start up it can kill 
it part way through the migration.


Fortunately not something I've run into myself, but as our mysqldump 
of our production DB is approaching 100GB now it's not something we 
want to run into!


All the best,
Chris




Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Chris Samuel

On 11/2/20 7:31 am, Paul Edmon wrote:

e. Run slurmdbd -Dv to do the database upgrade. Depending on the 
upgrade this can take a while because of database schema changes.


I'd like to emphasis the importance of doing the DB upgrade in this way, 
do not use systemctl for this as if systemd runs out of patience waiting 
for slurmdbd to finish the migration and start up it can kill it part 
way through the migration.


Fortunately not something I've run into myself, but as our mysqldump of 
our production DB is approaching 100GB now it's not something we want to 
run into!


All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA



Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Paul Edmon
We haven't really had MPI ugliness with the latest versions. Plus we've 
been rolling our own PMIx and building against that which seems to have 
solved most of the cross compatibility issues.


-Paul Edmon-

On 11/2/2020 10:38 AM, Fulcomer, Samuel wrote:
Our strategy is a bit simpler. We're migrating compute nodes to a new 
cluster running 20.x. This isn't an upgrade. We'll keep the old 
slurmdbd running for at least enough time to suck the remaining 
accounting data into XDMoD.


The old cluster will keep running jobs until there are no more to run. 
We'll drain and move nodes to the new cluster as we start seeing more 
and more idle nodes in the old cluster. This avoids MPI ugliness and 
we move directly to 20.x.




On Mon, Nov 2, 2020 at 9:28 AM Paul Edmon > wrote:


In general  I would follow this:

https://slurm.schedmd.com/quickstart_admin.html#upgrade


Namely:

Almost every new major release of Slurm (e.g. 19.05.x to 20.02.x)
involves changes to the state files with new data structures, new
options, etc. Slurm permits upgrades to a new major release from
the past two major releases, which happen every nine months (e.g.
18.08.x or 19.05.x to 20.02.x) without loss of jobs or other state
information. State information from older versions will not be
recognized and will be discarded, resulting in loss of all running
and pending jobs. State files are *not* recognized when
downgrading (e.g. from 19.05.x to 18.08.x) and will be discarded,
resulting in loss of all running and pending jobs. For this
reason, creating backup copies of state files (as described below)
can be of value. Therefore when upgrading Slurm (more precisely,
the slurmctld daemon), saving the /StateSaveLocation/ (as defined
in /slurm.conf/) directory contents with all state information is
recommended. If you need to downgrade, restoring that directory's
contents will let you recover the jobs. Jobs submitted under the
new version will not be in those state files, but it can let you
recover most jobs. An exception to this is that jobs may be lost
when installing new pre-release versions (e.g. 20.02.0-pre1 to
20.02.0-pre2). Developers will try to note these cases in the NEWS
file. Contents of major releases are also described in the
RELEASE_NOTES file.

So I wouldn't go directly to 20.x, instead I would go from 17.x to
19.x and then to 20.x

-Paul Edmon-

On 11/2/2020 8:55 AM, Fulcomer, Samuel wrote:

We're doing something similar. We're continuing to run production
on 17.x and have set up a new server/cluster  running 20.x for
testing and MPI app rebuilds.

Our plan had been to add recently purchased nodes to the new
cluster, and at some point turn off submission on the old cluster
and switch everyone to submission on the new cluster (new
login/submission hosts). That way previously submitted MPI apps
would continue to run properly. As the old cluster partitions
started to clear out we'd mark ranges of nodes to drain and move
them to the new cluster.

We've since decided to wait until January, when we've scheduled
some downtime. The process will remain the same wrt moving nodes
from the old cluster to the new, _except_ that everything will be
drained, so we can move big blocks of nodes and avoid slurm.conf
Partition line ugliness.

We're starting with a fresh database to get rid of the bug
induced corruption that prevents GPUs from being fenced with cgroups.

regards,
s

On Mon, Nov 2, 2020 at 8:28 AM navin srivastava
mailto:navin.alt...@gmail.com>> wrote:

Dear All,

Currently we are running slurm version 17.11.x and wanted to
move to 20.x.

We are building the New server with Slurm 20.2 version and
planning to upgrade the client nodes from 17.x to 20.x.

wanted to check if we can upgrade the Client from 17.x to
20.x directly or we need to go through 17.x to 18.x and 19.x
then 20.x

Regards
Navin.





Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Fulcomer, Samuel
Our strategy is a bit simpler. We're migrating compute nodes to a new
cluster running 20.x. This isn't an upgrade. We'll keep the old slurmdbd
running for at least enough time to suck the remaining accounting data into
XDMoD.

The old cluster will keep running jobs until there are no more to run.
We'll drain and move nodes to the new cluster as we start seeing more and
more idle nodes in the old cluster. This avoids MPI ugliness and we move
directly to 20.x.



On Mon, Nov 2, 2020 at 9:28 AM Paul Edmon  wrote:

> In general  I would follow this:
>
> https://slurm.schedmd.com/quickstart_admin.html#upgrade
>
> Namely:
>
> Almost every new major release of Slurm (e.g. 19.05.x to 20.02.x) involves
> changes to the state files with new data structures, new options, etc.
> Slurm permits upgrades to a new major release from the past two major
> releases, which happen every nine months (e.g. 18.08.x or 19.05.x to
> 20.02.x) without loss of jobs or other state information. State information
> from older versions will not be recognized and will be discarded, resulting
> in loss of all running and pending jobs. State files are *not* recognized
> when downgrading (e.g. from 19.05.x to 18.08.x) and will be discarded,
> resulting in loss of all running and pending jobs. For this reason,
> creating backup copies of state files (as described below) can be of value.
> Therefore when upgrading Slurm (more precisely, the slurmctld daemon),
> saving the *StateSaveLocation* (as defined in *slurm.conf*) directory
> contents with all state information is recommended. If you need to
> downgrade, restoring that directory's contents will let you recover the
> jobs. Jobs submitted under the new version will not be in those state
> files, but it can let you recover most jobs. An exception to this is that
> jobs may be lost when installing new pre-release versions (e.g.
> 20.02.0-pre1 to 20.02.0-pre2). Developers will try to note these cases in
> the NEWS file. Contents of major releases are also described in the
> RELEASE_NOTES file.
>
> So I wouldn't go directly to 20.x, instead I would go from 17.x to 19.x
> and then to 20.x
>
> -Paul Edmon-
> On 11/2/2020 8:55 AM, Fulcomer, Samuel wrote:
>
> We're doing something similar. We're continuing to run production on 17.x
> and have set up a new server/cluster  running 20.x for testing and MPI app
> rebuilds.
>
> Our plan had been to add recently purchased nodes to the new cluster, and
> at some point turn off submission on the old cluster and switch everyone
> to  submission on the new cluster (new login/submission hosts). That way
> previously submitted MPI apps would continue to run properly. As the old
> cluster partitions started to clear out we'd mark ranges of nodes to drain
> and move them to the new cluster.
>
> We've since decided to wait until January, when we've scheduled some
> downtime. The process will remain the same wrt moving nodes from the old
> cluster to the new, _except_ that everything will be drained, so we can
> move big blocks of nodes and avoid slurm.conf Partition line ugliness.
>
> We're starting with a fresh database to get rid of the bug
> induced corruption that prevents GPUs from being fenced with cgroups.
>
> regards,
> s
>
> On Mon, Nov 2, 2020 at 8:28 AM navin srivastava 
> wrote:
>
>> Dear All,
>>
>> Currently we are running slurm version 17.11.x and wanted to move to 20.x.
>>
>> We are building the New server with Slurm 20.2 version and planning to
>> upgrade the client nodes from 17.x to 20.x.
>>
>> wanted to check if we can upgrade the Client from 17.x to 20.x directly
>> or we need to go through 17.x to 18.x and 19.x then 20.x
>>
>> Regards
>> Navin.
>>
>>
>>
>>


Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Paul Edmon
We don't follow the recommended procedure here but rather build RPMs and 
upgrade using those.  We haven't and any issues.  Here is our procedure:


1. Build rpms from source using a version of the slurm.spec file that we 
maintain. It's the version SchedMD provides but modified with some 
specific stuff for our env and to disable automatic restarts on upgrade 
which can cause problems especially for upgrading the slurm database.


2. We test the upgrade on our test cluster using the following sequence.

a. Pause all jobs and stop all scheduling.

b. Stop slurmctld and slurmdbd.

c. Backup spool and the database.

d. Upgrade slurm rpms (note that you need to make sure that the upgrade 
will not automatically restart the dbd or the ctld else you may end up 
in a world of hurt)


e. Run slurmdbd -Dv to do the database upgrade. Depending on the 
upgrade this can take a while because of database schema changes.


f. Restart slurmdbd using the service

g. Upgrading slurm rpms across the cluster using salt.

h. Global restart of slurmd and slurmctld.

3. If that all looks good we rinse and repeat on our production cluster.

The rpms have worked fine for us.  The main hitch is the automatic 
restart on upgrade, which I do not recommend.  You should neuter that 
portion of the provided spec file, especially for the slurmdbd upgrades.


We generally prefer the RPM method as it is the normal method for 
interaction with the OS and works well with Puppet.


-Paul Edmon-

On 11/2/2020 10:13 AM, Jason Simms wrote:

Hello all,

I am going to reveal the degree of my inexperience here, but am I 
perhaps the only one who thinks that Slurm's upgrade procedure is too 
complex? Or, at least maybe not explained in enough detail?


I'm running a CentOS 8 cluster, and to me, I should be able simply to 
update the Slurm package and any of its dependencies, and that's it. 
When I looked at the notes from the recent Slurm Users' Group meeting, 
however, I see that while that mode is technically supported, it is 
not recommended, and instead one should always rebuild from source. 
Really?


So, ok, regardless whether that's the case, the upgrade notes linked 
to in the prior post don't, in my opinion, go into enough detail. It 
tells you broadly what to do, but not necessarily how to do it. I'd 
welcome example commands for each step (understanding that changes 
might be needed to account for local configurations). There are no 
examples in that section, for example, addressing recompiling from source.


Now, I suspect a chorus of "if you don't understand it well enough, 
you shouldn't be managing it." OK. Perhaps that's fair enough. But I 
came into this role via a non-traditional route and am constantly 
trying to improve my admin skills, and I may not have the complete 
mastery of all aspects quite yet. But I would also say that 
documentation should be clear and complete, and not written solely for 
experts. To be honest, I've had to go to lots of documentation 
external to SchedMD to see good examples of actually working with 
Slurm, or even ask the helpful people on this group. And I firmly 
believe that if there is a packaged version of your software - as 
there is for Slurm - that should be the default, fully-working way to 
upgrade.


Warmest regards,
Jason

On Mon, Nov 2, 2020 at 9:28 AM Paul Edmon > wrote:


In general  I would follow this:

https://slurm.schedmd.com/quickstart_admin.html#upgrade


Namely:

Almost every new major release of Slurm (e.g. 19.05.x to 20.02.x)
involves changes to the state files with new data structures, new
options, etc. Slurm permits upgrades to a new major release from
the past two major releases, which happen every nine months (e.g.
18.08.x or 19.05.x to 20.02.x) without loss of jobs or other state
information. State information from older versions will not be
recognized and will be discarded, resulting in loss of all running
and pending jobs. State files are *not* recognized when
downgrading (e.g. from 19.05.x to 18.08.x) and will be discarded,
resulting in loss of all running and pending jobs. For this
reason, creating backup copies of state files (as described below)
can be of value. Therefore when upgrading Slurm (more precisely,
the slurmctld daemon), saving the /StateSaveLocation/ (as defined
in /slurm.conf/) directory contents with all state information is
recommended. If you need to downgrade, restoring that directory's
contents will let you recover the jobs. Jobs submitted under the
new version will not be in those state files, but it can let you
recover most jobs. An exception to this is that jobs may be lost
when installing new pre-release versions (e.g. 20.02.0-pre1 to
20.02.0-pre2). Developers will try to note these cases in the NEWS
file. Contents of major releases are also described in 

Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Paul Edmon
We don't follow the recommended procedure here but rather build RPMs and 
upgrade using those.  We haven't and any issues.  Here is our procedure:


1. Build rpms from source using a version of the slurm.spec file that we 
maintain. It's the version SchedMD provides but modified with some 
specific stuff for our env and to disable automatic restarts on upgrade 
which can cause problems especially for upgrading the slurm database.


2. We test the upgrade on our test cluster using the following sequence.

a. Pause all jobs and stop all scheduling.

b. Stop slurmctld and slurmdbd.

c. Backup spool and the database.

d. Upgrade slurm rpms (note that you need to make sure that the upgrade 
will not automatically restart the dbd or the ctld else you may end up 
in a world of hurt)


e. Run slurmdbd -Dv to do the database upgrade. Depending on the 
upgrade this can take a while because of database schema changes.


f. Restart slurmdbd using the service

g. Upgrading slurm rpms across the cluster using salt.

h. Global restart of slurmd and slurmctld.

3. If that all looks good we rinse and repeat on our production cluster.

The rpms have worked fine for us.  The main hitch is the automatic 
restart on upgrade, which I do not recommend.  You should neuter that 
portion of the provided spec file, especially for the slurmdbd upgrades.


We generally prefer the RPM method as it is the normal method for 
interaction with the OS and works well with Puppet.


-Paul Edmon-

On 11/2/2020 10:13 AM, Jason Simms wrote:

Hello all,

I am going to reveal the degree of my inexperience here, but am I 
perhaps the only one who thinks that Slurm's upgrade procedure is too 
complex? Or, at least maybe not explained in enough detail?


I'm running a CentOS 8 cluster, and to me, I should be able simply to 
update the Slurm package and any of its dependencies, and that's it. 
When I looked at the notes from the recent Slurm Users' Group meeting, 
however, I see that while that mode is technically supported, it is 
not recommended, and instead one should always rebuild from source. 
Really?


So, ok, regardless whether that's the case, the upgrade notes linked 
to in the prior post don't, in my opinion, go into enough detail. It 
tells you broadly what to do, but not necessarily how to do it. I'd 
welcome example commands for each step (understanding that changes 
might be needed to account for local configurations). There are no 
examples in that section, for example, addressing recompiling from source.


Now, I suspect a chorus of "if you don't understand it well enough, 
you shouldn't be managing it." OK. Perhaps that's fair enough. But I 
came into this role via a non-traditional route and am constantly 
trying to improve my admin skills, and I may not have the complete 
mastery of all aspects quite yet. But I would also say that 
documentation should be clear and complete, and not written solely for 
experts. To be honest, I've had to go to lots of documentation 
external to SchedMD to see good examples of actually working with 
Slurm, or even ask the helpful people on this group. And I firmly 
believe that if there is a packaged version of your software - as 
there is for Slurm - that should be the default, fully-working way to 
upgrade.


Warmest regards,
Jason

On Mon, Nov 2, 2020 at 9:28 AM Paul Edmon > wrote:


In general  I would follow this:

https://slurm.schedmd.com/quickstart_admin.html#upgrade


Namely:

Almost every new major release of Slurm (e.g. 19.05.x to 20.02.x)
involves changes to the state files with new data structures, new
options, etc. Slurm permits upgrades to a new major release from
the past two major releases, which happen every nine months (e.g.
18.08.x or 19.05.x to 20.02.x) without loss of jobs or other state
information. State information from older versions will not be
recognized and will be discarded, resulting in loss of all running
and pending jobs. State files are *not* recognized when
downgrading (e.g. from 19.05.x to 18.08.x) and will be discarded,
resulting in loss of all running and pending jobs. For this
reason, creating backup copies of state files (as described below)
can be of value. Therefore when upgrading Slurm (more precisely,
the slurmctld daemon), saving the /StateSaveLocation/ (as defined
in /slurm.conf/) directory contents with all state information is
recommended. If you need to downgrade, restoring that directory's
contents will let you recover the jobs. Jobs submitted under the
new version will not be in those state files, but it can let you
recover most jobs. An exception to this is that jobs may be lost
when installing new pre-release versions (e.g. 20.02.0-pre1 to
20.02.0-pre2). Developers will try to note these cases in the NEWS
file. Contents of major releases are also described in 

Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Paul Edmon

In general  I would follow this:

https://slurm.schedmd.com/quickstart_admin.html#upgrade

Namely:

Almost every new major release of Slurm (e.g. 19.05.x to 20.02.x) 
involves changes to the state files with new data structures, new 
options, etc. Slurm permits upgrades to a new major release from the 
past two major releases, which happen every nine months (e.g. 18.08.x or 
19.05.x to 20.02.x) without loss of jobs or other state information. 
State information from older versions will not be recognized and will be 
discarded, resulting in loss of all running and pending jobs. State 
files are *not* recognized when downgrading (e.g. from 19.05.x to 
18.08.x) and will be discarded, resulting in loss of all running and 
pending jobs. For this reason, creating backup copies of state files (as 
described below) can be of value. Therefore when upgrading Slurm (more 
precisely, the slurmctld daemon), saving the /StateSaveLocation/ (as 
defined in /slurm.conf/) directory contents with all state information 
is recommended. If you need to downgrade, restoring that directory's 
contents will let you recover the jobs. Jobs submitted under the new 
version will not be in those state files, but it can let you recover 
most jobs. An exception to this is that jobs may be lost when installing 
new pre-release versions (e.g. 20.02.0-pre1 to 20.02.0-pre2). Developers 
will try to note these cases in the NEWS file. Contents of major 
releases are also described in the RELEASE_NOTES file.


So I wouldn't go directly to 20.x, instead I would go from 17.x to 19.x 
and then to 20.x


-Paul Edmon-

On 11/2/2020 8:55 AM, Fulcomer, Samuel wrote:
We're doing something similar. We're continuing to run production on 
17.x and have set up a new server/cluster running 20.x for testing and 
MPI app rebuilds.


Our plan had been to add recently purchased nodes to the new cluster, 
and at some point turn off submission on the old cluster and switch 
everyone to  submission on the new cluster (new login/submission 
hosts). That way previously submitted MPI apps would continue to run 
properly. As the old cluster partitions started to clear out we'd mark 
ranges of nodes to drain and move them to the new cluster.


We've since decided to wait until January, when we've scheduled some 
downtime. The process will remain the same wrt moving nodes from the 
old cluster to the new, _except_ that everything will be drained, so 
we can move big blocks of nodes and avoid slurm.conf Partition line 
ugliness.


We're starting with a fresh database to get rid of the bug 
induced corruption that prevents GPUs from being fenced with cgroups.


regards,
s

On Mon, Nov 2, 2020 at 8:28 AM navin srivastava 
mailto:navin.alt...@gmail.com>> wrote:


Dear All,

Currently we are running slurm version 17.11.x and wanted to move
to 20.x.

We are building the New server with Slurm 20.2 version and
planning to upgrade the client nodes from 17.x to 20.x.

wanted to check if we can upgrade the Client from 17.x to 20.x
directly or we need to go through 17.x to 18.x and 19.x then 20.x

Regards
Navin.





Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Fulcomer, Samuel
We're doing something similar. We're continuing to run production on 17.x
and have set up a new server/cluster  running 20.x for testing and MPI app
rebuilds.

Our plan had been to add recently purchased nodes to the new cluster, and
at some point turn off submission on the old cluster and switch everyone
to  submission on the new cluster (new login/submission hosts). That way
previously submitted MPI apps would continue to run properly. As the old
cluster partitions started to clear out we'd mark ranges of nodes to drain
and move them to the new cluster.

We've since decided to wait until January, when we've scheduled some
downtime. The process will remain the same wrt moving nodes from the old
cluster to the new, _except_ that everything will be drained, so we can
move big blocks of nodes and avoid slurm.conf Partition line ugliness.

We're starting with a fresh database to get rid of the bug
induced corruption that prevents GPUs from being fenced with cgroups.

regards,
s

On Mon, Nov 2, 2020 at 8:28 AM navin srivastava 
wrote:

> Dear All,
>
> Currently we are running slurm version 17.11.x and wanted to move to 20.x.
>
> We are building the New server with Slurm 20.2 version and planning to
> upgrade the client nodes from 17.x to 20.x.
>
> wanted to check if we can upgrade the Client from 17.x to 20.x directly or
> we need to go through 17.x to 18.x and 19.x then 20.x
>
> Regards
> Navin.
>
>
>
>


Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Christopher J Cawley
Depending on how large the database is,
the database backend upgrades can take
while.

Chris

Christopher J. Cawley

Systems Engineer/Linux Engineer, Information Technology Services

223 Aquia Building, Ffx, MSN: 1B5

George Mason University


Phone: (703) 993-6397

Email: ccawl...@gmu.edu

​


From: slurm-users  on behalf of 
Christopher J Cawley 
Sent: Monday, November 2, 2020 8:33 AM
To: Slurm User Community List 
Subject: Re: [slurm-users] Slurm Upgrade

I do not think so.

In any case, make sure that you stop services
and make a backup of the database.

Chris

Christopher J. Cawley

Systems Engineer/Linux Engineer, Information Technology Services

223 Aquia Building, Ffx, MSN: 1B5

George Mason University


Phone: (703) 993-6397

Email: ccawl...@gmu.edu

​


From: slurm-users  on behalf of navin 
srivastava 
Sent: Monday, November 2, 2020 8:25 AM
To: Slurm User Community List 
Subject: [slurm-users] Slurm Upgrade

Dear All,

Currently we are running slurm version 17.11.x and wanted to move to 20.x.

We are building the New server with Slurm 20.2 version and planning to upgrade 
the client nodes from 17.x to 20.x.

wanted to check if we can upgrade the Client from 17.x to 20.x directly or we 
need to go through 17.x to 18.x and 19.x then 20.x

Regards
Navin.





Re: [slurm-users] Slurm Upgrade

2020-11-02 Thread Christopher J Cawley
I do not think so.

In any case, make sure that you stop services
and make a backup of the database.

Chris

Christopher J. Cawley

Systems Engineer/Linux Engineer, Information Technology Services

223 Aquia Building, Ffx, MSN: 1B5

George Mason University


Phone: (703) 993-6397

Email: ccawl...@gmu.edu

​


From: slurm-users  on behalf of navin 
srivastava 
Sent: Monday, November 2, 2020 8:25 AM
To: Slurm User Community List 
Subject: [slurm-users] Slurm Upgrade

Dear All,

Currently we are running slurm version 17.11.x and wanted to move to 20.x.

We are building the New server with Slurm 20.2 version and planning to upgrade 
the client nodes from 17.x to 20.x.

wanted to check if we can upgrade the Client from 17.x to 20.x directly or we 
need to go through 17.x to 18.x and 19.x then 20.x

Regards
Navin.





[slurm-users] Slurm Upgrade

2020-11-02 Thread navin srivastava
Dear All,

Currently we are running slurm version 17.11.x and wanted to move to 20.x.

We are building the New server with Slurm 20.2 version and planning to
upgrade the client nodes from 17.x to 20.x.

wanted to check if we can upgrade the Client from 17.x to 20.x directly or
we need to go through 17.x to 18.x and 19.x then 20.x

Regards
Navin.


[slurm-users] Reset all accounting data

2020-11-02 Thread Diego Zuccato
Hello all.

I'm going to change (a lot!) our cluster config, including accounting
weights.
Since keeping historic data would cause a mess, how can I reset usage
for all accounts/users ?

I already tried the obvious:
# sacctmgr modify account set rawusage=0
You didn't set any conditions with 'WHERE'.
Are you sure you want to continue? (You have 30 seconds to decide)
(N/y): y
sacctmgr: error: An association name is required to remove usage

Should I iterate all the accounts or is there a better/faster method?

TIA!

-- 
Diego Zuccato
DIFA - Dip. di Fisica e Astronomia
Servizi Informatici
Alma Mater Studiorum - Università di Bologna
V.le Berti-Pichat 6/2 - 40127 Bologna - Italy
tel.: +39 051 20 95786