[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-18 Thread Christopher Samuel

On 18/08/16 20:32, Ole Holm Nielsen wrote:

> Chris Samuel in a previous posting had some more cautious advice about
> upgrading slurmd daemons!  I hope that Chris may offer addition insights.

It's just that if I don't have to upgrade nodes running jobs I'd really
rather avoid it.

I know it's supposed to work, and I'm sure it really does, just, well,
paranoia (and too many years using Torque where it could be a really Bad
Thing(tm)). :-)

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-18 Thread Christopher Samuel

On 18/08/16 21:07, Barbara Krasovec wrote:

> scontrol reconfigure

"scontrol reconfigure" will do most, but not all parameters. For
instance adding/removing nodes requires daemon restarts.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-18 Thread Christopher Samuel

On 17/08/16 23:36, Ole Holm Nielsen wrote:

> Obviously upgrading slurmd's which are running jobs is quite tricky!  I
> have some questions:
> 
> 1. Can't you replace the health check by a global scontrol like this?
>scontrol update NodeName= State=drain Reason="Upgrading
> slurmd"

Yes you could, but as we keep the configs for our healthchecks in git it
means we have a historical record of the changes made.

> 2. Do you really have to wait for *all* nodes to become drained before
> starting to upgrade?  This could take weeks!

No, that's the opposite of what we do.

We update the healthcheck, the nodes drain themselves, then as nodes
become idle we upgrade the slurmd on them, the healthcheck sees its the
new version and puts itself back into service.

So it's a rolling upgrade.

Our next outage does require us to have no nodes running jobs, but
that's just because we're changing a parameter in slurm.conf that
requires there to be no running jobs first.

cheers,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-18 Thread Christopher Samuel

On 17/08/16 04:42, Ole Holm Nielsen wrote:

> Question: Can anyone provide slurmdbd upgrade instructions which work
> correctly on CentOS 7 (and other OSes using systemd)?

We don't tend to start slurmdbd via an init script when doing an
upgrade, instead we run it by hand adding "-D -v -v" so it doesn't put
itself into the background and also prints what it's doing to the
terminal as well as logging it.

Call us paranoid, but that's just how we roll.. ;-)

-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci


[slurm-dev] Preemption stats

2016-08-18 Thread Jeff White


Does Slurm have a way of showing historical data of how many jobs have 
been preempted by other jobs?


--
Jeff White
HPC Systems Engineer
Information Technology Services - WSU


[slurm-dev] Re: Remote Visualization and Slurm

2016-08-18 Thread Andrew Elwell

> If anyone has a working remote visualization cluster that integrates well
> with slurm, I would love to hear from you.

We're using 'strudel'
https://www.massive.org.au/userguide/cluster-instructions/strudel
and our local instructions are
https://support.pawsey.org.au/documentation/display/US/Getting+started%3A+Remote+visualisation+with+Strudel

Andrew


[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-18 Thread Barbara Krasovec

Well, if you're doing the upgrade of already installed packages, you can do:

yum update slurm-sql slurm-munge slurm-slurmdbd

or

rpm -Uvh 
(the U switch is for upgrade of already installed packages)

Cheers,
Barbara

On 18/08/16 19:50, Balaji Deivam wrote:
Re: [slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - 
Instructions needed

Thanks for your response.

I have build the RPMs and got below files generated. Then installed 
those 3 rpms alone which you have mentioned. Is this right?


-rw-r- 1 root root 25680316 Aug 18 12:44 
slurm-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root   451160 Aug 18 12:44 
slurm-perlapi-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root   117380 Aug 18 12:44 
slurm-devel-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root16024 Aug 18 12:44 
slurm-munge-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root  1061536 Aug 18 12:44 
slurm-slurmdbd-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root   281868 Aug 18 12:44 
slurm-sql-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root  1140588 Aug 18 12:44 
slurm-plugins-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root35760 Aug 18 12:44 
slurm-torque-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root 7100 Aug 18 12:44 
slurm-sjobexit-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root 6260 Aug 18 12:44 
slurm-slurmdb-direct-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root10420 Aug 18 12:44 
slurm-sjstat-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root35696 Aug 18 12:44 
slurm-pam_slurm-15.08.12-1.el6.x86_64.rpm

[root@cloudlg017223 x86_64]# pwd
*/root/rpmbuild/RPMS/x86_64*
[root@cloudlg017223 x86_64]#


[root@cloudlg017223 x86_64]#*rpm -ivh 
slurm-slurmdbd-15.08.12-1.el6.x86_64.rpm 
slurm-munge-15.08.12-1.el6.x86_64.rpm slurm-sql-15.08.12-1.el6.x86_64.rpm*

Preparing...  ### [100%]
file /apps/slurm/lib64/slurm/accounting_storage_mysql.so from 
install of slurm-sql-15.08.12-1.el6.x86_64 conflicts with file from 
package slurm-sql-14.11.3-1.el6.x86_64
file /apps/slurm/lib64/slurm/jobcomp_mysql.so from install of 
slurm-sql-15.08.12-1.el6.x86_64 conflicts with file from package 
slurm-sql-14.11.3-1.el6.x86_64
file /apps/slurm/sbin/slurmdbd from install of 
slurm-slurmdbd-15.08.12-1.el6.x86_64 conflicts with file from package 
slurm-slurmdbd-14.11.3-1.el6.x86_64
file /apps/slurm/share/man/man5/slurmdbd.conf.5 from install 
of slurm-slurmdbd-15.08.12-1.el6.x86_64 conflicts with file from 
package slurm-slurmdbd-14.11.3-1.el6.x86_64
file /apps/slurm/share/man/man8/slurmdbd.8 from install of 
slurm-slurmdbd-15.08.12-1.el6.x86_64 conflicts with file from package 
slurm-slurmdbd-14.11.3-1.el6.x86_64
file /apps/slurm/lib64/slurm/auth_munge.so from install of 
slurm-munge-15.08.12-1.el6.x86_64 conflicts with file from package 
slurm-munge-14.11.3-1.el6.x86_64
file /apps/slurm/lib64/slurm/crypto_munge.so from install of 
slurm-munge-15.08.12-1.el6.x86_64 conflicts with file from package 
slurm-munge-14.11.3-1.el6.x86_64

[root@cloudlg017223 x86_64]#




Thanks & Regards,
Balaji Deivam
Staff Analyst - Business Data Center
Seagate Technology - 389 Disc Drive, Longmont, CO 80503 | 720-684- 
_3395_


On Thu, Aug 18, 2016 at 4:57 AM, Barbara Krasovec > wrote:


Helllo!


On 17/08/16 23:59, Balaji Deivam wrote:


Hi,

Can someone give me the detailed step on "Upgrade the
slurmdbd daemon" ? 



I have downloaded the Slrum source tar file and looking for how
to upgrade only the slurmdbd from that tar file.



Thanks & Regards,
Balaji Deivam
Staff Analyst - Business Data Center
Seagate Technology - 389 Disc Drive, Longmont, CO 80503 |
720-684- _3395_


Did you mention the operating system you are using? In my case, I
used rpm-s, so i have built rpms for the new version and then
upgraded the following packages: slurm-slurmdbd, slurm-munge,
slurm-sql

Instructions on how to build/install SLURM:
http://slurm.schedmd.com/quickstart_admin.html



Cheers,

Barbara






[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-18 Thread Balaji Deivam
Thanks for your response.

I have build the RPMs and got below files generated. Then installed those 3
rpms alone which you have mentioned. Is this right?

-rw-r- 1 root root 25680316 Aug 18 12:44 slurm-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root   451160 Aug 18 12:44
slurm-perlapi-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root   117380 Aug 18 12:44
slurm-devel-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root16024 Aug 18 12:44
slurm-munge-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root  1061536 Aug 18 12:44
slurm-slurmdbd-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root   281868 Aug 18 12:44
slurm-sql-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root  1140588 Aug 18 12:44
slurm-plugins-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root35760 Aug 18 12:44
slurm-torque-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root 7100 Aug 18 12:44
slurm-sjobexit-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root 6260 Aug 18 12:44
slurm-slurmdb-direct-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root10420 Aug 18 12:44
slurm-sjstat-15.08.12-1.el6.x86_64.rpm
-rw-r- 1 root root35696 Aug 18 12:44
slurm-pam_slurm-15.08.12-1.el6.x86_64.rpm
[root@cloudlg017223 x86_64]# pwd
*/root/rpmbuild/RPMS/x86_64*
[root@cloudlg017223 x86_64]#


[root@cloudlg017223 x86_64]#* rpm -ivh
slurm-slurmdbd-15.08.12-1.el6.x86_64.rpm
slurm-munge-15.08.12-1.el6.x86_64.rpm slurm-sql-15.08.12-1.el6.x86_64.rpm*
Preparing...###
[100%]
file /apps/slurm/lib64/slurm/accounting_storage_mysql.so from
install of slurm-sql-15.08.12-1.el6.x86_64 conflicts with file from package
slurm-sql-14.11.3-1.el6.x86_64
file /apps/slurm/lib64/slurm/jobcomp_mysql.so from install of
slurm-sql-15.08.12-1.el6.x86_64 conflicts with file from package
slurm-sql-14.11.3-1.el6.x86_64
file /apps/slurm/sbin/slurmdbd from install of
slurm-slurmdbd-15.08.12-1.el6.x86_64 conflicts with file from package
slurm-slurmdbd-14.11.3-1.el6.x86_64
file /apps/slurm/share/man/man5/slurmdbd.conf.5 from install of
slurm-slurmdbd-15.08.12-1.el6.x86_64 conflicts with file from package
slurm-slurmdbd-14.11.3-1.el6.x86_64
file /apps/slurm/share/man/man8/slurmdbd.8 from install of
slurm-slurmdbd-15.08.12-1.el6.x86_64 conflicts with file from package
slurm-slurmdbd-14.11.3-1.el6.x86_64
file /apps/slurm/lib64/slurm/auth_munge.so from install of
slurm-munge-15.08.12-1.el6.x86_64 conflicts with file from package
slurm-munge-14.11.3-1.el6.x86_64
file /apps/slurm/lib64/slurm/crypto_munge.so from install of
slurm-munge-15.08.12-1.el6.x86_64 conflicts with file from package
slurm-munge-14.11.3-1.el6.x86_64
[root@cloudlg017223 x86_64]#




Thanks & Regards,
Balaji Deivam
Staff Analyst - Business Data Center
Seagate Technology - 389 Disc Drive, Longmont, CO 80503 | 720-684-
<720-684-2363>*3395*

On Thu, Aug 18, 2016 at 4:57 AM, Barbara Krasovec  wrote:

> Helllo!
>
>
> On 17/08/16 23:59, Balaji Deivam wrote:
>
>
> Hi,
>
>
>> Can someone give me the detailed step on "Upgrade the slurmdbd daemon" ?
>
>
> I have downloaded the Slrum source tar file and looking for how to upgrade
> only the slurmdbd from that tar file.
>
>
>
> Thanks & Regards,
> Balaji Deivam
> Staff Analyst - Business Data Center
> Seagate Technology - 389 Disc Drive, Longmont, CO 80503 | 720-684-
> <720-684-2363>*3395*
>
> Did you mention the operating system you are using? In my case, I used
> rpm-s, so i have built rpms for the new version and then upgraded the
> following packages: slurm-slurmdbd, slurm-munge, slurm-sql
>
> Instructions on how to build/install SLURM: http://slurm.schedmd.com/
> quickstart_admin.html
> 
>
> Cheers,
>
> Barbara
>


[slurm-dev] Re: jobs rejected for no reason

2016-08-18 Thread Ade Fewings
Hi Antonia

Hmmwe set the Default Memory Per CPU for the partitions as well as having 
the node memory populated, so I may be unable to help here, but I'm curious as 
to why your node reports "RealMemory=1" - but I don't know if that is 'normal' 
in some case, to be honest.
If I request an impossible memory configuration I get the message you do at job 
submission.  Wonder if passing the '-v -v -v' to sbatch provides useful 
debugging output on the what Slurm assesses the job requirements to be and 
whether specifying compliant memory requirements for the job helps.

Sorry I can't be more help.
~~
Ade





From: Antonia Mey 
Sent: 18 August 2016 16:30:14
To: slurm-dev
Subject: [slurm-dev] Re: jobs rejected for no reason

Hi Ade,

looking at the node configuration  this should be ok:
NodeName=node011 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=4 CPUErr=0 CPUTot=32 CPULoad=2.79 Features=(null)
   Gres=gpu:4
   NodeAddr=node011 NodeHostName=node011 Version=14.11
   OS=Linux RealMemory=1 AllocMem=0 Sockets=32 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2015-10-30T09:30:14 SlurmdStartTime=2015-10-30T12:02:36
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
On 18 Aug 2016, at 16:18, Ade Fewings 
> wrote:

Hi Antonia

I think it's quite likely to be something to do with the nodes in that 
partition, does 'scontrol show node=' show all the requested 
capabilities?

~~
Ade


From: Antonia Mey >
Sent: 18 August 2016 15:55:12
To: slurm-dev
Subject: [slurm-dev] jobs rejected for no reason

Hi all,

I am a bit out of my depth here and apologies if this is a very trivial 
problem. Slurm rejects jobs due to insufficient resources (sbatch: error: Batch 
job submission failed: Requested node configuration is not available), when the 
partition should definitely accept the following job

#!/bin/bash --login
#SBATCH -n 8
#SBATCH -N 1
#SBATCH -o d66a.out
#SBATCH -e d66a.err
#SBATCH -p GTX
#SBATCH --gres=gpu:1
#SBATCH --time 8:00:00
# Switch to current working directory

scontrol show partition GTX gives:
PartitionName=GTX
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=1 LLN=NO 
MaxCPUsPerNode=UNLIMITED
   Nodes=node0[10-13]
   Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=128 TotalNodes=4 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

I have a similar partition configured in the same way, where the job is happily 
running. Am I missing something essential in the way the partition is 
configured?
Any thoughts? And thanks for any help.

Best,
Antonia

--
Dr Antonia Mey
University of Edinburgh
Department of Chemistry
Joseph Black Building
Edinburgh
EH9 3FJ

Tel: +44 1316507748
Email: antonia@gmail.com



   [HPC Wales - www.hpcwales.co.uk] 

The contents of this email and any files transmitted with it are confidential 
and intended solely for the named addressee only.  Unless you are the named 
addressee (or authorised to receive this on their behalf) you may not copy it 
or use it, or disclose it to anyone else.  If you have received this email in 
error, please notify the sender by email or telephone.  All emails sent by High 
Performance Computing Wales have been checked using an Anti-Virus system.  We 
would advise you to run your own virus check before opening any attachments 
received as we will not in any event accept any liability whatsoever, once an 
email and/or attachment is received.

High Performance Computing Wales is a private limited company incorporated in 
Wales on 8 March 2010 as company number 07181701.

Our registered office is at Finance Office, Bangor University, Cae Derwen, 
College Road, Bangor, Gwynedd. LL57 2DG. UK.

High Performance Computing Wales is part funded by the European Regional 
Development Fund through the Welsh Government.

--
Dr Antonia Mey
University of Edinburgh
Department of Chemistry
Joseph Black Building
Edinburgh
EH9 3FJ

Tel: +44 1316507748
Email: antonia@gmail.com




   [HPC Wales - www.hpcwales.co.uk] 



The contents of this email and any files transmitted with it are confidential 
and intended solely for the named addressee only.  Unless you are the named 
addressee (or authorised to receive this on their behalf) you may not copy it 
or use it, or disclose it to anyone else.  If you have received this email in 
error, please notify the 

[slurm-dev] Re: jobs rejected for no reason

2016-08-18 Thread Antonia Mey
Hi Ade,

looking at the node configuration  this should be ok:
NodeName=node011 Arch=x86_64 CoresPerSocket=1
   CPUAlloc=4 CPUErr=0 CPUTot=32 CPULoad=2.79 Features=(null)
   Gres=gpu:4
   NodeAddr=node011 NodeHostName=node011 Version=14.11
   OS=Linux RealMemory=1 AllocMem=0 Sockets=32 Boards=1
   State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2015-10-30T09:30:14 SlurmdStartTime=2015-10-30T12:02:36
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
> On 18 Aug 2016, at 16:18, Ade Fewings  wrote:
> 
> Hi Antonia
> 
> I think it's quite likely to be something to do with the nodes in that 
> partition, does 'scontrol show node=' show all the requested 
> capabilities?
> 
> ~~
> Ade
> 
> From: Antonia Mey >
> Sent: 18 August 2016 15:55:12
> To: slurm-dev
> Subject: [slurm-dev] jobs rejected for no reason
> 
> Hi all,
> 
> I am a bit out of my depth here and apologies if this is a very trivial 
> problem. Slurm rejects jobs due to insufficient resources (sbatch: error: 
> Batch job submission failed: Requested node configuration is not available), 
> when the partition should definitely accept the following job
> 
> #!/bin/bash --login
> #SBATCH -n 8
> #SBATCH -N 1
> #SBATCH -o d66a.out
> #SBATCH -e d66a.err
> #SBATCH -p GTX
> #SBATCH --gres=gpu:1
> #SBATCH --time 8:00:00
> # Switch to current working directory
> 
> scontrol show partition GTX gives:
> PartitionName=GTX
>AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
>AllocNodes=ALL Default=NO
>DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
>MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=1 LLN=NO 
> MaxCPUsPerNode=UNLIMITED
>Nodes=node0[10-13]
>Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
>State=UP TotalCPUs=128 TotalNodes=4 SelectTypeParameters=N/A
>DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED
> 
> I have a similar partition configured in the same way, where the job is 
> happily running. Am I missing something essential in the way the partition is 
> configured?
> Any thoughts? And thanks for any help.
> 
> Best,
> Antonia
> 
> --
> Dr Antonia Mey
> University of Edinburgh
> Department of Chemistry
> Joseph Black Building
> Edinburgh
> EH9 3FJ
> 
> Tel: +44 1316507748
> Email: antonia@gmail.com 
> 
> 
> 
> The contents of this email and any files transmitted with it are confidential 
> and intended solely for the named addressee only.  Unless you are the named 
> addressee (or authorised to receive this on their behalf) you may not copy it 
> or use it, or disclose it to anyone else.  If you have received this email in 
> error, please notify the sender by email or telephone.  All emails sent by 
> High Performance Computing Wales have been checked using an Anti-Virus 
> system.  We would advise you to run your own virus check before opening any 
> attachments received as we will not in any event accept any liability 
> whatsoever, once an email and/or attachment is received.
> 
> High Performance Computing Wales is a private limited company incorporated in 
> Wales on 8 March 2010 as company number 07181701.
> 
> Our registered office is at Finance Office, Bangor University, Cae Derwen, 
> College Road, Bangor, Gwynedd. LL57 2DG. UK.
> 
> High Performance Computing Wales is part funded by the European Regional 
> Development Fund through the Welsh Government.

--
Dr Antonia Mey
University of Edinburgh
Department of Chemistry
Joseph Black Building
Edinburgh
EH9 3FJ

Tel: +44 1316507748
Email: antonia@gmail.com




signature.asc
Description: Message signed with OpenPGP using GPGMail


[slurm-dev] jobs rejected for no reason

2016-08-18 Thread Antonia Mey
Hi all,

I am a bit out of my depth here and apologies if this is a very trivial 
problem. Slurm rejects jobs due to insufficient resources (sbatch: error: Batch 
job submission failed: Requested node configuration is not available), when the 
partition should definitely accept the following job

#!/bin/bash --login
#SBATCH -n 8
#SBATCH -N 1
#SBATCH -o d66a.out
#SBATCH -e d66a.err
#SBATCH -p GTX
#SBATCH --gres=gpu:1
#SBATCH --time 8:00:00
# Switch to current working directory

scontrol show partition GTX gives:
PartitionName=GTX
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=NO
   DefaultTime=NONE DisableRootJobs=NO GraceTime=0 Hidden=NO
   MaxNodes=UNLIMITED MaxTime=2-00:00:00 MinNodes=1 LLN=NO 
MaxCPUsPerNode=UNLIMITED
   Nodes=node0[10-13]
   Priority=1 RootOnly=NO ReqResv=NO Shared=NO PreemptMode=OFF
   State=UP TotalCPUs=128 TotalNodes=4 SelectTypeParameters=N/A
   DefMemPerNode=UNLIMITED MaxMemPerNode=UNLIMITED

I have a similar partition configured in the same way, where the job is happily 
running. Am I missing something essential in the way the partition is 
configured?
Any thoughts? And thanks for any help.

Best,
Antonia

--
Dr Antonia Mey
University of Edinburgh
Department of Chemistry
Joseph Black Building
Edinburgh
EH9 3FJ

Tel: +44 1316507748
Email: antonia@gmail.com




signature.asc
Description: Message signed with OpenPGP using GPGMail


[slurm-dev] Re: SLURM job's email notification does not work

2016-08-18 Thread Fatih Öztürk
Dear Doug and Christian,

Thank you very very much because of your help.

I could be able to solve the problem when i make the changes as you said 
(Applied in headnode then scontrol reconfigure). Now the emails are sent 
successfully.

Best Regards,

Fatih




[cid:image67d4fe.PNG@2243a42f.4cbb915e]


Fatih Öztürk

Engineering Information Technologies

Information Technologies

fatih.ozt...@tai.com.tr


[cid:imagef41981.PNG@8700d93b.408295be]

TAI - Turkish Aerospace Industries, Inc.   www.tai.com.tr / 
www.taigermany.com   [cid:imageceadb5.PNG@717da597.448b4653] 
  [cid:image3d3a47.PNG@ba206fc8.4c97b360] 
  
[cid:imagecf6cc3.PNG@d505ae86.4cb3ecda] 
Address: Fethiye Mh. Havacilik Blv. No:17 06980
Kazan – ANKARA / TURKEY Tel: +90 (312) 811 1800 / 810 8000-3179 // Fax: +90 312 
811 1425


[cid:image6bacfe.JPG@3799525b.4b82636e]


Legal Notice :
The Terms and Conditions which this e-mail is subject to, can be accessed from 
this link.

From: Douglas Jacobsen [mailto:dmjacob...@lbl.gov]
Sent: Thursday, August 18, 2016 4:28 PM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM job's email notification does not work

Email is only sent by slurmctld, you'll need to change slurm.conf there and at 
least do an `scontrol reconfigure`, then perhaps it'll start working.

-Doug


Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center

- __o
-- _ '\<,_
--(_)/  (_)__


On Thu, Aug 18, 2016 at 6:23 AM, Fatih Öztürk 
> wrote:
Dear Christian,

What i did now;

1) Changed computenode3 status to DRAIN
2) Changed slurm.conf only on computenode3. Added MailProg=/usr/bin/mailx at 
the end of the slurm.conf
3)Restarted munged and slurmd services
4) Changed computenode3 status to RESUME

Both with root and my own user id, email is still not sent again from 
computenode3.

Regards.



Fatih Öztürk
Engineering Information Technologies
Information Technologies

fatih.ozt...@tai.com.tr


TAI - Turkish Aerospace Industries, Inc.
Address: Fethiye Mh. Havacılık Blv. No:17 06980 Kazan – ANKARA / TURKEY Tel: 
+90 312 811 1800-3179 // Fax: +90 312 811 1425  
www.tai.com.tr

Legal Notice :
The Terms and Conditions which this e-mail is subject to, can be accessed from 
this link:
http://www.tai.com.tr/tr/yasal-uyari






-Original Message-
From: Christian Goll 
[mailto:christian.g...@h-its.org]
Sent: Thursday, August 18, 2016 3:59 PM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM job's email notification does not work


Hello Fatih,
did you set the variable MailProg to right value, e.g.
MailProg=/usr/bin/mailx
in slurm.conf?

kind regards,
Christian
On 18.08.2016 14:44, Fatih Öztürk wrote:
> Hello,
>
>
> I have a problem about email notification with jobs. I would be
> appreciate if you could help me.
>
>
> We have a SLURM cluster: 1 Head Node and about 20 Compute Nodes.
> User's run their jobs within only on head node with their own credentials.
>
>
> As an example, if i run a job like below on the head node;
>
>
> [t15247@headnode ] # srun -n1 -w "computenode3" --mail-type=ALL
> --mail-user=fatih.ozt...@tai.com.tr 
> /bin/hostname
>
>
> it runs successfully, however no slurm notification email sent to user.
>
>
> However, this user can send email manually with "mailx" successfully
> from both on headnode and computenode3.
>
>
> Would you please help me about this problem?
>
>
> Note: Remote microsoft exchange smtp server informations are set in
> /etc/mail.rc
>
>
> Regards,
>
>
> *@*
>
> *Fatih Öztürk*
>
> TAI, Turkish Aerospace Industries Inc.
>
> Engineering IT
>
> +90 312 811 18 00/ 3179
>
>
>
>
>
>
> *Fatih Öztürk*
>
> Engineering Information Technologies
>
> Information Technologies
>
> fatih.ozt...@tai.com.tr 
> >
>
> TAI - Turkish Aerospace Industries, Inc.   
> *www.tai.com.tr* /
> *www.taigermany.com* 
> >
> 
> 
> 
> Address: Fethiye Mh. Havacilik Blv. No:17 06980 Kazan – ANKARA /
> TURKEY Tel: +90 (312) 811 1800 / 810 
> 8000-3179 //
> Fax: +90 312 811 1425
>
> 
>
> Legal Notice :
> *The Terms and Conditions which this e-mail is subject to, can be
> accessed from this link.* 

[slurm-dev] Re: SLURM job's email notification does not work

2016-08-18 Thread Douglas Jacobsen
Email is only sent by slurmctld, you'll need to change slurm.conf there and
at least do an `scontrol reconfigure`, then perhaps it'll start working.

-Doug


Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center 

- __o
-- _ '\<,_
--(_)/  (_)__


On Thu, Aug 18, 2016 at 6:23 AM, Fatih Öztürk 
wrote:

> Dear Christian,
>
> What i did now;
>
> 1) Changed computenode3 status to DRAIN
> 2) Changed slurm.conf only on computenode3. Added MailProg=/usr/bin/mailx
> at the end of the slurm.conf
> 3)Restarted munged and slurmd services
> 4) Changed computenode3 status to RESUME
>
> Both with root and my own user id, email is still not sent again from
> computenode3.
>
> Regards.
>
>
>
> Fatih Öztürk
> Engineering Information Technologies
> Information Technologies
>
> fatih.ozt...@tai.com.tr
>
>
> TAI - Turkish Aerospace Industries, Inc.
> Address: Fethiye Mh. Havacılık Blv. No:17 06980 Kazan – ANKARA / TURKEY
> Tel: +90 312 811 1800-3179 // Fax: +90 312 811 1425  www.tai.com.tr
>
> Legal Notice :
> The Terms and Conditions which this e-mail is subject to, can be accessed
> from this link:
> http://www.tai.com.tr/tr/yasal-uyari
>
>
>
>
>
>
> -Original Message-
> From: Christian Goll [mailto:christian.g...@h-its.org]
> Sent: Thursday, August 18, 2016 3:59 PM
> To: slurm-dev
> Subject: [slurm-dev] Re: SLURM job's email notification does not work
>
>
> Hello Fatih,
> did you set the variable MailProg to right value, e.g.
> MailProg=/usr/bin/mailx
> in slurm.conf?
>
> kind regards,
> Christian
> On 18.08.2016 14:44, Fatih Öztürk wrote:
> > Hello,
> >
> >
> > I have a problem about email notification with jobs. I would be
> > appreciate if you could help me.
> >
> >
> > We have a SLURM cluster: 1 Head Node and about 20 Compute Nodes.
> > User's run their jobs within only on head node with their own
> credentials.
> >
> >
> > As an example, if i run a job like below on the head node;
> >
> >
> > [t15247@headnode ] # srun -n1 -w "computenode3" --mail-type=ALL
> > --mail-user=fatih.ozt...@tai.com.tr /bin/hostname
> >
> >
> > it runs successfully, however no slurm notification email sent to user.
> >
> >
> > However, this user can send email manually with "mailx" successfully
> > from both on headnode and computenode3.
> >
> >
> > Would you please help me about this problem?
> >
> >
> > Note: Remote microsoft exchange smtp server informations are set in
> > /etc/mail.rc
> >
> >
> > Regards,
> >
> >
> > *@*
> >
> > *Fatih Öztürk*
> >
> > TAI, Turkish Aerospace Industries Inc.
> >
> > Engineering IT
> >
> > +90 312 811 18 00/ 3179
> >
> >
> >
> >
> >
> >
> > *Fatih Öztürk*
> >
> > Engineering Information Technologies
> >
> > Information Technologies
> >
> > fatih.ozt...@tai.com.tr 
> >
> > TAI - Turkish Aerospace Industries, Inc.   *www.tai.com.tr* /
> > *www.taigermany.com* 
> > 
> > 
> > 
> > Address: Fethiye Mh. Havacilik Blv. No:17 06980 Kazan – ANKARA /
> > TURKEY Tel: +90 (312) 811 1800 / 810 8000-3179 //
> > Fax: +90 312 811 1425
> >
> > 
> >
> > Legal Notice :
> > *The Terms and Conditions which this e-mail is subject to, can be
> > accessed from this link.* 
> >
> >
>
> --
> Dr. Christian Goll
> HITS gGmbH
> Schloss-Wolfsbrunnenweg 35
> 69118 Heidelberg
> Germany
> Phone: +49 6221 533 230
> Fax: +49 6221 533 230
> 
> Amtsgericht Mannheim / HRB 337446
> Managing Director: Dr. Gesa Schönberger
>


[slurm-dev] Re: SLURM job's email notification does not work

2016-08-18 Thread Fatih Öztürk
Dear Christian,

What i did now;

1) Changed computenode3 status to DRAIN
2) Changed slurm.conf only on computenode3. Added MailProg=/usr/bin/mailx at 
the end of the slurm.conf
3)Restarted munged and slurmd services
4) Changed computenode3 status to RESUME

Both with root and my own user id, email is still not sent again from 
computenode3.

Regards.



Fatih Öztürk
Engineering Information Technologies
Information Technologies

fatih.ozt...@tai.com.tr


TAI - Turkish Aerospace Industries, Inc.
Address: Fethiye Mh. Havacılık Blv. No:17 06980 Kazan – ANKARA / TURKEY Tel: 
+90 312 811 1800-3179 // Fax: +90 312 811 1425  www.tai.com.tr

Legal Notice :
The Terms and Conditions which this e-mail is subject to, can be accessed from 
this link:
http://www.tai.com.tr/tr/yasal-uyari






-Original Message-
From: Christian Goll [mailto:christian.g...@h-its.org]
Sent: Thursday, August 18, 2016 3:59 PM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM job's email notification does not work


Hello Fatih,
did you set the variable MailProg to right value, e.g.
MailProg=/usr/bin/mailx
in slurm.conf?

kind regards,
Christian
On 18.08.2016 14:44, Fatih Öztürk wrote:
> Hello,
>
>
> I have a problem about email notification with jobs. I would be
> appreciate if you could help me.
>
>
> We have a SLURM cluster: 1 Head Node and about 20 Compute Nodes.
> User's run their jobs within only on head node with their own credentials.
>
>
> As an example, if i run a job like below on the head node;
>
>
> [t15247@headnode ] # srun -n1 -w "computenode3" --mail-type=ALL
> --mail-user=fatih.ozt...@tai.com.tr /bin/hostname
>
>
> it runs successfully, however no slurm notification email sent to user.
>
>
> However, this user can send email manually with "mailx" successfully
> from both on headnode and computenode3.
>
>
> Would you please help me about this problem?
>
>
> Note: Remote microsoft exchange smtp server informations are set in
> /etc/mail.rc
>
>
> Regards,
>
>
> *@*
>
> *Fatih Öztürk*
>
> TAI, Turkish Aerospace Industries Inc.
>
> Engineering IT
>
> +90 312 811 18 00/ 3179
>
>
>
>
>
>
> *Fatih Öztürk*
>
> Engineering Information Technologies
>
> Information Technologies
>
> fatih.ozt...@tai.com.tr 
>
> TAI - Turkish Aerospace Industries, Inc.   *www.tai.com.tr* /
> *www.taigermany.com* 
> 
> 
> 
> Address: Fethiye Mh. Havacilik Blv. No:17 06980 Kazan – ANKARA /
> TURKEY Tel: +90 (312) 811 1800 / 810 8000-3179 //
> Fax: +90 312 811 1425
>
> 
>
> Legal Notice :
> *The Terms and Conditions which this e-mail is subject to, can be
> accessed from this link.* 
>
>

--
Dr. Christian Goll
HITS gGmbH
Schloss-Wolfsbrunnenweg 35
69118 Heidelberg
Germany
Phone: +49 6221 533 230
Fax: +49 6221 533 230

Amtsgericht Mannheim / HRB 337446
Managing Director: Dr. Gesa Schönberger


[slurm-dev] Re: SLURM job's email notification does not work

2016-08-18 Thread Christian Goll

Hello Fatih,
did you set the variable MailProg to right value, e.g.
MailProg=/usr/bin/mailx
in slurm.conf?

kind regards,
Christian
On 18.08.2016 14:44, Fatih Öztürk wrote:
> Hello,
>
>
> I have a problem about email notification with jobs. I would be
> appreciate if you could help me.
>
>
> We have a SLURM cluster: 1 Head Node and about 20 Compute Nodes. User's
> run their jobs within only on head node with their own credentials.
>
>
> As an example, if i run a job like below on the head node;
>
>
> [t15247@headnode ] # srun -n1 -w "computenode3" --mail-type=ALL
> --mail-user=fatih.ozt...@tai.com.tr /bin/hostname
>
>
> it runs successfully, however no slurm notification email sent to user.
>
>
> However, this user can send email manually with "mailx" successfully
> from both on headnode and computenode3.
>
>
> Would you please help me about this problem?
>
>
> Note: Remote microsoft exchange smtp server informations are set in
> /etc/mail.rc
>
>
> Regards,
>
>
> *@*
>
> *Fatih Öztürk*
>
> TAI, Turkish Aerospace Industries Inc.
>
> Engineering IT
>
> +90 312 811 18 00/ 3179
>
>  
>
> 
>
>
> *Fatih Öztürk*
>
> Engineering Information Technologies
>
> Information Technologies
>
> fatih.ozt...@tai.com.tr 
>
> TAI - Turkish Aerospace Industries, Inc.   *www.tai.com.tr* /
> *www.taigermany.com*   
>  
>  
> 
> Address: Fethiye Mh. Havacilik Blv. No:17 06980
> Kazan – ANKARA / TURKEY Tel: +90 (312) 811 1800 / 810 8000-3179 //
> Fax: +90 312 811 1425
>
> 
>
> Legal Notice :
> *The Terms and Conditions which this e-mail is subject to, can be
> accessed from this link.* 
>
>

-- 
Dr. Christian Goll
HITS gGmbH
Schloss-Wolfsbrunnenweg 35
69118 Heidelberg
Germany
Phone: +49 6221 533 230
Fax: +49 6221 533 230

Amtsgericht Mannheim / HRB 337446
Managing Director: Dr. Gesa Schönberger


[slurm-dev] SLURM job's email notification does not work

2016-08-18 Thread Fatih Öztürk
Hello,


I have a problem about email notification with jobs. I would be appreciate if 
you could help me.


We have a SLURM cluster: 1 Head Node and about 20 Compute Nodes. User's run 
their jobs within only on head node with their own credentials.


As an example, if i run a job like below on the head node;


[t15247@headnode ] # srun -n1 -w "computenode3" --mail-type=ALL 
--mail-user=fatih.ozt...@tai.com.tr /bin/hostname


it runs successfully, however no slurm notification email sent to user.


However, this user can send email manually with "mailx" successfully from both 
on headnode and computenode3.


Would you please help me about this problem?


Note: Remote microsoft exchange smtp server informations are set in /etc/mail.rc


Regards,

@
Fatih Öztürk
TAI, Turkish Aerospace Industries Inc.
Engineering IT
+90 312 811 18 00/ 3179



[cid:imagefddd1e.PNG@72968851.479723dc]


Fatih Öztürk

Engineering Information Technologies

Information Technologies

fatih.ozt...@tai.com.tr


[cid:imagedac91d.PNG@77268814.458126ce]

TAI - Turkish Aerospace Industries, Inc.   www.tai.com.tr / 
www.taigermany.com   [cid:imagebe1b47.PNG@c87929ea.459b7ddb] 
  [cid:imagedb4856.PNG@daa099e3.458b35de] 
  
[cid:image700942.PNG@b23e8516.4a849564] 
Address: Fethiye Mh. Havacilik Blv. No:17 06980
Kazan – ANKARA / TURKEY Tel: +90 (312) 811 1800 / 810 8000-3179 // Fax: +90 312 
811 1425


[cid:image45a3d3.JPG@baf9d086.44895c44]


Legal Notice :
The Terms and Conditions which this e-mail is subject to, can be accessed from 
this link.



[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-18 Thread Barbara Krasovec


Hello!

On 18/08/16 12:33, Ole Holm Nielsen wrote:



On 08/17/2016 03:49 PM, Barbara Krasovec wrote:

I upgraded  SLURM rom 15.08 to 16.05 without draining the nodes and
without loosing any jobs, this was my procedure:

I increased timeouts in slurm.conf:
SlurmctldTimeout=3600
SlurmdTimeout=3600


Question: When you change parameters in slurm.conf, how do you force 
all daemons to reconfigure this?  Can you reload the slurm.conf 
without restarting the daemons (which is what you want to avoid)?

scontrol reconfigure



Did mysqldump of slurm database and copied slurmstate dir (just in


What do you mean by "slurmstate"?

StateSaveLocation=/var/spool/slurmstate



case), I increased innodb_buffer_size in my.cnf to 128M, then I followed


The 128 MB seems to be the default!  But in the SLURM accounting page 
I found some recommendations for the MySQL/Mariadb configuration.  How 
to implement these recommendations has now been added to my Wiki:

https://wiki.fysik.dtu.dk/niflheim/SLURM#mysql-configuration


the instructions on slurm page:


Shutdown the slurmdbd daemon
Upgrade the slurmdbd daemon
Restart the slurmdbd daemon
Shutdown the slurmctld daemon(s)
Shutdown the slurmd daemons on the compute nodes
Upgrade the slurmctld and slurmd daemons
Restart the slurmd daemons on the compute nodes
Restart the slurmctld daemon(s)


Chris Samuel in a previous posting had some more cautious advice about 
upgrading slurmd daemons!  I hope that Chris may offer addition insights.


/Ole

Cheers,
Barbara


[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-18 Thread Barbara Krasovec

Helllo!

On 17/08/16 23:59, Balaji Deivam wrote:
Re: [slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - 
Instructions needed


Hi,

Can someone give me the detailed step on "Upgrade the slurmdbd
daemon" ? 



I have downloaded the Slrum source tar file and looking for how to 
upgrade only the slurmdbd from that tar file.




Thanks & Regards,
Balaji Deivam
Staff Analyst - Business Data Center
Seagate Technology - 389 Disc Drive, Longmont, CO 80503 | 720-684- 
_3395_


Did you mention the operating system you are using? In my case, I used 
rpm-s, so i have built rpms for the new version and then upgraded the 
following packages: slurm-slurmdbd, slurm-munge, slurm-sql


Instructions on how to build/install SLURM: 
http://slurm.schedmd.com/quickstart_admin.html


Cheers,

Barbara



[slurm-dev] Re: Slurm Upgrade from 14.11.3 to 16.5.3 - Instructions needed

2016-08-18 Thread Ole Holm Nielsen



On 08/17/2016 03:49 PM, Barbara Krasovec wrote:

I upgraded  SLURM rom 15.08 to 16.05 without draining the nodes and
without loosing any jobs, this was my procedure:

I increased timeouts in slurm.conf:
SlurmctldTimeout=3600
SlurmdTimeout=3600


Question: When you change parameters in slurm.conf, how do you force all 
daemons to reconfigure this?  Can you reload the slurm.conf without 
restarting the daemons (which is what you want to avoid)?



Did mysqldump of slurm database and copied slurmstate dir (just in


What do you mean by "slurmstate"?


case), I increased innodb_buffer_size in my.cnf to 128M, then I followed


The 128 MB seems to be the default!  But in the SLURM accounting page I 
found some recommendations for the MySQL/Mariadb configuration.  How to 
implement these recommendations has now been added to my Wiki:

https://wiki.fysik.dtu.dk/niflheim/SLURM#mysql-configuration


the instructions on slurm page:


Shutdown the slurmdbd daemon
Upgrade the slurmdbd daemon
Restart the slurmdbd daemon
Shutdown the slurmctld daemon(s)
Shutdown the slurmd daemons on the compute nodes
Upgrade the slurmctld and slurmd daemons
Restart the slurmd daemons on the compute nodes
Restart the slurmctld daemon(s)


Chris Samuel in a previous posting had some more cautious advice about 
upgrading slurmd daemons!  I hope that Chris may offer addition insights.


/Ole