[slurm-dev] SLURM terminating jobs before they finish

2017-04-17 Thread Batsirai Mabvakure

Hi,

Slurm has been running okay until recently my jobs are being terminated before 
they finish running. At first I thought it was the memory and I allocated 
—mem=1, then moved to —mem=2, but still the jobs run halfway and stop 
without an error in the slurm.out file. I then tried a job that ran and 
completed a week ago, and it terminated when it was halfway as well. Has anyone 
ever experienced and rectified this? I also tried:

scontrol show config | grep InactiveLimit

InactiveLimit   = 0 sec


Regards,

Batsirai


The views expressed in this email are, unless otherwise stated, those of the 
author and not those of the National Health Laboratory Service or its 
management. The information in this e-mail is confidential and is intended 
solely for the addressee.
Access to this e-mail by anyone else is unauthorized. If you are not the 
intended recipient, any disclosure, copying, distribution or any action taken 
or omitted in reliance on this, is prohibited and may be unlawful.
Whilst all reasonable steps are taken to ensure the accuracy and integrity of 
information and data transmitted electronically and to preserve the 
confidentiality thereof, no liability or responsibility whatsoever is accepted 
if information or data is, for whatever reason, corrupted or does not reach its 
intended destination.


[slurm-dev] Re: Error messages: find_node_record: lookup failure when setting FQDN for compute nodes

2017-04-17 Thread Jianwen Wei
Thank you, Ryan.  I read through the "NodeName" session in 
https://slurm.schedmd.com/slurm.conf.html 
 , finding there is no clue on 
setting host list like "node[001-512].yourdomain.com".  Cited as below, SLURM 
seems to support "node[001-512]" only, which gives rise to lookup failures if 
FQDN (node300.yourdomain.com) is used on the compute node.

> NodeName
> Name that Slurm uses to refer to a node (or base partition for BlueGene 
> systems). Typically this would be the string that "/bin/hostname -s" returns. 
> It may also be the fully qualified domain name as returned by "/bin/hostname 
> -f" (e.g. "foo1.bar.com"), or any valid domain name associated with the host 
> through the host database (/etc/hosts) or DNS, depending on the resolver 
> settings. Note that if the short form of the hostname is not used, it may 
> prevent use of hostlist expressions (the numeric portion in brackets must be 
> at the end of the string). Only short hostname forms are compatible with the 
> switch/nrt plugin at this time. It may also be an arbitrary string if 
> NodeHostname is specified. If the NodeName is "DEFAULT", the values specified 
> with that record will apply to subsequent node specifications unless 
> explicitly set to other values in that node record or replaced with a 
> different set of default values. Each line where NodeName is "DEFAULT" will 
> replace or add to previous default values and not a reinitialize the default 
> values. For architectures in which the node order is significant, nodes will 
> be considered consecutive in the order defined. For example, if the 
> configuration for "NodeName=charlie" immediately follows the configuration 
> for "NodeName=baker" they will be considered adjacent in the computer.
> NodeHostname
> Typically this would be the string that "/bin/hostname -s" returns. It may 
> also be the fully qualified domain name as returned by "/bin/hostname -f" 
> (e.g. "foo1.bar.com"), or any valid domain name associated with the host 
> through the host database (/etc/hosts) or DNS, depending on the resolver 
> settings. Note that if the short form of the hostname is not used, it may 
> prevent use of hostlist expressions (the numeric portion in brackets must be 
> at the end of the string). Only short hostname forms are compatible with the 
> switch/nrt plugin at this time. A node range expression can be used to 
> specify a set of nodes. If an expression is used, the number of nodes 
> identified by NodeHostname on a line in the configuration file must be 
> identical to the number of nodes identified by NodeName. By default, the 
> NodeHostname will be identical in value to NodeName.
> NodeAddr
> Name that a node should be referred to in establishing a communications path. 
> This name will be used as an argument to the gethostbyname() function for 
> identification. If a node range expression is used to designate multiple 
> nodes, they must exactly match the entries in the NodeName (e.g. 
> "NodeName=lx[0-7] NodeAddr=elx[0-7]"). NodeAddr may also contain IP 
> addresses. By default, the NodeAddr will be identical in value to 
> NodeHostname.



Best,

Jianwen

> On 16 Apr 2017, at 00:51, Ryan Novosielski  wrote:
> 
> Read this slurm.conf manual, under the parameters that start with Node. They 
> discuss this situation. 
> 
> --
> 
> || \\UTGERS,   |---*O*---
> ||_// the State | Ryan Novosielski - novos...@rutgers.edu 
> 
> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> ||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
> `'
> 
> On Apr 15, 2017, at 11:47, Jianwen Wei  > wrote:
> 
>> Hi,
>> 
>> I used *short* hostnames (say node306) in all my compute node and SLURM 
>> settings before. It works well. However, error messages arise in 
>> /var/log/slurmctld.log when I set FQDN for the compute nodes.
>> 
>> [2017-04-15T22:50:06.149] error: find_node_record: lookup failure for 
>> node306. yourdomain.com 
>> 
>> 
>> On nnode306:
>> 
>> $ hostname node306.yourdomain.com 
>> $ hostname -s
>> node306
>> $ hostname -f
>> node306.yourdomain.com 
>> 
>> In /etc/slurm/slurm.conf , shortnames are used since FQDN prevents use of 
>> hostlist. That is, "node[001-332].yourdomain.com " 
>> is invalid.
>> 
>> NodeName=node[001-332]  CPUs=16 SocketsPerBoard=2 CoresPerSocket=8 
>> ThreadsPerCore=1 RealMemory=64100
>> By far, SLURM works fine despite the error message appearing in log every 10 
>> minutes. I appreciate any suggestion on this issue.
>> 
>> Best,
>> 
>> Jianwen
>> 



[slurm-dev] Re: Power user sstat rights

2017-04-17 Thread Christopher Benjamin Coffey
Hello all,

In my attempt to create another “root” user, I’ve found that it is not possible 
to create another user with the ability to “sstat jobid” every job on the 
cluster. This must be a bug. Can anyone confirm this? Thanks!

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167

On 3/14/17, 12:55 PM, "Christopher Benjamin Coffey"  
wrote:

Hello, anyone know if this is possible? Thanks! ☺

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167

On 3/8/17, 9:19 AM, "Christopher Benjamin Coffey"  
wrote:

Hello,

Is it possible to create a slurm account that has privileges to get 
sstat read access for all running jobs without giving modification privileges? 
Thank you.

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167







[slurm-dev] Re: SLURM terminating jobs before they finish

2017-04-17 Thread Benjamin Redling

Hi Batsirai,

Am 17.04.2017 um 14:54 schrieb Batsirai Mabvakure:
> SLURM has been running okay until recently my jobs are terminating before 
> they finish. 
> I have tried increasing memory using --mem, but still the jobs stop
halfway with an error in the slurm.out file.
> I then tried running again a job which once ran and completed a week
ago, it also terminated halfway. [...]

are you allowed to post (relevant parts of) the slurm.out file?

Regards,
Benjamin
-- 
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
vox: +49 3641 9 44323 | fax: +49 3641 9 44321


[slurm-dev] SLURM terminating jobs before they finish

2017-04-17 Thread Batsirai Mabvakure
Hi,

SLURM has been running okay until recently my jobs are terminating before they 
finish. I have tried increasing memory using --mem, but still the jobs stop 
halfway with an error in the slurm.out file. I then tried running again a job 
which once ran and completed a week ago, it also terminated halfway. Has anyone 
ever experienced this challenge?

Regards,

Batsirai

The views expressed in this email are, unless otherwise stated, those of the 
author and not those of the National Health Laboratory Service or its 
management. The information in this e-mail is confidential and is intended 
solely for the addressee.
Access to this e-mail by anyone else is unauthorized. If you are not the 
intended recipient, any disclosure, copying, distribution or any action taken 
or omitted in reliance on this, is prohibited and may be unlawful.
Whilst all reasonable steps are taken to ensure the accuracy and integrity of 
information and data transmitted electronically and to preserve the 
confidentiality thereof, no liability or responsibility whatsoever is accepted 
if information or data is, for whatever reason, corrupted or does not reach its 
intended destination.

[slurm-dev] Re: job stats in e-mail

2017-04-17 Thread Sander Kuusemets

You can write a script to send mail.

Slurm.conf:


MailProg=/etc/slurm/MailWrapper.sh


This will be called instead when email is sent by slurm.

You can add any kind of information to the email here, and then send it.

Best regards,

--
Sander Kuusemets
University of Tartu, High Performance Computing, IT Specialist
Skype: sander.kuusemets1
+372 737 5694

On 04/17/2017 12:09 PM, Vladimir Daric wrote:


Hello,

I would like to automatically send some job stats in email when 
/#SBATCH --mail-user/ and /#SBATCH --mail-type/ options are set for a 
job ?


With our slurm cluster configuration, when those options are used an 
empty mail is sent, all useful informations are in mail subject.


Thanks in advance for any avdice,
Vladimir





[slurm-dev] job stats in e-mail

2017-04-17 Thread Vladimir Daric

Hello, 

I would like to automatically send some job stats in email when #SBATCH 
--mail-user and #SBATCH --mail-type options are set for a job ? 

With our slurm cluster configuration, when those options are used an empty mail 
is sent, all useful informations are in mail subject. 

Thanks in advance for any avdice, 
Vladimir 



[slurm-dev] Re: Slurm leaving nodes in COMPLETING state

2017-04-17 Thread Sander Kuusemets

Alright, I found the problem.

sdiag says that my agent queue size is HUGE.

[root@rocket ~]# sdiag
***
sdiag output at Mon Apr 17 11:54:02 2017
Data since  Mon Apr 17 08:57:20 2017
***
Server thread count: 3
Agent queue size:1410747


My questions are, can I somehow increase the server thread count?

Secondly, I think the queue size is because one of our users is 
currently submitting a large amount of job steps. (For every 20 core 
node there's 598 job steps, it's an embarrasingly parallel job), and we 
allow him to use 2000 cores, so 2000/20*598=59800 job steps. Can I 
somehow make slurm more actively manage the agent queue (because the 
load on the server is very slow) or do we have to make his jobs more 
cluster-friendly?


Best regards,

--
Sander Kuusemets
University of Tartu, High Performance Computing, IT Specialist
Skype: sander.kuusemets1
+372 737 5694

On 04/12/2017 03:41 PM, Burian, John wrote:
If you’re using proctrack/cgroup or task/cgroup, you may be waiting 
for cgroups cleanup to finish. On our cluster, a large batch of jobs 
that die immediately, or canceling a large batch of jobs all at once, 
leaves those jobs in CG state for some time. If I look at a node, I 
see that all the per-job cgroup cleanup scripts are trying 
simultaneously to get a filesystem lock on the cgroups state 
directory, resulting in deadlocks that take a while to work out.


John


From: Sander Kuusemets >

Reply-To: slurm-dev >
Date: Wed, 12 Apr 2017 04:51:53 -0700
To: slurm-dev >
Subject: [slurm-dev] Re: Slurm leaving nodes in COMPLETING state

Hello,


Did you changed /etc/slurm.conf?

Have, several times. But I do this with configuration management
tools, which does restart the slurmd and slurmctld daemons
afterwards.

Do you have prolog scripts running on the compute nodes that
might be stuck??

No, I do not have any epi/prolog scripts configured.

Is there any Slurm plugin that is issuing any external command
(like the nhc that runs on Cray nodes) at job termination?

We have no plugins installed, nor NHC or other tools like this.
It's quite a vanilla installation.

Just in case adding the slurm config from a random node:

https://pastebin.com/zcXYFmHB

Best regards,

-- 
Sander Kuusemets

University of Tartu, High Performance Computing, IT Specialist
Skype: sander.kuusemets1
+372 737 5694

On 04/12/2017 01:36 PM, Miguel Gila wrote:

Hello,

Do you have prolog scripts running on the compute nodes that
might be stuck (e.g. doing IO)??? Is there any Slurm plugin that
is issuing any external command (like the nhc that runs on Cray
nodes) at job termination?

M.

-- 
Miguel Gila

CSCS Swiss National Supercomputing Centre
HPC Operations
Via Trevano 131 | CH-6900 Lugano | Switzerland
mg [at] cscs.ch





On 12 Apr 2017, at 11:29, Benedikt Schäfer
> wrote:

Did you changed /etc/slurm.conf?
You can try:
- on clients
systemctl restart slurmd (be sure that slurm service is down and
only slurmd is running)
- do on master:
scontrol reconfigure

best regards
Benedikt

~ Benedikt Schaefer benedikt.schae...@emea.nec.com
  ~
~ Senior System Analyst

  ~
~ NEC Deutschland GmbH

~
~ HPCE Division

  ~
~ Raiffeisenstr.14, 70771 Leinfelden-Echterdingen, Germany
  ~
~  Tel:+49 711 780 55 21  Mobile: +49 152 22851542  Fax:+49 711
780 55 25  ~ ~~~
~ NEC Deutschland GmbH, Hansaallee 101, D-40549 Duesseldorf ~
~ Geschaeftsfuehrer: Yuichi Kojima
  ~
~ Handelsregister DÃ1Ž4sseldorf HRB 57941; VAT ID DE129424743
  ~
~~~

-Ursprüngliche Nachricht-
Von: Sander Kuusemets [mailto:sander.kuusem...@ut.ee]
Gesendet: Mittwoch, 12. April 2017 10:22
An: 

[slurm-dev] Re: Slurm with Torque

2017-04-17 Thread Gilles Gouaillardet

Mahmood,


fwiw, slurm provides torque compatible commands (qsub, qstat, pbsnodes) 
that can help your users to transition from torque to slurm :


your users can submit torque scripts on your slurm cluster

qsub script.pbs

until they move to slurm

sbatch script.slurm


Cheers


Gilles
On 4/16/2017 11:11 PM, Mahmood Naderan wrote:

Slurm with Torque
Hi,
Currently, Torque is running on our cluster. I want to know, is it 
possible to install Slurm, create some test partitions, submit some 
test jobs and be sure that it is working while Torque is running?
Then we are able to tell the users to use Slurm scripts.  Any feedback 
is welcomed.


Regards,
Mahmood