Re: [slurm-users] Slurm Perl API use and examples

2020-03-23 Thread Thomas M. Payerle
I was never able to figure out how to use the Perl API shipped with Slurm,
but instead have written some wrappers around some of the Slurm commands
for Perl.  My wrappers for the sacctmgr and share commands are available at
CPAN:
https://metacpan.org/release/Slurm-Sacctmgr
https://metacpan.org/release/Slurm-Sshare
(I have similar wrappers for a few other commands, but have not polished
enough for CPAN release, but am willing to share if you contact me).

On Mon, Mar 23, 2020 at 3:49 PM Burian, John <
john.bur...@nationwidechildrens.org> wrote:

> I have some questions about the Slurm Perl API
> - Is it still actively supported? I see it's still in the source in Git.
> - Does anyone use it? If so, do you have a pointer to some example code?
>
> My immediate question is, for methods that take a data structure as an
> input argument, how does one define that data structure? In Perl, it's just
> a hash, am I supposed to populate the keys of the hash by reading the
> matching C structure in slurm.h? Or do I only need to populate the keys
> that I care to provide a value for, and Slurm assigns defaults to the other
> keys/fields? Thanks,
>
> --
> John Burian
> Senior Systems Programmer, Technical Lead
> Institutional High Performance Computing
> Abigail Wexner Research Institute, Nationwide Children’s Hospital
>
>
>

-- 
Tom Payerle
DIT-ACIGS/Mid-Atlantic Crossroadspaye...@umd.edu
5825 University Research Park   (301) 405-6135
University of Maryland
College Park, MD 20740-3831


[slurm-users] Slurm Perl API use and examples

2020-03-23 Thread Burian, John
I have some questions about the Slurm Perl API
- Is it still actively supported? I see it's still in the source in Git.
- Does anyone use it? If so, do you have a pointer to some example code?

My immediate question is, for methods that take a data structure as an input 
argument, how does one define that data structure? In Perl, it's just a hash, 
am I supposed to populate the keys of the hash by reading the matching C 
structure in slurm.h? Or do I only need to populate the keys that I care to 
provide a value for, and Slurm assigns defaults to the other keys/fields? 
Thanks,

-- 
John Burian
Senior Systems Programmer, Technical Lead
Institutional High Performance Computing
Abigail Wexner Research Institute, Nationwide Children’s Hospital




[slurm-users] 19.05 not recognizing DefMemPerCPU?

2020-03-23 Thread Prentice Bisbal
Last week I upgraded from Slurm 18.08 to Slurm 19.05. Since that time, 
several users have reported to me that they can't submit jobs without 
specifying a memory requirement. In a way, this is intended - my 
job_submit.lua script checks to make sure that --mem or --mem-per-node 
is specified, and will reject a job of neither of those are specified.


But here's the thing - that check should never be needed, because I have 
set  default value of 2 GB/CPU in my slurm.conf file, and that worked 
fine up until the upgrade last week. Did 19.05 change the way defaults 
are set, or is this a bug?


From my slurm.conf file:

DefMemPerCPU=2000

Any ideas why this behavior changed with the upgrade?

Prentice




Re: [slurm-users] Running an MPI job across two partitions

2020-03-23 Thread Renfro, Michael
Others might have more ideas, but anything I can think of would require a lot 
of manual steps to avoid mutual interference with jobs in the other partitions 
(allocating resources for a dummy job in the other partition, modifying the MPI 
host list to include nodes in the other partition, etc.).

So why not make another partition encompassing both sets of nodes?

> On Mar 23, 2020, at 10:58 AM, CB  wrote:
> 
> Hi Andy,
> 
> Yes, they are on teh same network fabric.
> 
> Sure, creating another partition that encompass all of the nodes of the two 
> or more partitions would solve the problem.
> I am wondering if there are any other ways instead of creating a new 
> partition?
> 
> Thanks,
> Chansup
> 
> 
> On Mon, Mar 23, 2020 at 11:51 AM Riebs, Andy  wrote:
> When you say “distinct compute nodes,” are they at least on the same network 
> fabric?
> 
>  
> 
> If so, the first thing I’d try would be to create a new partition that 
> encompasses all of the nodes of the other two partitions.
> 
>  
> 
> Andy
> 
>  
> 
> From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
> CB
> Sent: Monday, March 23, 2020 11:32 AM
> To: Slurm User Community List 
> Subject: [slurm-users] Running an MPI job across two partitions
> 
>  
> 
> Hi,
> 
>  
> 
> I'm running Slurm 19.05 version.
> 
>  
> 
> Is there any way to launch an MPI job on a group of distributed  nodes from 
> two or more partitions, where each partition has distinct compute nodes?
> 
>  
> 
> I've looked at the heterogeneous job support but it creates two-separate jobs.
> 
>  
> 
> If there is no such capability with the current Slurm, I'd like to hear any 
> recommendations or suggestions.
> 
>  
> 
> Thanks,
> 
> Chansup
> 



Re: [slurm-users] Running an MPI job across two partitions

2020-03-23 Thread CB
Hi Andy,

Yes, they are on teh same network fabric.

Sure, creating another partition that encompass all of the nodes of the two
or more partitions would solve the problem.
I am wondering if there are any other ways instead of creating a new
partition?

Thanks,
Chansup


On Mon, Mar 23, 2020 at 11:51 AM Riebs, Andy  wrote:

> When you say “distinct compute nodes,” are they at least on the same
> network fabric?
>
>
>
> If so, the first thing I’d try would be to create a new partition that
> encompasses all of the nodes of the other two partitions.
>
>
>
> Andy
>
>
>
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On
> Behalf Of *CB
> *Sent:* Monday, March 23, 2020 11:32 AM
> *To:* Slurm User Community List 
> *Subject:* [slurm-users] Running an MPI job across two partitions
>
>
>
> Hi,
>
>
>
> I'm running Slurm 19.05 version.
>
>
>
> Is there any way to launch an MPI job on a group of distributed  nodes
> from two or more partitions, where each partition has distinct compute
> nodes?
>
>
>
> I've looked at the heterogeneous job support but it creates two-separate
> jobs.
>
>
>
> If there is no such capability with the current Slurm, I'd like to hear
> any recommendations or suggestions.
>
>
>
> Thanks,
>
> Chansup
>


Re: [slurm-users] Can slurm be configured to only run one job at a time?

2020-03-23 Thread Faraz Hussain
The singleton dependency seems exactly what I need!

However, does it really matter to the network if I upload five 1 GB files 
sequentially or all at once? I am not too savy on how routers operate. But 
don't they already do so some kind of load balancing to make sure enough 
bandwidth is available to other users?






On Monday, March 23, 2020, 11:36:46 AM EDT, Renfro, Michael  
wrote: 





Rather than configure it to only run one job at a time, you can use job 
dependencies to make sure only one job of a particular type at a time. A 
singleton dependency [1, 2] should work for this. From [1]:

  #SBATCH --dependency=singleton --job-name=big-youtube-upload

in any job script would ensure that only one job with that job name should run 
at a time.

[1] https://slurm.schedmd.com/sbatch.html
[2] https://hpc.nih.gov/docs/job_dependencies.html

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601    / Tennessee Tech University

> On Mar 23, 2020, at 10:00 AM, Faraz Hussain  wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> 
> 
> 
> I have a five node cluster of raspberry pis. Every hour they all have to 
> upload a local 1 GB file to YouTube. I want it so only one pi can upload at a 
> time so that network doesn't get bogged down.
> 
> Can slurm be configured to only run one job at a time? Or perhaps some other 
> way to accomplish what I want?
> 
> Thanks!
> 



Re: [slurm-users] Running an MPI job across two partitions

2020-03-23 Thread Riebs, Andy
When you say “distinct compute nodes,” are they at least on the same network 
fabric?

If so, the first thing I’d try would be to create a new partition that 
encompasses all of the nodes of the other two partitions.

Andy

From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of CB
Sent: Monday, March 23, 2020 11:32 AM
To: Slurm User Community List 
Subject: [slurm-users] Running an MPI job across two partitions

Hi,

I'm running Slurm 19.05 version.

Is there any way to launch an MPI job on a group of distributed  nodes from two 
or more partitions, where each partition has distinct compute nodes?

I've looked at the heterogeneous job support but it creates two-separate jobs.

If there is no such capability with the current Slurm, I'd like to hear any 
recommendations or suggestions.

Thanks,
Chansup


Re: [slurm-users] Can slurm be configured to only run one job at a time?

2020-03-23 Thread Renfro, Michael
Rather than configure it to only run one job at a time, you can use job 
dependencies to make sure only one job of a particular type at a time. A 
singleton dependency [1, 2] should work for this. From [1]:

  #SBATCH --dependency=singleton --job-name=big-youtube-upload

in any job script would ensure that only one job with that job name should run 
at a time.

[1] https://slurm.schedmd.com/sbatch.html
[2] https://hpc.nih.gov/docs/job_dependencies.html

-- 
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University

> On Mar 23, 2020, at 10:00 AM, Faraz Hussain  wrote:
> 
> External Email Warning
> 
> This email originated from outside the university. Please use caution when 
> opening attachments, clicking links, or responding to requests.
> 
> 
> 
> I have a five node cluster of raspberry pis. Every hour they all have to 
> upload a local 1 GB file to YouTube. I want it so only one pi can upload at a 
> time so that network doesn't get bogged down.
> 
> Can slurm be configured to only run one job at a time? Or perhaps some other 
> way to accomplish what I want?
> 
> Thanks!
> 




[slurm-users] Running an MPI job across two partitions

2020-03-23 Thread CB
Hi,

I'm running Slurm 19.05 version.

Is there any way to launch an MPI job on a group of distributed  nodes from
two or more partitions, where each partition has distinct compute nodes?

I've looked at the heterogeneous job support but it creates two-separate
jobs.

If there is no such capability with the current Slurm, I'd like to hear any
recommendations or suggestions.

Thanks,
Chansup


[slurm-users] Can slurm be configured to only run one job at a time?

2020-03-23 Thread Faraz Hussain
I have a five node cluster of raspberry pis. Every hour they all have to upload 
a local 1 GB file to YouTube. I want it so only one pi can upload at a time so 
that network doesn't get bogged down.

Can slurm be configured to only run one job at a time? Or perhaps some other 
way to accomplish what I want?

Thanks!



Re: [slurm-users] sshare with usernames too long

2020-03-23 Thread Paul Edmon
--parsable2 will print full names.  You can also use -o to format your 
output.


-Paul Edmon-

On 3/23/2020 10:46 AM, Sysadmin CAOS wrote:

Hi,

when I run "sshare -A myaccount -a" and, myaccount containts usernames 
with more than 10 characters, "sshare" output shows a "+" at the 10th 
character and, then, I can't know what user is. This is a big problem 
for me because I have accounts in format "student-1, student-2, etc"...


Is there any way to show the entire username?

Thanks!





[slurm-users] sshare with usernames too long

2020-03-23 Thread Sysadmin CAOS

Hi,

when I run "sshare -A myaccount -a" and, myaccount containts usernames 
with more than 10 characters, "sshare" output shows a "+" at the 10th 
character and, then, I can't know what user is. This is a big problem 
for me because I have accounts in format "student-1, student-2, etc"...


Is there any way to show the entire username?

Thanks!



Re: [slurm-users] reseting SchedNodeList

2020-03-23 Thread Sefa Arslan
Thanks Paul.
Holding and releasing or re-queueing the job didn,t clear the
SchedNodeList value, due to bacfilling mechanism.  I could clear it by
restarting slurmctdl only.


Sefa Arslan


Paul Edmon , 23 Mar 2020 Pzt, 16:25 tarihinde şunu
yazdı:

> You could try holding the job and the releasing it.  I've inquired of
> SchedMD about this before and this is the response they gave:
>
> https://bugs.schedmd.com/show_bug.cgi?id=8069
>
> -Paul Edmon-
> On 3/23/2020 8:05 AM, Sefa Arslan wrote:
>
> Hi,
>
> Due to lack of source in a partition, I updated the job to another
> partition and increased the priority to top value. Although there are
> enough source  for the job to be started, updated jobs have not started
> yet.  When I looked using "scontrol check jobid", I saw the SchedNodeList 
> value
> is not updated, and still pointing  a nodes from the earlier partition. Is
> there a way to reset/clear the SchedNodeList value? Or force slurmctdl to
> start the job immediately?
>
>
> Regards,
>
>


Re: [slurm-users] reseting SchedNodeList

2020-03-23 Thread Paul Edmon
You could try holding the job and the releasing it.  I've inquired of 
SchedMD about this before and this is the response they gave:


https://bugs.schedmd.com/show_bug.cgi?id=8069

-Paul Edmon-

On 3/23/2020 8:05 AM, Sefa Arslan wrote:

Hi,

Due to lack of source in a partition, I updated the job to another 
partition and increased the priority to top value. Although there are 
enough source  for the job to be started, updated jobs have not 
started yet.  When I looked using "scontrol check jobid", I saw the 
SchedNodeList value is not updated, and still pointing  a nodes from 
the earlier partition. Is there a way to reset/clear the SchedNodeList 
value? Or force slurmctdl to start the job immediately?



Regards,


[slurm-users] reseting SchedNodeList

2020-03-23 Thread Sefa Arslan
Hi,

Due to lack of source in a partition, I updated the job to another
partition and increased the priority to top value. Although there are
enough source  for the job to be started, updated jobs have not started
yet.  When I looked using "scontrol check jobid", I saw the
SchedNodeList value
is not updated, and still pointing  a nodes from the earlier partition. Is
there a way to reset/clear the SchedNodeList value? Or force slurmctdl to
start the job immediately?


Regards,


Re: [slurm-users] Accounting Information from slurmdbd does not reach slurmctld

2020-03-23 Thread Marcus Wagner

Hi Pascal,

are the slurmdbd and slurmctld running on he same host?

Best
Marcus

Am 20.03.2020 um 18:12 schrieb Pascal Klink:

Hi Chris,

Thanks for the quick answer! I tried the 'sacctmgr show clusters‘ command, 
which gave

Cluster ControlHost ControlPort   RPC   Share ... QOS   
Def QOS
--  ---   - - ... 
  -
iascluster  127.0.0.1   6817  8192  1 normal


I removed the columns which had no value in between the 'Share' and 'QOS' row. 
Also you can see the relevant output of slurmctld and slurmdbd here (it was 
running on debug mode):

slurmctld:

[2020-03-18T22:59:52.441] debug:  sched: slurmctld starting
[2020-03-18T22:59:52.441] slurmctld version 17.11.2 started on cluster 
iascluster
[2020-03-18T22:59:52.442] Munge cryptographic signature plugin loaded
[2020-03-18T22:59:52.442] preempt/none loaded
[2020-03-18T22:59:52.442] debug:  Checkpoint plugin loaded: checkpoint/none
[2020-03-18T22:59:52.442] debug:  AcctGatherEnergy NONE plugin loaded
[2020-03-18T22:59:52.442] debug:  AcctGatherProfile NONE plugin loaded
[2020-03-18T22:59:52.442] debug:  AcctGatherInterconnect NONE plugin loaded
[2020-03-18T22:59:52.442] debug:  AcctGatherFilesystem NONE plugin loaded
[2020-03-18T22:59:52.442] debug:  Job accounting gather LINUX plugin loaded
[2020-03-18T22:59:52.442] ExtSensors NONE plugin loaded
[2020-03-18T22:59:52.442] debug:  switch NONE plugin loaded
[2020-03-18T22:59:52.442] debug:  power_save module disabled, SuspendTime < 0
[2020-03-18T22:59:52.442] debug:  No backup controller to shutdown
[2020-03-18T22:59:52.442] Accounting storage SLURMDBD plugin loaded with 
AuthInfo=(null)
[2020-03-18T22:59:52.442] debug:  Munge authentication plugin loaded
[2020-03-18T22:59:52.443] debug:  slurmdbd: Sent PersistInit msg
[2020-03-18T22:59:52.443] slurmdbd: recovered 0 pending RPCs
[2020-03-18T22:59:52.447] debug:  Reading slurm.conf file: 
/etc/slurm-llnl/slurm.conf
[2020-03-18T22:59:52.447] layouts: no layout to initialize
[2020-03-18T22:59:52.447] topology NONE plugin loaded
[2020-03-18T22:59:52.447] debug:  No DownNodes
[2020-03-18T22:59:52.492] debug:  Log file re-opened
[2020-03-18T22:59:52.492] error: chown(var/log/slurm/slurmctld.log, 64030, 
64030): No such file or directory
[2020-03-18T22:59:52.492] sched: Backfill scheduler plugin loaded
[2020-03-18T22:59:52.493] route default plugin loaded
[2020-03-18T22:59:52.493] layouts: loading entities/relations information
[2020-03-18T22:59:52.493] debug:  layouts: 7/7 nodes in hash table, rc=0
[2020-03-18T22:59:52.493] debug:  layouts: loading stage 1
[2020-03-18T22:59:52.493] debug:  layouts: loading stage 1.1 (restore state)
[2020-03-18T22:59:52.493] debug:  layouts: loading stage 2
[2020-03-18T22:59:52.493] debug:  layouts: loading stage 3
[2020-03-18T22:59:52.493] Recovered state of 7 nodes
[2020-03-18T22:59:52.493] Down nodes: cn[01-07]
[2020-03-18T22:59:52.493] Recovered information about 0 jobs
[2020-03-18T22:59:52.493] debug:  Updating partition uid access list
[2020-03-18T22:59:52.493] Recovered state of 0 reservations
[2020-03-18T22:59:52.493] State of 0 triggers recovered
[2020-03-18T22:59:52.493] _preserve_plugins: backup_controller not specified
[2020-03-18T22:59:52.493] Running as primary controller
[2020-03-18T22:59:52.493] debug:  No BackupController, not launching heartbeat.
[2020-03-18T22:59:52.493] Registering slurmctld at port 6817 with slurmdbd.
[2020-03-18T22:59:52.528] debug:  No feds to retrieve from state
[2020-03-18T22:59:52.572] debug:  Priority MULTIFACTOR plugin loaded
[2020-03-18T22:59:52.572] No parameter for mcs plugin, default values set
[2020-03-18T22:59:52.572] mcs: MCSParameters = (null). ondemand set.
[2020-03-18T22:59:52.572] debug:  mcs none plugin loaded
[2020-03-18T22:59:52.573] debug:  power_save mode not enabled
[2020-03-18T22:59:55.578] debug:  Spawning registration agent for cn[01-07] 7 
hosts
[2020-03-18T23:00:05.591] agent/is_node_resp: node:cn01 
RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
[2020-03-18T23:00:05.591] agent/is_node_resp: node:cn02 
RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
[2020-03-18T23:00:05.591] agent/is_node_resp: node:cn07 
RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
[2020-03-18T23:00:05.592] agent/is_node_resp: node:cn04 
RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
[2020-03-18T23:00:05.592] agent/is_node_resp: node:cn03 
RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
[2020-03-18T23:00:05.592] agent/is_node_resp: node:cn05 
RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
[2020-03-18T23:00:05.592] agent/is_node_resp: node:cn06 
RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure
[2020-03-18T23:00:22.494] debug:  backfill: beginning
[2020-03-18T23:00:22.494] debug:  backfill: no jobs to backfill