Re: [slurm-users] Slurm Perl API use and examples
I was never able to figure out how to use the Perl API shipped with Slurm, but instead have written some wrappers around some of the Slurm commands for Perl. My wrappers for the sacctmgr and share commands are available at CPAN: https://metacpan.org/release/Slurm-Sacctmgr https://metacpan.org/release/Slurm-Sshare (I have similar wrappers for a few other commands, but have not polished enough for CPAN release, but am willing to share if you contact me). On Mon, Mar 23, 2020 at 3:49 PM Burian, John < john.bur...@nationwidechildrens.org> wrote: > I have some questions about the Slurm Perl API > - Is it still actively supported? I see it's still in the source in Git. > - Does anyone use it? If so, do you have a pointer to some example code? > > My immediate question is, for methods that take a data structure as an > input argument, how does one define that data structure? In Perl, it's just > a hash, am I supposed to populate the keys of the hash by reading the > matching C structure in slurm.h? Or do I only need to populate the keys > that I care to provide a value for, and Slurm assigns defaults to the other > keys/fields? Thanks, > > -- > John Burian > Senior Systems Programmer, Technical Lead > Institutional High Performance Computing > Abigail Wexner Research Institute, Nationwide Children’s Hospital > > > -- Tom Payerle DIT-ACIGS/Mid-Atlantic Crossroadspaye...@umd.edu 5825 University Research Park (301) 405-6135 University of Maryland College Park, MD 20740-3831
[slurm-users] Slurm Perl API use and examples
I have some questions about the Slurm Perl API - Is it still actively supported? I see it's still in the source in Git. - Does anyone use it? If so, do you have a pointer to some example code? My immediate question is, for methods that take a data structure as an input argument, how does one define that data structure? In Perl, it's just a hash, am I supposed to populate the keys of the hash by reading the matching C structure in slurm.h? Or do I only need to populate the keys that I care to provide a value for, and Slurm assigns defaults to the other keys/fields? Thanks, -- John Burian Senior Systems Programmer, Technical Lead Institutional High Performance Computing Abigail Wexner Research Institute, Nationwide Children’s Hospital
[slurm-users] 19.05 not recognizing DefMemPerCPU?
Last week I upgraded from Slurm 18.08 to Slurm 19.05. Since that time, several users have reported to me that they can't submit jobs without specifying a memory requirement. In a way, this is intended - my job_submit.lua script checks to make sure that --mem or --mem-per-node is specified, and will reject a job of neither of those are specified. But here's the thing - that check should never be needed, because I have set default value of 2 GB/CPU in my slurm.conf file, and that worked fine up until the upgrade last week. Did 19.05 change the way defaults are set, or is this a bug? From my slurm.conf file: DefMemPerCPU=2000 Any ideas why this behavior changed with the upgrade? Prentice
Re: [slurm-users] Running an MPI job across two partitions
Others might have more ideas, but anything I can think of would require a lot of manual steps to avoid mutual interference with jobs in the other partitions (allocating resources for a dummy job in the other partition, modifying the MPI host list to include nodes in the other partition, etc.). So why not make another partition encompassing both sets of nodes? > On Mar 23, 2020, at 10:58 AM, CB wrote: > > Hi Andy, > > Yes, they are on teh same network fabric. > > Sure, creating another partition that encompass all of the nodes of the two > or more partitions would solve the problem. > I am wondering if there are any other ways instead of creating a new > partition? > > Thanks, > Chansup > > > On Mon, Mar 23, 2020 at 11:51 AM Riebs, Andy wrote: > When you say “distinct compute nodes,” are they at least on the same network > fabric? > > > > If so, the first thing I’d try would be to create a new partition that > encompasses all of the nodes of the other two partitions. > > > > Andy > > > > From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of > CB > Sent: Monday, March 23, 2020 11:32 AM > To: Slurm User Community List > Subject: [slurm-users] Running an MPI job across two partitions > > > > Hi, > > > > I'm running Slurm 19.05 version. > > > > Is there any way to launch an MPI job on a group of distributed nodes from > two or more partitions, where each partition has distinct compute nodes? > > > > I've looked at the heterogeneous job support but it creates two-separate jobs. > > > > If there is no such capability with the current Slurm, I'd like to hear any > recommendations or suggestions. > > > > Thanks, > > Chansup >
Re: [slurm-users] Running an MPI job across two partitions
Hi Andy, Yes, they are on teh same network fabric. Sure, creating another partition that encompass all of the nodes of the two or more partitions would solve the problem. I am wondering if there are any other ways instead of creating a new partition? Thanks, Chansup On Mon, Mar 23, 2020 at 11:51 AM Riebs, Andy wrote: > When you say “distinct compute nodes,” are they at least on the same > network fabric? > > > > If so, the first thing I’d try would be to create a new partition that > encompasses all of the nodes of the other two partitions. > > > > Andy > > > > *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On > Behalf Of *CB > *Sent:* Monday, March 23, 2020 11:32 AM > *To:* Slurm User Community List > *Subject:* [slurm-users] Running an MPI job across two partitions > > > > Hi, > > > > I'm running Slurm 19.05 version. > > > > Is there any way to launch an MPI job on a group of distributed nodes > from two or more partitions, where each partition has distinct compute > nodes? > > > > I've looked at the heterogeneous job support but it creates two-separate > jobs. > > > > If there is no such capability with the current Slurm, I'd like to hear > any recommendations or suggestions. > > > > Thanks, > > Chansup >
Re: [slurm-users] Can slurm be configured to only run one job at a time?
The singleton dependency seems exactly what I need! However, does it really matter to the network if I upload five 1 GB files sequentially or all at once? I am not too savy on how routers operate. But don't they already do so some kind of load balancing to make sure enough bandwidth is available to other users? On Monday, March 23, 2020, 11:36:46 AM EDT, Renfro, Michael wrote: Rather than configure it to only run one job at a time, you can use job dependencies to make sure only one job of a particular type at a time. A singleton dependency [1, 2] should work for this. From [1]: #SBATCH --dependency=singleton --job-name=big-youtube-upload in any job script would ensure that only one job with that job name should run at a time. [1] https://slurm.schedmd.com/sbatch.html [2] https://hpc.nih.gov/docs/job_dependencies.html -- Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services 931 372-3601 / Tennessee Tech University > On Mar 23, 2020, at 10:00 AM, Faraz Hussain wrote: > > External Email Warning > > This email originated from outside the university. Please use caution when > opening attachments, clicking links, or responding to requests. > > > > I have a five node cluster of raspberry pis. Every hour they all have to > upload a local 1 GB file to YouTube. I want it so only one pi can upload at a > time so that network doesn't get bogged down. > > Can slurm be configured to only run one job at a time? Or perhaps some other > way to accomplish what I want? > > Thanks! >
Re: [slurm-users] Running an MPI job across two partitions
When you say “distinct compute nodes,” are they at least on the same network fabric? If so, the first thing I’d try would be to create a new partition that encompasses all of the nodes of the other two partitions. Andy From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of CB Sent: Monday, March 23, 2020 11:32 AM To: Slurm User Community List Subject: [slurm-users] Running an MPI job across two partitions Hi, I'm running Slurm 19.05 version. Is there any way to launch an MPI job on a group of distributed nodes from two or more partitions, where each partition has distinct compute nodes? I've looked at the heterogeneous job support but it creates two-separate jobs. If there is no such capability with the current Slurm, I'd like to hear any recommendations or suggestions. Thanks, Chansup
Re: [slurm-users] Can slurm be configured to only run one job at a time?
Rather than configure it to only run one job at a time, you can use job dependencies to make sure only one job of a particular type at a time. A singleton dependency [1, 2] should work for this. From [1]: #SBATCH --dependency=singleton --job-name=big-youtube-upload in any job script would ensure that only one job with that job name should run at a time. [1] https://slurm.schedmd.com/sbatch.html [2] https://hpc.nih.gov/docs/job_dependencies.html -- Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services 931 372-3601 / Tennessee Tech University > On Mar 23, 2020, at 10:00 AM, Faraz Hussain wrote: > > External Email Warning > > This email originated from outside the university. Please use caution when > opening attachments, clicking links, or responding to requests. > > > > I have a five node cluster of raspberry pis. Every hour they all have to > upload a local 1 GB file to YouTube. I want it so only one pi can upload at a > time so that network doesn't get bogged down. > > Can slurm be configured to only run one job at a time? Or perhaps some other > way to accomplish what I want? > > Thanks! >
[slurm-users] Running an MPI job across two partitions
Hi, I'm running Slurm 19.05 version. Is there any way to launch an MPI job on a group of distributed nodes from two or more partitions, where each partition has distinct compute nodes? I've looked at the heterogeneous job support but it creates two-separate jobs. If there is no such capability with the current Slurm, I'd like to hear any recommendations or suggestions. Thanks, Chansup
[slurm-users] Can slurm be configured to only run one job at a time?
I have a five node cluster of raspberry pis. Every hour they all have to upload a local 1 GB file to YouTube. I want it so only one pi can upload at a time so that network doesn't get bogged down. Can slurm be configured to only run one job at a time? Or perhaps some other way to accomplish what I want? Thanks!
Re: [slurm-users] sshare with usernames too long
--parsable2 will print full names. You can also use -o to format your output. -Paul Edmon- On 3/23/2020 10:46 AM, Sysadmin CAOS wrote: Hi, when I run "sshare -A myaccount -a" and, myaccount containts usernames with more than 10 characters, "sshare" output shows a "+" at the 10th character and, then, I can't know what user is. This is a big problem for me because I have accounts in format "student-1, student-2, etc"... Is there any way to show the entire username? Thanks!
[slurm-users] sshare with usernames too long
Hi, when I run "sshare -A myaccount -a" and, myaccount containts usernames with more than 10 characters, "sshare" output shows a "+" at the 10th character and, then, I can't know what user is. This is a big problem for me because I have accounts in format "student-1, student-2, etc"... Is there any way to show the entire username? Thanks!
Re: [slurm-users] reseting SchedNodeList
Thanks Paul. Holding and releasing or re-queueing the job didn,t clear the SchedNodeList value, due to bacfilling mechanism. I could clear it by restarting slurmctdl only. Sefa Arslan Paul Edmon , 23 Mar 2020 Pzt, 16:25 tarihinde şunu yazdı: > You could try holding the job and the releasing it. I've inquired of > SchedMD about this before and this is the response they gave: > > https://bugs.schedmd.com/show_bug.cgi?id=8069 > > -Paul Edmon- > On 3/23/2020 8:05 AM, Sefa Arslan wrote: > > Hi, > > Due to lack of source in a partition, I updated the job to another > partition and increased the priority to top value. Although there are > enough source for the job to be started, updated jobs have not started > yet. When I looked using "scontrol check jobid", I saw the SchedNodeList > value > is not updated, and still pointing a nodes from the earlier partition. Is > there a way to reset/clear the SchedNodeList value? Or force slurmctdl to > start the job immediately? > > > Regards, > >
Re: [slurm-users] reseting SchedNodeList
You could try holding the job and the releasing it. I've inquired of SchedMD about this before and this is the response they gave: https://bugs.schedmd.com/show_bug.cgi?id=8069 -Paul Edmon- On 3/23/2020 8:05 AM, Sefa Arslan wrote: Hi, Due to lack of source in a partition, I updated the job to another partition and increased the priority to top value. Although there are enough source for the job to be started, updated jobs have not started yet. When I looked using "scontrol check jobid", I saw the SchedNodeList value is not updated, and still pointing a nodes from the earlier partition. Is there a way to reset/clear the SchedNodeList value? Or force slurmctdl to start the job immediately? Regards,
[slurm-users] reseting SchedNodeList
Hi, Due to lack of source in a partition, I updated the job to another partition and increased the priority to top value. Although there are enough source for the job to be started, updated jobs have not started yet. When I looked using "scontrol check jobid", I saw the SchedNodeList value is not updated, and still pointing a nodes from the earlier partition. Is there a way to reset/clear the SchedNodeList value? Or force slurmctdl to start the job immediately? Regards,
Re: [slurm-users] Accounting Information from slurmdbd does not reach slurmctld
Hi Pascal, are the slurmdbd and slurmctld running on he same host? Best Marcus Am 20.03.2020 um 18:12 schrieb Pascal Klink: Hi Chris, Thanks for the quick answer! I tried the 'sacctmgr show clusters‘ command, which gave Cluster ControlHost ControlPort RPC Share ... QOS Def QOS -- --- - - ... - iascluster 127.0.0.1 6817 8192 1 normal I removed the columns which had no value in between the 'Share' and 'QOS' row. Also you can see the relevant output of slurmctld and slurmdbd here (it was running on debug mode): slurmctld: [2020-03-18T22:59:52.441] debug: sched: slurmctld starting [2020-03-18T22:59:52.441] slurmctld version 17.11.2 started on cluster iascluster [2020-03-18T22:59:52.442] Munge cryptographic signature plugin loaded [2020-03-18T22:59:52.442] preempt/none loaded [2020-03-18T22:59:52.442] debug: Checkpoint plugin loaded: checkpoint/none [2020-03-18T22:59:52.442] debug: AcctGatherEnergy NONE plugin loaded [2020-03-18T22:59:52.442] debug: AcctGatherProfile NONE plugin loaded [2020-03-18T22:59:52.442] debug: AcctGatherInterconnect NONE plugin loaded [2020-03-18T22:59:52.442] debug: AcctGatherFilesystem NONE plugin loaded [2020-03-18T22:59:52.442] debug: Job accounting gather LINUX plugin loaded [2020-03-18T22:59:52.442] ExtSensors NONE plugin loaded [2020-03-18T22:59:52.442] debug: switch NONE plugin loaded [2020-03-18T22:59:52.442] debug: power_save module disabled, SuspendTime < 0 [2020-03-18T22:59:52.442] debug: No backup controller to shutdown [2020-03-18T22:59:52.442] Accounting storage SLURMDBD plugin loaded with AuthInfo=(null) [2020-03-18T22:59:52.442] debug: Munge authentication plugin loaded [2020-03-18T22:59:52.443] debug: slurmdbd: Sent PersistInit msg [2020-03-18T22:59:52.443] slurmdbd: recovered 0 pending RPCs [2020-03-18T22:59:52.447] debug: Reading slurm.conf file: /etc/slurm-llnl/slurm.conf [2020-03-18T22:59:52.447] layouts: no layout to initialize [2020-03-18T22:59:52.447] topology NONE plugin loaded [2020-03-18T22:59:52.447] debug: No DownNodes [2020-03-18T22:59:52.492] debug: Log file re-opened [2020-03-18T22:59:52.492] error: chown(var/log/slurm/slurmctld.log, 64030, 64030): No such file or directory [2020-03-18T22:59:52.492] sched: Backfill scheduler plugin loaded [2020-03-18T22:59:52.493] route default plugin loaded [2020-03-18T22:59:52.493] layouts: loading entities/relations information [2020-03-18T22:59:52.493] debug: layouts: 7/7 nodes in hash table, rc=0 [2020-03-18T22:59:52.493] debug: layouts: loading stage 1 [2020-03-18T22:59:52.493] debug: layouts: loading stage 1.1 (restore state) [2020-03-18T22:59:52.493] debug: layouts: loading stage 2 [2020-03-18T22:59:52.493] debug: layouts: loading stage 3 [2020-03-18T22:59:52.493] Recovered state of 7 nodes [2020-03-18T22:59:52.493] Down nodes: cn[01-07] [2020-03-18T22:59:52.493] Recovered information about 0 jobs [2020-03-18T22:59:52.493] debug: Updating partition uid access list [2020-03-18T22:59:52.493] Recovered state of 0 reservations [2020-03-18T22:59:52.493] State of 0 triggers recovered [2020-03-18T22:59:52.493] _preserve_plugins: backup_controller not specified [2020-03-18T22:59:52.493] Running as primary controller [2020-03-18T22:59:52.493] debug: No BackupController, not launching heartbeat. [2020-03-18T22:59:52.493] Registering slurmctld at port 6817 with slurmdbd. [2020-03-18T22:59:52.528] debug: No feds to retrieve from state [2020-03-18T22:59:52.572] debug: Priority MULTIFACTOR plugin loaded [2020-03-18T22:59:52.572] No parameter for mcs plugin, default values set [2020-03-18T22:59:52.572] mcs: MCSParameters = (null). ondemand set. [2020-03-18T22:59:52.572] debug: mcs none plugin loaded [2020-03-18T22:59:52.573] debug: power_save mode not enabled [2020-03-18T22:59:55.578] debug: Spawning registration agent for cn[01-07] 7 hosts [2020-03-18T23:00:05.591] agent/is_node_resp: node:cn01 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure [2020-03-18T23:00:05.591] agent/is_node_resp: node:cn02 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure [2020-03-18T23:00:05.591] agent/is_node_resp: node:cn07 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure [2020-03-18T23:00:05.592] agent/is_node_resp: node:cn04 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure [2020-03-18T23:00:05.592] agent/is_node_resp: node:cn03 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure [2020-03-18T23:00:05.592] agent/is_node_resp: node:cn05 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure [2020-03-18T23:00:05.592] agent/is_node_resp: node:cn06 RPC:REQUEST_NODE_REGISTRATION_STATUS : Communication connection failure [2020-03-18T23:00:22.494] debug: backfill: beginning [2020-03-18T23:00:22.494] debug: backfill: no jobs to backfill