[slurm-users] "Task %d reported exit for a second time" on Slurm 17.11.9-2 and 17.11.10

2018-10-05 Thread Andy Riebs
of messages. Any thoughts about what might be going on here? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

Re: [slurm-users] Looking for old SLURM versions

2018-10-24 Thread Andy Riebs
Bob, you can find older versions of Slurm at the archive, at . Andy *From:* Bob Healey *Sent:* Wednesday, October 24, 2018 5:51PM *To:* Slurm-users *Cc:* *Subject:* [slurm-users] Looki

Re: [slurm-users] Can't find an address

2018-10-25 Thread Andy Riebs
Make sure that the "hostname" command returns the same name that Slurm expects on your compute nodes. *From:* Zohar Roe Mlm *Sent:* Thursday, October 25, 2018 3:02AM *To:* 'Slurm User Community List' *Cc:* *Subject:* Re:

Re: [slurm-users] Simple question but I can't find the answer

2019-01-10 Thread Andy Riebs
Is it following a host name, or a partition name? If the latter, it just means that it's the default partition. *From:* Jeffrey R. Lang *Sent:* Thursday, January 10, 2019 11:13AM *To:* Slurm-users *Cc:* *Subject:* [slurm-

[slurm-users] Mysterious job terminations on Slurm 17.11.10

2019-01-31 Thread Andy Riebs
a node goes down. With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a clue in the slurmctld or slurmd logs. Any thoughts on what might be happening, or what I might try next? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineeri

Re: [slurm-users] Mysterious job terminations on Slurm 17.11.10

2019-02-05 Thread Andy Riebs
ons on Slurm 17.11.10 On Friday, 1 February 2019 6:04:45 AM AEDT Andy Riebs wrote: Any thoughts on what might be happening, or what I might try next? Anything in dmesg on the nodes or syslog at that time? I'm wondering if you're seeing the OOM killer step in and take processes out. What does your slurm.conf look like? All the best, Chris

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Andy Riebs
Michael, are you setting time limits for the jobs? That's a huge part of a scheduler's decision about whether another job can be run. For example, if a job is running with the Slurm default of "infinite," the scheduler will likely decide that jobs that will fit in the remaining nodes will be ab

[slurm-users] Resolution! was Re: Mysterious job terminations on Slurm 17.11.10

2019-03-12 Thread Andy Riebs
to offer suggestions. Andy -------- *From:* Andy Riebs *Sent:* Thursday, January 31, 2019 2:04PM *To:* Slurm-users *Cc:* *Subject:* Mysterious job terminations on Slurm 17.11.10 Hi All, Just checking to see if this sounds familiar to anyone. Environment: - CentOS 7.5 x86_64 - Slurm 17.11.1

Re: [slurm-users] Slurm 1 CPU

2019-04-04 Thread Andy Riebs
in slurm.conf, on the line(s) starting "NodeName=", you'll want to add specs for sockets, cores, and threads/core. *From:* Chris Bateson *Sent:* Thursday, April 04, 2019 5:18PM *To:* Slurm-users *Cc:* *Subject:* [slurm-us

Re: [slurm-users] Scontrol update: invalid user id

2019-04-15 Thread Andy Riebs
The "invalid user id" message suggests that you need to be running as root (or possibly as the slurm user?) to update the node state. Run "slurmd -Dvv" as root on one of the compute nodes and it will show you what it thinks is the socket/core/thread configuration.

[slurm-users] job startup timeouts?

2019-04-26 Thread Andy Riebs
n be changed to accommodate a lengthy startup time like this? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Andy Riebs
ler numbers of nodes. Is there a timeout setting that we're missing that can be changed to accommodate a lengthy startup time like this? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Andy Riebs
m depending on your local configuration. > > Doug Jacobsen, Ph.D. > NERSC Computer Systems Engineer > Acting Group Lead, Computational Systems Group > National Energy Research Scientific Computing Center > dmjacob...@lbl.gov <mailto:dmjacob...@lbl.gov>

Re: [slurm-users] missing info from sacct

2020-11-18 Thread Andy Riebs
Are you using federated clusters? If not, check slurm.conf -- do you have FirstJobId set? Andy On 11/18/2020 8:42 AM, navin srivastava wrote: While running the sacct we found that some jobid are not listing. 5535566      SYNTHLIBT+  stdg_defq   stdg_acc          1  COMPLETED      0:0 5535567

Re: [slurm-users] missing info from sacct

2020-11-18 Thread Andy Riebs
15 AM, Andy Riebs wrote: Are you using federated clusters? If not, check slurm.conf -- do you have FirstJobId set? Andy On 11/18/2020 8:42 AM, navin srivastava wrote: While running the sacct we found that some jobid are not listing. 5535566      SYNTHLIBT+  stdg_defq   stdg_acc      

Re: [slurm-users] missing info from sacct

2020-11-18 Thread Andy Riebs
                     cpu         1 How it is calculating the hour in a day . Regards Navin. On Wed, Nov 18, 2020 at 7:51 PM Andy Riebs <mailto:a...@candooz.com>> wrote: I see from your subsequent post that you're using a pair of clusters with a single database, so yes, you ar

Re: [slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-26 Thread Andy Riebs
1. Look for a firewall on all of your slurm -- they almost always break slurm communications. 2. Confirm that "ssh srvgridslurm01 hostname" returns, exactly, "srvgridslurm01" Andy On 11/26/2020 12:21 PM, Steve Bland wrote: Sinfo always returns nodes not responding [root@srvgridslurm03 ~

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-26 Thread Andy Riebs
e -s’ when configuring the slurm.conf node entries. In fact slurm seems to be case sensitive, which surprised the heck out of me *From:* slurm-users *On Behalf Of *Andy Riebs *Sent:* Thursday, November 26, 2020 12:50 *To:* slurm-users@lists.schedmd.com *Subject:* [EXTERNAL] Re: [slurm-users]

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-27 Thread Andy Riebs
Name=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP > > about the only thing I can think of is to make one of the nodes on the > otherside of the gateway into the control node > > > *Steve Bland* > *Technical Product Manager* > > *Third Party Products* > Ross

Re: [slurm-users] pmix issue

2020-12-04 Thread Andy Riebs
Are you sure that /share/local/pmix-3.2.1 exists on the compute nodes? On 12/4/2020 2:54 PM, Yuengling, Philip J. wrote: Hi everyone, I’ve been having difficulty getting the --mpi=pmix_v3 option to work for me.  I can get --mpi=pmi2 to work ok, but I really want to understand what I’m doing

Re: [slurm-users] pmix issue

2020-12-04 Thread Andy Riebs
Also, Slurm was built with "/fs/local/pmix-3.2.1" -- does that translate well to "/share/local/pmix-3.2.1"? Andy On 12/4/2020 2:59 PM, Andy Riebs wrote: Are you sure that /share/local/pmix-3.2.1 exists on the compute nodes? On 12/4/2020 2:54 PM, Yuengling, Philip J.

Re: [slurm-users] [EXT] Re: pmix issue

2020-12-07 Thread Andy Riebs
confirmed that everything was compiled with that version of compiler. I also set LD_LIBRARY_PATH to include /share/local/pmix-3.2.1 Cheers! Phil *From: *slurm-users on behalf of Andy Riebs *Reply-To: *"a...@candooz.com" , Slurm User Community List *Date: *Friday, December 4, 2020 at 3

Re: [slurm-users] [EXT] Re: pmix issue

2020-12-08 Thread Andy Riebs
I should also note I used devtoolset-10 (gcc 10) on RHEL7 and confirmed that everything was compiled with that version of compiler. I also set LD_LIBRARY_PATH to include /share/local/pmix-3.2.1 Cheers! Phil *From: *slurm-users on behalf of Andy Riebs *Reply-To: *"a...@candooz.com" ,

Re: [slurm-users] Question about unit tests

2020-12-09 Thread Andy Riebs
Did you do the first "make check" from the top-level Slurm directory (not testsuite/slurm_unit)? On 12/8/2020 11:15 PM, Rikimaru Honjo wrote: Hi, I ran unit tests according to the following document. https://github.com/SchedMD/slurm/blob/master/testsuite/slurm_unit/README As a result, all un

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-25 Thread Andy Riebs
en improve it, based in part on feedback, over time." 4. Slurm packages (and other contributions, including suggestions on this mailing list) that haven't been provided by SchedMD have probably been provisioned and tested by a volunteer -- be sure to keep the conversation civi

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-25 Thread Andy Riebs
See below On 1/25/2021 9:36 AM, Ole Holm Nielsen wrote: On 1/25/21 2:59 PM, Andy Riebs wrote: Several things to keep in mind...  1. Slurm, as a product undergoing frequent, incompatible revisions, is     not well-suited for provisioning from a stable public repository! On     the other hand

Re: [slurm-users] only 1 job running

2021-01-28 Thread Andy Riebs
Hi Chandler, If the only changes to your system have been the slurm.conf configuration and the addition of a new node, the easiest way to track this down is probably to show us the diffs between the previous and current versions of slurm.conf, and a note about what's different about the new n

Re: [slurm-users] salloc: error: Error on msg accept socket: Too many open files

2021-02-02 Thread Andy Riebs
Run salloc with a smaller number of nodes or tasks, then take a look at lsof (or some other favorite means of finding IP connections). IIRC, each srun/node in the allocation needs 70-80 IP connections with the node running salloc, so a large node count can overwhelm the default allocation of fi

Re: [slurm-users] Using LSPSuite with SBATCH

2017-11-13 Thread Andy Riebs
Paul, it would be incredibly helpful to reveal * What version of Slurm you are using * What Slurm commands you are using * The mpirun command(s) that do effect what you desire * Your slurm configuration -- preferably a copy of slurm.conf (with node names and IP addresses obscured for security re

Re: [slurm-users] Priority wait

2017-11-14 Thread Andy Riebs
Hi Roy, What command are you using to start the jobs? On 11/14/2017 09:58 AM, Zohar Roe MLM wrote: Hello, Trying again with the slurm.conf This time. I have a cluster name: Autobot In this cluster I have servers: Optimus[1-10] and Megatron[1-10]. I sent 3000 jobs with feature Optimus and

Re: [slurm-users] Problem with slurmctl communication with clurmdbd

2017-11-29 Thread Andy Riebs
It looks like you don't have the munged daemon running. On 11/29/2017 08:01 AM, Bruno Santos wrote: Hi everyone, I have set-up slurm to use slurm_db and all was working fine. However I had to change the slurm.conf to play with user priority and upon restarting the slurmctl is fails with the f

[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-11-29 Thread Andy Riebs
IN It's been a long day (for other reasons), so I'll go dig into this tomorrow. But if anyone can shine some light on where I should start looking, I shall be most obliged! Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engin

[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-11-30 Thread Andy Riebs
RAIN If anyone can shine some light on where I should start looking, I shall be most obliged! Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-12-08 Thread Andy Riebs
quot;error recovering username" seems likely to be at the heart of the problem here. This worked just fine with Slurm 16.05.8, and I think it was also working with Slurm 17.11.0-0pre2. Any thoughts about where I should go from here? Andy On 11/30/2017 08:40 AM, Andy Riebs wrote: We&#x

Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-12-08 Thread Andy Riebs
Answering my own question, I got private email which points to <https://bugs.schedmd.com/show_bug.cgi?id=4412>, describing both the problem and the solution. (Thanks Matthieu!) Andy On 12/08/2017 11:06 AM, Andy Riebs wrote: I've gathered more information, and I am probably hav

[slurm-users] bug 4333, "srun: fatal: step_launch.c:1036 step_launch_state_destroy"

2018-01-24 Thread Andy Riebs
? Or better yet, has anyone found a way to fix it? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

[slurm-users] Too many single-stream jobs?

2018-02-12 Thread Andy Riebs
but that wasn't sufficient. The environment: * CentOS 7.3 (x86_64) * Slurm 17.11.0 Does this ring any bells? Any thoughts about how we should proceed? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinion

Re: [slurm-users] Too many single-stream jobs?

2018-02-12 Thread Andy Riebs
least it seems harmfull to still have that in the code. You should file a bug for that. HTH Matthieu 2018-02-12 22:42 GMT+01:00 Andy Riebs <mailto:andy.ri...@hpe.com>>: We have a user who wants to run multiple instances of a single process job across a cluster, using a loo

Re: [slurm-users] slurm and dates?

2018-02-26 Thread Andy Riebs
y the system to see what the user has set. but to me this is a little muddy. i'd prefer the dates either come out in UTC or have a timezone appended to the output, though i suspect that's easier said then done... -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Perfor

[slurm-users] sbatch --immediate

2018-03-02 Thread Andy Riebs
ight not always work? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

[slurm-users] "allocated+" status

2018-04-16 Thread Andy Riebs
pleted. Other than coding a little loop to wait until the desired nodes are "idle" before scheduling a job, is there an automated way to say "don't start a job on a node until it reaches 'idle' status?" Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterp

Re: [slurm-users] "allocated+" status

2018-04-16 Thread Andy Riebs
Thanks Kilian! On 04/16/2018 02:15 PM, Kilian Cavalotti wrote: Hi Andy, On Mon, Apr 16, 2018 at 8:43 AM, Andy Riebs wrote: I hadn't realized that jobs can be scheduled to run on a node that is still in "completing" state from an earlier job. We occasionally use epilog script

Re: [slurm-users] "allocated+" status

2018-04-17 Thread Andy Riebs
am I missing? Andy On 04/16/2018 02:15 PM, Kilian Cavalotti wrote: Hi Andy, On Mon, Apr 16, 2018 at 8:43 AM, Andy Riebs wrote: I hadn't realized that jobs can be scheduled to run on a node that is still in "completing" state from an earlier job. We occasionally use epilog

[slurm-users] What can cause a job to get killed?

2018-04-17 Thread Andy Riebs
lurmd logs also indicate "task (Y) exited. Killed by signal 9." Any thoughts about why a job would get cancelled without getting any more detail than this? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404

Re: [slurm-users] SLURM on Ubuntu 16.04

2018-04-25 Thread Andy Riebs
1-650-723-7382 -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Andy Riebs
Administrator for Research/ Division of Radiation & Cancer  Biology Department of Radiation Oncology Stanford University School of Medicine Stanford, California 94305 Tel:1-650-498-7969   No Texting Fax:1-650-723-7382 -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise

Re: [slurm-users] network/communication failure

2018-05-21 Thread Andy Riebs
Biological Engineering http://che.eng.ua.edu University of Alabama 3448 SEC, Box 870203 Tuscaloosa, AL  35487 (205) 348-1733 (phone) (205) 561-7450 (cell) (205) 348-7558 (fax) htur...@eng.ua.edu http://turnerresearchgroup.ua.edu -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High

Re: [slurm-users] Enable SLURM Accounting

2018-05-28 Thread Andy Riebs
27;scontrol reconfigure' the SLURM configuration. Is that all I have to do? Ronan -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!