Re: [slurm-users] salloc: error: Error on msg accept socket: Too many open files

2021-02-02 Thread Andy Riebs
Run salloc with a smaller number of nodes or tasks, then take a look at lsof (or some other favorite means of finding IP connections). IIRC, each srun/node in the allocation needs 70-80 IP connections with the node running salloc, so a large node count can overwhelm the default allocation of

Re: [slurm-users] only 1 job running

2021-01-28 Thread Andy Riebs
Hi Chandler, If the only changes to your system have been the slurm.conf configuration and the addition of a new node, the easiest way to track this down is probably to show us the diffs between the previous and current versions of slurm.conf, and a note about what's different about the new

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-25 Thread Andy Riebs
See below On 1/25/2021 9:36 AM, Ole Holm Nielsen wrote: On 1/25/21 2:59 PM, Andy Riebs wrote: Several things to keep in mind...  1. Slurm, as a product undergoing frequent, incompatible revisions, is     not well-suited for provisioning from a stable public repository! On     the other hand

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-25 Thread Andy Riebs
prove it, based in part on feedback, over time." 4. Slurm packages (and other contributions, including suggestions on this mailing list) that haven't been provided by SchedMD have probably been provisioned and tested by a volunteer -- be sure to keep the conversation civil! Andy R

Re: [slurm-users] Question about unit tests

2020-12-09 Thread Andy Riebs
Did you do the first "make check" from the top-level Slurm directory (not testsuite/slurm_unit)? On 12/8/2020 11:15 PM, Rikimaru Honjo wrote: Hi, I ran unit tests according to the following document. https://github.com/SchedMD/slurm/blob/master/testsuite/slurm_unit/README As a result, all

Re: [slurm-users] [EXT] Re: pmix issue

2020-12-08 Thread Andy Riebs
uld also note I used devtoolset-10 (gcc 10) on RHEL7 and confirmed that everything was compiled with that version of compiler. I also set LD_LIBRARY_PATH to include /share/local/pmix-3.2.1 Cheers! Phil *From: *slurm-users on behalf of Andy Riebs *Reply-To: *"a...@candooz.com" , Slur

Re: [slurm-users] [EXT] Re: pmix issue

2020-12-07 Thread Andy Riebs
and confirmed that everything was compiled with that version of compiler. I also set LD_LIBRARY_PATH to include /share/local/pmix-3.2.1 Cheers! Phil *From: *slurm-users on behalf of Andy Riebs *Reply-To: *"a...@candooz.com" , Slurm User Community List *Date: *Friday, December 4, 2020

Re: [slurm-users] pmix issue

2020-12-04 Thread Andy Riebs
Also, Slurm was built with "/fs/local/pmix-3.2.1" -- does that translate well to "/share/local/pmix-3.2.1"? Andy On 12/4/2020 2:59 PM, Andy Riebs wrote: Are you sure that /share/local/pmix-3.2.1 exists on the compute nodes? On 12/4/2020 2:54 PM, Yuengling, Philip J.

Re: [slurm-users] pmix issue

2020-12-04 Thread Andy Riebs
Are you sure that /share/local/pmix-3.2.1 exists on the compute nodes? On 12/4/2020 2:54 PM, Yuengling, Philip J. wrote: Hi everyone, I’ve been having difficulty getting the --mpi=pmix_v3 option to work for me.  I can get --mpi=pmi2 to work ok, but I really want to understand what I’m doing

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-27 Thread Andy Riebs
Nodes=ALL Default=YES MaxTime=INFINITE State=UP > > about the only thing I can think of is to make one of the nodes on the > otherside of the gateway into the control node > > > *Steve Bland* > *Technical Product Manager* > > *Third Party Products* > Ross Vid

Re: [slurm-users] [EXTERNAL] Re: trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-26 Thread Andy Riebs
configuring the slurm.conf node entries. In fact slurm seems to be case sensitive, which surprised the heck out of me *From:* slurm-users *On Behalf Of *Andy Riebs *Sent:* Thursday, November 26, 2020 12:50 *To:* slurm-users@lists.schedmd.com *Subject:* [EXTERNAL] Re: [slurm-users] trying

Re: [slurm-users] trying to diagnose a connectivity issue between the slurmctld process and the slurmd nodes

2020-11-26 Thread Andy Riebs
1. Look for a firewall on all of your slurm -- they almost always break slurm communications. 2. Confirm that "ssh srvgridslurm01 hostname" returns, exactly, "srvgridslurm01" Andy On 11/26/2020 12:21 PM, Steve Bland wrote: Sinfo always returns nodes not responding [root@srvgridslurm03

Re: [slurm-users] missing info from sacct

2020-11-18 Thread Andy Riebs
       cpu         1 How it is calculating the hour in a day . Regards Navin. On Wed, Nov 18, 2020 at 7:51 PM Andy Riebs <mailto:a...@candooz.com>> wrote: I see from your subsequent post that you're using a pair of clusters with a single database, so yes, you are using federation.

Re: [slurm-users] missing info from sacct

2020-11-18 Thread Andy Riebs
, Andy Riebs wrote: Are you using federated clusters? If not, check slurm.conf -- do you have FirstJobId set? Andy On 11/18/2020 8:42 AM, navin srivastava wrote: While running the sacct we found that some jobid are not listing. 5535566      SYNTHLIBT+  stdg_defq   stdg_acc          1

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Andy Riebs
> Doug Jacobsen, Ph.D. > NERSC Computer Systems Engineer > Acting Group Lead, Computational Systems Group > National Energy Research Scientific Computing Center > dmjacob...@lbl.gov <mailto:dmjacob...@lbl.gov> > > ----- __o > -

Re: [slurm-users] job startup timeouts?

2019-04-26 Thread Andy Riebs
ing that we're missing that can be changed to accommodate a lengthy startup time like this? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

[slurm-users] job startup timeouts?

2019-04-26 Thread Andy Riebs
to accommodate a lengthy startup time like this? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

Re: [slurm-users] Slurm 1 CPU

2019-04-04 Thread Andy Riebs
in slurm.conf, on the line(s) starting "NodeName=", you'll want to add specs for sockets, cores, and threads/core. *From:* Chris Bateson *Sent:* Thursday, April 04, 2019 5:18PM *To:* Slurm-users *Cc:* *Subject:*

[slurm-users] Resolution! was Re: Mysterious job terminations on Slurm 17.11.10

2019-03-12 Thread Andy Riebs
suggestions. Andy ---- *From:* Andy Riebs *Sent:* Thursday, January 31, 2019 2:04PM *To:* Slurm-users *Cc:* *Subject:* Mysterious job terminations on Slurm 17.11.10 Hi All, Just checking to see if this sounds familiar to anyone. Environment: - CentOS 7.5 x86_64 - Slurm 17.11.10 (bu

Re: [slurm-users] Large job starvation on cloud cluster

2019-02-27 Thread Andy Riebs
Michael, are you setting time limits for the jobs? That's a huge part of a scheduler's decision about whether another job can be run. For example, if a job is running with the Slurm default of "infinite," the scheduler will likely decide that jobs that will fit in the remaining nodes will be

[slurm-users] Mysterious job terminations on Slurm 17.11.10

2019-01-31 Thread Andy Riebs
goes down. With SlurmctldDebug=debug2 and SlurmdDebug=debug5, there's not a clue in the slurmctld or slurmd logs. Any thoughts on what might be happening, or what I might try next? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 M

Re: [slurm-users] Simple question but I can't find the answer

2019-01-10 Thread Andy Riebs
Is it following a host name, or a partition name? If the latter, it just means that it's the default partition. *From:* Jeffrey R. Lang *Sent:* Thursday, January 10, 2019 11:13AM *To:* Slurm-users *Cc:* *Subject:*

Re: [slurm-users] Can't find an address

2018-10-25 Thread Andy Riebs
Make sure that the "hostname" command returns the same name that Slurm expects on your compute nodes. *From:* Zohar Roe Mlm *Sent:* Thursday, October 25, 2018 3:02AM *To:* 'Slurm User Community List' *Cc:* *Subject:* Re:

Re: [slurm-users] Looking for old SLURM versions

2018-10-24 Thread Andy Riebs
Bob, you can find older versions of Slurm at the archive, at . Andy *From:* Bob Healey *Sent:* Wednesday, October 24, 2018 5:51PM *To:* Slurm-users *Cc:* *Subject:* [slurm-users]

[slurm-users] "Task %d reported exit for a second time" on Slurm 17.11.9-2 and 17.11.10

2018-10-05 Thread Andy Riebs
ges. Any thoughts about what might be going on here? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

Re: [slurm-users] Enable SLURM Accounting

2018-05-28 Thread Andy Riebs
ontrol reconfigure' the SLURM configuration. Is that all I have to do? Ronan -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

Re: [slurm-users] network/communication failure

2018-05-21 Thread Andy Riebs
and Biological Engineering http://che.eng.ua.edu University of Alabama 3448 SEC, Box 870203 Tuscaloosa, AL  35487 (205) 348-1733 (phone) (205) 561-7450 (cell) (205) 348-7558 (fax) htur...@eng.ua.edu http://turnerresearchgroup.ua.edu -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High

Re: [slurm-users] Nodes are down after 2-3 minutes.

2018-05-07 Thread Andy Riebs
Administrator for Research/ Division of Radiation & Cancer  Biology Department of Radiation Oncology Stanford University School of Medicine Stanford, California 94305 Tel:1-650-498-7969   No Texting Fax:1-650-723-7382 -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise

Re: [slurm-users] SLURM on Ubuntu 16.04

2018-04-25 Thread Andy Riebs
1-650-723-7382 -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

[slurm-users] What can cause a job to get killed?

2018-04-17 Thread Andy Riebs
logs also indicate "task (Y) exited. Killed by signal 9." Any thoughts about why a job would get cancelled without getting any more detail than this? Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404

Re: [slurm-users] "allocated+" status

2018-04-17 Thread Andy Riebs
missing? Andy On 04/16/2018 02:15 PM, Kilian Cavalotti wrote: Hi Andy, On Mon, Apr 16, 2018 at 8:43 AM, Andy Riebs <andy.ri...@hpe.com> wrote: I hadn't realized that jobs can be scheduled to run on a node that is still in "completing" state from an earlier job. We occasional

Re: [slurm-users] "allocated+" status

2018-04-16 Thread Andy Riebs
Thanks Kilian! On 04/16/2018 02:15 PM, Kilian Cavalotti wrote: Hi Andy, On Mon, Apr 16, 2018 at 8:43 AM, Andy Riebs <andy.ri...@hpe.com> wrote: I hadn't realized that jobs can be scheduled to run on a node that is still in "completing" state from an earlier job. We occasio

[slurm-users] "allocated+" status

2018-04-16 Thread Andy Riebs
Other than coding a little loop to wait until the desired nodes are "idle" before scheduling a job, is there an automated way to say "don't start a job on a node until it reaches 'idle' status?" Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Com

Re: [slurm-users] slurm and dates?

2018-02-26 Thread Andy Riebs
course you can certainly query the system to see what the user has set. but to me this is a little muddy. i'd prefer the dates either come out in UTC or have a timezone appended to the output, though i suspect that's easier said then done... -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Ente

Re: [slurm-users] Too many single-stream jobs?

2018-02-12 Thread Andy Riebs
least it seems harmfull to still have that in the code. You should file a bug for that. HTH Matthieu 2018-02-12 22:42 GMT+01:00 Andy Riebs <andy.ri...@hpe.com <mailto:andy.ri...@hpe.com>>: We have a user who wants to run multiple instances of a single process job across

Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-12-08 Thread Andy Riebs
Answering my own question, I got private email which points to <https://bugs.schedmd.com/show_bug.cgi?id=4412>, describing both the problem and the solution. (Thanks Matthieu!) Andy On 12/08/2017 11:06 AM, Andy Riebs wrote: I've gathered more information, and I am probably having a

Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-12-08 Thread Andy Riebs
error recovering username" seems likely to be at the heart of the problem here. This worked just fine with Slurm 16.05.8, and I think it was also working with Slurm 17.11.0-0pre2. Any thoughts about where I should go from here? Andy On 11/30/2017 08:40 AM, Andy Riebs wrote: We've just

[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-11-30 Thread Andy Riebs
If anyone can shine some light on where I should start looking, I shall be most obliged! Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 648 9024 My opinions are not necessarily those of HPE May the source be with you!

[slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-11-29 Thread Andy Riebs
It's been a long day (for other reasons), so I'll go dig into this tomorrow. But if anyone can shine some light on where I should start looking, I shall be most obliged! Andy -- Andy Riebs andy.ri...@hpe.com Hewlett-Packard Enterprise High Performance Computing Software Engineering +1 404 6

Re: [slurm-users] Priority wait

2017-11-14 Thread Andy Riebs
Hi Roy, What command are you using to start the jobs? On 11/14/2017 09:58 AM, Zohar Roe MLM wrote: Hello, Trying again with the slurm.conf This time. I have a cluster name: Autobot In this cluster I have servers: Optimus[1-10] and Megatron[1-10]. I sent 3000 jobs with feature Optimus

Re: [slurm-users] Using LSPSuite with SBATCH

2017-11-13 Thread Andy Riebs
Paul, it would be incredibly helpful to reveal * What version of Slurm you are using * What Slurm commands you are using * The mpirun command(s) that do effect what you desire * Your slurm configuration -- preferably a copy of slurm.conf (with node names and IP addresses obscured for security