Unless you specify a partition, it should go to the partition defined as
default.
Do you mean not to run on particular nodes?
In that case, you can use the --exclude option:
*-x*,*--exclude*=
Explicitly exclude certain nodes from the resources granted to the job.
Brian Andrus
On 5/21
I saw you got some good answers, but a quick note on mpi. For some of
them, you are compiling it yourself, they can be "slurm-aware" (eg:
openmpi). Then when you do 'mpirun' it automatically knows your
inherited hostlist and you need do nothing extra when running.
Brian Andrus
On
Seems like there are better approaches.
In this situation, I would use an epilogue script and give sudo access
to the script. Check out https://slurm.schedmd.com/prolog_epilog.html
That would likely be much easier and fit into the methodology slurm uses.
Brian Andrus
Firstspot, Inc.
On 6/4
for scheduling individually*. The default
value is 1./
Brian Andrus
On 6/26/2018 3:06 PM, Bill wrote:
Hi Everyone,
For example, I have two partitions, high,low each has same nodes
node[1-10], When we submit job to high partition the nodes order is
node1,node2..node10, when we submit job to low
You show you still have more that one partition with Default=YES.
There should one and only one that is set to YES.
That is the one partition that is used if it is not specified.
Brian Andrus
On 7/27/2018 6:34 AM, valeri...@cbpf.br wrote:
Hi Merlin
Do you accidentally have more than one
All,
Is it possible to submit a job such that the memory limit is a
percentage of that on the node?
For instance a cluster with nodes in the same partition with varying
memory installed.
If it lands on a node with more memory, go ahead and use it.
Brian Andrus
uot; - Guy Fleegman (GalaxyQuest)
Brian Andrus
On Tue, Nov 6, 2018 at 4:39 PM Christopher Samuel wrote:
> On 7/11/18 7:35 am, Brian Andrus wrote:
>
> > I am able to submit using account=projectB on cluster3. ???
> > Since 'projectB' is a child of account ' DevOps', which is only
Ah just scontrol reconfigure doesn't actually make it take effect.
Restarting slurmctld did it.
On Tue, Nov 6, 2018 at 7:07 PM Christopher Samuel wrote:
> On 7/11/18 1:57 pm, Brian Andrus wrote:
>
> > Ah. I thought I had set that.
> > So I did and now it is:
> >
We use sssd with realmd
enumeration is off.
Brian Andrus
On 11/8/2018 11:26 AM, Marcin Stolarek wrote:
I have very similar issue for quite a time and I was unable to find
its root cause. Are you using sssd and AD as a data source with only a
subtree of entries searched - this is my case
?
This is an issue in a production environment. We don't want to have to
restart all the slurmctld daemons anytime there is a change to any
associations. That could get painful
Brian Andrus
to allocate resources: Invalid account or
account/partition combination specified*
So now I don't seem to be able to run anything...
On Tue, Nov 6, 2018 at 7:53 PM Christopher Samuel wrote:
> On 7/11/18 2:44 pm, Brian Andrus wrote:
>
> > Ah just scontrol reconfigure doesn't actually
not ideal.
Brian Andrus
On 11/8/2018 1:31 PM, Chris Samuel wrote:
On Friday, 9 November 2018 5:38:22 AM AEDT Brian Andrus wrote:
Where, slurmctld is not picking up new accounts unless it is restarted.
This is usually because slurmdbd cannot connect back to the slurmctld on the
management
slurmctld[54739]: _job_complete: JobId=6 done
Is this something that cannot be done from a system that is outside a
federated cluster?
Brian Andrus
9 10:34 PM, Chris Samuel wrote:
On Tuesday, 5 March 2019 10:07:30 AM PST Brian Andrus wrote:
Does anyone have a process they use to handle empty (aka "Unknown") end
times for jobs that are not running?
What does:
sacctmgr list runawayjobs
say?
ed for
me. I don't know if this is your problem or not. If you choose this
route, be careful and good luck!
On 3/6/19 10:15 AM, Brian Andrus wrote:
It shows several jobs that all have "Unknown" for end_time. Some are
PENDING and some are RUNNING (none are truly in either state).
time keeps growing.
Does anyone have a process they use to handle empty (aka "Unknown") end
times for jobs that are not running?
Brian Andrus
On Wed, Feb 27, 2019 at 10:43 PM Chris Samuel wrote:
> On Tuesday, 26 February 2019 10:03:34 AM PST Brian Andrus wrote:
>
> > On
It seems to me that END should be filled with the time the job failed,
no? Is there a setting or something that can be done to do this? Or a
schema so I could update the table(s) myself for any job with a state of
"FAILED"?
All the Best,
Brian Andrus
If you are using mpi, it should be aware automatically if everything was
compiled with support (eg mpirun).
If you are looking to just get the total tasks, $SLURM_NTASKS is
probably what you are looking for
Brian Andrus
On 6/8/2019 2:46 AM, Mahmood Naderan wrote:
Hi,
A genetic program
is passed to bring up at once? ResumeRate is the
default 300.
Brian Andrus
Using slurm 19.05.0-1
MinJobAge is set to 300
MaxJobCount is set to 1
There are only about 30 jobs running. However, when a job completes, it
vanishes immediately from the output of 'squeue'
Shouldn't it be staying there for 5 minutes?
Brian Andrus
Can you give the exact command/output you have from this?
I suspect a typo in your slurm.conf for nodenames or what you are typing.
Brian Andrus
On 6/18/2019 11:29 PM, nathan norton wrote:
Hi,
It just shows
"Node $NODE not found"
Whereas others all work as expected (ie, they a
All,
So I am experiencing great frustrations with the associations and
performance of slurmdbd with a mariadb backend.
A simple example is where I have a user with access to 4 partitions each
with the same 1200 account codes.
I want to retire two of the partitions, but there is no simple
All,
I know the argument passed to ResumeProgram is the node to be started,
but is there any way to access job info from within that script?
In particular, the number of nodes and cores actually requested.
Brian Andrus
I think you need a pty instead of just running bash...
try:
srun --pty bash
Or get specific on what resources you need, eg:
srun --nodes=1 --exclusive --pty bash
Brian Andrus
On 6/27/2019 2:11 PM, Micael Carvalho wrote:
Hello there,
I am having trouble with arrow keys in srun.
Example
That is because your configuration only lists node0 as the host. You can
only have one slurmctld running at a time, so you can either define
node1 as a backuphost or not bother trying to start slurmctld on it.
Brian Andrus
On 6/28/2019 6:31 AM, Pär Lundö wrote:
Hi all slurm-experts
I don't think that is possible. At least not easily
I just symlink /tmp to /scratch on systems I use.
That way folks can get used to /scratch, but if anything has hard-coded
/tmp, it will still work.
Brian Andrus
On 7/11/2019 8:19 AM, Douglas Duckworth wrote:
Hello
I am wondering
slurm.conf your BackupController
and your AccountingStorageBackupHost
slurmctld and slurmdbd will run on each of those respectively.
Brian Andrus
On 7/2/2019 1:48 PM, Tina Fora wrote:
Hi all,
We run mysql on a dedicated machine with slurmctld and slurmdbd running on
another machine. Now I want to add
. May not exceed 65533.
Brian Andrus
On 7/3/2019 2:45 PM, Tina Fora wrote:
Thanks Brian Andrus and Chris Samuel.
I was able to get it to work on our dev setup as primary/backup. Already
had the shared state directory. If I take primary down it takes about two
minutes for slurm commands to work
in megabytes (e.g. "2048"). The
default value is 1.
I would suggest RealMemory=191879 , where I suspect you have
RealMemory=196489092
Brian Andrus
On 7/8/2019 11:59 AM, Robert Kudyba wrote:
I’m new to Slurm and we have a 3 node + head node cluster running
Centos 7 and Bright C
upgrade.. for so many reasons.
Brian Andrus
On 7/8/2019 12:49 PM, Pariksheet Nanda wrote:
Hi SLURM devs,
TL;DR: What magic incantations are needed to preprocess the slurm.spec
file in SLURM 15?
Our cluster is currently running SLURM version 15.08.11. We are
planning some downtime to upgrade
for both.
I tend to build the rpms in a very simple method:
1) yum install munge munge-devel
2) rpmbuild -ta
If there are any special functions you need, ensure you have the -devel
packages for them (eg: openmpi-devel) and slurm will detect that and
include it in the build.
Brian Andrus
On 7
. Clusters are meant to be something that does all the work for
you while you are away (hence the batch concept). You likely want to
look at getting your code to run without human interference and send it
off to do so.
Brian Andrus
On 6/29/2019 7:48 AM, Valerio Bellizzomi wrote:
On Sat, 2019-06-29
trying to do and we may be able to advise
the best way to accomplish it.
Brian Andrus
On 6/29/2019 12:53 AM, Valerio Bellizzomi wrote:
How it gets done normally ?
Yeah, you can't do that in that fashion.
If you want to do that, I'd suggest you put the option in the sbatch
command you use to submit the script so:
sbatch --job-name=`basename $PWD` /path/to/script.sh
Brian Andrus
On 7/28/2019 10:51 PM, Verzelloni Fabio wrote:
Hi Everyone,
I'm
lease... I have
installed 18.08.0, .3, .4 and .8 on the same server and nodes since
Sep of 2018 using the same procedures and never had any issues...
Currently running 18.08.8
Thanks.
Lou
On Thu, Aug 15, 2019 at 3:07 PM Brian Andrus <mailto:toomuc...@gmail.com>> wrote:
Lou,
Lou,
Are you installing on the same machine you built?
Are the nvidia libraries installed by RPM or a 'make install' on the box
you compiled it on?
Brian Andrus
On 8/15/2019 7:53 AM, Lou Nicotra wrote:
I have tried running ldconfig manually as suggested with
slurm-19.05.1-2 and it fails
Have you tried adding the dependency at submit time?
sbatch --dependency=singleton fakejob.sh
Brian Andrus
On 8/21/2019 1:51 PM, Jarno van der Kolk wrote:
Hi,
I am helping a researcher who encountered an unexpected behaviour with dependencies. He uses both
"singleton"
-H --depend=afterany:${waitjob} fakejob.sh|sed
's/Submitted batch job //')*//*
*//*done*/
Of course, if you are actually running the exact same script, I would
recommend using arrays as well.
Brian Andrus
On 8/22/2019 6:23 AM, Jarno van der Kolk wrote:
Hi Brian,
Thanks for the suggestion
After you restart slurmctld do "scontrol reconfigure"
Brian Andrus
On 8/30/2019 6:57 AM, Robert Kudyba wrote:
I had set RealMemory to a really high number as I mis-interpreted the
recommendation.
NodeName=node[001-003] CoresPerSocket=12 RealMemory=
196489092 Sockets=2 Gres=gpu:1
up a proper input file for a script, a single
submission is all it takes. Then you can control how many are currently
running (MaxArrayTask) and can change that to scale up/down.
Brian Andrus
On 8/25/2019 11:12 PM, Guillaume Perrault Archambault wrote:
Hello,
I wrote a regression-testing
Here is where you may want to look into slurmdbd and sacct
Then you can create a qos that has MaxJobsPerUser to limit the total
number running on a per-user basis:
https://slurm.schedmd.com/resource_limits.html
Brian Andrus
On 8/27/2019 9:38 AM, Guillaume Perrault Archambault wrote:
Hi
with
each installation.
Brian Andrus
On 9/5/2019 8:48 AM, Douglas Duckworth wrote:
Hello
We added some newer Epyc nodes, with NVMe scratch, to our cluster and
so want jobs to run on these over others. So we added "Weight=100"
/*to the older nodes*/ and left the new ones blank. So indee
are in use, you can add weights to the node definitions.
This would mean users could request >192GB memory, so it has to go to
one of the updated nodes, which will only be taken if the other nodes
are used up, or a job needing > 192GB is running on them.
Brian Andrus
On 9/4/2019 9:53 AM
.
However, there are definite use cases that make it worthwhile.
So long as you allocate enough resources for the node (be it the
controller or other) you will be fine.
Brian Andrus
On 9/12/2019 7:23 AM, Jose A wrote:
Dear all,
In the expansion of our Cluster we are considering to install SLURM
Quick question?
When I use sacct to show job stats, it always has a blank entry for the
MaxRSS field. Is there something that needs enabled to get that in?
I do see it if I use sstat while the job is running.
Brian Andrus
is used to collect accounting information. Supported values are
> *jobacct_gather/linux* (recommended), *jobacct_gather/cgroup* and
> *jobacct_gather/none* (no information collected).
>
> Antony
>
>
> On Mon, 16 Sep 2019, 14:07 Brian Andrus, wrote:
>
>> Yep, the ma
Hmm. We are only using allocations and have slurm.conf configured with:
AccountingStorageEnforce=associations,nosteps
Are steps required to capture Max RSS?
Brian
On 9/15/2019 1:48 PM, Mark Hahn wrote:
When I use sacct to show job stats, it always has a blank entry for
the MaxRSS field. Is
=18446744073709551614,4=1,5=4 |
++-++
Brian Andrus
On Mon, Sep 16, 2019 at 2:58 PM Brian Andrus wrote:
> I have
> JobAcctGatherType = jobacct_gather/linux
>
> Brian
>
> On Mon, Sep 16, 2019 at 12:40 PM Antony Cleave
The jobs have definitely completed when I try to gather the info.
Brian
On 9/15/2019 4:01 PM, Steven Dick wrote:
I don't think it shows up until the job completes.
On Sat, Sep 14, 2019 at 2:25 AM Brian Andrus wrote:
Quick question?
When I use sacct to show job stats, it always has a blank
, Christopher Samuel wrote:
On 9/15/19 4:17 PM, Brian Andrus wrote:
Are steps required to capture Max RSS?
No, you should see a MaxRSS reported for the batch step, for instance:
$ sacct -j $JOBID -o jobid,jobname,maxrss
All the best,
Chris
s includes
PPR, where the pattern would be terminated by another colon to
separate it from the modifiers.
so adding "--map-by node" would give you what you are looking for.
Of course, this syntax is for Openmpi's mpirun command, so YMMV
Brian Andrus
On 7/30/2019 5:14 AM, CB
]) for JobId=52545
I suspect this is in the saved state directory and if I were to down the
entire cluster and delete those files up, it would clear it up, but I
prefer to not have to down the cluster...
Is there a way to clean up "phantom" nodes and partitions that were deleted?
Brian Andrus
The jobs themselves no longer exist. They had completed before I deleted
the partition, which is odd to me.
I may have did 'reconfigure' before restarting slurmctld, it was awhile
ago, so I don't recall.
Brian Andrus
On 7/26/2019 8:10 PM, Chris Samuel wrote:
On 26/7/19 8:28 am, Jeffrey
Lyn,
That was it, thanks!
sacct -o reserved
Brian
On 9/21/2019 9:26 AM, Lyn Gerner wrote:
Hey Brian,
I think the discussion was in the context of suspend/resume,
and it was the Reserved value that effectively represents that time.
Regards,
Lyn
On Sat, Sep 21, 2019 at 9:15 AM Brian Andrus
There was a command shared at the SLUG that showed how long it took a
node to go from a power_down (idle~) state to up and having a job
running on it, but I cannot remember what it was.
Does anyone recall that?
Brian Andrus
Except sstat can give you the MaxRSS without having cgroups and it will
give you a simple MaxRSS, whereas sacct provides a MaxRSS for every
step... have to play with that data to get the high water mark grrr.
I had tried to use sstat in an epilogue but apparently that is too late...
Brian
ckages except pmix-devel. Haven't figured that one yet.
Brian Andrus
On 10/30/2019 11:18 AM, Christopher Benjamin Coffey wrote:
Yes, I'd be interested too.
Best,
Chris
t actually
sharing homes could be the cause.
Brian Andrus
On 11/17/2019 11:24 AM, Yann Bouteiller wrote:
Hello,
I am trying to do this on computecanada, which is managed by slurm:
https://ray.readthedocs.io/en/latest/deploying-on-slurm.html
However, on computecanada, you cannot inst
, I get back 41 groups I am in.
Bug?
Brian Andrus
Quick question:
Is the epilogue script run if a job exceeds its time limits and is being
canceled?
What about just cancelled?
I need to be able to clean up some job-specific files regardless of how
the job ends and I'm not sure epilogue is sufficient.
Brian Andrus
s have had the same issue and even add to comments in the
bugs, but no responses/resolution for this have been posted.
FWIW, I also see the issue with the latest slurm 20.05 pre1 code.
Brian Andrus
On 12/5/2019 11:46 PM, von St. Vieth, Benedikt wrote:
Hi again,
I answered this question on Oct 2
Tim claims it works...
I have compiled it, but when you try to run slurmd, it throws some
errors and will not start. From a previous thread:
While I can successfully build/run slurmctld, slurmd is failing because ALL
of the SelectType libraries are missing symbols.
Example from
crickets. I think in our case we were not able to ensure that the
epilog always ran for different types of job failures, so we just had
the users add some more cleanup code to the end of their jobs _and_
also run separate cleanup jobs.
Regards,
Alex
On Wed, Dec 4, 2019 at 7:29 PM Brian Andrus
depends on what best suits the specific needs.
Brian Andrus
On 12/16/2019 2:29 PM, Ransom, Geoffrey M. wrote:
Hello
I am looking into switching from Univa (sge) to slurm and am
figuring out how to implement some of our usage policy in slurm.
We have a Univa queue which uses job classes
a cleanup script run on jobs that
have timed out?
Brian Andrus
You prompted me to dig even deeper into my epilog. I was trying to
access a semaphore file in the user's home directory.
It seems that when the epilogue is run the ~ is not expanded in anyway.
So I can't even use ~${SLURM_JOB_USER} to access their semaphore file.
Potentially problematic for
-1.el8.x86_64.rpm
slurm-slurmdbd-19.05.3-1.el8.x86_64.rpm
slurm-torque-19.05.3-1.el8.x86_64.rpm
Brian Andrus
On 10/28/2019 2:32 AM, Benjamin Redling wrote:
On 28/10/2019 08.26, Bjørn-Helge Mevik wrote:
Taras Shapovalov writes:
Do I understand correctly that Slurm19 is not compatible
:34 2019-10-01T00:00:44 00:00:10
Brian Andrus
handling
until they have it as part of their app.
Brian Andrus
On 10/14/2019 4:40 AM, Oytun Peksel wrote:
It is quite weird if slurm has no mechanism as described. I have been
digging more into it and someone suggested a workaround using mail
notifications. You use a script instead of the mail
tun Peksel*
oytun.pek...@semcon.com <mailto:oytun.pek...@semcon.com>
+46739205917
*From:*slurm-users *On Behalf
Of *Brian Andrus
*Sent:* den 15 oktober 2019 20:58
*To:* slurm-users@lists.schedmd.com
*Subject:* Re: [slurm-users] Execute scripts on su
IIRC, the big difference is if you want to use cgroups on the nodes. You
must use the cgroup plugin.
Brian Andrus
On 10/24/2019 3:54 PM, Christopher Benjamin Coffey wrote:
Hi Juergen,
From what I see so far, there is nothing missing from the jobacct_gather/linux
plugin vs the cgroup
I prefer building packages.
I did have to extract and change the .spec file to accommodate some of
the changes as well as set up the environment to complete.
Brian
On 10/29/2019 8:11 AM, Christopher Benjamin Coffey wrote:
Brian, I've actually just started attempting to build slurm 19 on
/libslurmfull.so|grep powercap_*//*
*//*0010f7b8 T slurm_free_powercap_info_msg*//*
*//*00060060 T slurm_print_powercap_info_msg*/
So, sure enough powercap_get_cluster_current_cap is not in there.
Methinks the linking needs examined.
Brian Andrus
On 10/28/2019 2:32 AM, Benjamin Redling
.
Brian Andrus
On 10/18/2019 1:03 PM, bbenede...@goodyear.com wrote:
Greetings!
I am trying to set up a partition that will only allow one job at a time to
run, regardless of who submits it.
So multiple jobs from multiple users can be in the queue. But I only want the
partition to run one
/openmpi), which forces only one
version to be able to be loaded. I also set paths so specific versions
of libraries become available depending on what environment you select
(gcc vs intel for example).
Is there something besides versioning that lmod shines at?
Brian Andrus
On 11/24/2019 12:48 AM
server you use.
The best solution, of course, is to educate the users.
You could create a job_submit plugin that removes mail options for
arrays, but you may negatively impact users that do need that.
Brian Andrus
On 11/25/2019 10:55 PM, ichebo...@univ.haifa.ac.il wrote:
I meant on the admin
FAIL apply to a job array
as a whole rather than generating individual email messages for each
task in the job array./
Brian Andrus
On 11/25/2019 1:48 AM, ichebo...@univ.haifa.ac.il wrote:
Hi,
I would like to ask if there is some options to configure the e-mail
notification of slurm job
Are you specifying memory for each of the jobs?
Can't run a small job if there isn't enough memory available for it.
Brian Andrus
On 11/1/2019 7:42 AM, c b wrote:
I have:
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory
On Fri, Nov 1, 2019 at 10:39 AM Mark Hahn <mailt
Brian Andrus <mailto:toomuc...@gmail.com>> wrote:
Are you specifying memory for each of the jobs?
Can't run a small job if there isn't enough memory available for it.
Brian Andrus
On 11/1/2019 7:42 AM, c b wrote:
I have:
SelectType=select/cons_res
SelectTypeP
You are trying to specifically run on node cn110, so you may want to
check that out with sinfo
A quick "sinfo -R" can list any down machines and the reasons.
Brian Andrus
On 11/10/2019 11:23 PM, Sukman wrote:
Hi Brian,
I see. Thank you for your suggestion.
I definitely will try i
that are idle~ but no calls to the script.
If I restart slurmctld, the backlog starts running and things work.
Any ideas what could cause this?
Brian Andrus
So it seems nss_slurm does not play well with sudo.
If I connect to a box that uses it and try to use sudo, I get:
*sudo: PAM account management error: Authentication service cannot retrieve
authentication info*
Has anyone else seen this?
Is there a workaround?
Brian Andrus
Bright is not needed... for much of anything...
On 2/25/2020 12:48 PM, Robert Kudyba wrote:
I suppose I can ask Bright Computing but does anyone know what version
of Bright is needed? I would guess 8.2 or 9.0. Definitely want to dive
into this.
on that are.
Brian Andrus
I would say so.
Certainly, if you have many nodes and/or many jobs being submitted, you
will see an impact, but in my experience comparing Slurm to SGE, Slurm
has much less overhead to cause as much impact.
Brian Andrus
On 2/26/2020 1:05 PM, Joshua Baker-LePain wrote:
On Wed, 26 Feb 2020
easy to do. Just add the lines to your slurm.conf for the
backup controller, start it up and reconfigure for all running nodes to
be aware of it.
Brian Andrus
On 2/26/2020 12:48 PM, Joshua Baker-LePain wrote:
We're planning the migration of our moderately sized cluster (~400
nodes, 40K jobs
Your trying to run bash which, without special configuration, needs a pty
Try
srun -v -p debug --pty bash
Brian Andrus
On 2/6/2020 10:28 PM, Hector Yuen wrote:
Hello,
I am setting up a very simple configuration: one node running slurmd
and another one running slurmctld.
In the slurmctld
Usually means you updated the slurm.conf but have not done "scontrol
reconfigure" yet.
Brian Andrus
On 2/10/2020 8:55 AM, Robert Kudyba wrote:
We are using Bright Cluster 8.1 with and just upgraded to slurm-17.11.12.
We're getting the below errors when I restart the slurmct
ster generically, so
their configs are not getting matched to the specific info in your main
config
Brian Andrus
On 1/20/2020 10:37 AM, Robert Kudyba wrote:
I've posted about this previously here
<https://groups.google.com/forum/#!searchin/slurm-users/kudyba%7Csort:date/slurm-users/mMECjerUmFE/V
Check the slurmd log file on the node.
Ensure slurmd is still running. Sounds possible that OOM Killer or such
may be killing slurmd
Brian Andrus
On 1/20/2020 1:12 PM, Dean Schulze wrote:
If I restart slurmd the asterisk goes away. Then I can run the job
once and the asterisk is back
I think we would need to see your SuspendScript to get a better idea of
what is happening.
That error indicates the nodes are likely not running slurmd and the
control daemon things they are still up.
What is the output of 'sinfo -R'?
Brian Andrus
On 1/7/2020 3:42 AM, Steve Brasier wrote
. It could probably be worked around, but
not in a simple way. Easier to upgrade to the newest release :)
Brian Andrus
On 3/9/2020 10:14 AM, MrBr @ GMail wrote:
Hi Brian
The nodes work with slurm without any issues till I try the "--reboot"
option.
I can successfully allocate the no
the
next uid on any node.
The error below looks like you may have a different uid for the slurm
user on the node. What uid is slurmd running as on the bad node vs a
good node?
Brian Andrus
On 4/17/2020 2:38 PM, Dean Schulze wrote:
Just noticed this. On the problem node the munged.log file
For CentOS/RHEL, it is in the OpenFusion repo:
http://repo.openfusion.net/centos7-x86_64/
just
yum install
http://repo.openfusion.net/centos7-x86_64/openfusion-release-0.7-1.of.el7.noarch.rpm
then
yum install libjwt-devel
Brian Andrus
On 4/18/2020 2:27 PM, Daniel Letai wrote
Maybe too obvious, but have you checked your .bashrc, .bash_profile and
such?
Brian Andrus
On 5/12/2020 10:27 AM, Ellestad, Erik wrote:
Which SLURM prolog specifically?
I’m not finding that to work for me in either task-prolog or prolog.
SLURM_TMPDIR and TMPDIR are still both set to /tmp
' from the node and verify it is
able to talk to slurmctld from the node and verify slurmd started
successfully.
Brian Andrus
On 3/9/2020 4:38 AM, MrBr @ GMail wrote:
Hi all
I'm trying to use the --reboot option of srun to reboot the nodes
before allocation.
However the nodes not been
normal users cannot use "--reboot"
Brian Andrus
On 3/9/2020 10:14 AM, MrBr @ GMail wrote:
Hi Brian
The nodes work with slurm without any issues till I try the "--reboot"
option.
I can successfully allocate the nodes or any other slurm related operation
> You may want to dou
both. I do high debug to the journal and info to the log file.
Brian Andrus
On 9/8/2020 2:41 AM, Gestió Servidors wrote:
Hello,
I don’t know why, but my SLURM server (that is running fine) has its
slurmdctl.log file with size 0 bytes... so... where is writting logs?
It seems that log file has
do you have your gres.conf on the nodes also?
Brian Andrus
On 10/8/2020 11:57 AM, Sajesh Singh wrote:
Slurm 18.08
CentOS 7.7.1908
I have 2 M500 GPUs in a compute node which is defined in the
slurm.conf and gres.conf of the cluster, but if I launch a job
requesting GPUs the environment
they will wait a relatively shorter amount of time. There are
numerous other factors you can use. If you have accounting and
associations configured, you can manipulate it all the way to the
association and qos.
Brian Andrus
On 8/17/2020 11:23 PM, Gerhard Strangar wrote:
Brian Andrus wrote:
Most likely, b
1 - 100 of 339 matches
Mail list logo