date:20230719

Re: [slurm-users] MIG-Slice: Unavailable GRES

2023-07-19 Thread Groner, Rob

At some  point when we were experimenting with MIG, I was being entirely 
frustrated in getting it to work until I finally removed the autodetect from 
gres.conf and explicitly listed the stuff instead.  THEN it worked.  I think 
you can find the list of files that are the device files using nvidia-smi.

Here is the entry we use in our gres.conf for one of the nodes:

NodeName=p-gc-3037 Name=gpu Type=1g.5gb 
File=/dev/nvidia-caps/nvidia-cap[66,75,84,102,111,120,129,201,210,219,228,237,246,255]

Something to TRY anyway.  Odd that 3g.20gb works.  You might try reconfiguring 
the node for that instead and see if it works then.  We've used 3g.20gb and 
1g.5gb on our nodes and it works fine, never tried 2g.10gb.

Rob



From: slurm-users on behalf of Vogt, Timon
Sent: Wednesday, July 19, 2023 3:08 PM
To: slurm-us...@schedmd.com
Subject: [slurm-users] MIG-Slice: Unavailable GRES

Dear Slurm Mailing List,

I am experiencing a problem which affects our cluster and for which I am
completely out of ideas by now, so I would like to ask the community for
hints or ideas.

We run a partition on our cluster containing multiple nodes with Nvidia
A100 GPUs (40GB), which we have sliced up using Nvidia Multi-Instance
GPUs (MIG) into one 3g.20gb slice and two 2g.10gb slices per GPU.

Now, when submitting a job to it and requesting the 3g.20gb slice (like
with "srun -p mig-partition -G 3g.20gb:1 hostname"), the job runs fine,
but when a job requests one of the 2g.10gb slices instead (like with
"srun -p mig-partition -G 2g.10gb:1 hostname"), the job does not get
scheduled and the controller repeatedly outputs the error:

slurmctld[28945]: error: _set_job_bits1: job 4780824 failed to find any
available GRES on node 1471
slurmctld[28945]: error: gres_select_filter_select_and_set job 4780824
failed to satisfy gres-per-job counter

Our cluster uses the AutoDetect=nvml feature for the nodes in the
gres.conf and both slice types are defined in "AccountingStorageTRES"
and in the GRES parameter of the node definition. The slurmd on the node
also finds both types of slices and reports the correct amounts. They
are also visible in the "Gres=" section of "scontrol show node", again
in correct amounts.

I have also ensured that the nodes are not used otherwise by creating a
reservation on them accessible only to me, and I have restarted all
slurmd's and the slurmctld.

By now, I am out of ideas. Does someone here have a suggestion on what
else I can try? Has someone already seen this error and knows more about it?

Thank you very much in advance and
best regards,
Timon

--
Timon Vogt
Arbeitsgruppe "Computing"
Nationales Hochleistungsrechnen (NHR)
Scientific Employee NHR
Tel.: +49 551 39-30146, E-Mail: timon.v...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Burckhardtweg 4, 37077 Göttingen, URL: https://gwdg.de

Support: Tel.: +49 551 39-3, URL: https://gwdg.de/support
Sekretariat: Tel.: +49 551 39-30001, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001 und ISO 27001
-

[slurm-users] MIG-Slice: Unavailable GRES

2023-07-19 Thread Vogt, Timon


Dear Slurm Mailing List,

I am experiencing a problem which affects our cluster and for which I am 
completely out of ideas by now, so I would like to ask the community for 
hints or ideas.


We run a partition on our cluster containing multiple nodes with Nvidia 
A100 GPUs (40GB), which we have sliced up using Nvidia Multi-Instance 
GPUs (MIG) into one 3g.20gb slice and two 2g.10gb slices per GPU.


Now, when submitting a job to it and requesting the 3g.20gb slice (like 
with "srun -p mig-partition -G 3g.20gb:1 hostname"), the job runs fine, 
but when a job requests one of the 2g.10gb slices instead (like with 
"srun -p mig-partition -G 2g.10gb:1 hostname"), the job does not get 
scheduled and the controller repeatedly outputs the error:


slurmctld[28945]: error: _set_job_bits1: job 4780824 failed to find any 
available GRES on node 1471
slurmctld[28945]: error: gres_select_filter_select_and_set job 4780824 
failed to satisfy gres-per-job counter


Our cluster uses the AutoDetect=nvml feature for the nodes in the 
gres.conf and both slice types are defined in "AccountingStorageTRES" 
and in the GRES parameter of the node definition. The slurmd on the node 
also finds both types of slices and reports the correct amounts. They 
are also visible in the "Gres=" section of "scontrol show node", again 
in correct amounts.


I have also ensured that the nodes are not used otherwise by creating a 
reservation on them accessible only to me, and I have restarted all 
slurmd's and the slurmctld.


By now, I am out of ideas. Does someone here have a suggestion on what 
else I can try? Has someone already seen this error and knows more about it?


Thank you very much in advance and
best regards,
Timon

--
Timon Vogt
Arbeitsgruppe "Computing"
Nationales Hochleistungsrechnen (NHR)
Scientific Employee NHR
Tel.: +49 551 39-30146, E-Mail: timon.v...@gwdg.de
-
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
Burckhardtweg 4, 37077 Göttingen, URL: https://gwdg.de

Support: Tel.: +49 551 39-3, URL: https://gwdg.de/support
Sekretariat: Tel.: +49 551 39-30001, E-Mail: g...@gwdg.de

Geschäftsführer: Prof. Dr. Ramin Yahyapour
Aufsichtsratsvorsitzender: Prof. Dr. Christian Griesinger
Sitz der Gesellschaft: Göttingen
Registergericht: Göttingen, Handelsregister-Nr. B 598

Zertifiziert nach ISO 9001 und ISO 27001
-



OpenPGP_0x6441BD7DD0CD6C40.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature

Re: [slurm-users] Unconfigured GPUs being allocated

2023-07-19 Thread Wilson, Steven M

I found that this is actually a known bug in Slurm so I'll note it here in case 
anyone comes across this thread in the future:
  https://bugs.schedmd.com/show_bug.cgi?id=10598

Steve

From: slurm-users  on behalf of Wilson, 
Steven M 
Sent: Tuesday, July 18, 2023 5:32 PM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Unconfigured GPUs being allocated

Further testing and looking at the source code confirms what looks to me like a 
bug in Slurm. GPUs that are not configured in gres.conf are detected by slurmd 
in the system and discarded since they aren't found in gres.conf. That's fine 
except they should also be hidden through cgroup control so that they aren't 
visible along with allocated GPUs when a job is run. Slurm assumes that the job 
can only see the GPUs that it allocates to the job and sets the 
$CUDA_VISIBLE_DEVICES accordingly. Unfortunately, the job actually sees the 
allocated GPUs plus any unconfigured GPUs and $CUDA_VISIBLE_DEVICES may or may 
not happen to correspond to the GPU(s) allocated by Slurm.

I was hoping that I could write a Prolog script that would adjust 
$CUDA_VISIBLE_DEVICES to remove any unconfigured GPUs but any changes using 
"export CUDA_VISIBLE_DEVICES=..." don't seem to have an effect upon the actual 
environment of the job.

Steve

From: Wilson, Steven M 
Sent: Friday, July 14, 2023 4:10 PM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Unconfigured GPUs being allocated

It's not so much whether a job may or may not access the GPU but rather which 
GPU(s) is(are) included in $CUDA_VISIBLE_DEVICES. That is what controls what 
our CUDA jobs can see and therefore use (within any cgroups constraints, of 
course). In my case, Slurm is sometimes setting $CUDA_VISIBLE_DEVICES to a GPU 
that is not in the Slurm configuration because it is intended only for driving 
the display and not GPU computations.

Thanks for your thoughts!

Steve

From: slurm-users  on behalf of 
Christopher Samuel 
Sent: Friday, July 14, 2023 1:57 PM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] Unconfigured GPUs being allocated

[You don't often get email from ch...@csamuel.org. Learn why this is important 
at https://aka.ms/LearnAboutSenderIdentification ]

 External Email: Use caution with attachments, links, or sharing data 

On 7/14/23 10:20 am, Wilson, Steven M wrote:

> I upgraded Slurm to 23.02.3 but I'm still running into the same problem.
> Unconfigured GPUs (those absent from gres.conf and slurm.conf) are still
> being made available to jobs so we end up with compute jobs being run on
> GPUs which should only be used

I think this is expected - it's not that Slurm is making them available,
it's that it's unaware of them and so doesn't control them in the way it
does for the GPUs it does know about. So you get the default behaviour
(any process can access them).

If you want to stop them being accessed from Slurm you'd need to find a
way to prevent that access via cgroups games or similar.

All the best,
Chris
--
Chris Samuel  :  
https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.csamuel.org%2F=05%7C01%7Cstevew%40purdue.edu%7C6fba97485b73413521d208db8494160a%7C4130bd397c53419cb1e58758d6d63f21%7C0%7C0%7C638249543794377751%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=VslW51ree1Ibt3xfYyy99Aj%2BREZh7BqpM6Ipg3jAM84%3D=0
  :  Berkeley, CA, USA

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Xaver Stiensmeier


Hi Hermann,

count doesn't make a difference, but I noticed that when I reconfigure
slurm and do reloads afterwards, the error "gpu count lower than
configured" no longer appears - so maybe it is just because a
reconfigure is needed after reloading slurmctld - or maybe it doesn't
show the error anymore, because the node is still invalid? However, I
still get the error:

    error: _slurm_rpc_node_registration node=NName: Invalid argument

If I understand correctly, this is telling me that there's something
wrong with my slurm.conf. I know that all pre-existing parameters are
correct, so I assume it must be the gpus entry, but I don't see where
it's wrong:

   NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000
   Gres=gpu:1 State=CLOUD # bibiserv

Thanks for all the help,
Xaver

On 19.07.23 15:04, Hermann Schwärzler wrote:

Hi Xaver,

I think you are missing the "Count=..." part in gres.conf

It should read

NodeName=NName Name=gpu File=/dev/tty0 Count=1

in your case.

Regards,
Hermann

On 7/19/23 14:19, Xaver Stiensmeier wrote:

Okay,

thanks to S. Zhang I was able to figure out why nothing changed.
While I did restart systemctld at the beginning of my tests, I didn't
do so later, because I felt like it was unnecessary, but it is right
there in the fourth line of the log that this is needed. Somehow I
misread it and thought it automatically restarted slurmctld.

Given the setup:

slurm.conf
...
GresTypes=gpu
NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000
GRES=gpu:1 State=UNKNOWN
...

gres.conf
NodeName=NName Name=gpu File=/dev/tty0

When restarting, I get the following error:

error: Setting node NName state to INVAL with reason:gres/gpu count
reported lower than configured (0 < 1)

So it is still not working, but at least I get a more helpful log
message. Because I know that this /dev/tty trick works, I am still
unsure where the current error lies, but I will try to investigate it
further. I am thankful for any ideas in that regard.

Best regards,
Xaver

On 19.07.23 10:23, Xaver Stiensmeier wrote:


Alright,

I tried a few more things, but I still wasn't able to get past:
srun: error: Unable to allocate resources: Invalid generic resource
(gres) specification.

I should mention that the node I am trying to test GPU with, doesn't
really have a gpu, but Rob was so kind to find out that you do not
need a gpu as long as you just link to a file in /dev/ in the
gres.conf. As mentioned: This is just for testing purposes - in the
end we will run this on a node with a gpu, but it is not available
at the moment.

*The error isn't changing*

If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same
error.

*Debug Info*

I added the gpu debug flag and logged the following:

[2023-07-18T14:59:45.026] restoring original state of nodes
[2023-07-18T14:59:45.026] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 2 partitions
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to
gpu ignored
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
change GresPlugins
[2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not
specified
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to
gpu ignored
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
change GresPlugins
[2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure:
select/cons_tres: reconfigure
[2023-07-18T14:59:45.027] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 2 partitions
[2023-07-18T14:59:45.027] No parameter for mcs plugin, default
values set
[2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand set.
[2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller:
completed usec=5898
[2023-07-18T14:59:45.952]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

I am a bit unsure what to do next to further investigate this issue.

Best regards,
Xaver

On 17.07.23 15:57, Groner, Rob wrote:

That would certainly do it.  If you look at the slurmctld log when
it comes up, it will say that it's marking that node as invalid
because it has less (0) gres resources then you say it should
have.  That's because slurmd on that node will come up and say
"What gres resources??"

For testing purposes,  you can just create a dummy file on the
node, then in gres.conf, point to that file as the "graphics file"
interface.  As long as you don't try to actually use it as a
graphics file, that should be enough for that node to think it has
gres/gpu resources. That's what I do in my vagrant slurm cluster.

Rob



*From:* slurm-users  on
behalf of Xaver Stiensmeier 
*Sent:* Monday, July 17, 2023 9:43 AM
*To:* slurm-users@lists.schedmd.com 
*Subject:* Re: [slurm-users] GRES and GPUs
Hi Hermann,

Good idea, but we are already using `SelectType=select/cons_tres`.

[slurm-users] MCNP6.2 test

2023-07-19 Thread Ozeryan, Vladimir

Hello everyone,

Has anyone here ever ran MCNP6.2 parallel job via Slurm scheduler?
I am looking for a simple test job to test my software compilation.

Thank you,

Vlad Ozeryan

Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-19 Thread Timo Rothenpieler


On 19/07/2023 15:04, Jan Andersen wrote:
Hmm, OK - but that is the only nvml.h I can find, as shown by the find 
command. I downloaded the official NVIDIA-Linux-x86_64-535.54.03.run and 
ran it successfully; do I need to install something else beside? A 
google search for 'CUDA SDK' leads directly to NVIDIA's page: 
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html




Yes, I'm pretty sure it's part of the CUDA SDK.

And be careful with running the .run installers from Nvidia.
They bypass the package manager and can badly clash with system 
packages, making recovery complicated.

Always prefer system packages for the drivers and SDKs.

Re: [slurm-users] Notify users about job submit plugin actions

2023-07-19 Thread Jeffrey T Frey

In case you're developing the plugin in C and not LUA, behind the scenes the 
LUA mechanism is concatenating all log_user() strings into a single variable 
(user_msg).  When the LUA code completes, the C code sets the *err_msg argument 
to the job_submit()/job_modify() function to that string, then NULLs-out 
user-msg.  (There's a mutex around all of that code so slurmctld never executes 
LUA job submit/modify scripts concurrently.)  The slurmctld then communicates 
that returned string back to sbatch/salloc/srun for display to the user.

Your C plugin would do likewise — set *err_msg before returning from 
job_submit()/job_modify() — and needn't be mutex'ed if the code is reentrant.

> On Jul 19, 2023, at 08:37, Angel de Vicente  wrote:
> 
> Hello Lorenzo,
> 
> Lorenzo Bosio  writes:
> 
>> I'm developing a job submit plugin to check if some conditions are met 
>> before a job runs.
>> I'd need a way to notify the user about the plugin actions (i.e. why its 
>> jobs was killed and what to do), but after a lot of research I could only 
>> write to logs and not the user shell.
>> The user gets the output of slurm_kill_job but I can't find a way to add a 
>> custom note.
>> 
>> Can anyone point me to the right api/function in the code?
> 
> In our "job_submit.lua" script we have the following for that purpose:
> 
> ,
> |   slurm.log_user("%s: WARNING: [...]", log_prefix)
> `
> 
> -- 
> Ángel de Vicente
> Research Software Engineer (Supercomputing and BigData)
> Tel.: +34 922-605-747
> Web.: http://research.iac.es/proyecto/polmag/
> 
> GPG: 0x8BDC390B69033F52

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Groner, Rob

Worth a try, but the documentation says that by default the count is the same 
as the number of files specified...so, should automatically be 1.

If you want to stop the node from going to INVAL, you can always set 
config_overrides in slurm.conf.  That will tell the node what it has, instead 
of what it thinks it has.  Useful for testing.

Rob


From: slurm-users  on behalf of Hermann 
Schwärzler 
Sent: Wednesday, July 19, 2023 9:04 AM
To: slurm-users@lists.schedmd.com 
Subject: Re: [slurm-users] GRES and GPUs

Hi Xaver,

I think you are missing the "Count=..." part in gres.conf

It should read

NodeName=NName Name=gpu File=/dev/tty0 Count=1

in your case.

Regards,
Hermann

On 7/19/23 14:19, Xaver Stiensmeier wrote:
> Okay,
>
> thanks to S. Zhang I was able to figure out why nothing changed. While I
> did restart systemctld at the beginning of my tests, I didn't do so
> later, because I felt like it was unnecessary, but it is right there in
> the fourth line of the log that this is needed. Somehow I misread it and
> thought it automatically restarted slurmctld.
>
> Given the setup:
>
> slurm.conf
> ...
> GresTypes=gpu
> NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000
> GRES=gpu:1 State=UNKNOWN
> ...
>
> gres.conf
> NodeName=NName Name=gpu File=/dev/tty0
>
> When restarting, I get the following error:
>
> error: Setting node NName state to INVAL with reason:gres/gpu count
> reported lower than configured (0 < 1)
>
> So it is still not working, but at least I get a more helpful log
> message. Because I know that this /dev/tty trick works, I am still
> unsure where the current error lies, but I will try to investigate it
> further. I am thankful for any ideas in that regard.
>
> Best regards,
> Xaver
>
> On 19.07.23 10:23, Xaver Stiensmeier wrote:
>>
>> Alright,
>>
>> I tried a few more things, but I still wasn't able to get past: srun:
>> error: Unable to allocate resources: Invalid generic resource (gres)
>> specification.
>>
>> I should mention that the node I am trying to test GPU with, doesn't
>> really have a gpu, but Rob was so kind to find out that you do not
>> need a gpu as long as you just link to a file in /dev/ in the
>> gres.conf. As mentioned: This is just for testing purposes - in the
>> end we will run this on a node with a gpu, but it is not available at
>> the moment.
>>
>> *The error isn't changing*
>>
>> If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same error.
>>
>> *Debug Info*
>>
>> I added the gpu debug flag and logged the following:
>>
>> [2023-07-18T14:59:45.026] restoring original state of nodes
>> [2023-07-18T14:59:45.026] select/cons_tres: part_data_create_array:
>> select/cons_tres: preparing for 2 partitions
>> [2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to
>> gpu ignored
>> [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
>> change GresPlugins
>> [2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not specified
>> [2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to
>> gpu ignored
>> [2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
>> change GresPlugins
>> [2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure:
>> select/cons_tres: reconfigure
>> [2023-07-18T14:59:45.027] select/cons_tres: part_data_create_array:
>> select/cons_tres: preparing for 2 partitions
>> [2023-07-18T14:59:45.027] No parameter for mcs plugin, default values set
>> [2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand set.
>> [2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller: completed
>> usec=5898
>> [2023-07-18T14:59:45.952]
>> SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2
>>
>> I am a bit unsure what to do next to further investigate this issue.
>>
>> Best regards,
>> Xaver
>>
>> On 17.07.23 15:57, Groner, Rob wrote:
>>> That would certainly do it.  If you look at the slurmctld log when it
>>> comes up, it will say that it's marking that node as invalid because
>>> it has less (0) gres resources then you say it should have.  That's
>>> because slurmd on that node will come up and say "What gres resources??"
>>>
>>> For testing purposes,  you can just create a dummy file on the node,
>>> then in gres.conf, point to that file as the "graphics file"
>>> interface.  As long as you don't try to actually use it as a graphics
>>> file, that should be enough for that node to think it has gres/gpu
>>> resources. That's what I do in my vagrant slurm cluster.
>>>
>>> Rob
>>>
>>> 
>>> *From:* slurm-users  on behalf
>>> of Xaver Stiensmeier 
>>> *Sent:* Monday, July 17, 2023 9:43 AM
>>> *To:* slurm-users@lists.schedmd.com 
>>> *Subject:* Re: [slurm-users] GRES and GPUs
>>> Hi Hermann,
>>>
>>> Good idea, but we are already using `SelectType=select/cons_tres`. After
>>> setting

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Hermann Schwärzler

Hi Xaver,

I think you are missing the "Count=..." part in gres.conf

It should read

NodeName=NName Name=gpu File=/dev/tty0 Count=1

in your case.

Regards,
Hermann

On 7/19/23 14:19, Xaver Stiensmeier wrote:

Okay,

thanks to S. Zhang I was able to figure out why nothing changed. While I 
did restart systemctld at the beginning of my tests, I didn't do so 
later, because I felt like it was unnecessary, but it is right there in 
the fourth line of the log that this is needed. Somehow I misread it and 
thought it automatically restarted slurmctld.

Given the setup:

slurm.conf
...
GresTypes=gpu
NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000 
GRES=gpu:1 State=UNKNOWN

...

gres.conf
NodeName=NName Name=gpu File=/dev/tty0

When restarting, I get the following error:

error: Setting node NName state to INVAL with reason:gres/gpu count 
reported lower than configured (0 < 1)

So it is still not working, but at least I get a more helpful log 
message. Because I know that this /dev/tty trick works, I am still 
unsure where the current error lies, but I will try to investigate it 
further. I am thankful for any ideas in that regard.

Best regards,
Xaver

On 19.07.23 10:23, Xaver Stiensmeier wrote:

Alright,

I tried a few more things, but I still wasn't able to get past: srun: 
error: Unable to allocate resources: Invalid generic resource (gres) 
specification.

I should mention that the node I am trying to test GPU with, doesn't 
really have a gpu, but Rob was so kind to find out that you do not 
need a gpu as long as you just link to a file in /dev/ in the 
gres.conf. As mentioned: This is just for testing purposes - in the 
end we will run this on a node with a gpu, but it is not available at 
the moment.

*The error isn't changing*

If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same error.

*Debug Info*

I added the gpu debug flag and logged the following:

[2023-07-18T14:59:45.026] restoring original state of nodes
[2023-07-18T14:59:45.026] select/cons_tres: part_data_create_array: 
select/cons_tres: preparing for 2 partitions
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to 
gpu ignored
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to 
change GresPlugins

[2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not specified
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to 
gpu ignored
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to 
change GresPlugins
[2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure: 
select/cons_tres: reconfigure
[2023-07-18T14:59:45.027] select/cons_tres: part_data_create_array: 
select/cons_tres: preparing for 2 partitions

[2023-07-18T14:59:45.027] No parameter for mcs plugin, default values set
[2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand set.
[2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller: completed 
usec=5898
[2023-07-18T14:59:45.952] 
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

I am a bit unsure what to do next to further investigate this issue.

Best regards,
Xaver

On 17.07.23 15:57, Groner, Rob wrote:
That would certainly do it.  If you look at the slurmctld log when it 
comes up, it will say that it's marking that node as invalid because 
it has less (0) gres resources then you say it should have.  That's 
because slurmd on that node will come up and say "What gres resources??"

For testing purposes,  you can just create a dummy file on the node, 
then in gres.conf, point to that file as the "graphics file" 
interface.  As long as you don't try to actually use it as a graphics 
file, that should be enough for that node to think it has gres/gpu 
resources. That's what I do in my vagrant slurm cluster.

Rob

*From:* slurm-users  on behalf 
of Xaver Stiensmeier 

*Sent:* Monday, July 17, 2023 9:43 AM
*To:* slurm-users@lists.schedmd.com 
*Subject:* Re: [slurm-users] GRES and GPUs
Hi Hermann,

Good idea, but we are already using `SelectType=select/cons_tres`. After
setting everything up again (in case I made an unnoticed mistake), I saw
that the node got marked STATE=inval.

To be honest, I thought I can just claim that a node has a gpu even if
it doesn't have one - just for testing purposes. Could this be the issue?

Best regards,
Xaver Stiensmeier

On 17.07.23 14:11, Hermann Schwärzler wrote:
> Hi Xaver,
>
> what kind of SelectType are you using in your slurm.conf?
>
> Per 
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D=0

Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-19 Thread Jan Andersen

Hmm, OK - but that is the only nvml.h I can find, as shown by the find 
command. I downloaded the official NVIDIA-Linux-x86_64-535.54.03.run and 
ran it successfully; do I need to install something else beside? A 
google search for 'CUDA SDK' leads directly to NVIDIA's page: 
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html




On 19/07/2023 12:26, Timo Rothenpieler wrote:

On 19/07/2023 11:47, Jan Andersen wrote:
I'm trying to build slurm with nvml support, but configure doesn't 
find it:


root@zorn:~/slurm-23.02.3# ./configure --with-nvml
...
checking for hwloc installation... /usr
checking for nvml.h... no
checking for nvmlInit in -lnvidia-ml... yes
configure: error: unable to locate libnvidia-ml.so and/or nvml.h

But:

root@zorn:~/slurm-23.02.3# find / -xdev -name nvml.h
/usr/include/hwloc/nvml.h


It's not looking for the hwloc header, but for the nvidia one.
If you have your CUDA SDK installed in for example /opt/cuda, you got to 
point it there: --with-nvml=/opt/cuda



root@zorn:~/slurm-23.02.3# find / -xdev -name libnvidia-ml.so
/usr/lib32/libnvidia-ml.so
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so

I tried to figure out how to tell configure where to find them, but 
the script is a bit eye-watering; how should I do?

Re: [slurm-users] slurmctld and slurmdbd on the server, mysql on remote

2023-07-19 Thread AMU

oups, i found my error, i forgot to remove JobCompHost, found it after 
reading this:

https://bugs.schedmd.com/show_bug.cgi?id=2322#c5

sorry for the noise

On 19/07/2023 14:51, Gérard Henry (AMU) wrote:

Hello all,
is it possible to have this configuration? i installed slurm on ubuntu 
20 LTS, but slurmctld refuses to start with messages:


[2023-07-19T14:37:59.563] Job completion MYSQL plugin loaded
[2023-07-19T14:37:59.563] debug:  /var/log/slurm/jobcomp doesn't look 
like a database name using slurm_jobcomp_db
[2023-07-19T14:37:59.563] debug2: mysql_connect() called for db 
slurm_jobcomp_db

[2023-07-19T14:37:59.571] debug2: Attempting to connect to localhost:3306
[2023-07-19T14:37:59.571] error: mysql_real_connect failed: 2002 Can't 
connect to local MySQL server through socket 
'/var/run/mysqld/mysqld.sock' (2)

[2023-07-19T14:37:59.572] fatal: You haven't inited this storage yet.

slurmdbd is running, and some stuff seems to be written in db:
# sacctmgr show cluster
    Cluster ControlHost  ControlPort   RPC Share GrpJobs GrpTRES 
GrpSubmit MaxJobs   MaxTRES MaxSubmit MaxWall

   QOS   Def QOS
-- ---  - - --- 
- - --- - - --- ---

- -
    cathena    0 0 1
    normal

i don't understand why slurmctld needs to connect to mysql, since it 
connects to slurmdbd.


slurm is # slurmctld -V
slurm-wlm 19.05.5

Thanks in advance for help,




--
Gérard HENRY
Institut Fresnel - UMR 7249
+33 413945457
Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue 
Escadrille Normandie Niemen, 13013 Marseille

Site : https://fresnel.fr/
Afin de respecter l'environnement, merci de n'imprimer cet email que si 
nécessaire.

[slurm-users] slurmctld and slurmdbd on the server, mysql on remote

2023-07-19 Thread AMU


Hello all,
is it possible to have this configuration? i installed slurm on ubuntu 
20 LTS, but slurmctld refuses to start with messages:


[2023-07-19T14:37:59.563] Job completion MYSQL plugin loaded
[2023-07-19T14:37:59.563] debug:  /var/log/slurm/jobcomp doesn't look 
like a database name using slurm_jobcomp_db
[2023-07-19T14:37:59.563] debug2: mysql_connect() called for db 
slurm_jobcomp_db

[2023-07-19T14:37:59.571] debug2: Attempting to connect to localhost:3306
[2023-07-19T14:37:59.571] error: mysql_real_connect failed: 2002 Can't 
connect to local MySQL server through socket 
'/var/run/mysqld/mysqld.sock' (2)

[2023-07-19T14:37:59.572] fatal: You haven't inited this storage yet.

slurmdbd is running, and some stuff seems to be written in db:
# sacctmgr show cluster
   Cluster ControlHost  ControlPort   RPC Share GrpJobs 
GrpTRES GrpSubmit MaxJobs   MaxTRES MaxSubmit MaxWall

  QOS   Def QOS
-- ---  - - --- 
- - --- - - --- ---

- -
   cathena0 0 1 


   normal

i don't understand why slurmctld needs to connect to mysql, since it 
connects to slurmdbd.


slurm is # slurmctld -V
slurm-wlm 19.05.5

Thanks in advance for help,


--
Gérard HENRY
Institut Fresnel - UMR 7249
+33 413945457
Aix-Marseille Université - Campus Etoile, BATIMENT FRESNEL, Avenue 
Escadrille Normandie Niemen, 13013 Marseille

Site : https://fresnel.fr/
Afin de respecter l'environnement, merci de n'imprimer cet email que si 
nécessaire.

Re: [slurm-users] Notify users about job submit plugin actions

2023-07-19 Thread Angel de Vicente

Hello Lorenzo,

Lorenzo Bosio  writes:

> I'm developing a job submit plugin to check if some conditions are met before 
> a job runs.
> I'd need a way to notify the user about the plugin actions (i.e. why its jobs 
> was killed and what to do), but after a lot of research I could only write to 
> logs and not the user shell.
> The user gets the output of slurm_kill_job but I can't find a way to add a 
> custom note.
>
> Can anyone point me to the right api/function in the code?

In our "job_submit.lua" script we have the following for that purpose:

,
|   slurm.log_user("%s: WARNING: [...]", log_prefix)
`

-- 
Ángel de Vicente
 Research Software Engineer (Supercomputing and BigData)
 Tel.: +34 922-605-747
 Web.: http://research.iac.es/proyecto/polmag/

 GPG: 0x8BDC390B69033F52


smime.p7s
Description: S/MIME cryptographic signature

Re: [slurm-users] Notify users about job submit plugin actions

2023-07-19 Thread Ole Holm Nielsen

Hi Lorenzo,

On 7/19/23 14:22, Lorenzo Bosio wrote:
> I'm developing a job submit plugin to check if some conditions are met 
> before a job runs.
> I'd need a way to notify the user about the plugin actions (i.e. why its 
> jobs was killed and what to do), but after a lot of research I could only 
> write to logs and not the user shell.
> The user gets the output of slurm_kill_job but I can't find a way to add a 
> custom note.
> 
> Can anyone point me to the right api/function in the code?

I've written a fairly general job submit plugin which you can copy and 
customize for your needs:

https://github.com/OleHolmNielsen/Slurm_tools/tree/master/plugins

The slurm.log_user() function prints an error message to the user's terminal.

I hope this helps.

/Ole

[slurm-users] Notify users about job submit plugin actions

2023-07-19 Thread Lorenzo Bosio


Hello everyone,

I'm developing a job submit plugin to check if some conditions are met 
before a job runs.
I'd need a way to notify the user about the plugin actions (i.e. why its 
jobs was killed and what to do), but after a lot of research I could 
only write to logs and not the user shell.
The user gets the output of slurm_kill_job but I can't find a way to add 
a custom note.


Can anyone point me to the right api/function in the code?

Thanks in advance,
Lorenzo

--
*/Dott. Mag. Lorenzo Bosio/*
Tecnico di Ricerca
Dipartimento di Informatica


Università degli Studi di Torino
Corso Svizzera, 185 - 10149 Torino
tel. +39 011 670 6836

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Xaver Stiensmeier

Okay,

thanks to S. Zhang I was able to figure out why nothing changed. While I
did restart systemctld at the beginning of my tests, I didn't do so
later, because I felt like it was unnecessary, but it is right there in
the fourth line of the log that this is needed. Somehow I misread it and
thought it automatically restarted slurmctld.

Given the setup:

slurm.conf
...
GresTypes=gpu
NodeName=NName SocketsPerBoard=8 CoresPerSocket=1 RealMemory=8000
GRES=gpu:1 State=UNKNOWN
...

gres.conf
NodeName=NName Name=gpu File=/dev/tty0

When restarting, I get the following error:

error: Setting node NName state to INVAL with reason:gres/gpu count
reported lower than configured (0 < 1)

So it is still not working, but at least I get a more helpful log
message. Because I know that this /dev/tty trick works, I am still
unsure where the current error lies, but I will try to investigate it
further. I am thankful for any ideas in that regard.

Best regards,
Xaver

On 19.07.23 10:23, Xaver Stiensmeier wrote:

Alright,

I tried a few more things, but I still wasn't able to get past: srun:
error: Unable to allocate resources: Invalid generic resource (gres)
specification.

I should mention that the node I am trying to test GPU with, doesn't
really have a gpu, but Rob was so kind to find out that you do not
need a gpu as long as you just link to a file in /dev/ in the
gres.conf. As mentioned: This is just for testing purposes - in the
end we will run this on a node with a gpu, but it is not available at
the moment.

*The error isn't changing*

If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same error.

*Debug Info*

I added the gpu debug flag and logged the following:

[2023-07-18T14:59:45.026] restoring original state of nodes
[2023-07-18T14:59:45.026] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 2 partitions
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to
gpu ignored
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
change GresPlugins
[2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not specified
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to
gpu ignored
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to
change GresPlugins
[2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure:
select/cons_tres: reconfigure
[2023-07-18T14:59:45.027] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 2 partitions
[2023-07-18T14:59:45.027] No parameter for mcs plugin, default values set
[2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand set.
[2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller: completed
usec=5898
[2023-07-18T14:59:45.952]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

I am a bit unsure what to do next to further investigate this issue.

Best regards,
Xaver

On 17.07.23 15:57, Groner, Rob wrote:

That would certainly do it.  If you look at the slurmctld log when it
comes up, it will say that it's marking that node as invalid because
it has less (0) gres resources then you say it should have.  That's
because slurmd on that node will come up and say "What gres resources??"

For testing purposes,  you can just create a dummy file on the node,
then in gres.conf, point to that file as the "graphics file"
interface.  As long as you don't try to actually use it as a graphics
file, that should be enough for that node to think it has gres/gpu
resources. That's what I do in my vagrant slurm cluster.

Rob

*From:* slurm-users  on behalf
of Xaver Stiensmeier 
*Sent:* Monday, July 17, 2023 9:43 AM
*To:* slurm-users@lists.schedmd.com 
*Subject:* Re: [slurm-users] GRES and GPUs
Hi Hermann,

Good idea, but we are already using `SelectType=select/cons_tres`. After
setting everything up again (in case I made an unnoticed mistake), I saw
that the node got marked STATE=inval.

To be honest, I thought I can just claim that a node has a gpu even if
it doesn't have one - just for testing purposes. Could this be the issue?

Best regards,
Xaver Stiensmeier

On 17.07.23 14:11, Hermann Schwärzler wrote:
> Hi Xaver,
>
> what kind of SelectType are you using in your slurm.conf?
>
> Per
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D=0
 you have to consider:
> "As for the --gpu* option, these options are only supported by Slurm's
> select/cons_tres plugin."
>
> So you can use "--gpus ..." only when you state
> SelectType  = select/cons_tres

Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-19 Thread Timo Rothenpieler


On 19/07/2023 11:47, Jan Andersen wrote:

I'm trying to build slurm with nvml support, but configure doesn't find it:

root@zorn:~/slurm-23.02.3# ./configure --with-nvml
...
checking for hwloc installation... /usr
checking for nvml.h... no
checking for nvmlInit in -lnvidia-ml... yes
configure: error: unable to locate libnvidia-ml.so and/or nvml.h

But:

root@zorn:~/slurm-23.02.3# find / -xdev -name nvml.h
/usr/include/hwloc/nvml.h


It's not looking for the hwloc header, but for the nvidia one.
If you have your CUDA SDK installed in for example /opt/cuda, you got to 
point it there: --with-nvml=/opt/cuda



root@zorn:~/slurm-23.02.3# find / -xdev -name libnvidia-ml.so
/usr/lib32/libnvidia-ml.so
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so

I tried to figure out how to tell configure where to find them, but the 
script is a bit eye-watering; how should I do?

[slurm-users] configure script can't find nvml.h or libnvidia-ml.so

2023-07-19 Thread Jan Andersen


I'm trying to build slurm with nvml support, but configure doesn't find it:

root@zorn:~/slurm-23.02.3# ./configure --with-nvml
...
checking for hwloc installation... /usr
checking for nvml.h... no
checking for nvmlInit in -lnvidia-ml... yes
configure: error: unable to locate libnvidia-ml.so and/or nvml.h

But:

root@zorn:~/slurm-23.02.3# find / -xdev -name nvml.h
/usr/include/hwloc/nvml.h
root@zorn:~/slurm-23.02.3# find / -xdev -name libnvidia-ml.so
/usr/lib32/libnvidia-ml.so
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so

I tried to figure out how to tell configure where to find them, but the 
script is a bit eye-watering; how should I do?

Re: [slurm-users] GRES and GPUs

2023-07-19 Thread Xaver Stiensmeier

Alright,

I tried a few more things, but I still wasn't able to get past: srun:
error: Unable to allocate resources: Invalid generic resource (gres)
specification.

I should mention that the node I am trying to test GPU with, doesn't
really have a gpu, but Rob was so kind to find out that you do not need
a gpu as long as you just link to a file in /dev/ in the gres.conf. As
mentioned: This is just for testing purposes - in the end we will run
this on a node with a gpu, but it is not available at the moment.

*The error isn't changing*

If I omitt "GresTypes=gpu" and "Gres=gpu:1", I still get the same error.

*Debug Info*

I added the gpu debug flag and logged the following:

[2023-07-18T14:59:45.026] restoring original state of nodes
[2023-07-18T14:59:45.026] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 2 partitions
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to gpu
ignored
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to change
GresPlugins
[2023-07-18T14:59:45.026] read_slurm_conf: backup_controller not specified
[2023-07-18T14:59:45.026] error: GresPlugins changed from (null) to gpu
ignored
[2023-07-18T14:59:45.026] error: Restart the slurmctld daemon to change
GresPlugins
[2023-07-18T14:59:45.026] select/cons_tres: select_p_reconfigure:
select/cons_tres: reconfigure
[2023-07-18T14:59:45.027] select/cons_tres: part_data_create_array:
select/cons_tres: preparing for 2 partitions
[2023-07-18T14:59:45.027] No parameter for mcs plugin, default values set
[2023-07-18T14:59:45.027] mcs: MCSParameters = (null). ondemand set.
[2023-07-18T14:59:45.028] _slurm_rpc_reconfigure_controller: completed
usec=5898
[2023-07-18T14:59:45.952]
SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,sched_min_interval=2

I am a bit unsure what to do next to further investigate this issue.

Best regards,
Xaver

On 17.07.23 15:57, Groner, Rob wrote:

That would certainly do it.  If you look at the slurmctld log when it
comes up, it will say that it's marking that node as invalid because
it has less (0) gres resources then you say it should have.  That's
because slurmd on that node will come up and say "What gres resources??"

For testing purposes,  you can just create a dummy file on the node,
then in gres.conf, point to that file as the "graphics file"
interface.  As long as you don't try to actually use it as a graphics
file, that should be enough for that node to think it has gres/gpu
resources.  That's what I do in my vagrant slurm cluster.

Rob

*From:* slurm-users  on behalf
of Xaver Stiensmeier 
*Sent:* Monday, July 17, 2023 9:43 AM
*To:* slurm-users@lists.schedmd.com 
*Subject:* Re: [slurm-users] GRES and GPUs
Hi Hermann,

Good idea, but we are already using `SelectType=select/cons_tres`. After
setting everything up again (in case I made an unnoticed mistake), I saw
that the node got marked STATE=inval.

To be honest, I thought I can just claim that a node has a gpu even if
it doesn't have one - just for testing purposes. Could this be the issue?

Best regards,
Xaver Stiensmeier

On 17.07.23 14:11, Hermann Schwärzler wrote:
> Hi Xaver,
>
> what kind of SelectType are you using in your slurm.conf?
>
> Per
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D=0
 you have to consider:
> "As for the --gpu* option, these options are only supported by Slurm's
> select/cons_tres plugin."
>
> So you can use "--gpus ..." only when you state
> SelectType  = select/cons_tres
> in your slurm.conf.
>
> But "--gres=gpu:1" should work always.
>
> Regards
> Hermann
>
>
> On 7/17/23 13:43, Xaver Stiensmeier wrote:
>> Hey,
>>
>> I am currently trying to understand how I can schedule a job that
>> needs a GPU.
>>
>> I read about GRES
https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fslurm.schedmd.com%2Fgres.html=05%7C01%7Crug262%40psu.edu%7Cbc4b7775beae4d2e376c08db86cbfc7b%7C7cf48d453ddb4389a9c1c115526eb52e%7C0%7C0%7C638251982928987379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=PqvE6pL2sKSb6KxLngi0sbm6qhIv8MRYTmUM%2Bgq1hrI%3D=0
 and tried to use:
>>
>> GresTypes=gpu
>> NodeName=test Gres=gpu:1
>>
>> But calling - after a 'sudo scontrol reconfigure':
>>
>> srun --gpus 1 hostname
>>
>> didn't work:
>>
>> srun: error: Unable to allocate resources: Invalid generic resource
>> (gres) specification
>>
>> so I read more

Re: [slurm-users] MIG-Slice: Unavailable GRES

[slurm-users] MIG-Slice: Unavailable GRES

Re: [slurm-users] Unconfigured GPUs being allocated

Re: [slurm-users] GRES and GPUs

[slurm-users] MCNP6.2 test

Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

Re: [slurm-users] Notify users about job submit plugin actions

Re: [slurm-users] GRES and GPUs

Re: [slurm-users] GRES and GPUs

Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

Re: [slurm-users] slurmctld and slurmdbd on the server, mysql on remote

[slurm-users] slurmctld and slurmdbd on the server, mysql on remote

Re: [slurm-users] Notify users about job submit plugin actions

Re: [slurm-users] Notify users about job submit plugin actions

[slurm-users] Notify users about job submit plugin actions

Re: [slurm-users] GRES and GPUs

Re: [slurm-users] configure script can't find nvml.h or libnvidia-ml.so

[slurm-users] configure script can't find nvml.h or libnvidia-ml.so

Re: [slurm-users] GRES and GPUs

19 matches

Site Navigation

Mail list logo

Footer information