[slurm-users] Recommendation on running multiple jobs

2023-12-10 Thread Sundaram Kumaran
Dear Users,
May I have your guidance?
How to run the multiple job in the server,

We have 2 servers Platinum and Cerium,

  1.  when I launch the 2 job in Platinum the tool launches successfully and 
distribute the job to 2 different servers. but while launching the 3rd job the 
resource is in queue.

Is it possible to launch the multiple job ,I used "srun -l virtuoso &"

[vlsicad3@cerium ~/PDK_TSMC013]$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES 
NODELIST(REASON)
   251active virtuosogohks PD   0:00  1 (Resources)
   247active virtuoso   yongcs  R 5-20:39:14  1 cerium
   250active virtuoso vlsicad3  R   3:34  1 platinum

Thanks for the support
Regards,
KumaranS


This e-mail and any attachments are only for the use of the intended recipient 
and may contain material that is confidential, privileged and/or protected by 
the Official Secrets Act. If you are not the intended recipient, please delete 
it or notify the sender immediately. Please do not copy or use it for any 
purpose or disclose the contents to any other person.


[slurm-users] slurm bank utility

2023-12-10 Thread Purvesh Parmar
Hi,

We are using slurm 21.08. We are curious to know how to use "sbank" utility
for crediting GPU Hours , just like cpu minutes, and also get the status of
GPUHours credited, used etc.
Actually, sbank utility from github is not having functionality of adding /
querying the GPUHours

Any other means of how this can be done

Rg,

Purvesh


Re: [slurm-users] SlurmdSpoolDir full

2023-12-10 Thread Ole Holm Nielsen

On 10-12-2023 17:29, Ryan Novosielski wrote:

This is basically always somebody filling up /tmp and /tmp residing on the same 
filesystem as the actual SlurmdSpoolDirectory.

/tmp, without modifications, it’s almost certainly the wrong place for 
temporary HPC files. Too large.


Agreed!  That's why temporary job directories may be configured in 
Slurm, see the Wiki page for a summary:

https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#temporary-job-directories

/Ole


On Dec 8, 2023, at 10:02, Xaver Stiensmeier  wrote:

Dear slurm-user list,

during a larger cluster run (the same I mentioned earlier 242 nodes), I
got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a
directory on the workers that is used for job state information
(https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However,
I was unable to find more precise information on that dictionary. We
compute all data on another volume so SlurmdSpoolDir has roughly 38 GB
of free space where nothing is intentionally put during the run. This
error only occurred on very few nodes.

I would like to understand what Slurmd is placing in this dir that fills
up the space. Do you have any ideas? Due to the workflow used, we have a
hard time reconstructing the exact scenario that caused this error. I
guess, the "fix" is to just pick a bit larger disk, but I am unsure
whether Slurm behaves normal here.

Best regards
Xaver Stiensmeier




Re: [slurm-users] SlurmdSpoolDir full

2023-12-10 Thread Peter Goode
We maintain /tmp as a separate partition to mitigate this exact scenario on all 
nodes though it doesn’t necessarily need to be part of the primary system RAID. 
 No need for tmp resiliency.


Regards,
Peter

Peter Goode
Research Computing Systems Administrator
Lafayette College

> On Dec 10, 2023, at 11:33, Ryan Novosielski  wrote:
> 
> This is basically always somebody filling up /tmp and /tmp residing on the 
> same filesystem as the actual SlurmdSpoolDirectory.
> 
> /tmp, without modifications, it’s almost certainly the wrong place for 
> temporary HPC files. Too large.
> 
> Sent from my iPhone
> 
>> On Dec 8, 2023, at 10:02, Xaver Stiensmeier  wrote:
>> 
>> Dear slurm-user list,
>> 
>> during a larger cluster run (the same I mentioned earlier 242 nodes), I
>> got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a
>> directory on the workers that is used for job state information
>> (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However,
>> I was unable to find more precise information on that dictionary. We
>> compute all data on another volume so SlurmdSpoolDir has roughly 38 GB
>> of free space where nothing is intentionally put during the run. This
>> error only occurred on very few nodes.
>> 
>> I would like to understand what Slurmd is placing in this dir that fills
>> up the space. Do you have any ideas? Due to the workflow used, we have a
>> hard time reconstructing the exact scenario that caused this error. I
>> guess, the "fix" is to just pick a bit larger disk, but I am unsure
>> whether Slurm behaves normal here.
>> 
>> Best regards
>> Xaver Stiensmeier
>> 
>> 



Re: [slurm-users] SlurmdSpoolDir full

2023-12-10 Thread Ryan Novosielski
This is basically always somebody filling up /tmp and /tmp residing on the same 
filesystem as the actual SlurmdSpoolDirectory.

/tmp, without modifications, it’s almost certainly the wrong place for 
temporary HPC files. Too large.

Sent from my iPhone

> On Dec 8, 2023, at 10:02, Xaver Stiensmeier  wrote:
> 
> Dear slurm-user list,
> 
> during a larger cluster run (the same I mentioned earlier 242 nodes), I
> got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a
> directory on the workers that is used for job state information
> (https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However,
> I was unable to find more precise information on that dictionary. We
> compute all data on another volume so SlurmdSpoolDir has roughly 38 GB
> of free space where nothing is intentionally put during the run. This
> error only occurred on very few nodes.
> 
> I would like to understand what Slurmd is placing in this dir that fills
> up the space. Do you have any ideas? Due to the workflow used, we have a
> hard time reconstructing the exact scenario that caused this error. I
> guess, the "fix" is to just pick a bit larger disk, but I am unsure
> whether Slurm behaves normal here.
> 
> Best regards
> Xaver Stiensmeier
> 
> 


Re: [slurm-users] SlurmdSpoolDir full

2023-12-10 Thread Xaver Stiensmeier

Hello Brian Andrus,

we ran 'df -h' to determine the amount of free space I mentioned below.
I also should add that at the time we inspected the node, there was
still around 38 GB of space left - however, we were unable to watch the
remaining space while the error occurred so maybe the large file(s) got
removed immediately.

I will take a look at /var/log. That's a good idea. I don't think that
there will be anything unusual, but it's something I haven't thought
about yet (the reason of the error being somewhere else).

Best regards
Xaver

On 10.12.23 00:41, Brian Andrus wrote:

Xaver,

It is likely your /var or /var/spool mount.
That may be a separate partition or part of your root partition. It is
the partition that is full, not the directory itself. So the cause
could very well be log files in /var/log. I would check to see what
(if any) partitions are getting filled on the node. You can run 'df
-h' and see some info that would get you started.

Brian Andrus

On 12/8/2023 7:00 AM, Xaver Stiensmeier wrote:

Dear slurm-user list,

during a larger cluster run (the same I mentioned earlier 242 nodes), I
got the error "SlurmdSpoolDir full". The SlurmdSpoolDir is apparently a
directory on the workers that is used for job state information
(https://slurm.schedmd.com/slurm.conf.html#OPT_SlurmdSpoolDir). However,
I was unable to find more precise information on that dictionary. We
compute all data on another volume so SlurmdSpoolDir has roughly 38 GB
of free space where nothing is intentionally put during the run. This
error only occurred on very few nodes.

I would like to understand what Slurmd is placing in this dir that fills
up the space. Do you have any ideas? Due to the workflow used, we have a
hard time reconstructing the exact scenario that caused this error. I
guess, the "fix" is to just pick a bit larger disk, but I am unsure
whether Slurm behaves normal here.

Best regards
Xaver Stiensmeier