Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ward Poelmans
Hi, We have a slightly difference script to do the same. It only relies on /sys: # Search for infiniband devices and check waits until # at least one reports that it is ACTIVE if [[ ! -d /sys/class/infiniband ]] then logger "No infiniband found" exit 0 fi ports=$(ls

Re: [slurm-users] SLURM , maximum scalable instance is which one

2023-11-01 Thread Davide DelVento
Not sure if it's the largest, but LUMI is a very large one https://www.top500.org/system/180048/ https://docs.lumi-supercomputer.eu/runjobs/scheduled-jobs/partitions/ On Sun, Oct 29, 2023 at 4:16 AM John Joseph wrote: > Dear All, > Like to know that what is the maximum scalled up instance of

Re: [slurm-users] RES: RES: How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ole Holm Nielsen
I would like to report how the Infiniband/OPA network device starts up step by step as reported by Max's Systemd service from https://github.com/maxlxl/network.target_wait-for-interfaces This is the sequence of events during boot: $ grep wait-for-interfaces.sh /var/log/messages Nov 1

Re: [slurm-users] RES: RES: multiple srun commands in the same SLURM script

2023-11-01 Thread Kevin Broch
Could this apply in your case: https://slurm.schedmd.com/faq.html#opencl_pmix ? On Wed, Nov 1, 2023 at 5:24 AM Paulo Jose Braga Estrela < paulo.estr...@petrobras.com.br> wrote: > Yeah, you are right. I don’t know why but it seems that my email client > messed with message formatting putting all

[slurm-users] RES: RES: How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Paulo Jose Braga Estrela
Ole, Look at the NetworkManager-wait-online.service man page bellow (from RHEL 8.8). Maybe your IB interfaces aren't properly configured in NetworkManager. The *** were added by me. " NetworkManager-wait-online.service blocks until NetworkManager logs "startup complete" and announces startup

[slurm-users] RES: RES: multiple srun commands in the same SLURM script

2023-11-01 Thread Paulo Jose Braga Estrela
Yeah, you are right. I don’t know why but it seems that my email client messed with message formatting putting all srun commands in one line. PÚBLICA -Mensagem original- De: slurm-users Em nome de Bjørn-Helge Mevik Enviada em: quarta-feira, 1 de novembro de 2023 04:55 Para:

Re: [slurm-users] how to configure correctly node and memory when a script fails with out of memory

2023-11-01 Thread Rémi Palancher
Hello Gérard, > On 30/10/2023 15:46, Gérard Henry (AMU) wrote: >> Hello all, >> … >> when it fails, sacct gives the follwing information: >> JobID   JobName    Elapsed  NCPUS   TotalCPU    CPUTime >> ReqMem MaxRSS  MaxDiskRead MaxDiskWrite  State ExitCode >>

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ole Holm Nielsen
Hi Rémi, Thanks for the feedback! The patch revert[1] explains SchedMD's reason: The reasoning is that sysadmins who see nodes with Reason "Not Responding" but they can manually ping/access the node end up confused. That reason should only be set if the node is trully not responding, but not

Re: [slurm-users] How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Rémi Palancher
Hi Ole, Le 30/10/2023 à 13:50, Ole Holm Nielsen a écrit : > I'm fighting this strange scenario where slurmd is started before the > Infiniband/OPA network is fully up. The Node Health Check (NHC) executed > by slurmd then fails the node (as it should). This happens only on EL8 > Linux

Re: [slurm-users] RES: How to delay the start of slurmd until Infiniband/OPA network is fully up?

2023-11-01 Thread Ole Holm Nielsen
Hi Paulo, On 11/1/23 01:12, Paulo Jose Braga Estrela wrote: I think that you should use NetworkManager-wait-online.service In RHEL 8. Take a look at its man page. It only allows the system reach network-online after all network interfaces are online. So, if your OP interfaces are managed by

Re: [slurm-users] RES: multiple srun commands in the same SLURM script

2023-11-01 Thread Bjørn-Helge Mevik
Paulo Jose Braga Estrela writes: > Hi, > > I think that you have a syntax error in your bash script. The "&" > means that you want to send a process to background not that you want > to run many commands in parallel. To run commands in a serial fashion > you should use cmd && cmd2, then the cmd2