Hello Shamim,

The error says that you're calling MPI with the wrong parameters, specificall -npernode. Since you're using slurm, MPI should be smart enough that you don't need to pass -n, -npernode,  How did you get a Runscript and Submitscript for this machine. Did you create yourself?

--Steve

On 5/1/2024 6:54 AM, Shamim Haque 1910511 wrote:
Hi all,

I am attempting ETK installation in KALINGA Cluster at NISER, India. This cluster has 40 procs per node and SLURM workload manager.

I compiled ETK with gcc-7.5 and openmpi-4.0.5 (attached the machinefile, optionlist, submitscript and runscript). The installation is mostly alright, as I can run parfiles for test TOV and BNS mergers.

I tried to run a simulation with procs=160 (nodes 4) and num-threads=1 but landed with this error (error file also attached):

/+ mpiexec -n 640 -npernode 40.0 /home/kamal/simulations/dx25_r500_rg7_t30_p640-1_2/SIMFACTORY/exe/cactus_sim -L 3 /home/kamal/simulations/dx25_r500_rg7_t30_p640-1_2/output-0000/eos20_dx25_r500_rg7.par
----------------------------------------------------------------------------
Open MPI has detected that a parameter given to a command line
option does not match the expected format:

  Option: npernode
  Param:  40.0

This is frequently caused by omitting to provide the parameter
to an option that requires one. Please check the command line and try again.
----------------------------------------------------------------------------
/

Strangely, this error is not at all regular. Mostly, the error won't appear, and the simulation works just fine (with no changes being made in the scripts or simfactory command). In fact, this exact simulation has worked fine before. Since I am unable to find the source of this issue, I am also unable to recreate the error on my own. But it does kick in occasionally.

My command for mpi execution in runscript looks like this:

/time mpiexec -n @NUM_PROCS@ -npernode @(@PPN_USED@ / @NUM_THREADS@)@ @EXECUTABLE@ -L 3 @PARFILE@/

If I replace / @(@PPN_USED@ / @NUM_THREADS@)@ /with a desired value, then the script always works. My simfactory command looks like this:

/./simfactory/bin/sim create-submit dx25_r500_rg7_t30_p640-1_2 --parfile=par-smooth/scale_test/eos20_dx25_r500_rg7.par --queue=large1 --procs=640 --num-threads=1 --walltime=00:45:00
/

I am unable to understand how to solve this issue. Any help with this issue is appreciated. Please let me know if you need more information. Thank you.

Regards
Shamim Haque
Senior Research Fellow (SRF)
Department of Physics
IISER Bhopal
ᐧ

_______________________________________________
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users
_______________________________________________
Users mailing list
Users@einsteintoolkit.org
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to