Hello Everyone,
I am new to ETK. I am working on my high school research project trying to
run the simulation of BNS merger on amarel supercomputer from my local
university.

Could you please help me to start my simulation on SLURM. I have followed
the ETK gallery example for BNS simulation steps 1-5. But I am not able to
proceed to successfully create a machine to run the simulation.

I run the following steps
/home/sb1554/BNS/simfactory/bin/sim create bns --parfile
/home/sb1554/BNS/bns.par --machine slurmbns

srun  bns.sh -o slurm.bns.%N.%j.out

and got the error:



**** An error occurred in MPI_Init_thread*** on a NULL communicator***
MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,***
 and potentially your MPI job)*

I am attaching my machine, submit script, run script, log files.

I would appreciate any pointers from you. Or if you could point me to the
right person.

I was trying to post this on EKT forum, but need one credit to post.

Thank you,
Maya
+ set -e
+ cd /home/sb1554/simulations/bns/output-0000-active
+ echo Checking:
+ pwd
+ hostname
+ date
+ echo Environment:
+ export CACTUS_NUM_PROCS=1
+ CACTUS_NUM_PROCS=1
+ export CACTUS_NUM_THREADS=8
+ CACTUS_NUM_THREADS=8
+ export GMON_OUT_PREFIX=gmon.out
+ GMON_OUT_PREFIX=gmon.out
+ export OMP_NUM_THREADS=8
+ OMP_NUM_THREADS=8
+ sort
+ env
+ echo Starting:
++ date +%s
+ export CACTUS_STARTTIME=1732921316
+ CACTUS_STARTTIME=1732921316
+ '[' 1 = 1 ']'
+ '[' 0 -eq 0 ']'
+ /home/sb1554/simulations/bns/SIMFACTORY/exe/cactus_sim -L 3 
/home/sb1554/simulations/bns/output-0000/bns.par
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[slepner085.amarel.rutgers.edu:07624] Local abort before MPI_INIT completed 
completed successfully, but am not able to aggregate error messages, and not 
able to guarantee that all other processes were killed!
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::Creating 
simulation bns
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::Simulation 
directory: /home/sb1554/simulations/bns
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::Simulation 
Properties:
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::[properties]
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::machine      
   = slurmbns
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::simulationid 
   = 
simulation-bns-slurmbns-amarel1.amarel.rutgers.edu-sb1554-2024.11.29-18.01.08-16276
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::sourcedir    
   = /home/sb1554/BNS
[LOG:2024-11-29 18:01:08] restart.create(simulationName, 
parfile)::configuration   = sim
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::configid     
   = config-sim-slepner088.amarel.rutgers.edu-cache-home-sb1554-BNS
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::buildid      
   = build-sim-slepner088.amarel.rutgers.edu-sb1554-2024.11.15-02.32.38-2196
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::testsuite    
   = False
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::executable   
   = /home/sb1554/simulations/bns/SIMFACTORY/exe/cactus_sim
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::optionlist   
   = /home/sb1554/simulations/bns/SIMFACTORY/cfg/OptionList
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::submitscript 
   = /home/sb1554/simulations/bns/SIMFACTORY/run/SubmitScript
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::runscript    
   = /home/sb1554/simulations/bns/SIMFACTORY/run/RunScript
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::parfile      
   = /home/sb1554/simulations/bns/SIMFACTORY/par/bns.par
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::Simulation 
bns created
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::Creating new 
properties because this is an independant run, not a run following a submit
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::Determined the 
following properties
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::[properties]
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::machine         = 
slurmbns
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::simulationid    = 
simulation-bns-slurmbns-amarel1.amarel.rutgers.edu-sb1554-2024.11.29-18.01.08-16276
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::sourcedir       = 
/home/sb1554/BNS
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::configuration   = sim
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::configid        = 
config-sim-slepner088.amarel.rutgers.edu-cache-home-sb1554-BNS
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::buildid         = 
build-sim-slepner088.amarel.rutgers.edu-sb1554-2024.11.15-02.32.38-2196
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::testsuite       = 
False
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::executable      = 
/home/sb1554/simulations/bns/SIMFACTORY/exe/cactus_sim
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::optionlist      = 
/home/sb1554/simulations/bns/SIMFACTORY/cfg/OptionList
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::submitscript    = 
/home/sb1554/simulations/bns/SIMFACTORY/run/SubmitScript
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::runscript       = 
/home/sb1554/simulations/bns/SIMFACTORY/run/RunScript
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::parfile         = 
/home/sb1554/simulations/bns/SIMFACTORY/par/bns.par
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::nodes           = 1
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::procsrequested  = 8
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::ppn             = 8
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::numprocs        = 1
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::nodeprocs       = 1
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::procs           = 1
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::numthreads      = 8
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::ppnused         = 8
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::numsmt          = 1
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::hostname        = 
amarel1.amarel.rutgers.edu
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::user            = 
sb1554
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::memory          = 
124000
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::cpufreq         =
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::pbsSimulationName= 
bns-0000
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::
[LOG:2024-11-29 18:01:56] self.makeActive()::Simulation bns with restart-id 0 
has been made active
[LOG:2024-11-29 18:01:56] self.run(debug)::Prepping for execution/run
[LOG:2024-11-29 18:01:56] checkpointing = 
self.PrepareCheckpointing(recover_id)::PrepareCheckpointing: max_restart_id: -1
[LOG:2024-11-29 18:01:56] self.run(debug)::Defined substitution properties for 
execution/run
[LOG:2024-11-29 18:01:56] self.run(debug)::{'MACHINE': 'slurmbns', 'SOURCEDIR': 
'/home/sb1554/BNS', 'SIMULATION_NAME': 'bns', 'SHORT_SIMULATION_NAME': 
'bns-0000', 'SIMULATION_ID': 
'simulation-bns-slurmbns-amarel1.amarel.rutgers.edu-sb1554-2024.11.29-18.01.08-16276',
 'RESTART_ID': 0, 'SCRIPTFILE': 
'/home/sb1554/simulations/bns/SIMFACTORY/run/SubmitScript', 'SUBMITSCRIPT': 
'/home/sb1554/simulations/bns/SIMFACTORY/run/SubmitScript', 'CONFIGURATION': 
'sim', 'EXECUTABLE': '/home/sb1554/simulations/bns/SIMFACTORY/exe/cactus_sim', 
'PARFILE': '/home/sb1554/simulations/bns/output-0000/bns.par', 'RUNDIR': 
'/home/sb1554/simulations/bns/output-0000', 'HOSTNAME': 
'amarel1.amarel.rutgers.edu', 'USER': 'sb1554', 'ALLOCATION': 'NO_ALLOCATION', 
'NODES': 1, 'PROCS_REQUESTED': 8, 'PPN': 8, 'NUM_PROCS': 1, 'NODE_PROCS': 1, 
'PROCS': 1, 'NUM_THREADS': 8, 'PPN_USED': 8, 'NUM_SMT': 1, 'MEMORY': '124000', 
'CPUFREQ': None, 'RUNDEBUG': 0}
[LOG:2024-11-29 18:01:56] self.run(debug)::Executing run command: 
/home/sb1554/simulations/bns/output-0000/SIMFACTORY/RunScript
[LOG:2024-11-29 18:05:43] restart.load(simulationName, active_id)::For 
simulation bns, loaded restart id 0, long restart id 0000
[LOG:2024-11-29 18:29:40] ret = restart.load(sim, restart_id)::For simulation 
bns, loaded restart id 0, long restart id 0000
[LOG:2024-11-29 18:29:40] ret = restart.load(sim, restart_id)::For simulation 
bns, loaded restart id 0, long restart id 0000
  1 #! /bin/bash
  2
  3 export SIMFACTORY=/home/sb1554/Cactus/simafactory/bin
  4 export SOURCE_DIR=/home/sb1554/Cactus
  5 export CACTUS_PATH=/home/sb1554/BNS
  6
  7 #BATCH --partition=main          # Partition (job queue)
  8
  9 #SBATCH --requeue                 # Return job to the queue if preempted
 10
 11 #SBATCH --job-name=bnsnew       # Assign a short name to your job
 12
 13 #SBATCH --nodes=1               # Number of nodes you require
 14
 15 #SBATCH --ntasks=1              # Total # of tasks across all nodes
 16
 17 #SBATCH --ntasks-per-node=1
 18
 19 #SBATCH --cpus-per-task=1        # Cores per task (>1 if multithread tasks)
 20
 21 #SBATCH --mem=124000                # Real memory (RAM) required (MB)
 22
 23 #SBATCH --time=70:00:00           # Total run time limit (HH:MM:SS)
 24
 25 #SBATCH --output=slurm.bns.%N.%j.out  # STDOUT output file
 26
 27 #SBATCH --error=slurm.bns.%N.%j.err   # STDERR output file (optional)
 28
 29
 30 module use /projects/community/modulefiles
 31 #module load gcc/10.2.0/openmpi/4.0.5-bz186
 32 module load gcc/11.2/openmpi/4.1.3-kholodvl
 33 module load libnl/3.2.25-sb1554
 34 module load rdma-core/54.0-sb1554
 35 module load gsl/2.5-bd387
 36
 37
 38
 39 cd /home/sb1554/BNS
 40 /home/sb1554/BNS/simfactory/mdb/runscripts/slurmbns.run
[slurmbns]

# This machine description file is used internally by simfactory as a template
# during the sim setup and sim setup-silent commands
# Edit at your own risk
# Machine description
nickname        = slurmbns
name            = slurmbns
location        = LSU
description     = CCT
status          = production

# Access to this machine
hostname        = amarel1.amarel.rutgers.edu
aliaspattern    = ^\w+(\.amarel\.rutgers\.edu)?$

# Source tree management
sourcebasedir   = /home/sb1554
optionlist      = generic.cfg
submitscript    = slurmbns.sub
runscript       = slurmbns.run
make            = make -j@MAKEJOBS@
basedir         = /home/sb1554/simulations
ppn             = 8
max-num-threads = 128
num-threads     = 8
memory          = 124000
nodes           = 2
num-smt         = 1
#procs           = 16
submit          = sbatch /home/sb1554/BNS/simfactory/mdb/runscripts/slurmbns.run
getstatus       = squeue -j @JOB_ID@

# need to kill the whole set of processes descending from @JOB_ID@, not just the
# (simfactory) top-level process
stop            = scancel @JOB_ID@
submitpattern   = 'Submitted batch job (\d+)'
statuspattern   = '@JOB_ID@ '
queuedpattern   = ' PD '
queue           = checkpt
runningpattern  = ' (CF|CG|R|TO) '
holdingpattern  = '\(JobHeldUser\)'
[sb1554@amarel1 machines]$
exechostpattern = (.*)
stdout          = cat @[email protected]
stderr          = cat @[email protected]
stdout-follow   = sleep 10 ; sattach @[email protected]
# stdout-follow   = while ! scontrol >/dev/null wait_job @JOB_ID@ ; do sleep 5 
; done ; tail -n 100 -f @[email protected] @[email protected]
maxwalltime     = 72:00:00
disabled-thorns = CactusUtils/SystemTopology

[slurmbns]

# This machine description file is used internally by simfactory as a template
# during the sim setup and sim setup-silent commands
# Edit at your own risk
# Machine description
nickname        = slurmbns
name            = slurmbns
location        = LSU
description     = CCT
status          = production

# Access to this machine
hostname        = amarel1.amarel.rutgers.edu
aliaspattern    = ^\w+(\.amarel\.rutgers\.edu)?$

# Source tree management
sourcebasedir   = /home/sb1554
optionlist      = generic.cfg
submitscript    = slurmbns.sub
runscript       = slurmbns.run
make            = make -j@MAKEJOBS@
basedir         = /home/sb1554/simulations
ppn             = 8
max-num-threads = 128
num-threads     = 8
memory          = 124000
nodes           = 33
submit          = sbatch /home/sb1554/BNS/simfactory/mdb/runscripts/slurmbns.run
getstatus       = squeue -j @JOB_ID@

# need to kill the whole set of processes descending from @JOB_ID@, not just the
# (simfactory) top-level process
stop            = scancel @JOB_ID@
submitpattern   = 'Submitted batch job (\d+)'
statuspattern   = '@JOB_ID@ '
queuedpattern   = ' PD '
queue           = checkpt
runningpattern  = ' (CF|CG|R|TO) '
holdingpattern  = '\(JobHeldUser\)'
exechost        = hostname -s
exechostpattern = (.*)
stdout          = cat @[email protected]
stderr          = cat @[email protected]
stdout-follow   = sleep 10 ; sattach @[email protected]
# stdout-follow   = while ! scontrol >/dev/null wait_job @JOB_ID@ ; do sleep 5 
; done ; tail -n 100 -f @[email protected] @[email protected]
maxwalltime     = 72:00:00
disabled-thorns = CactusUtils/SystemTopology
  1 #! /bin/bash
  2
  3 echo "Preparing:"
  4 set -x                          # Output commands
  5 set -e                          # Abort on errors
  6
  7 cd /home/sb1554/
  8
  9 echo "Checking:"
 10 pwd
 11 hostname
 12 date
 13
 14 echo "Environment:"
 15 export CACTUS_PATH=/home/sb1554/BNS
 16 export CACTUS_NUM_PROCS=2
 17 export CACTUS_NUM_THREADS=8
 18 export GMON_OUT_PREFIX=gmon.out
 19 export OMP_NUM_THREADS=8
 20 export OMP_PLACES=cores # TODO: maybe use threads when smt is used?
 21 # https://github.com/open-mpi/ompi/issues/4948
 22 export OMPI_MCA_btl_vader_single_copy_mechanism=none
 23 env | sort > /home/sb1554/BNS/simfactory/ENVIRONMENT
 24
 25 echo "Starting:"
 26 export CACTUS_STARTTIME=$(date +%s)
 27 #time srun -n ${CACTUS_NUM_PROCS} @EXECUTABLE@ -L 3 /home/sb1554/BNS/bns.par
 28 time /home/sb1554/BNS/simfactory/bin/sim run bns --parfile   
/home/sb1554/BNS/bns.        par --machine slurmbns
 29 #time srun  @EXECUTABLE@ -L 3 /home/sb1554/BNS/bns.par
 30 echo "Stopping:"
 31 date
 32
 33 echo "Done."

Attachment: bns.sh
Description: Bourne shell script

_______________________________________________
Users mailing list
[email protected]
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to