Hello Everyone, I am new to ETK. I am working on my high school research project trying to run the simulation of BNS merger on amarel supercomputer from my local university.
Could you please help me to start my simulation on SLURM. I have followed the ETK gallery example for BNS simulation steps 1-5. But I am not able to proceed to successfully create a machine to run the simulation. I run the following steps /home/sb1554/BNS/simfactory/bin/sim create bns --parfile /home/sb1554/BNS/bns.par --machine slurmbns srun bns.sh -o slurm.bns.%N.%j.out and got the error: **** An error occurred in MPI_Init_thread*** on a NULL communicator*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,*** and potentially your MPI job)* I am attaching my machine, submit script, run script, log files. I would appreciate any pointers from you. Or if you could point me to the right person. I was trying to post this on EKT forum, but need one credit to post. Thank you, Maya
+ set -e + cd /home/sb1554/simulations/bns/output-0000-active + echo Checking: + pwd + hostname + date + echo Environment: + export CACTUS_NUM_PROCS=1 + CACTUS_NUM_PROCS=1 + export CACTUS_NUM_THREADS=8 + CACTUS_NUM_THREADS=8 + export GMON_OUT_PREFIX=gmon.out + GMON_OUT_PREFIX=gmon.out + export OMP_NUM_THREADS=8 + OMP_NUM_THREADS=8 + sort + env + echo Starting: ++ date +%s + export CACTUS_STARTTIME=1732921316 + CACTUS_STARTTIME=1732921316 + '[' 1 = 1 ']' + '[' 0 -eq 0 ']' + /home/sb1554/simulations/bns/SIMFACTORY/exe/cactus_sim -L 3 /home/sb1554/simulations/bns/output-0000/bns.par *** An error occurred in MPI_Init_thread *** on a NULL communicator *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, *** and potentially your MPI job) [slepner085.amarel.rutgers.edu:07624] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::Creating
simulation bns
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::Simulation
directory: /home/sb1554/simulations/bns
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::Simulation
Properties:
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::[properties]
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::machine
= slurmbns
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::simulationid
=
simulation-bns-slurmbns-amarel1.amarel.rutgers.edu-sb1554-2024.11.29-18.01.08-16276
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::sourcedir
= /home/sb1554/BNS
[LOG:2024-11-29 18:01:08] restart.create(simulationName,
parfile)::configuration = sim
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::configid
= config-sim-slepner088.amarel.rutgers.edu-cache-home-sb1554-BNS
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::buildid
= build-sim-slepner088.amarel.rutgers.edu-sb1554-2024.11.15-02.32.38-2196
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::testsuite
= False
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::executable
= /home/sb1554/simulations/bns/SIMFACTORY/exe/cactus_sim
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::optionlist
= /home/sb1554/simulations/bns/SIMFACTORY/cfg/OptionList
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::submitscript
= /home/sb1554/simulations/bns/SIMFACTORY/run/SubmitScript
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::runscript
= /home/sb1554/simulations/bns/SIMFACTORY/run/RunScript
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::parfile
= /home/sb1554/simulations/bns/SIMFACTORY/par/bns.par
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::
[LOG:2024-11-29 18:01:08] restart.create(simulationName, parfile)::Simulation
bns created
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::Creating new
properties because this is an independant run, not a run following a submit
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::Determined the
following properties
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::[properties]
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::machine =
slurmbns
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::simulationid =
simulation-bns-slurmbns-amarel1.amarel.rutgers.edu-sb1554-2024.11.29-18.01.08-16276
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::sourcedir =
/home/sb1554/BNS
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::configuration = sim
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::configid =
config-sim-slepner088.amarel.rutgers.edu-cache-home-sb1554-BNS
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::buildid =
build-sim-slepner088.amarel.rutgers.edu-sb1554-2024.11.15-02.32.38-2196
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::testsuite =
False
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::executable =
/home/sb1554/simulations/bns/SIMFACTORY/exe/cactus_sim
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::optionlist =
/home/sb1554/simulations/bns/SIMFACTORY/cfg/OptionList
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::submitscript =
/home/sb1554/simulations/bns/SIMFACTORY/run/SubmitScript
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::runscript =
/home/sb1554/simulations/bns/SIMFACTORY/run/RunScript
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::parfile =
/home/sb1554/simulations/bns/SIMFACTORY/par/bns.par
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::nodes = 1
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::procsrequested = 8
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::ppn = 8
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::numprocs = 1
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::nodeprocs = 1
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::procs = 1
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::numthreads = 8
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::ppnused = 8
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::numsmt = 1
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::hostname =
amarel1.amarel.rutgers.edu
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::user =
sb1554
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::memory =
124000
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::cpufreq =
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::pbsSimulationName=
bns-0000
[LOG:2024-11-29 18:01:56] restart.userRun(simulationName)::
[LOG:2024-11-29 18:01:56] self.makeActive()::Simulation bns with restart-id 0
has been made active
[LOG:2024-11-29 18:01:56] self.run(debug)::Prepping for execution/run
[LOG:2024-11-29 18:01:56] checkpointing =
self.PrepareCheckpointing(recover_id)::PrepareCheckpointing: max_restart_id: -1
[LOG:2024-11-29 18:01:56] self.run(debug)::Defined substitution properties for
execution/run
[LOG:2024-11-29 18:01:56] self.run(debug)::{'MACHINE': 'slurmbns', 'SOURCEDIR':
'/home/sb1554/BNS', 'SIMULATION_NAME': 'bns', 'SHORT_SIMULATION_NAME':
'bns-0000', 'SIMULATION_ID':
'simulation-bns-slurmbns-amarel1.amarel.rutgers.edu-sb1554-2024.11.29-18.01.08-16276',
'RESTART_ID': 0, 'SCRIPTFILE':
'/home/sb1554/simulations/bns/SIMFACTORY/run/SubmitScript', 'SUBMITSCRIPT':
'/home/sb1554/simulations/bns/SIMFACTORY/run/SubmitScript', 'CONFIGURATION':
'sim', 'EXECUTABLE': '/home/sb1554/simulations/bns/SIMFACTORY/exe/cactus_sim',
'PARFILE': '/home/sb1554/simulations/bns/output-0000/bns.par', 'RUNDIR':
'/home/sb1554/simulations/bns/output-0000', 'HOSTNAME':
'amarel1.amarel.rutgers.edu', 'USER': 'sb1554', 'ALLOCATION': 'NO_ALLOCATION',
'NODES': 1, 'PROCS_REQUESTED': 8, 'PPN': 8, 'NUM_PROCS': 1, 'NODE_PROCS': 1,
'PROCS': 1, 'NUM_THREADS': 8, 'PPN_USED': 8, 'NUM_SMT': 1, 'MEMORY': '124000',
'CPUFREQ': None, 'RUNDEBUG': 0}
[LOG:2024-11-29 18:01:56] self.run(debug)::Executing run command:
/home/sb1554/simulations/bns/output-0000/SIMFACTORY/RunScript
[LOG:2024-11-29 18:05:43] restart.load(simulationName, active_id)::For
simulation bns, loaded restart id 0, long restart id 0000
[LOG:2024-11-29 18:29:40] ret = restart.load(sim, restart_id)::For simulation
bns, loaded restart id 0, long restart id 0000
[LOG:2024-11-29 18:29:40] ret = restart.load(sim, restart_id)::For simulation
bns, loaded restart id 0, long restart id 00001 #! /bin/bash 2 3 export SIMFACTORY=/home/sb1554/Cactus/simafactory/bin 4 export SOURCE_DIR=/home/sb1554/Cactus 5 export CACTUS_PATH=/home/sb1554/BNS 6 7 #BATCH --partition=main # Partition (job queue) 8 9 #SBATCH --requeue # Return job to the queue if preempted 10 11 #SBATCH --job-name=bnsnew # Assign a short name to your job 12 13 #SBATCH --nodes=1 # Number of nodes you require 14 15 #SBATCH --ntasks=1 # Total # of tasks across all nodes 16 17 #SBATCH --ntasks-per-node=1 18 19 #SBATCH --cpus-per-task=1 # Cores per task (>1 if multithread tasks) 20 21 #SBATCH --mem=124000 # Real memory (RAM) required (MB) 22 23 #SBATCH --time=70:00:00 # Total run time limit (HH:MM:SS) 24 25 #SBATCH --output=slurm.bns.%N.%j.out # STDOUT output file 26 27 #SBATCH --error=slurm.bns.%N.%j.err # STDERR output file (optional) 28 29 30 module use /projects/community/modulefiles 31 #module load gcc/10.2.0/openmpi/4.0.5-bz186 32 module load gcc/11.2/openmpi/4.1.3-kholodvl 33 module load libnl/3.2.25-sb1554 34 module load rdma-core/54.0-sb1554 35 module load gsl/2.5-bd387 36 37 38 39 cd /home/sb1554/BNS 40 /home/sb1554/BNS/simfactory/mdb/runscripts/slurmbns.run
[slurmbns] # This machine description file is used internally by simfactory as a template # during the sim setup and sim setup-silent commands # Edit at your own risk # Machine description nickname = slurmbns name = slurmbns location = LSU description = CCT status = production # Access to this machine hostname = amarel1.amarel.rutgers.edu aliaspattern = ^\w+(\.amarel\.rutgers\.edu)?$ # Source tree management sourcebasedir = /home/sb1554 optionlist = generic.cfg submitscript = slurmbns.sub runscript = slurmbns.run make = make -j@MAKEJOBS@ basedir = /home/sb1554/simulations ppn = 8 max-num-threads = 128 num-threads = 8 memory = 124000 nodes = 2 num-smt = 1 #procs = 16 submit = sbatch /home/sb1554/BNS/simfactory/mdb/runscripts/slurmbns.run getstatus = squeue -j @JOB_ID@ # need to kill the whole set of processes descending from @JOB_ID@, not just the # (simfactory) top-level process stop = scancel @JOB_ID@ submitpattern = 'Submitted batch job (\d+)' statuspattern = '@JOB_ID@ ' queuedpattern = ' PD ' queue = checkpt runningpattern = ' (CF|CG|R|TO) ' holdingpattern = '\(JobHeldUser\)' [sb1554@amarel1 machines]$ exechostpattern = (.*) stdout = cat @[email protected] stderr = cat @[email protected] stdout-follow = sleep 10 ; sattach @[email protected] # stdout-follow = while ! scontrol >/dev/null wait_job @JOB_ID@ ; do sleep 5 ; done ; tail -n 100 -f @[email protected] @[email protected] maxwalltime = 72:00:00 disabled-thorns = CactusUtils/SystemTopology [slurmbns] # This machine description file is used internally by simfactory as a template # during the sim setup and sim setup-silent commands # Edit at your own risk # Machine description nickname = slurmbns name = slurmbns location = LSU description = CCT status = production # Access to this machine hostname = amarel1.amarel.rutgers.edu aliaspattern = ^\w+(\.amarel\.rutgers\.edu)?$ # Source tree management sourcebasedir = /home/sb1554 optionlist = generic.cfg submitscript = slurmbns.sub runscript = slurmbns.run make = make -j@MAKEJOBS@ basedir = /home/sb1554/simulations ppn = 8 max-num-threads = 128 num-threads = 8 memory = 124000 nodes = 33 submit = sbatch /home/sb1554/BNS/simfactory/mdb/runscripts/slurmbns.run getstatus = squeue -j @JOB_ID@ # need to kill the whole set of processes descending from @JOB_ID@, not just the # (simfactory) top-level process stop = scancel @JOB_ID@ submitpattern = 'Submitted batch job (\d+)' statuspattern = '@JOB_ID@ ' queuedpattern = ' PD ' queue = checkpt runningpattern = ' (CF|CG|R|TO) ' holdingpattern = '\(JobHeldUser\)' exechost = hostname -s exechostpattern = (.*) stdout = cat @[email protected] stderr = cat @[email protected] stdout-follow = sleep 10 ; sattach @[email protected] # stdout-follow = while ! scontrol >/dev/null wait_job @JOB_ID@ ; do sleep 5 ; done ; tail -n 100 -f @[email protected] @[email protected] maxwalltime = 72:00:00 disabled-thorns = CactusUtils/SystemTopology
1 #! /bin/bash 2 3 echo "Preparing:" 4 set -x # Output commands 5 set -e # Abort on errors 6 7 cd /home/sb1554/ 8 9 echo "Checking:" 10 pwd 11 hostname 12 date 13 14 echo "Environment:" 15 export CACTUS_PATH=/home/sb1554/BNS 16 export CACTUS_NUM_PROCS=2 17 export CACTUS_NUM_THREADS=8 18 export GMON_OUT_PREFIX=gmon.out 19 export OMP_NUM_THREADS=8 20 export OMP_PLACES=cores # TODO: maybe use threads when smt is used? 21 # https://github.com/open-mpi/ompi/issues/4948 22 export OMPI_MCA_btl_vader_single_copy_mechanism=none 23 env | sort > /home/sb1554/BNS/simfactory/ENVIRONMENT 24 25 echo "Starting:" 26 export CACTUS_STARTTIME=$(date +%s) 27 #time srun -n ${CACTUS_NUM_PROCS} @EXECUTABLE@ -L 3 /home/sb1554/BNS/bns.par 28 time /home/sb1554/BNS/simfactory/bin/sim run bns --parfile /home/sb1554/BNS/bns. par --machine slurmbns 29 #time srun @EXECUTABLE@ -L 3 /home/sb1554/BNS/bns.par 30 echo "Stopping:" 31 date 32 33 echo "Done."
bns.sh
Description: Bourne shell script
_______________________________________________ Users mailing list [email protected] http://lists.einsteintoolkit.org/mailman/listinfo/users
