Dear users, Hello, I'm relatively new to building OpenMPI from scratch, so I'm going to try and provide a lot of information about exactly what I did here. I'm attempting to run the MHD code Flash 4.2.2 on Pleiades (NASA AMES), and also need some python mpi4py functionality and Cuda which ruled out using the pre-installed MPI implementations. My code has been tested and working under a previous build of OpenMPI 1.10.2 on a local cluster at Drexel University that does not have a job manager and that uses a simple Infiniband setup.. Pleiades is a bit more complicated, but I've been following the NASA folks setup commands and they claim looking at my job logs from their side that nothing seems wrong communications wise.
However, when I run just a vanilla version of Flash 4.2.2 it runs for several steps and then crashes. Here's the last part of the Flash run output: *** Wrote particle file to BB_hdf5_part_0008 **** 17 1.5956E+11 5.4476E+09 (-5.031E+16, 1.969E+16, -2.188E+15) | 5.448E+09 *** Wrote plotfile to BB_hdf5_plt_cnt_0009 **** WARNING: globalNumParticles = 0!!! iteration, no. not moved = 0 69 iteration, no. not moved = 1 29 iteration, no. not moved = 2 0 refined: total leaf blocks = 120 refined: total blocks = 137 18 1.7046E+11 5.3814E+09 (-2.516E+16, 2.734E+16, -1.094E+15) | 5.381E+09 WARNING: globalNumParticles = 0!!! *** Wrote particle file to BB_hdf5_part_0009 **** 19 1.8122E+11 2.9425E+09 (-2.078E+16, -2.516E+16, -3.391E+16) | 2.943E+09 *** Wrote plotfile to BB_hdf5_plt_cnt_0010 **** WARNING: globalNumParticles = 0!!! iteration, no. not moved = 0 128 iteration, no. not moved = 1 25 iteration, no. not moved = 2 0 refined: total leaf blocks = 456 refined: total blocks = 521 Paramesh error : pe 65 needed full blk 1 57 but could not find it or only found part of it in the message buffer. Contact PARAMESH developers for help. -------------------------------------------------------------------------- MPI_ABORT was invoked on rank 65 in communicator MPI COMMUNICATOR 3 SPLIT FROM 0 with errorcode 0. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them. -------------------------------------------------------------------------- Paramesh error : pe 80 needed full blk 1 72 but could not find it or only found part of it in the message buffer. Contact PARAMESH developers for help. You can see the entire output at: https://drive.google.com/file/d/0B7Zx9zNTB3icQWZPTUlhZFQtcWs/view?usp=sharing Okay, so I built it with (as instructed by NASA HECC): ./configure --with-tm=/PBS --with-verbs=/usr --enable-mca-no-build=maffinity-libnuma --with-cuda=/nasa/cuda/7.0 --enable-mpi-interface-warning --without-slurm --without-loadleveler --enable-mpirun-prefix-by-default --enable-btl-openib-failover --prefix=/u/jewall/ompi-1.10.2 And if I run the ompi_info on 96 cores (the same # I did the job on) I get the following output: https://drive.google.com/file/d/0B7Zx9zNTB3icSHNZaEpZZkhPcXc/view?usp=sharing And the job was run with the following script: #PBS -S /bin/bash #PBS -N cfd #PBS -q debug #PBS -l select=8:ncpus=12:model=has #PBS -l walltime=0:30:00 #PBS -j oe #PBS -W group_list=g23107 #PBS -m e # Load a compiler you use to build your executable, for example, comp-intel/2015.0.090. #source /usr/local/lib/global.profile module load git/2.4.5 module load szip/2.1/gcc module load cuda/7.0 module load gcc/4.9.3 module load cmake/2.8.12.1 module load python/2.7.10 # Add your commands here to extend your PATH, etc. export MPIHOME=/u/jewall/ompi-1.10.2 export MPICC=${MPIHOME}/bin/mpicc export MPIFC=${MPIHOME}/bin/mpif90 export MPICXX=${MPIHOME}/bin/mpic++ export MPIEXEC=${MPIHOME}/bin/mpiexec export HDF5=/u/jewall/hdf5 setenv OMPI_MCA_btl_openib_if_include mlx4_0:1 PATH=$PATH:${PYTHONPATH}:$HOME/bin # Add private commands to PATH # By default, PBS executes your job from your home directory. # However, you can use the environment variable # PBS_O_WORKDIR to change to the directory where # you submitted your job. cd $PBS_O_WORKDIR echo ${PBS_NODEFILE} cat ${PBS_NODEFILE} | awk '{print $1}' > "local_host.txt" cat local_host.txt # use of dplace to pin processes to processors may improve performance # Here you request to pin processes to processors 4-11 of each Sandy Bridge node. # For other processor types, you may have to pin to different processors. # The resource request of select=32 and mpiprocs=8 implies # that you want to have 256 MPI processes in total. # If this is correct, you can omit the -np 256 for mpiexec # that you might have used before. ${MPIEXEC} --mca mpi_warn_on_fork 0 --mca mpi_cuda_support 0 --mca btl self,sm,openib --mca oob_tcp_if_include ib0 -hostfile local_host.txt ./flash4 #--mca oob_tcp_if_include ib0 # suggested in an OpenMPI forum for Pleiades running # It is a good practice to write stderr and stdout to a file (ex: output) # Otherwise, they will be written to the PBS stderr and stdout in /PBS/spool, # which has limited amount of space. When /PBS/spool is filled up, any job # that tries to write to /PBS/spool will die. # -end of script- Hopefully this is enough information for someone to find an error in how I did things. I also have the outputs of the make, make-test and make-install if anyone would like to see those. :) Thanks for the help! Cordially, Joshua Wall