I am working with star ccm+ 2019.1.1 Build 14.02.012
CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64
Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with
Also trying to make openmpi work (more on that later)
Cisco UCS b200 and c240 cluster using USNIC fabric over 10gbeIntel(R) Xeon(R)
CPU E5-26987 nodes280 total cores
enic RPM version kmod-enic-22.214.171.124-738.18.centos7u7.x86_64 installedusnic
RPM kmod-usnic_verbs-126.96.36.199-738.18.rhel7u6.x86_64 installedenic modinfo
version: 188.8.131.52enic loaded module version: 184.108.40.206usnic_verbs modinfo
version: 220.127.116.11usnic_verbs loaded module version: 18.104.22.168libdaplusnic
RPM version 2.0.39cisco22.214.171.124 installedlibfabric RPM version
On batch runs less than 5 hours, everything works flawlessly, the jobs complete
with out error, and is quite fast with dapl. Especially compared to TCP btl.
However when running with n-1 273 total cores at or around 5 hours into a job,
the longer jobs die with a star ccm floating point exception.The same job
completes fine with no more than 210 cores, 30 cores per each of 7 nodes. I
would like to be able to use the 60 more cores.I am using PBS Pro with 99 hour
Here is the overflow error. ------------------Turbulent viscosity limited on 56
cells in RegionA floating point exception has occurred: floating point
exception [Overflow]. The specific cause cannot be identified. Please refer
to the troubleshooting section of the User's Guide.Context:
star.coupledflow.CoupledImplicitSolverCommand: Automation.Run error: Server
I have not ruled out that I am missing some parameters or tuning with Intel MPI
as this is a new cluster.
I am also trying to make Open MPI work. I have openmpi compiled and it runs
and I can see it is using the usnic fabric, however it only runs with very
small number of CPU. Anything over about 2 cores per node it hangs
indefinately, right after the job starts.
I have compiled Open MPI 3.1.3 from [url]https://www.open-mpi.org/[/url]
because this is what Star CCM version I am running supports. I am telling star
to use the open mpi that I installed so it can support the Cisco USNIC fabric,
which I can verify using Cisco native tools (star ships with openmpi btw
however I'm not using it).
I am thinking that I need to tune OpenMPI, which was also requried with Intel
MPI in order to run with out indefinite hang.
With Intel MPI prior to tuning, jobs with more than about 100 cores would hang
forever until I added these parameters:
After adding these parms I can scale to 273 cores and it runs very fast, up
until the point where it gets the floating point exception about 5 hours into
I am struggling trying to find equivelant turning parms for Open MPI.
I have listed all the MCA available with Open using MCA, and have tried setting
these parms with no success, I may not have the equivalent parms listed here,
this is what I have tried.
btl_max_send_size = 4096btl_usnic_eager_limit =
2147483647btl_usnic_rndv_eager_limit = 2147483647btl_usnic_sd_num =
8208btl_usnic_rd_num = 8208btl_usnic_prio_sd_num = 8704btl_usnic_prio_rd_num =
8704btl_usnic_pack_lazy_threshold = -1
Does anyone have any advice or ideas for:
1.) The floating point overflow issueand 2.) Know of equivelant tuning parms
for Open MPI
Many thanks in advance!
users mailing list