Re: [OMPI users] Floating point overflow and tuning
>> Do you know what StarCCM is doing when it hangs? I.e., is it in an MPI call? I have set FI_LOG_LEVEL="debug" and below except is the point where it hangs on usdf_cq_readerr, right after the last usdf_am_insert_async. I am defining hang as 5 minutes. It might hang for longer? With Intel MPI and USNIC or TCP BTL, there is no "hang" and it starts happily running the batch job almost immediately. libfabric-cisco:usnic:domain:usdf_am_get_distance():219 libfabric-cisco:usnic:av:usdf_am_insert_async():317 libfabric-cisco:usnic:cq:usdf_cq_readerr():93 libfabric-cisco:usnic:cq:usdf_cq_readerr():93 libfabric-cisco:usnic:cq:usdf_cq_readerr():93 libfabric-cisco:usnic:cq:usdf_cq_readerr():93 (above readerr's generate rapidly forever..) On the large core runs it happens during the first stages of mpi init and it never get's passed "Starting STAR-CCM+ parallel server". It does not reach CPU Affinity Report (I have -cpubind bandwidth,v flag in STAR). Perhaps it is possible this is lower level than mpi, perhaps with libfabric-cisco, or as you point out with StarCCM. Interestingly, with a small number of cores selected, the job does complete, however we still see these libfabric-cisco:usnic:cq:usdf_cq_readerr():93 errors above. I will try to run some other app through mpirun and see if I can replicate. I briefly used fi_pingpong and cant replicate the cq_readerr, however did get plenty of other errors related to provider -Logan ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
Re: [OMPI users] Floating point overflow and tuning
On Sep 6, 2019, at 2:17 PM, Logan Stonebraker via users wrote: > > I am working with star ccm+ 2019.1.1 Build 14.02.012 > > CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64 > > Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with > star ccm+) > > Also trying to make openmpi work (more on that later) Greetings Logan. I would definitely recommend Open MPI vs. DAPL/Intel MPI. > Cisco UCS b200 and c240 cluster using USNIC fabric over 10gbe > Intel(R) Xeon(R) CPU E5-2698 > 7 nodes > 280 total cores > > enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installed > usnic RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installed > enic modinfo version: 3.2.210.22 > enic loaded module version: 3.2.210.22 > usnic_verbs modinfo version: 3.2.158.15 > usnic_verbs loaded module version: 3.2.158.15 > libdaplusnic RPM version 2.0.39cisco3.2.112.8 installed > libfabric RPM version 1.6.0cisco3.2.112.9.rhel7u6 installed > > On batch runs less than 5 hours, everything works flawlessly, the jobs > complete with out error, and is quite fast with dapl. Especially compared to > TCP btl. > > However when running with n-1 273 total cores at or around 5 hours into a > job, the longer jobs die with a star ccm floating point exception. > The same job completes fine with no more than 210 cores, 30 cores per each of > 7 nodes. I would like to be able to use the 60 more cores. > I am using PBS Pro with 99 hour wall time > > Here is the overflow error. > -- > Turbulent viscosity limited on 56 cells in Region > A floating point exception has occurred: floating point exception [Overflow]. > The specific cause cannot be identified. Please refer to the > troubleshooting section of the User's Guide. > Context: star.coupledflow.CoupledImplicitSolver > Command: Automation.Run >error: Server Error > -- > > I have not ruled out that I am missing some parameters or tuning with Intel > MPI as this is a new cluster. That's odd. That type of error is *usually* not the MPI's fault. > I am also trying to make Open MPI work. I have openmpi compiled and it runs > and I can see it is using the usnic fabric, however it only runs with very > small number of CPU. Anything over about 2 cores per node it hangs > indefinately, right after the job starts. That's also quite odd; it shouldn't *hang*. > I have compiled Open MPI 3.1.3 from [url]https://www.open-mpi.org/[/url] > because this is what Star CCM version I am running supports. I am telling > star to use the open mpi that I installed so it can support the Cisco USNIC > fabric, which I can verify using Cisco native tools (star ships with openmpi > btw however I'm not using it). > > I am thinking that I need to tune OpenMPI, which was also requried with Intel > MPI in order to run with out indefinite hang. > > With Intel MPI prior to tuning, jobs with more than about 100 cores would > hang forever until I added these parameters: > > reference: > [url]https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/542591[/url] > reference: > [url]https://software.intel.com/en-us/articles/tuning-the-intel-mpi-library-advanced-techniques[/url] > > export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208 > export I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208 > export I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704 > export I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704 > export I_MPI_DAPL_UD_RNDV_EP_NUM=2 > export I_MPI_DAPL_UD_REQ_EVD_SIZE=2000 > export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096 > export I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647 > > After adding these parms I can scale to 273 cores and it runs very fast, up > until the point where it gets the floating point exception about 5 hours into > the job. > > I am struggling trying to find equivelant turning parms for Open MPI. FWIW, you shouldn't need any tuning params -- it should "just work". > I have listed all the MCA available with Open using MCA, and have tried > setting these parms with no success, I may not have the equivalent parms > listed here, this is what I have tried. > > btl_max_send_size = 4096 > btl_usnic_eager_limit = 2147483647 > btl_usnic_rndv_eager_limit = 2147483647 > btl_usnic_sd_num = 8208 > btl_usnic_rd_num = 8208 > btl_usnic_prio_sd_num = 8704 > btl_usnic_prio_rd_num = 8704 > btl_usnic_pack_lazy_threshold = -1 All those look reasonable. Do you know what StarCCM is doing when it hangs? I.e., is it in an MPI call? -- Jeff Squyres jsquy...@cisco.com ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] Floating point overflow and tuning
I am working with star ccm+ 2019.1.1 Build 14.02.012 CentOS 7.6 kernel 3.10.0-957.21.3.el7.x86_64 Intel MPI Version 2018 Update 5 Build 20190404 (this is version shipped with star ccm+) Also trying to make openmpi work (more on that later) Cisco UCS b200 and c240 cluster using USNIC fabric over 10gbeIntel(R) Xeon(R) CPU E5-26987 nodes280 total cores enic RPM version kmod-enic-3.2.210.22-738.18.centos7u7.x86_64 installedusnic RPM kmod-usnic_verbs-3.2.158.15-738.18.rhel7u6.x86_64 installedenic modinfo version: 3.2.210.22enic loaded module version: 3.2.210.22usnic_verbs modinfo version: 3.2.158.15usnic_verbs loaded module version: 3.2.158.15libdaplusnic RPM version 2.0.39cisco3.2.112.8 installedlibfabric RPM version 1.6.0cisco3.2.112.9.rhel7u6 installed On batch runs less than 5 hours, everything works flawlessly, the jobs complete with out error, and is quite fast with dapl. Especially compared to TCP btl. However when running with n-1 273 total cores at or around 5 hours into a job, the longer jobs die with a star ccm floating point exception.The same job completes fine with no more than 210 cores, 30 cores per each of 7 nodes. I would like to be able to use the 60 more cores.I am using PBS Pro with 99 hour wall time Here is the overflow error. --Turbulent viscosity limited on 56 cells in RegionA floating point exception has occurred: floating point exception [Overflow]. The specific cause cannot be identified. Please refer to the troubleshooting section of the User's Guide.Context: star.coupledflow.CoupledImplicitSolverCommand: Automation.Run error: Server Error-- I have not ruled out that I am missing some parameters or tuning with Intel MPI as this is a new cluster. I am also trying to make Open MPI work. I have openmpi compiled and it runs and I can see it is using the usnic fabric, however it only runs with very small number of CPU. Anything over about 2 cores per node it hangs indefinately, right after the job starts. I have compiled Open MPI 3.1.3 from [url]https://www.open-mpi.org/[/url] because this is what Star CCM version I am running supports. I am telling star to use the open mpi that I installed so it can support the Cisco USNIC fabric, which I can verify using Cisco native tools (star ships with openmpi btw however I'm not using it). I am thinking that I need to tune OpenMPI, which was also requried with Intel MPI in order to run with out indefinite hang. With Intel MPI prior to tuning, jobs with more than about 100 cores would hang forever until I added these parameters: reference: [url]https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology/topic/542591[/url]reference: [url]https://software.intel.com/en-us/articles/tuning-the-intel-mpi-library-advanced-techniques[/url] export I_MPI_DAPL_UD_SEND_BUFFER_NUM=8208export I_MPI_DAPL_UD_RECV_BUFFER_NUM=8208export I_MPI_DAPL_UD_ACK_SEND_POOL_SIZE=8704export I_MPI_DAPL_UD_ACK_RECV_POOL_SIZE=8704export I_MPI_DAPL_UD_RNDV_EP_NUM=2export I_MPI_DAPL_UD_REQ_EVD_SIZE=2000export I_MPI_DAPL_UD_MAX_MSG_SIZE=4096export I_MPI_DAPL_UD_DIRECT_COPY_THRESHOLD=2147483647 After adding these parms I can scale to 273 cores and it runs very fast, up until the point where it gets the floating point exception about 5 hours into the job. I am struggling trying to find equivelant turning parms for Open MPI. I have listed all the MCA available with Open using MCA, and have tried setting these parms with no success, I may not have the equivalent parms listed here, this is what I have tried. btl_max_send_size = 4096btl_usnic_eager_limit = 2147483647btl_usnic_rndv_eager_limit = 2147483647btl_usnic_sd_num = 8208btl_usnic_rd_num = 8208btl_usnic_prio_sd_num = 8704btl_usnic_prio_rd_num = 8704btl_usnic_pack_lazy_threshold = -1 Does anyone have any advice or ideas for: 1.) The floating point overflow issueand 2.) Know of equivelant tuning parms for Open MPI Many thanks in advance! -Logan___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users