Re: [OMPI users] Segmentation fault when using 31 or 32 ranks

2019-07-10 Thread Raymond Arter via users

Jeff and Steven,

Thanks for your help.

I downloaded the nightly snapshot and it fixes the problem. I need to
do more testing tomorrow and I will report back if any issues arise.

Thanks again.

T.


On 10/07/2019 18:44, Jeff Squyres (jsquyres) via users wrote:

It might be worth trying the latest v4.0.x nightly snapshot -- we just updated 
the internal PMIx on the v4.0.x branch:

 https://www.open-mpi.org/nightly/v4.0.x/



On Jul 10, 2019, at 1:29 PM, Steven Varga via users  
wrote:

Hi i am fighting similar. Did you try to update the pmix most recent 3.1.3 
series release?

On Wed, Jul 10, 2019, 12:24 Raymond Arter via users,  
wrote:
Hi,

I have the following issue with version 4.0.1 when running on a node with
two 16 core CPUs (Intel Xeon Gold 6142) installed. Running with 30 ranks or
less is fine, and running 33 or above gives the "not enough slots" message
which is expected.

However, using 31 or 32 ranks results in the following error:

[nodek19:391429] *** Process received signal ***
[nodek19:391429] Signal: Segmentation fault (11)
[nodek19:391429] Signal code: Address not mapped (1)
[nodek19:391429] Failing at address: 0x7fa34954d008
[nodek19:391429] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7fa348dfc5d0]
[nodek19:391429] [ 1] 
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/pmix/mca_gds_ds21.so(pmix_gds_ds21_lock_init+0x11a)[0x7fa345ded16a]
[nodek19:391429] [ 2] 
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libmca_common_dstore.so.1(pmix_common_dstor_init+0x833)[0x7fa3493c8df3]
[nodek19:391429] [ 3] 
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/pmix/mca_gds_ds21.so(+0x1e14)[0x7fa345dece14]
[nodek19:391429] [ 4] 
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libpmix.so.2(OPAL_MCA_PMIX3X_pmix_gds_base_select+0x108)[0x7fa345b73fe8]
[nodek19:391429] [ 5] 
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libpmix.so.2(OPAL_MCA_PMIX3X_pmix_rte_init+0x7c3)[0x7fa345b30f83]
[nodek19:391429] [ 6] 
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libpmix.so.2(OPAL_MCA_PMIX3X_PMIx_Init+0x168)[0x7fa345aefd08]
[nodek19:391429] [ 7] 
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_client_init+0xbb)[0x7fa345bc4fdb]
[nodek19:391429] [ 8] 
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/openmpi/mca_ess_pmi.so(+0x1ad6)[0x7fa3467f2ad6]
[nodek19:391429] [ 9] 
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libopen-rte.so.40(orte_init+0x291)[0x7fa348780b21]
[nodek19:391429] [10] 
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libmpi.so.40(ompi_mpi_init+0x264)[0x7fa349058a24]
[nodek19:391429] [11] 
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libmpi.so.40(MPI_Init+0x99)[0x7fa349088b89]
[nodek19:391429] [12] mpitest[0x4007fe]
[nodek19:391429] [13] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fa348a423d5]
[nodek19:391429] [14] mpitest[0x400729]
[nodek19:391429] *** End of error message ***


Furthermore, when using computers with two Intel Xeon Gold 6132 (14 cores)
or 6126 (12 cores), the issue doesn't occur. I'm able to use all the cores,
28 and 24 respectivity. Version 3.1.4 works across all three computers without
issue.

Any comments would be appreciated.

Regards,

T.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users




___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Segmentation fault when using 31 or 32 ranks

2019-07-10 Thread Jeff Squyres (jsquyres) via users
It might be worth trying the latest v4.0.x nightly snapshot -- we just updated 
the internal PMIx on the v4.0.x branch:

https://www.open-mpi.org/nightly/v4.0.x/


> On Jul 10, 2019, at 1:29 PM, Steven Varga via users 
>  wrote:
> 
> Hi i am fighting similar. Did you try to update the pmix most recent 3.1.3 
> series release?
> 
> On Wed, Jul 10, 2019, 12:24 Raymond Arter via users, 
>  wrote:
> Hi,
> 
> I have the following issue with version 4.0.1 when running on a node with
> two 16 core CPUs (Intel Xeon Gold 6142) installed. Running with 30 ranks or
> less is fine, and running 33 or above gives the "not enough slots" message
> which is expected.
> 
> However, using 31 or 32 ranks results in the following error:
> 
> [nodek19:391429] *** Process received signal ***
> [nodek19:391429] Signal: Segmentation fault (11)
> [nodek19:391429] Signal code: Address not mapped (1)
> [nodek19:391429] Failing at address: 0x7fa34954d008
> [nodek19:391429] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7fa348dfc5d0]
> [nodek19:391429] [ 1] 
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/pmix/mca_gds_ds21.so(pmix_gds_ds21_lock_init+0x11a)[0x7fa345ded16a]
> [nodek19:391429] [ 2] 
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libmca_common_dstore.so.1(pmix_common_dstor_init+0x833)[0x7fa3493c8df3]
> [nodek19:391429] [ 3] 
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/pmix/mca_gds_ds21.so(+0x1e14)[0x7fa345dece14]
> [nodek19:391429] [ 4] 
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libpmix.so.2(OPAL_MCA_PMIX3X_pmix_gds_base_select+0x108)[0x7fa345b73fe8]
> [nodek19:391429] [ 5] 
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libpmix.so.2(OPAL_MCA_PMIX3X_pmix_rte_init+0x7c3)[0x7fa345b30f83]
> [nodek19:391429] [ 6] 
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libpmix.so.2(OPAL_MCA_PMIX3X_PMIx_Init+0x168)[0x7fa345aefd08]
> [nodek19:391429] [ 7] 
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_client_init+0xbb)[0x7fa345bc4fdb]
> [nodek19:391429] [ 8] 
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/openmpi/mca_ess_pmi.so(+0x1ad6)[0x7fa3467f2ad6]
> [nodek19:391429] [ 9] 
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libopen-rte.so.40(orte_init+0x291)[0x7fa348780b21]
> [nodek19:391429] [10] 
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libmpi.so.40(ompi_mpi_init+0x264)[0x7fa349058a24]
> [nodek19:391429] [11] 
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libmpi.so.40(MPI_Init+0x99)[0x7fa349088b89]
> [nodek19:391429] [12] mpitest[0x4007fe]
> [nodek19:391429] [13] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fa348a423d5]
> [nodek19:391429] [14] mpitest[0x400729]
> [nodek19:391429] *** End of error message ***
> 
> 
> Furthermore, when using computers with two Intel Xeon Gold 6132 (14 cores) 
> or 6126 (12 cores), the issue doesn't occur. I'm able to use all the cores, 
> 28 and 24 respectivity. Version 3.1.4 works across all three computers 
> without 
> issue.
> 
> Any comments would be appreciated.
> 
> Regards,
> 
> T.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Segmentation fault when using 31 or 32 ranks

2019-07-10 Thread Steven Varga via users
Hi i am fighting similar. Did you try to update the pmix most recent 3.1.3
series release?

On Wed, Jul 10, 2019, 12:24 Raymond Arter via users, <
users@lists.open-mpi.org> wrote:

> Hi,
>
> I have the following issue with version 4.0.1 when running on a node with
> two 16 core CPUs (Intel Xeon Gold 6142) installed. Running with 30 ranks or
> less is fine, and running 33 or above gives the "not enough slots" message
> which is expected.
>
> However, using 31 or 32 ranks results in the following error:
>
> [nodek19:391429] *** Process received signal ***
> [nodek19:391429] Signal: Segmentation fault (11)
> [nodek19:391429] Signal code: Address not mapped (1)
> [nodek19:391429] Failing at address: 0x7fa34954d008
> [nodek19:391429] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7fa348dfc5d0]
> [nodek19:391429] [ 1]
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/pmix/mca_gds_ds21.so(pmix_gds_ds21_lock_init+0x11a)[0x7fa345ded16a]
> [nodek19:391429] [ 2]
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libmca_common_dstore.so.1(pmix_common_dstor_init+0x833)[0x7fa3493c8df3]
> [nodek19:391429] [ 3]
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/pmix/mca_gds_ds21.so(+0x1e14)[0x7fa345dece14]
> [nodek19:391429] [ 4]
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libpmix.so.2(OPAL_MCA_PMIX3X_pmix_gds_base_select+0x108)[0x7fa345b73fe8]
> [nodek19:391429] [ 5]
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libpmix.so.2(OPAL_MCA_PMIX3X_pmix_rte_init+0x7c3)[0x7fa345b30f83]
> [nodek19:391429] [ 6]
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libpmix.so.2(OPAL_MCA_PMIX3X_PMIx_Init+0x168)[0x7fa345aefd08]
> [nodek19:391429] [ 7]
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_client_init+0xbb)[0x7fa345bc4fdb]
> [nodek19:391429] [ 8]
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/openmpi/mca_ess_pmi.so(+0x1ad6)[0x7fa3467f2ad6]
> [nodek19:391429] [ 9]
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libopen-rte.so.40(orte_init+0x291)[0x7fa348780b21]
> [nodek19:391429] [10]
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libmpi.so.40(ompi_mpi_init+0x264)[0x7fa349058a24]
> [nodek19:391429] [11]
> /opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libmpi.so.40(MPI_Init+0x99)[0x7fa349088b89]
> [nodek19:391429] [12] mpitest[0x4007fe]
> [nodek19:391429] [13]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fa348a423d5]
> [nodek19:391429] [14] mpitest[0x400729]
> [nodek19:391429] *** End of error message ***
>
>
> Furthermore, when using computers with two Intel Xeon Gold 6132 (14 cores)
> or 6126 (12 cores), the issue doesn't occur. I'm able to use all the
> cores,
> 28 and 24 respectivity. Version 3.1.4 works across all three computers
> without
> issue.
>
> Any comments would be appreciated.
>
> Regards,
>
> T.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Segmentation fault when using 31 or 32 ranks

2019-07-10 Thread Raymond Arter via users
Hi,

I have the following issue with version 4.0.1 when running on a node with
two 16 core CPUs (Intel Xeon Gold 6142) installed. Running with 30 ranks or
less is fine, and running 33 or above gives the "not enough slots" message
which is expected.

However, using 31 or 32 ranks results in the following error:

[nodek19:391429] *** Process received signal ***
[nodek19:391429] Signal: Segmentation fault (11)
[nodek19:391429] Signal code: Address not mapped (1)
[nodek19:391429] Failing at address: 0x7fa34954d008
[nodek19:391429] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7fa348dfc5d0]
[nodek19:391429] [ 1]
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/pmix/mca_gds_ds21.so(pmix_gds_ds21_lock_init+0x11a)[0x7fa345ded16a]
[nodek19:391429] [ 2]
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libmca_common_dstore.so.1(pmix_common_dstor_init+0x833)[0x7fa3493c8df3]
[nodek19:391429] [ 3]
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/pmix/mca_gds_ds21.so(+0x1e14)[0x7fa345dece14]
[nodek19:391429] [ 4]
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libpmix.so.2(OPAL_MCA_PMIX3X_pmix_gds_base_select+0x108)[0x7fa345b73fe8]
[nodek19:391429] [ 5]
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libpmix.so.2(OPAL_MCA_PMIX3X_pmix_rte_init+0x7c3)[0x7fa345b30f83]
[nodek19:391429] [ 6]
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libpmix.so.2(OPAL_MCA_PMIX3X_PMIx_Init+0x168)[0x7fa345aefd08]
[nodek19:391429] [ 7]
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_client_init+0xbb)[0x7fa345bc4fdb]
[nodek19:391429] [ 8]
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/openmpi/mca_ess_pmi.so(+0x1ad6)[0x7fa3467f2ad6]
[nodek19:391429] [ 9]
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libopen-rte.so.40(orte_init+0x291)[0x7fa348780b21]
[nodek19:391429] [10]
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libmpi.so.40(ompi_mpi_init+0x264)[0x7fa349058a24]
[nodek19:391429] [11]
/opt/apps/libs/openmpi/4.0.1/gcc/testing/lib/libmpi.so.40(MPI_Init+0x99)[0x7fa349088b89]
[nodek19:391429] [12] mpitest[0x4007fe]
[nodek19:391429] [13]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7fa348a423d5]
[nodek19:391429] [14] mpitest[0x400729]
[nodek19:391429] *** End of error message ***


Furthermore, when using computers with two Intel Xeon Gold 6132 (14 cores)
or 6126 (12 cores), the issue doesn't occur. I'm able to use all the cores,
28 and 24 respectivity. Version 3.1.4 works across all three computers
without
issue.

Any comments would be appreciated.

Regards,

T.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users