Re: [OMPI devel] v2.1.5rc1 is out

2018-08-17 Thread Vallee, Geoffroy R.
I would assume so as well and the 2.x series is not really critical for these 
systems, especially since 3.x is not having the problem. I have no problem 
ignoring that problem.


> On Aug 17, 2018, at 3:48 PM, Jeff Squyres (jsquyres) via devel 
>  wrote:
> 
> Thanks for the testing.
> 
> I'm assuming the MXM failure has been around for a while, and the correct way 
> to fix it is to upgrade to a newer Open MPI and/or use UCX.
> 
> 
>> On Aug 17, 2018, at 11:01 AM, Vallee, Geoffroy R.  wrote:
>> 
>> FYI, that segfault problem did not occur when I tested 3.1.2rc1.
>> 
>> Thanks,
>> 
>>> On Aug 17, 2018, at 10:28 AM, Pavel Shamis  wrote:
>>> 
>>> It looks to me like mxm related failure ? 
>>> 
>>> On Thu, Aug 16, 2018 at 1:51 PM Vallee, Geoffroy R.  
>>> wrote:
>>> Hi,
>>> 
>>> I ran some tests on Summitdev here at ORNL:
>>> - the UCX problem is solved and I get the expected results for the tests 
>>> that I am running (netpipe and IMB).
>>> - without UCX:
>>>   * the performance numbers are below what would be expected but I 
>>> believe at this point that the slight performance deficiency is due to 
>>> other users using other parts of the system. 
>>>   * I also encountered the following problem while running IMB_EXT and 
>>> I now realize that I had the same problem with 2.4.1rc1 but did not catch 
>>> it at the time:
>>> [summitdev-login1:112517:0] Caught signal 11 (Segmentation fault)
>>> [summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault)
>>>  backtrace 
>>> 2 0x00073864 mxm_handle_error()  
>>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>>> 3 0x00073fa4 mxm_error_signal_handler()  
>>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>>> 4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
>>> 5 0x000d4634 ompi_osc_base_select()  ??:0
>>> 6 0x00065e84 ompi_win_create()  ??:0
>>> 7 0x000a2488 PMPI_Win_create()  ??:0
>>> 8 0x1000b28c IMB_window()  ??:0
>>> 9 0x10005764 IMB_init_buffers_iter()  ??:0
>>> 10 0x10001ef8 main()  ??:0
>>> 11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
>>> 12 0x00024b74 __libc_start_main()  ??:0
>>> ===
>>>  backtrace 
>>> 2 0x00073864 mxm_handle_error()  
>>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>>> 3 0x00073fa4 mxm_error_signal_handler()  
>>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>>> 4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
>>> 5 0x000d4634 ompi_osc_base_select()  ??:0
>>> 6 0x00065e84 ompi_win_create()  ??:0
>>> 7 0x000a2488 PMPI_Win_create()  ??:0
>>> 8 0x1000b28c IMB_window()  ??:0
>>> 9 0x10005764 IMB_init_buffers_iter()  ??:0
>>> 10 0x10001ef8 main()  ??:0
>>> 11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
>>> 12 0x00024b74 __libc_start_main()  ??:0
>>> ===
>>> 
>>> FYI, the 2.x series is not important to me so it can stay as is. I will 
>>> move on testing 3.1.2rc1.
>>> 
>>> Thanks,
>>> 
>>> 
 On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel 
  wrote:
 
 Per our discussion over the weekend and on the weekly webex yesterday, 
 we're releasing v2.1.5.  There are only two changes:
 
 1. A trivial link issue for UCX.
 2. A fix for the vader BTL issue.  This is how I described it in NEWS:
 
 - A subtle race condition bug was discovered in the "vader" BTL
 (shared memory communications) that, in rare instances, can cause
 MPI processes to crash or incorrectly classify (or effectively drop)
 an MPI message sent via shared memory.  If you are using the "ob1"
 PML with "vader" for shared memory communication (note that vader is
 the default for shared memory communication with ob1), you need to
 upgrade to v2.1.5 to fix this issue.  You may also upgrade to the
 following versions to fix this issue:
 - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x
  series
 - Open MPI v3.1.2 (expected end of August, 2018) or later
 
 This vader fix was warranted serious enough to generate a 2.1.5 release.  
 This really will be the end of the 2.1.x series.  Trust me; my name is Joe 
 Isuzu.
 
 2.1.5rc1 will be available from the usual location in a few minutes (the 
 website will update in about 7 minutes):
 
  https://www.open-mpi.org/software/ompi/v2.1/
 
 -- 
 Jeff Squyres
 jsquy...@cisco.com
 
 ___
 devel mailing list
 devel@lists.open-mpi.org
 https://lists.open-mpi.org/mailman/listinfo/devel
>>> 
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo

Re: [OMPI devel] v2.1.5rc1 is out

2018-08-17 Thread Jeff Squyres (jsquyres) via devel
Thanks for the testing.

I'm assuming the MXM failure has been around for a while, and the correct way 
to fix it is to upgrade to a newer Open MPI and/or use UCX.


> On Aug 17, 2018, at 11:01 AM, Vallee, Geoffroy R.  wrote:
> 
> FYI, that segfault problem did not occur when I tested 3.1.2rc1.
> 
> Thanks,
> 
>> On Aug 17, 2018, at 10:28 AM, Pavel Shamis  wrote:
>> 
>> It looks to me like mxm related failure ? 
>> 
>> On Thu, Aug 16, 2018 at 1:51 PM Vallee, Geoffroy R.  
>> wrote:
>> Hi,
>> 
>> I ran some tests on Summitdev here at ORNL:
>> - the UCX problem is solved and I get the expected results for the tests 
>> that I am running (netpipe and IMB).
>> - without UCX:
>>* the performance numbers are below what would be expected but I 
>> believe at this point that the slight performance deficiency is due to other 
>> users using other parts of the system. 
>>* I also encountered the following problem while running IMB_EXT and 
>> I now realize that I had the same problem with 2.4.1rc1 but did not catch it 
>> at the time:
>> [summitdev-login1:112517:0] Caught signal 11 (Segmentation fault)
>> [summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault)
>>  backtrace 
>> 2 0x00073864 mxm_handle_error()  
>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>> 3 0x00073fa4 mxm_error_signal_handler()  
>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>> 4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
>> 5 0x000d4634 ompi_osc_base_select()  ??:0
>> 6 0x00065e84 ompi_win_create()  ??:0
>> 7 0x000a2488 PMPI_Win_create()  ??:0
>> 8 0x1000b28c IMB_window()  ??:0
>> 9 0x10005764 IMB_init_buffers_iter()  ??:0
>> 10 0x10001ef8 main()  ??:0
>> 11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
>> 12 0x00024b74 __libc_start_main()  ??:0
>> ===
>>  backtrace 
>> 2 0x00073864 mxm_handle_error()  
>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>> 3 0x00073fa4 mxm_error_signal_handler()  
>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>> 4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
>> 5 0x000d4634 ompi_osc_base_select()  ??:0
>> 6 0x00065e84 ompi_win_create()  ??:0
>> 7 0x000a2488 PMPI_Win_create()  ??:0
>> 8 0x1000b28c IMB_window()  ??:0
>> 9 0x10005764 IMB_init_buffers_iter()  ??:0
>> 10 0x10001ef8 main()  ??:0
>> 11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
>> 12 0x00024b74 __libc_start_main()  ??:0
>> ===
>> 
>> FYI, the 2.x series is not important to me so it can stay as is. I will move 
>> on testing 3.1.2rc1.
>> 
>> Thanks,
>> 
>> 
>>> On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel 
>>>  wrote:
>>> 
>>> Per our discussion over the weekend and on the weekly webex yesterday, 
>>> we're releasing v2.1.5.  There are only two changes:
>>> 
>>> 1. A trivial link issue for UCX.
>>> 2. A fix for the vader BTL issue.  This is how I described it in NEWS:
>>> 
>>> - A subtle race condition bug was discovered in the "vader" BTL
>>> (shared memory communications) that, in rare instances, can cause
>>> MPI processes to crash or incorrectly classify (or effectively drop)
>>> an MPI message sent via shared memory.  If you are using the "ob1"
>>> PML with "vader" for shared memory communication (note that vader is
>>> the default for shared memory communication with ob1), you need to
>>> upgrade to v2.1.5 to fix this issue.  You may also upgrade to the
>>> following versions to fix this issue:
>>> - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x
>>>   series
>>> - Open MPI v3.1.2 (expected end of August, 2018) or later
>>> 
>>> This vader fix was warranted serious enough to generate a 2.1.5 release.  
>>> This really will be the end of the 2.1.x series.  Trust me; my name is Joe 
>>> Isuzu.
>>> 
>>> 2.1.5rc1 will be available from the usual location in a few minutes (the 
>>> website will update in about 7 minutes):
>>> 
>>>   https://www.open-mpi.org/software/ompi/v2.1/
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> 
>>> ___
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/devel
>> 
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel


-- 
Jeff Squyres
jsquy...@cisco.com


Re: [OMPI devel] v2.1.5rc1 is out

2018-08-17 Thread Vallee, Geoffroy R.
FYI, that segfault problem did not occur when I tested 3.1.2rc1.

Thanks,

> On Aug 17, 2018, at 10:28 AM, Pavel Shamis  wrote:
> 
> It looks to me like mxm related failure ? 
> 
> On Thu, Aug 16, 2018 at 1:51 PM Vallee, Geoffroy R.  wrote:
> Hi,
> 
> I ran some tests on Summitdev here at ORNL:
> - the UCX problem is solved and I get the expected results for the tests that 
> I am running (netpipe and IMB).
> - without UCX:
> * the performance numbers are below what would be expected but I 
> believe at this point that the slight performance deficiency is due to other 
> users using other parts of the system. 
> * I also encountered the following problem while running IMB_EXT and 
> I now realize that I had the same problem with 2.4.1rc1 but did not catch it 
> at the time:
> [summitdev-login1:112517:0] Caught signal 11 (Segmentation fault)
> [summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault)
>  backtrace 
>  2 0x00073864 mxm_handle_error()  
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>  3 0x00073fa4 mxm_error_signal_handler()  
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>  4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
>  5 0x000d4634 ompi_osc_base_select()  ??:0
>  6 0x00065e84 ompi_win_create()  ??:0
>  7 0x000a2488 PMPI_Win_create()  ??:0
>  8 0x1000b28c IMB_window()  ??:0
>  9 0x10005764 IMB_init_buffers_iter()  ??:0
> 10 0x10001ef8 main()  ??:0
> 11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
> 12 0x00024b74 __libc_start_main()  ??:0
> ===
>  backtrace 
>  2 0x00073864 mxm_handle_error()  
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>  3 0x00073fa4 mxm_error_signal_handler()  
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>  4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
>  5 0x000d4634 ompi_osc_base_select()  ??:0
>  6 0x00065e84 ompi_win_create()  ??:0
>  7 0x000a2488 PMPI_Win_create()  ??:0
>  8 0x1000b28c IMB_window()  ??:0
>  9 0x10005764 IMB_init_buffers_iter()  ??:0
> 10 0x10001ef8 main()  ??:0
> 11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
> 12 0x00024b74 __libc_start_main()  ??:0
> ===
> 
> FYI, the 2.x series is not important to me so it can stay as is. I will move 
> on testing 3.1.2rc1.
> 
> Thanks,
> 
> 
> > On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel 
> >  wrote:
> > 
> > Per our discussion over the weekend and on the weekly webex yesterday, 
> > we're releasing v2.1.5.  There are only two changes:
> > 
> > 1. A trivial link issue for UCX.
> > 2. A fix for the vader BTL issue.  This is how I described it in NEWS:
> > 
> > - A subtle race condition bug was discovered in the "vader" BTL
> >  (shared memory communications) that, in rare instances, can cause
> >  MPI processes to crash or incorrectly classify (or effectively drop)
> >  an MPI message sent via shared memory.  If you are using the "ob1"
> >  PML with "vader" for shared memory communication (note that vader is
> >  the default for shared memory communication with ob1), you need to
> >  upgrade to v2.1.5 to fix this issue.  You may also upgrade to the
> >  following versions to fix this issue:
> >  - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x
> >series
> >  - Open MPI v3.1.2 (expected end of August, 2018) or later
> > 
> > This vader fix was warranted serious enough to generate a 2.1.5 release.  
> > This really will be the end of the 2.1.x series.  Trust me; my name is Joe 
> > Isuzu.
> > 
> > 2.1.5rc1 will be available from the usual location in a few minutes (the 
> > website will update in about 7 minutes):
> > 
> >https://www.open-mpi.org/software/ompi/v2.1/
> > 
> > -- 
> > Jeff Squyres
> > jsquy...@cisco.com
> > 
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] v2.1.5rc1 is out

2018-08-17 Thread Pavel Shamis
It looks to me like mxm related failure ?

On Thu, Aug 16, 2018 at 1:51 PM Vallee, Geoffroy R. 
wrote:

> Hi,
>
> I ran some tests on Summitdev here at ORNL:
> - the UCX problem is solved and I get the expected results for the tests
> that I am running (netpipe and IMB).
> - without UCX:
> * the performance numbers are below what would be expected but I
> believe at this point that the slight performance deficiency is due to
> other users using other parts of the system.
> * I also encountered the following problem while running IMB_EXT
> and I now realize that I had the same problem with 2.4.1rc1 but did not
> catch it at the time:
> [summitdev-login1:112517:0] Caught signal 11 (Segmentation fault)
> [summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault)
>  backtrace 
>  2 0x00073864 mxm_handle_error()
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>  3 0x00073fa4 mxm_error_signal_handler()
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>  4 0x00017b24 ompi_osc_rdma_component_query()
> osc_rdma_component.c:0
>  5 0x000d4634 ompi_osc_base_select()  ??:0
>  6 0x00065e84 ompi_win_create()  ??:0
>  7 0x000a2488 PMPI_Win_create()  ??:0
>  8 0x1000b28c IMB_window()  ??:0
>  9 0x10005764 IMB_init_buffers_iter()  ??:0
> 10 0x10001ef8 main()  ??:0
> 11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
> 12 0x00024b74 __libc_start_main()  ??:0
> ===
>  backtrace 
>  2 0x00073864 mxm_handle_error()
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>  3 0x00073fa4 mxm_error_signal_handler()
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>  4 0x00017b24 ompi_osc_rdma_component_query()
> osc_rdma_component.c:0
>  5 0x000d4634 ompi_osc_base_select()  ??:0
>  6 0x00065e84 ompi_win_create()  ??:0
>  7 0x000a2488 PMPI_Win_create()  ??:0
>  8 0x1000b28c IMB_window()  ??:0
>  9 0x10005764 IMB_init_buffers_iter()  ??:0
> 10 0x10001ef8 main()  ??:0
> 11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
> 12 0x00024b74 __libc_start_main()  ??:0
> ===
>
> FYI, the 2.x series is not important to me so it can stay as is. I will
> move on testing 3.1.2rc1.
>
> Thanks,
>
>
> > On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel <
> devel@lists.open-mpi.org> wrote:
> >
> > Per our discussion over the weekend and on the weekly webex yesterday,
> we're releasing v2.1.5.  There are only two changes:
> >
> > 1. A trivial link issue for UCX.
> > 2. A fix for the vader BTL issue.  This is how I described it in NEWS:
> >
> > - A subtle race condition bug was discovered in the "vader" BTL
> >  (shared memory communications) that, in rare instances, can cause
> >  MPI processes to crash or incorrectly classify (or effectively drop)
> >  an MPI message sent via shared memory.  If you are using the "ob1"
> >  PML with "vader" for shared memory communication (note that vader is
> >  the default for shared memory communication with ob1), you need to
> >  upgrade to v2.1.5 to fix this issue.  You may also upgrade to the
> >  following versions to fix this issue:
> >  - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x
> >series
> >  - Open MPI v3.1.2 (expected end of August, 2018) or later
> >
> > This vader fix was warranted serious enough to generate a 2.1.5
> release.  This really will be the end of the 2.1.x series.  Trust me; my
> name is Joe Isuzu.
> >
> > 2.1.5rc1 will be available from the usual location in a few minutes (the
> website will update in about 7 minutes):
> >
> >https://www.open-mpi.org/software/ompi/v2.1/
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] v2.1.5rc1 is out

2018-08-16 Thread Vallee, Geoffroy R.
Hi,

I ran some tests on Summitdev here at ORNL:
- the UCX problem is solved and I get the expected results for the tests that I 
am running (netpipe and IMB).
- without UCX:
* the performance numbers are below what would be expected but I 
believe at this point that the slight performance deficiency is due to other 
users using other parts of the system. 
* I also encountered the following problem while running IMB_EXT and I 
now realize that I had the same problem with 2.4.1rc1 but did not catch it at 
the time:
[summitdev-login1:112517:0] Caught signal 11 (Segmentation fault)
[summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault)
 backtrace 
 2 0x00073864 mxm_handle_error()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
 3 0x00073fa4 mxm_error_signal_handler()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
 4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
 5 0x000d4634 ompi_osc_base_select()  ??:0
 6 0x00065e84 ompi_win_create()  ??:0
 7 0x000a2488 PMPI_Win_create()  ??:0
 8 0x1000b28c IMB_window()  ??:0
 9 0x10005764 IMB_init_buffers_iter()  ??:0
10 0x10001ef8 main()  ??:0
11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
12 0x00024b74 __libc_start_main()  ??:0
===
 backtrace 
 2 0x00073864 mxm_handle_error()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
 3 0x00073fa4 mxm_error_signal_handler()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
 4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
 5 0x000d4634 ompi_osc_base_select()  ??:0
 6 0x00065e84 ompi_win_create()  ??:0
 7 0x000a2488 PMPI_Win_create()  ??:0
 8 0x1000b28c IMB_window()  ??:0
 9 0x10005764 IMB_init_buffers_iter()  ??:0
10 0x10001ef8 main()  ??:0
11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
12 0x00024b74 __libc_start_main()  ??:0
===

FYI, the 2.x series is not important to me so it can stay as is. I will move on 
testing 3.1.2rc1.

Thanks,


> On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel 
>  wrote:
> 
> Per our discussion over the weekend and on the weekly webex yesterday, we're 
> releasing v2.1.5.  There are only two changes:
> 
> 1. A trivial link issue for UCX.
> 2. A fix for the vader BTL issue.  This is how I described it in NEWS:
> 
> - A subtle race condition bug was discovered in the "vader" BTL
>  (shared memory communications) that, in rare instances, can cause
>  MPI processes to crash or incorrectly classify (or effectively drop)
>  an MPI message sent via shared memory.  If you are using the "ob1"
>  PML with "vader" for shared memory communication (note that vader is
>  the default for shared memory communication with ob1), you need to
>  upgrade to v2.1.5 to fix this issue.  You may also upgrade to the
>  following versions to fix this issue:
>  - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x
>series
>  - Open MPI v3.1.2 (expected end of August, 2018) or later
> 
> This vader fix was warranted serious enough to generate a 2.1.5 release.  
> This really will be the end of the 2.1.x series.  Trust me; my name is Joe 
> Isuzu.
> 
> 2.1.5rc1 will be available from the usual location in a few minutes (the 
> website will update in about 7 minutes):
> 
>https://www.open-mpi.org/software/ompi/v2.1/
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] v2.1.5rc1 is out

2018-08-16 Thread Vallee, Geoffroy R.
Hi,

I ran some tests on Summitdev here at ORNL:
- the UCX problem is solved and I get the expected results for the tests that I 
am running (netpipe and IMB).
- without UCX:
* the performance numbers are below what would be expected but I 
believe at this point that the slight performance deficiency is due to other 
users using other parts of the system. 
* I also encountered the following problem while running IMB_EXT and I 
now realize that I had the same problem with 2.4.1rc1 but did not catch it at 
the time:
[summitdev-login1:112517:0] Caught signal 11 (Segmentation fault)
[summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault)
 backtrace 
 2 0x00073864 mxm_handle_error()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
 3 0x00073fa4 mxm_error_signal_handler()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
 4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
 5 0x000d4634 ompi_osc_base_select()  ??:0
 6 0x00065e84 ompi_win_create()  ??:0
 7 0x000a2488 PMPI_Win_create()  ??:0
 8 0x1000b28c IMB_window()  ??:0
 9 0x10005764 IMB_init_buffers_iter()  ??:0
10 0x10001ef8 main()  ??:0
11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
12 0x00024b74 __libc_start_main()  ??:0
===
 backtrace 
 2 0x00073864 mxm_handle_error()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
 3 0x00073fa4 mxm_error_signal_handler()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
 4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
 5 0x000d4634 ompi_osc_base_select()  ??:0
 6 0x00065e84 ompi_win_create()  ??:0
 7 0x000a2488 PMPI_Win_create()  ??:0
 8 0x1000b28c IMB_window()  ??:0
 9 0x10005764 IMB_init_buffers_iter()  ??:0
10 0x10001ef8 main()  ??:0
11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
12 0x00024b74 __libc_start_main()  ??:0
===

FYI, the 2.x series is not important to me so it can stay as is. I will move on 
testing 3.1.2rc1.

Thanks,


> On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel 
>  wrote:
> 
> Per our discussion over the weekend and on the weekly webex yesterday, we're 
> releasing v2.1.5.  There are only two changes:
> 
> 1. A trivial link issue for UCX.
> 2. A fix for the vader BTL issue.  This is how I described it in NEWS:
> 
> - A subtle race condition bug was discovered in the "vader" BTL
>  (shared memory communications) that, in rare instances, can cause
>  MPI processes to crash or incorrectly classify (or effectively drop)
>  an MPI message sent via shared memory.  If you are using the "ob1"
>  PML with "vader" for shared memory communication (note that vader is
>  the default for shared memory communication with ob1), you need to
>  upgrade to v2.1.5 to fix this issue.  You may also upgrade to the
>  following versions to fix this issue:
>  - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x
>series
>  - Open MPI v3.1.2 (expected end of August, 2018) or later
> 
> This vader fix was warranted serious enough to generate a 2.1.5 release.  
> This really will be the end of the 2.1.x series.  Trust me; my name is Joe 
> Isuzu.
> 
> 2.1.5rc1 will be available from the usual location in a few minutes (the 
> website will update in about 7 minutes):
> 
>https://www.open-mpi.org/software/ompi/v2.1/
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel