Takahiro,

Sorry for the delay in answering. Thanks for the bug report and the patch.
I applied you patch, and added some tougher tests to make sure we catch
similar issues in the future.

Thanks,
  George.


On Mon, Sep 29, 2014 at 8:56 PM, Kawashima, Takahiro <
t-kawash...@jp.fujitsu.com> wrote:

> Hi George,
>
> Thank you for attending the meeting at Kyoto. As we talked
> at the meeting, my colleague suffers from a datatype problem.
>
> See attached create_resized.c. It creates a datatype with an
> LB marker using MPI_Type_create_struct and MPI_Type_create_resized.
>
> Expected contents of the output file (received_data) is:
> --------------------------------
> 0: t1 = 0.1, t2 = 0.2
> 1: t1 = 1.1, t2 = 1.2
> 2: t1 = 2.1, t2 = 2.2
> 3: t1 = 3.1, t2 = 3.2
> 4: t1 = 4.1, t2 = 4.2
> ... snip ...
> 1995: t1 = 1995.1, t2 = 1995.2
> 1996: t1 = 1996.1, t2 = 1996.2
> 1997: t1 = 1997.1, t2 = 1997.2
> 1998: t1 = 1998.1, t2 = 1998.2
> 1999: t1 = 1999.1, t2 = 1999.2
> --------------------------------
>
> But if you run the program many times with multiple BTL modules
> and with their small eager_limit and small max_send_size,
> you'll see on some run:
> --------------------------------
> 0: t1 = 0.1, t2 = 0.2
> 1: t1 = 1.1, t2 = 1.2
> 2: t1 = 2.1, t2 = 2.2
> 3: t1 = 3.1, t2 = 3.2
> 4: t1 = 4.1, t2 = 4.2
> ... snip ...
> 470: t1 = 470.1, t2 = 470.2
> 471: t1 = 471.1, t2 = 471.2
> 472: t1 = 472.1, t2 = 472.2
> 473: t1 = 473.1, t2 = 473.2
> 474: t1 = 474.1, t2 = 0        <-- broken!
> 475: t1 = 0, t2 = 475.1
> 476: t1 = 0, t2 = 476.1
> 477: t1 = 0, t2 = 477.1
> ... snip ...
> 1995: t1 = 0, t2 = 1995.1
> 1996: t1 = 0, t2 = 1996.1
> 1997: t1 = 0, t2 = 1997.1
> 1998: t1 = 0, t2 = 1998.1
> 1999: t1 = 0, t2 = 1999.1
> --------------------------------
>
> The index of the array at which data start to break (474 in the
> above case) may change on every run.
> Same result appears on both trunk and v1.8.3.
>
> You can reproduce this with the following options if you have
> multiple IB HCAs.
>
>   -n 2
>   --mca btl self,openib
>   --mca btl_openib_eager_limit 256
>   --mca btl_openib_max_send_size 384
>
> Or if you don't have multiple NICs, with the following options.
>
>   -n 2
>   --host localhost
>   --mca btl self,sm,vader
>   --mca btl_vader_exclusivity 65536
>   --mca btl_vader_eager_limit 256
>   --mca btl_vader_max_send_size 384
>   --mca btl_sm_exclusivity 65536
>   --mca btl_sm_eager_limit 256
>   --mca btl_sm_max_send_size 384
>
> My colleague found that OPAL convertor on the receiving process
> seems to add the LB value twice for out-of-order arrival of
> fragments when computing the receive buffer write-offset.
>
> He created the patch bellow. Our program works fine with
> this patch but we don't know this is a correct fix.
> Could you see this issue?
>
> Index: opal/datatype/opal_convertor.c
> ===================================================================
> --- opal/datatype/opal_convertor.c      (revision 32807)
> +++ opal/datatype/opal_convertor.c      (working copy)
> @@ -362,11 +362,11 @@
>      if( OPAL_LIKELY(0 == count) ) {
>          pStack[1].type     = pElems->elem.common.type;
>          pStack[1].count    = pElems->elem.count;
> -        pStack[1].disp     = pElems->elem.disp;
> +        pStack[1].disp     = 0;
>      } else {
>          pStack[1].type  = OPAL_DATATYPE_UINT1;
>          pStack[1].count = pData->size - count;
> -        pStack[1].disp  = pData->true_lb + count;
> +        pStack[1].disp  = count;
>      }
>      pStack[1].index    = 0;  /* useless */
>
>
> Best regards,
> Takahiro Kawashima,
> MPI development team,
> Fujitsu
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15939.php
>

Reply via email to