Re: [OMPI devel] MPI_Comm_spawn crashes with the openib btl
Thanks Ralph ! it did fix the problem Cheers, Gilles On 2014/10/01 3:04, Ralph Castain wrote: > I fixed this in r32818 - the components shouldn't be passing back success if > the requested info isn't found. Hope that fixes the problem. > > > On Sep 30, 2014, at 1:54 AM, Gilles Gouaillardet >wrote: > >> Folks, >> >> the dynamic/spawn test from the ibm test suite crashes if the openib btl >> is detected >> (the test can be ran on one node with an IB port) >> >> here is what happens : >> >> in mca_btl_openib_proc_create, >> the macro >>OPAL_MODEX_RECV(rc, _btl_openib_component.super.btl_version, >>proc, , _size); >> does not find any information *but* >> rc is OPAL_SUCCESS >> msg_size is not updated (e.g. left uninitialized) >> message is not updated (e.g. left uninitialized) >> >> then, if msg_size is unitialized with a non zero value, and if message >> is uninitialized with >> a non valid address, a crash will occur when accessing message. >> >> /* i am not debating here the fact that there is no information returned, >> i am simply discussing the crash */ >> >> a simple workaround is to initialize msg_size to zero. >> >> that being said, is this the correct fix ? >> >> one possible alternate fix is to update the OPAL_MODEX_RECV_STRING macro >> like this : >> >> /* from opal/mca/pmix/pmix.h */ >> #define OPAL_MODEX_RECV_STRING(r, s, p, d, sz) \ >>do {\ >>opal_value_t *kv; \ >>if (OPAL_SUCCESS == ((r) = opal_pmix.get(&(p)->proc_name, \ >> (s), ))) { \ >>if (NULL != kv) >> { \ >>*(d) = >> kv->data.bo.bytes; \ >>*(sz) = >> kv->data.bo.size; \ >>kv->data.bo.bytes = NULL; /* protect the data >> */\ >> >> OBJ_RELEASE(kv);\ >>} else {\ >>*(sz) = 0;\ >>(r) = OPAL_ERR_NOT_FOUND; >>} \ >>} \ >>} while(0); >> >> /* >> *(sz) = 0; and (r) = OPAL_ERR_NOT_FOUND; can be seen as redundant, *(sz) >> *or* (r) could be set >> */ >> >> and an other alternate fix is to update the end of the native_get >> function like this : >> >> /* from opal/mca/pmix/native/pmix_native.c */ >> >>if (found) { >>return OPAL_SUCCESS; >>} >>*kv = NULL; >>if (OPAL_SUCCESS == rc) { >>if (OPAL_SUCCESS == ret) { >>rc = OPAL_ERR_NOT_FOUND; >>} else { >>rc = ret; >>} >>} >>return rc; >> >> Could you please advise ? >> >> Cheers, >> >> Gilles >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15942.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15950.php
Re: [OMPI devel] MPI_Comm_spawn crashes with the openib btl
I fixed this in r32818 - the components shouldn't be passing back success if the requested info isn't found. Hope that fixes the problem. On Sep 30, 2014, at 1:54 AM, Gilles Gouaillardetwrote: > Folks, > > the dynamic/spawn test from the ibm test suite crashes if the openib btl > is detected > (the test can be ran on one node with an IB port) > > here is what happens : > > in mca_btl_openib_proc_create, > the macro >OPAL_MODEX_RECV(rc, _btl_openib_component.super.btl_version, >proc, , _size); > does not find any information *but* > rc is OPAL_SUCCESS > msg_size is not updated (e.g. left uninitialized) > message is not updated (e.g. left uninitialized) > > then, if msg_size is unitialized with a non zero value, and if message > is uninitialized with > a non valid address, a crash will occur when accessing message. > > /* i am not debating here the fact that there is no information returned, > i am simply discussing the crash */ > > a simple workaround is to initialize msg_size to zero. > > that being said, is this the correct fix ? > > one possible alternate fix is to update the OPAL_MODEX_RECV_STRING macro > like this : > > /* from opal/mca/pmix/pmix.h */ > #define OPAL_MODEX_RECV_STRING(r, s, p, d, sz) \ >do {\ >opal_value_t *kv; \ >if (OPAL_SUCCESS == ((r) = opal_pmix.get(&(p)->proc_name, \ > (s), ))) { \ >if (NULL != kv) > { \ >*(d) = > kv->data.bo.bytes; \ >*(sz) = > kv->data.bo.size; \ >kv->data.bo.bytes = NULL; /* protect the data > */\ > > OBJ_RELEASE(kv);\ >} else {\ >*(sz) = 0;\ >(r) = OPAL_ERR_NOT_FOUND; >} \ >} \ >} while(0); > > /* > *(sz) = 0; and (r) = OPAL_ERR_NOT_FOUND; can be seen as redundant, *(sz) > *or* (r) could be set > */ > > and an other alternate fix is to update the end of the native_get > function like this : > > /* from opal/mca/pmix/native/pmix_native.c */ > >if (found) { >return OPAL_SUCCESS; >} >*kv = NULL; >if (OPAL_SUCCESS == rc) { >if (OPAL_SUCCESS == ret) { >rc = OPAL_ERR_NOT_FOUND; >} else { >rc = ret; >} >} >return rc; > > Could you please advise ? > > Cheers, > > Gilles > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15942.php
[OMPI devel] MPI_Comm_spawn crashes with the openib btl
Folks, the dynamic/spawn test from the ibm test suite crashes if the openib btl is detected (the test can be ran on one node with an IB port) here is what happens : in mca_btl_openib_proc_create, the macro OPAL_MODEX_RECV(rc, _btl_openib_component.super.btl_version, proc, , _size); does not find any information *but* rc is OPAL_SUCCESS msg_size is not updated (e.g. left uninitialized) message is not updated (e.g. left uninitialized) then, if msg_size is unitialized with a non zero value, and if message is uninitialized with a non valid address, a crash will occur when accessing message. /* i am not debating here the fact that there is no information returned, i am simply discussing the crash */ a simple workaround is to initialize msg_size to zero. that being said, is this the correct fix ? one possible alternate fix is to update the OPAL_MODEX_RECV_STRING macro like this : /* from opal/mca/pmix/pmix.h */ #define OPAL_MODEX_RECV_STRING(r, s, p, d, sz) \ do {\ opal_value_t *kv; \ if (OPAL_SUCCESS == ((r) = opal_pmix.get(&(p)->proc_name, \ (s), ))) { \ if (NULL != kv) { \ *(d) = kv->data.bo.bytes; \ *(sz) = kv->data.bo.size; \ kv->data.bo.bytes = NULL; /* protect the data */\ OBJ_RELEASE(kv);\ } else {\ *(sz) = 0;\ (r) = OPAL_ERR_NOT_FOUND; } \ } \ } while(0); /* *(sz) = 0; and (r) = OPAL_ERR_NOT_FOUND; can be seen as redundant, *(sz) *or* (r) could be set */ and an other alternate fix is to update the end of the native_get function like this : /* from opal/mca/pmix/native/pmix_native.c */ if (found) { return OPAL_SUCCESS; } *kv = NULL; if (OPAL_SUCCESS == rc) { if (OPAL_SUCCESS == ret) { rc = OPAL_ERR_NOT_FOUND; } else { rc = ret; } } return rc; Could you please advise ? Cheers, Gilles