Re: [OMPI users] Issues with Get/Put and IRecv
Well, mpich2 and mvapich2 are working smoothly for my app. mpich2 under gige is also giving ~2X the performance of openmpi during the working cases for openmpi. After the paper deadline, I'll attempt to package up a simple test case and send it to the list. Thanks! -Mike Mike Houston wrote: Sadly, I've just hit this problem again, so I'll have to find another MPI implementation as I have a paper deadline quickly approaching. I'm using single threads now, but I had very similar issues when using multiple threads and issuing send/recv on one thread and waiting on a posted MPI_Recv on another. The issue seems to actually be with MPI_Gets. I can do heavy MPI_Put's and things seem okay. But as soon as I have a similar communication pattern with MPI_Get's things get unstable. -Mike Brian Barrett wrote: Mike - In Open MPI 1.2, one-sided is implemented over point-to-point, so I would expect it to be slower. This may or may not be addressed in a future version of Open MPI (I would guess so, but don't want to commit to it). Where you using multiple threads? If so, how? On the good news, I think your call stack looked similar to what I was seeing, so hopefully I can make some progress on a real solution. Brian On Mar 20, 2007, at 8:54 PM, Mike Houston wrote: Well, I've managed to get a working solution, but I'm not sure how I got there. I built a test case that looked like a nice simple version of what I was trying to do and it worked, so I moved the test code into my implementation and low and behold it works. I must have been doing something a little funky in the original pass, likely causing a stack smash somewhere or trying to do a get/put out of bounds. If I have any more problems, I'll let y'all know. I've tested pretty heavy usage up to 128 MPI processes across 16 nodes and things seem to be behaving. I did notice that single sided transfers seem to be a little slower than explicit send/recv, at least on GigE. Once I do some more testing, I'll bring things up on IB and see how things are going. -Mike Mike Houston wrote: Brian Barrett wrote: On Mar 20, 2007, at 3:15 PM, Mike Houston wrote: If I only do gets/puts, things seem to be working correctly with version 1.2. However, if I have a posted Irecv on the target node and issue a MPI_Get against that target, MPI_Test on the posed IRecv causes a segfaults: Anyone have suggestions? Sadly, I need to have IRecv's posted. I'll attempt to find a workaround, but it looks like the posed IRecv is getting all the data of the MPI_Get from the other node. It's like the message tagging is getting ignored. I've never tried posting two different IRecv's with different message tags either... Hi Mike - I've spent some time this afternoon looking at the problem and have some ideas on what could be happening. I don't think it's a data mismatch (the data intended for the IRecv getting delivered to the Get), but more a problem with the call to MPI_Test perturbing the progress flow of the one-sided engine. I can see one or two places where it's possible this could happen, although I'm having trouble replicating the problem with any test case I can write. Is it possible for you to share the code causing the problem (or some small test case)? It would make me feel considerably better if I could really understand the conditions required to end up in a seg fault state. Thanks, Brian Well, I can give you a linux x86 binary if that would do it. The code is huge as it's part of a much larger system, so there is no such thing as a simple case at the moment, and the code is in pieces an largely unrunnable now with all the hacking... I basically have one thread spinning on an MPI_Test on a posted IRecv while being used as the target to the MPI_Get. I'll see if I can hack together a simple version that breaks late tonight. I've just played with posting a send to that IRecv, issuing the MPI_Get, handshaking and then posting another IRecv and the MPI_Test continues to eat it, but in a memcpy: #0 0x001c068c in memcpy () from /lib/libc.so.6 #1 0x00e412d9 in ompi_convertor_pack (pConv=0x83c1198, iov=0xa0, out_size=0xaffc1fd8, max_data=0xaffc1fdc) at convertor.c:254 #2 0x00ea265d in ompi_osc_pt2pt_replyreq_send (module=0x856e668, replyreq=0x83c1180) at osc_pt2pt_data_move.c:411 #3 0x00ea0ebe in ompi_osc_pt2pt_component_fragment_cb (pt2pt_buffer=0x8573380) at osc_pt2pt_component.c:582 #4 0x00ea1389 in ompi_osc_pt2pt_progress () at osc_pt2pt_component.c:769 #5 0x00aa3019 in opal_progress () at runtime/opal_progress.c:288 #6 0x00ea59e5 in ompi_osc_pt2pt_passive_unlock (module=0x856e668, origin=1, count=1) at osc_pt2pt_sync.c:60 #7 0x00ea0cd2 in ompi_osc_pt2pt_component_fragment_cb (pt2pt_buffer=0x856f300) at osc_pt2pt_component.c:688 #8 0x00ea1389 in ompi_osc_pt2pt_progress () at
Re: [OMPI users] Issues with Get/Put and IRecv
Sadly, I've just hit this problem again, so I'll have to find another MPI implementation as I have a paper deadline quickly approaching. I'm using single threads now, but I had very similar issues when using multiple threads and issuing send/recv on one thread and waiting on a posted MPI_Recv on another. The issue seems to actually be with MPI_Gets. I can do heavy MPI_Put's and things seem okay. But as soon as I have a similar communication pattern with MPI_Get's things get unstable. -Mike Brian Barrett wrote: Mike - In Open MPI 1.2, one-sided is implemented over point-to-point, so I would expect it to be slower. This may or may not be addressed in a future version of Open MPI (I would guess so, but don't want to commit to it). Where you using multiple threads? If so, how? On the good news, I think your call stack looked similar to what I was seeing, so hopefully I can make some progress on a real solution. Brian On Mar 20, 2007, at 8:54 PM, Mike Houston wrote: Well, I've managed to get a working solution, but I'm not sure how I got there. I built a test case that looked like a nice simple version of what I was trying to do and it worked, so I moved the test code into my implementation and low and behold it works. I must have been doing something a little funky in the original pass, likely causing a stack smash somewhere or trying to do a get/put out of bounds. If I have any more problems, I'll let y'all know. I've tested pretty heavy usage up to 128 MPI processes across 16 nodes and things seem to be behaving. I did notice that single sided transfers seem to be a little slower than explicit send/recv, at least on GigE. Once I do some more testing, I'll bring things up on IB and see how things are going. -Mike Mike Houston wrote: Brian Barrett wrote: On Mar 20, 2007, at 3:15 PM, Mike Houston wrote: If I only do gets/puts, things seem to be working correctly with version 1.2. However, if I have a posted Irecv on the target node and issue a MPI_Get against that target, MPI_Test on the posed IRecv causes a segfaults: Anyone have suggestions? Sadly, I need to have IRecv's posted. I'll attempt to find a workaround, but it looks like the posed IRecv is getting all the data of the MPI_Get from the other node. It's like the message tagging is getting ignored. I've never tried posting two different IRecv's with different message tags either... Hi Mike - I've spent some time this afternoon looking at the problem and have some ideas on what could be happening. I don't think it's a data mismatch (the data intended for the IRecv getting delivered to the Get), but more a problem with the call to MPI_Test perturbing the progress flow of the one-sided engine. I can see one or two places where it's possible this could happen, although I'm having trouble replicating the problem with any test case I can write. Is it possible for you to share the code causing the problem (or some small test case)? It would make me feel considerably better if I could really understand the conditions required to end up in a seg fault state. Thanks, Brian Well, I can give you a linux x86 binary if that would do it. The code is huge as it's part of a much larger system, so there is no such thing as a simple case at the moment, and the code is in pieces an largely unrunnable now with all the hacking... I basically have one thread spinning on an MPI_Test on a posted IRecv while being used as the target to the MPI_Get. I'll see if I can hack together a simple version that breaks late tonight. I've just played with posting a send to that IRecv, issuing the MPI_Get, handshaking and then posting another IRecv and the MPI_Test continues to eat it, but in a memcpy: #0 0x001c068c in memcpy () from /lib/libc.so.6 #1 0x00e412d9 in ompi_convertor_pack (pConv=0x83c1198, iov=0xa0, out_size=0xaffc1fd8, max_data=0xaffc1fdc) at convertor.c:254 #2 0x00ea265d in ompi_osc_pt2pt_replyreq_send (module=0x856e668, replyreq=0x83c1180) at osc_pt2pt_data_move.c:411 #3 0x00ea0ebe in ompi_osc_pt2pt_component_fragment_cb (pt2pt_buffer=0x8573380) at osc_pt2pt_component.c:582 #4 0x00ea1389 in ompi_osc_pt2pt_progress () at osc_pt2pt_component.c:769 #5 0x00aa3019 in opal_progress () at runtime/opal_progress.c:288 #6 0x00ea59e5 in ompi_osc_pt2pt_passive_unlock (module=0x856e668, origin=1, count=1) at osc_pt2pt_sync.c:60 #7 0x00ea0cd2 in ompi_osc_pt2pt_component_fragment_cb (pt2pt_buffer=0x856f300) at osc_pt2pt_component.c:688 #8 0x00ea1389 in ompi_osc_pt2pt_progress () at osc_pt2pt_component.c:769 #9 0x00aa3019 in opal_progress () at runtime/opal_progress.c:288 #10 0x00e33f05 in ompi_request_test (rptr=0xaffc2430, completed=0xaffc2434, status=0xaffc23fc) at request/req_test.c:82 #11 0x00e61770 in PMPI_Test (request=0xaffc2430, completed=0xaffc2434, status=0xaffc23fc) at ptest.c:52 -Mike
Re: [OMPI users] Issues with Get/Put and IRecv
Mike - In Open MPI 1.2, one-sided is implemented over point-to-point, so I would expect it to be slower. This may or may not be addressed in a future version of Open MPI (I would guess so, but don't want to commit to it). Where you using multiple threads? If so, how? On the good news, I think your call stack looked similar to what I was seeing, so hopefully I can make some progress on a real solution. Brian On Mar 20, 2007, at 8:54 PM, Mike Houston wrote: Well, I've managed to get a working solution, but I'm not sure how I got there. I built a test case that looked like a nice simple version of what I was trying to do and it worked, so I moved the test code into my implementation and low and behold it works. I must have been doing something a little funky in the original pass, likely causing a stack smash somewhere or trying to do a get/put out of bounds. If I have any more problems, I'll let y'all know. I've tested pretty heavy usage up to 128 MPI processes across 16 nodes and things seem to be behaving. I did notice that single sided transfers seem to be a little slower than explicit send/recv, at least on GigE. Once I do some more testing, I'll bring things up on IB and see how things are going. -Mike Mike Houston wrote: Brian Barrett wrote: On Mar 20, 2007, at 3:15 PM, Mike Houston wrote: If I only do gets/puts, things seem to be working correctly with version 1.2. However, if I have a posted Irecv on the target node and issue a MPI_Get against that target, MPI_Test on the posed IRecv causes a segfaults: Anyone have suggestions? Sadly, I need to have IRecv's posted. I'll attempt to find a workaround, but it looks like the posed IRecv is getting all the data of the MPI_Get from the other node. It's like the message tagging is getting ignored. I've never tried posting two different IRecv's with different message tags either... Hi Mike - I've spent some time this afternoon looking at the problem and have some ideas on what could be happening. I don't think it's a data mismatch (the data intended for the IRecv getting delivered to the Get), but more a problem with the call to MPI_Test perturbing the progress flow of the one-sided engine. I can see one or two places where it's possible this could happen, although I'm having trouble replicating the problem with any test case I can write. Is it possible for you to share the code causing the problem (or some small test case)? It would make me feel considerably better if I could really understand the conditions required to end up in a seg fault state. Thanks, Brian Well, I can give you a linux x86 binary if that would do it. The code is huge as it's part of a much larger system, so there is no such thing as a simple case at the moment, and the code is in pieces an largely unrunnable now with all the hacking... I basically have one thread spinning on an MPI_Test on a posted IRecv while being used as the target to the MPI_Get. I'll see if I can hack together a simple version that breaks late tonight. I've just played with posting a send to that IRecv, issuing the MPI_Get, handshaking and then posting another IRecv and the MPI_Test continues to eat it, but in a memcpy: #0 0x001c068c in memcpy () from /lib/libc.so.6 #1 0x00e412d9 in ompi_convertor_pack (pConv=0x83c1198, iov=0xa0, out_size=0xaffc1fd8, max_data=0xaffc1fdc) at convertor.c:254 #2 0x00ea265d in ompi_osc_pt2pt_replyreq_send (module=0x856e668, replyreq=0x83c1180) at osc_pt2pt_data_move.c:411 #3 0x00ea0ebe in ompi_osc_pt2pt_component_fragment_cb (pt2pt_buffer=0x8573380) at osc_pt2pt_component.c:582 #4 0x00ea1389 in ompi_osc_pt2pt_progress () at osc_pt2pt_component.c:769 #5 0x00aa3019 in opal_progress () at runtime/opal_progress.c:288 #6 0x00ea59e5 in ompi_osc_pt2pt_passive_unlock (module=0x856e668, origin=1, count=1) at osc_pt2pt_sync.c:60 #7 0x00ea0cd2 in ompi_osc_pt2pt_component_fragment_cb (pt2pt_buffer=0x856f300) at osc_pt2pt_component.c:688 #8 0x00ea1389 in ompi_osc_pt2pt_progress () at osc_pt2pt_component.c:769 #9 0x00aa3019 in opal_progress () at runtime/opal_progress.c:288 #10 0x00e33f05 in ompi_request_test (rptr=0xaffc2430, completed=0xaffc2434, status=0xaffc23fc) at request/req_test.c:82 #11 0x00e61770 in PMPI_Test (request=0xaffc2430, completed=0xaffc2434, status=0xaffc23fc) at ptest.c:52 -Mike ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Issues with Get/Put and IRecv
Well, I've managed to get a working solution, but I'm not sure how I got there. I built a test case that looked like a nice simple version of what I was trying to do and it worked, so I moved the test code into my implementation and low and behold it works. I must have been doing something a little funky in the original pass, likely causing a stack smash somewhere or trying to do a get/put out of bounds. If I have any more problems, I'll let y'all know. I've tested pretty heavy usage up to 128 MPI processes across 16 nodes and things seem to be behaving. I did notice that single sided transfers seem to be a little slower than explicit send/recv, at least on GigE. Once I do some more testing, I'll bring things up on IB and see how things are going. -Mike Mike Houston wrote: Brian Barrett wrote: On Mar 20, 2007, at 3:15 PM, Mike Houston wrote: If I only do gets/puts, things seem to be working correctly with version 1.2. However, if I have a posted Irecv on the target node and issue a MPI_Get against that target, MPI_Test on the posed IRecv causes a segfaults: Anyone have suggestions? Sadly, I need to have IRecv's posted. I'll attempt to find a workaround, but it looks like the posed IRecv is getting all the data of the MPI_Get from the other node. It's like the message tagging is getting ignored. I've never tried posting two different IRecv's with different message tags either... Hi Mike - I've spent some time this afternoon looking at the problem and have some ideas on what could be happening. I don't think it's a data mismatch (the data intended for the IRecv getting delivered to the Get), but more a problem with the call to MPI_Test perturbing the progress flow of the one-sided engine. I can see one or two places where it's possible this could happen, although I'm having trouble replicating the problem with any test case I can write. Is it possible for you to share the code causing the problem (or some small test case)? It would make me feel considerably better if I could really understand the conditions required to end up in a seg fault state. Thanks, Brian Well, I can give you a linux x86 binary if that would do it. The code is huge as it's part of a much larger system, so there is no such thing as a simple case at the moment, and the code is in pieces an largely unrunnable now with all the hacking... I basically have one thread spinning on an MPI_Test on a posted IRecv while being used as the target to the MPI_Get. I'll see if I can hack together a simple version that breaks late tonight. I've just played with posting a send to that IRecv, issuing the MPI_Get, handshaking and then posting another IRecv and the MPI_Test continues to eat it, but in a memcpy: #0 0x001c068c in memcpy () from /lib/libc.so.6 #1 0x00e412d9 in ompi_convertor_pack (pConv=0x83c1198, iov=0xa0, out_size=0xaffc1fd8, max_data=0xaffc1fdc) at convertor.c:254 #2 0x00ea265d in ompi_osc_pt2pt_replyreq_send (module=0x856e668, replyreq=0x83c1180) at osc_pt2pt_data_move.c:411 #3 0x00ea0ebe in ompi_osc_pt2pt_component_fragment_cb (pt2pt_buffer=0x8573380) at osc_pt2pt_component.c:582 #4 0x00ea1389 in ompi_osc_pt2pt_progress () at osc_pt2pt_component.c:769 #5 0x00aa3019 in opal_progress () at runtime/opal_progress.c:288 #6 0x00ea59e5 in ompi_osc_pt2pt_passive_unlock (module=0x856e668, origin=1, count=1) at osc_pt2pt_sync.c:60 #7 0x00ea0cd2 in ompi_osc_pt2pt_component_fragment_cb (pt2pt_buffer=0x856f300) at osc_pt2pt_component.c:688 #8 0x00ea1389 in ompi_osc_pt2pt_progress () at osc_pt2pt_component.c:769 #9 0x00aa3019 in opal_progress () at runtime/opal_progress.c:288 #10 0x00e33f05 in ompi_request_test (rptr=0xaffc2430, completed=0xaffc2434, status=0xaffc23fc) at request/req_test.c:82 #11 0x00e61770 in PMPI_Test (request=0xaffc2430, completed=0xaffc2434, status=0xaffc23fc) at ptest.c:52 -Mike ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] Issues with Get/Put and IRecv
Brian Barrett wrote: On Mar 20, 2007, at 3:15 PM, Mike Houston wrote: If I only do gets/puts, things seem to be working correctly with version 1.2. However, if I have a posted Irecv on the target node and issue a MPI_Get against that target, MPI_Test on the posed IRecv causes a segfaults: Anyone have suggestions? Sadly, I need to have IRecv's posted. I'll attempt to find a workaround, but it looks like the posed IRecv is getting all the data of the MPI_Get from the other node. It's like the message tagging is getting ignored. I've never tried posting two different IRecv's with different message tags either... Hi Mike - I've spent some time this afternoon looking at the problem and have some ideas on what could be happening. I don't think it's a data mismatch (the data intended for the IRecv getting delivered to the Get), but more a problem with the call to MPI_Test perturbing the progress flow of the one-sided engine. I can see one or two places where it's possible this could happen, although I'm having trouble replicating the problem with any test case I can write. Is it possible for you to share the code causing the problem (or some small test case)? It would make me feel considerably better if I could really understand the conditions required to end up in a seg fault state. Thanks, Brian Well, I can give you a linux x86 binary if that would do it. The code is huge as it's part of a much larger system, so there is no such thing as a simple case at the moment, and the code is in pieces an largely unrunnable now with all the hacking... I basically have one thread spinning on an MPI_Test on a posted IRecv while being used as the target to the MPI_Get. I'll see if I can hack together a simple version that breaks late tonight. I've just played with posting a send to that IRecv, issuing the MPI_Get, handshaking and then posting another IRecv and the MPI_Test continues to eat it, but in a memcpy: #0 0x001c068c in memcpy () from /lib/libc.so.6 #1 0x00e412d9 in ompi_convertor_pack (pConv=0x83c1198, iov=0xa0, out_size=0xaffc1fd8, max_data=0xaffc1fdc) at convertor.c:254 #2 0x00ea265d in ompi_osc_pt2pt_replyreq_send (module=0x856e668, replyreq=0x83c1180) at osc_pt2pt_data_move.c:411 #3 0x00ea0ebe in ompi_osc_pt2pt_component_fragment_cb (pt2pt_buffer=0x8573380) at osc_pt2pt_component.c:582 #4 0x00ea1389 in ompi_osc_pt2pt_progress () at osc_pt2pt_component.c:769 #5 0x00aa3019 in opal_progress () at runtime/opal_progress.c:288 #6 0x00ea59e5 in ompi_osc_pt2pt_passive_unlock (module=0x856e668, origin=1, count=1) at osc_pt2pt_sync.c:60 #7 0x00ea0cd2 in ompi_osc_pt2pt_component_fragment_cb (pt2pt_buffer=0x856f300) at osc_pt2pt_component.c:688 #8 0x00ea1389 in ompi_osc_pt2pt_progress () at osc_pt2pt_component.c:769 #9 0x00aa3019 in opal_progress () at runtime/opal_progress.c:288 #10 0x00e33f05 in ompi_request_test (rptr=0xaffc2430, completed=0xaffc2434, status=0xaffc23fc) at request/req_test.c:82 #11 0x00e61770 in PMPI_Test (request=0xaffc2430, completed=0xaffc2434, status=0xaffc23fc) at ptest.c:52 -Mike
Re: [OMPI users] Issues with Get/Put and IRecv
On Mar 20, 2007, at 3:15 PM, Mike Houston wrote: If I only do gets/puts, things seem to be working correctly with version 1.2. However, if I have a posted Irecv on the target node and issue a MPI_Get against that target, MPI_Test on the posed IRecv causes a segfaults: Anyone have suggestions? Sadly, I need to have IRecv's posted. I'll attempt to find a workaround, but it looks like the posed IRecv is getting all the data of the MPI_Get from the other node. It's like the message tagging is getting ignored. I've never tried posting two different IRecv's with different message tags either... Hi Mike - I've spent some time this afternoon looking at the problem and have some ideas on what could be happening. I don't think it's a data mismatch (the data intended for the IRecv getting delivered to the Get), but more a problem with the call to MPI_Test perturbing the progress flow of the one-sided engine. I can see one or two places where it's possible this could happen, although I'm having trouble replicating the problem with any test case I can write. Is it possible for you to share the code causing the problem (or some small test case)? It would make me feel considerably better if I could really understand the conditions required to end up in a seg fault state. Thanks, Brian
[OMPI users] Issues with Get/Put and IRecv
If I only do gets/puts, things seem to be working correctly with version 1.2. However, if I have a posted Irecv on the target node and issue a MPI_Get against that target, MPI_Test on the posed IRecv causes a segfaults: [expose:21249] *** Process received signal *** [expose:21249] Signal: Segmentation fault (11) [expose:21249] Signal code: Address not mapped (1) [expose:21249] Failing at address: 0xa0 [expose:21249] [ 0] [0x96e440] [expose:21249] [ 1] /usr/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_replyreq_send+0xed) [0x2c765d] [expose:21249] [ 2] /usr/lib/openmpi/mca_osc_pt2pt.so [0x2c5ebe] [expose:21249] [ 3] /usr/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_progress+0x119) [0x2c6389] [expose:21249] [ 4] /usr/lib/libopen-pal.so.0(opal_progress+0x69) [0x67d019] [expose:21249] [ 5] /usr/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_passive_unlock+0xb5) [0x2ca9e5] [expose:21249] [ 6] /usr/lib/openmpi/mca_osc_pt2pt.so [0x2c5cd2] [expose:21249] [ 7] /usr/lib/openmpi/mca_osc_pt2pt.so(ompi_osc_pt2pt_progress+0x119) [0x2c6389] [expose:21249] [ 8] /usr/lib/libopen-pal.so.0(opal_progress+0x69) [0x67d019] [expose:21249] [ 9] /usr/lib/libmpi.so.0(ompi_request_test+0x35) [0x3d6f05] [expose:21249] [10] /usr/lib/libmpi.so.0(PMPI_Test+0x80) [0x404770] Anyone have suggestions? Sadly, I need to have IRecv's posted. I'll attempt to find a workaround, but it looks like the posed IRecv is getting all the data of the MPI_Get from the other node. It's like the message tagging is getting ignored. I've never tried posting two different IRecv's with different message tags either... -Mike