Scott, Thanks for your advice! Good to know about the checksum debug functionality! Strangely enough running with either "MX_CSUM=1" or "-mca pml cm" allows Murasaki to work normally, and makes the test case I attached in my previous mail work. Very suspicious, but at least this does make a functional solution (however, if I understand OpenMPI correctly, I shouldn't be able to use the CM PML over a network where some nodes have MX and some don't, correct?).
Scott Atchley atchley-at-myri.com |openmpi-users/Allow| wrote: > Hi Kris, > > I have not run your code yet, but I will try to this weekend. > > You can have MX checksum its messages if you set MX_CSUM=1 and use the > MX debug library (e.g. LD_LIBRARY_PATH to /opt/mx/lib/debug). > > Do you have the problem if you use the MX MTL? To test it modify your > mpirun as follows: > > $ mpirun -mca pml cm ... > > and do not specify any BTL info. > > Scott > > On Jul 2, 2009, at 6:05 PM, 8mj6tc...@sneakemail.com wrote: > >> Hi. I've now spent many many hours tracking down a bug that was causing >> my program to die, as though either its memory were getting corrupted or >> messages were getting clobbered while going through the network, I >> couldn't tell which. I really wish the checksum flag on btl_mx_flags >> were working. But anyway, I think I've managed to recreate the core of >> the problem in a small-ish test case which I've attached >> (verifycontent.cc). This usually segfaults at MPI_Issend after sending >> about 60-90 messages for me while using OpenMPI 1.3.2 with myricom's >> mx-1.2.9 drivers on linux using gcc 4.3.2. Disabling the mx btl (mpirun >> -mca btl ^mx) makes it work (likewise, the same for my own larger >> project (Murasaki)). The MPI_Ssend using version >> (verifycontent-ssend.cc) also works no problem over mx. So I suspect the >> issue lies in OpenMPI 1.3.2's handling of MPI_Issend over mx, but it's >> also possible I've horribly misunderstood something fundamental about >> MPI and it's just my fault, so if that's the case, please let me know >> (but both my this test case and Murasaki work over mpichmx, so OpenMPI >> is definitely doing something different). >> >> Here's a brief description of verifycontent.cc to make reading it easier: >> * given -np=N, half the nodes will be sending, half will be receiving >> some number of messages (reps) >> * each message consists of buflen (5000) chars, set to some value based >> on the sending node's rank and the sequence number of the message >> * the receiving node starts an irecv for each sending node, tests each >> request until a message arrives >> * the receiver then checks the contents of the message to make sure it >> matches what was supposed to be in there (this is where my real project, >> Murasaki, fails actually. I can't seem to replicate that however). >> * the senders meanwhile keep sending messages and dequeuing them when >> their request tests as completed. >> >> Testing out the current subversion trunk version, 1.4a1r21594, that >> seems to pass my test case, but also tends to show errors like >> "mca_btl_mx_init: mx_open_endpoint() failed with status 20 (Busy)" on >> start up, and Murasaki still fails (messages turn into zeros about 132KB >> in), so something still isn't right... >> >> If anyone has any ideas about this test case failing, or my larger issue >> of messages turning into zeros after 132KB (though sadly sometimes it >> isn't at 132KB, but straight from 0KB, which is very confusing) while on >> MX, I'd greatly appreciate it. Even a simple confirmation of "Yes, >> MPI_Issend/Irecv with MX has issues in 1.3.2" would help my sanity. >> -- >> Kris Popendorf >> >> Keio University >> http://murasaki................... <- (Probably too cumbersome to expect >> most people to test, but if you feel daring, try putting in some >> Human/Mouse chromosomes over MX) >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- --Kris 叶ってしまう夢は本当の夢と言えん。 [A dream that comes true can't really be called a dream.]