[OMPI devel] BML problem?

2008-12-11 Thread Eugene Loh
I'm not exactly sure where the fix to this should be, but I think I've found a problem. Consider, for illustration, launching a multi-process job on a single node. The function mca_bml_r2_add_procs() calls mca_btl_sm_add_procs() Each process could conceivably return a different

Re: [OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes

2008-12-11 Thread George Bosilca
Fixed in r20120. george. On Dec 11, 2008, at 19:14 , Brian Barrett wrote: I think that's a reasonable solution. However, the words "not it" come to mind. Sorry, but I have way too much on my plate this month. By the way, in case no one noticed, I had e-mailed my findings to devel.

Re: [OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes

2008-12-11 Thread Brian Barrett
I think that's a reasonable solution. However, the words "not it" come to mind. Sorry, but I have way too much on my plate this month. By the way, in case no one noticed, I had e-mailed my findings to devel. Someone might want to reply to Dorian's e-mail on users. Brian On Dec 11,

Re: [OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes

2008-12-11 Thread George Bosilca
Brian, You're right, the datatype is being too cautious with the boundaries when detecting the overlap. There is no good solution to detect the overlap except parsing the whole memory layout to check the status of every predefined type. As one can imagine this is a very expensive

Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT

2008-12-11 Thread Jeff Squyres
I'm quite sure that the CM CPC stuff (both IBCM -- which doesn't fully work anyway -- and RDMA CM) will timeout and Bad Things will happen if you interrupt it in the middle of some network transactions. The (kernel-imposed) timeout for RDMACM is pretty short -- on the order of a minute or

Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT

2008-12-11 Thread Josh Hursey
I would expect that you will hit problems with timeouts throughout the codebase as Jeff mentioned, particularly with network connections. Having a 'prepare to suspend' signal followed by a 'suspend now' signal might work since it should provide enough of a window to ready the application

Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT

2008-12-11 Thread Jeff Squyres
On Dec 11, 2008, at 2:55 PM, Terry Dontje wrote: Well under SGE it allows you to have SGE send mpirun SIGUSR1 so many minutes before sending the Suspend signal. My point is that the right approach might be to work in the context of Josh's CR stuff -- he's already got hooks for "do this

Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT

2008-12-11 Thread Terry Dontje
Jeff Squyres wrote: On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote: It sounds reasonable to me. I agree with Ralf W about having mpirun send a STOP to itself - that would seem to solve the problem about stopping everything. It would seem, however, that you cannot similarly STOP the

Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT

2008-12-11 Thread Jeff Squyres
On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote: It sounds reasonable to me. I agree with Ralf W about having mpirun send a STOP to itself - that would seem to solve the problem about stopping everything. It would seem, however, that you cannot similarly STOP the daemons or else you

Re: [OMPI devel] BTL move - the notion

2008-12-11 Thread Jeff Squyres
(chiming in a bit after the fact) In general, I agree with most of what has been stated. 1. The BTLs should remain "owned" by Open MPI. There are OMPI member organizations in multiple projects that want to use the BTLs, but the BTLs are primarily for the Open MPI project. 2. An

Re: [OMPI devel] 1.3 staging area?

2008-12-11 Thread Jeff Squyres
I think that this sounds reasonable. It's actually not too much of a change from the existing CMR process: - if your commit is applicable to the trunk, do so *** if you intend your commit to go to the v1.3 branch, also commit it there (potentially adjusting the patch to commit cleanly in