I'm not exactly sure where the fix to this should be, but I think I've
found a problem.
Consider, for illustration, launching a multi-process job on a single
node. The function
mca_bml_r2_add_procs()
calls
mca_btl_sm_add_procs()
Each process could conceivably return a different
Fixed in r20120.
george.
On Dec 11, 2008, at 19:14 , Brian Barrett wrote:
I think that's a reasonable solution. However, the words "not it"
come to mind. Sorry, but I have way too much on my plate this
month. By the way, in case no one noticed, I had e-mailed my
findings to devel.
I think that's a reasonable solution. However, the words "not it"
come to mind. Sorry, but I have way too much on my plate this month.
By the way, in case no one noticed, I had e-mailed my findings to
devel. Someone might want to reply to Dorian's e-mail on users.
Brian
On Dec 11,
Brian,
You're right, the datatype is being too cautious with the boundaries
when detecting the overlap. There is no good solution to detect the
overlap except parsing the whole memory layout to check the status of
every predefined type. As one can imagine this is a very expensive
I'm quite sure that the CM CPC stuff (both IBCM -- which doesn't fully
work anyway -- and RDMA CM) will timeout and Bad Things will happen if
you interrupt it in the middle of some network transactions. The
(kernel-imposed) timeout for RDMACM is pretty short -- on the order of
a minute or
I would expect that you will hit problems with timeouts throughout the
codebase as Jeff mentioned, particularly with network connections.
Having a 'prepare to suspend' signal followed by a 'suspend now'
signal might work since it should provide enough of a window to ready
the application
On Dec 11, 2008, at 2:55 PM, Terry Dontje wrote:
Well under SGE it allows you to have SGE send mpirun SIGUSR1 so many
minutes before sending the Suspend signal.
My point is that the right approach might be to work in the context of
Josh's CR stuff -- he's already got hooks for "do this
Jeff Squyres wrote:
On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote:
It sounds reasonable to me. I agree with Ralf W about having mpirun
send a STOP to itself - that would seem to solve the problem about
stopping everything.
It would seem, however, that you cannot similarly STOP the
On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote:
It sounds reasonable to me. I agree with Ralf W about having mpirun
send a STOP to itself - that would seem to solve the problem about
stopping everything.
It would seem, however, that you cannot similarly STOP the daemons
or else you
(chiming in a bit after the fact)
In general, I agree with most of what has been stated.
1. The BTLs should remain "owned" by Open MPI. There are OMPI member
organizations in multiple projects that want to use the BTLs, but the
BTLs are primarily for the Open MPI project.
2. An
I think that this sounds reasonable. It's actually not too much of a
change from the existing CMR process:
- if your commit is applicable to the trunk, do so
*** if you intend your commit to go to the v1.3 branch, also commit it
there (potentially adjusting the patch to commit cleanly in
11 matches
Mail list logo