Ah, that's good to know. I'm trying to wire things up through orte-server on
2.x now and am at the point where I can do an MPI_Publish_name on one node and
MPI_Lookup_name on another (both started through their own mpirun calls). Then,
when the first process does an MPI_Comm_accept and the secon
... accidentally sent before finishing.
This error happened because lookup was returning multiple published entries,
N-1 of which had already disconnected. Should this case be handled more
gracefully, or is it expected to fail like this?
Thanks,
Pieter
From: Pie
Should be handled more gracefully, of course. When a proc departs, we cleanup
any published storage that wasn’t specifically indicated to be retained for a
longer period of time (see pmix_common.h for the various supported options).
You’ve obviously found a race condition, and we’ll have to trac
I'll open an issue on GitHub with more details.
As far as the accept/connect issue, do you expect the processes to find each
other if they do a publish/lookup through the same orte-server if they are
started separately?
I can try with master as well. I'm working off of 2.0.1 now (using stable
Yes, they should find one another. You do have to pass the server’s URI on the
mpirun cmd line so it can find it.
> On Nov 8, 2016, at 12:40 PM, Pieter Noordhuis wrote:
>
> I'll open an issue on GitHub with more details.
>
> As far as the accept/connect issue, do you expect the processes to fi