hm, this looks actually correct. The question now basically is, why the
intermediate hand-shake by the processes with rank 0 on the
inter-communicator is not finishing.
I am wandering whether this could be related to a problem reported in
another thread (Processes stuck after MPI_Waitall() in
I've attached gdb to the client which has just connected to the grid.
Its bt is almost exactly the same as the server's one:
#0 0x428066d7 in sched_yield () from /lib/libc.so.6
#1 0x00933cbf in opal_progress () at ../../opal/runtime/opal_progress.c:220
#2 0x00d460b8 in opal_condition_wait
based on your output shown here, there is absolutely nothing wrong
(yet). Both processes are in the same function and do what they are
supposed to do.
However, I am fairly sure that the client process bt that you show is
already part of current_intracomm. Could you try to create a bt of the
This slides outside of my purview - I would suggest you post this question with
a different subject line specifically mentioning failure of intercomm_merge to
work so it attracts the attention of those with knowledge of that area.
On Jul 27, 2010, at 9:30 AM, Grzegorz Maj wrote:
> So now I
So now I have a new question.
When I run my server and a lot of clients on the same machine,
everything looks fine.
But when I try to run the clients on several machines the most
frequent scenario is:
* server is stared on machine A
* X (= 1, 4, 10, ..) clients are started on machine B and they
No problem at all - glad it works!
On Jul 26, 2010, at 7:58 AM, Grzegorz Maj wrote:
> I'm very sorry, but the problem was on my side. My installation
> process was not always taking the newest sources of openmpi. In this
> case it hasn't installed the version with the latest patch. Now I
I'm very sorry, but the problem was on my side. My installation
process was not always taking the newest sources of openmpi. In this
case it hasn't installed the version with the latest patch. Now I
think everything works fine - I could run over 130 processes with no
I'm sorry again
We're having some problem replicating this once my patches are applied. Can you
send us your configure cmd? Just the output from "head config.log" will do for
On Jul 20, 2010, at 9:09 AM, Grzegorz Maj wrote:
> My start script looks almost exactly the same as the one published by
My start script looks almost exactly the same as the one published by
Edgar, ie. the processes are starting one by one with no delay.
2010/7/20 Ralph Castain :
> Grzegorz: something occurred to me. When you start all these processes, how
> are you staggering their wireup? Are
Grzegorz: something occurred to me. When you start all these processes, how are
you staggering their wireup? Are they flooding us, or are you time-shifting
them a little?
On Jul 19, 2010, at 10:32 AM, Edgar Gabriel wrote:
> Hm, so I am not sure how to approach this. First of all, the test
Hm, so I am not sure how to approach this. First of all, the test case
works for me. I used up to 80 clients, and for both optimized and
non-optimized compilation. I ran the tests with trunk (not with 1.4
series, but the communicator code is identical in both cases). Clearly,
the patch from Ralph
As far as I can tell, it appears the problem is somewhere in our communicator
setup. The people knowledgeable on that area are going to look into it later
I'm creating a ticket to track the problem and will copy you on it.
On Jul 13, 2010, at 6:57 AM, Ralph Castain wrote:
On Jul 13, 2010, at 3:36 AM, Grzegorz Maj wrote:
> Bad news..
> I've tried the latest patch with and without the prior one, but it
> hasn't changed anything. I've also tried using the old code but with
> the OMPI_DPM_BASE_MAXJOBIDS constant changed to 80, but it also didn't
I've tried the latest patch with and without the prior one, but it
hasn't changed anything. I've also tried using the old code but with
the OMPI_DPM_BASE_MAXJOBIDS constant changed to 80, but it also didn't
While looking through the sources of openmpi-1.4.2 I couldn't find any
Just so you don't have to wait for 1.4.3 release, here is the patch (doesn't
include the prior patch).
Description: Binary data
On Jul 12, 2010, at 12:13 PM, Grzegorz Maj wrote:
> 2010/7/12 Ralph Castain :
>> Dug around a bit and found the problem!!
>> I have
2010/7/12 Ralph Castain :
> Dug around a bit and found the problem!!
> I have no idea who or why this was done, but somebody set a limit of 64
> separate jobids in the dynamic init called by ompi_comm_set, which builds the
> intercommunicator. Unfortunately, they hard-wired
Dug around a bit and found the problem!!
I have no idea who or why this was done, but somebody set a limit of 64
separate jobids in the dynamic init called by ompi_comm_set, which builds the
intercommunicator. Unfortunately, they hard-wired the array size, but never
check that size before
1024 is not the problem: changing it to 2048 hasn't change anything.
Following your advice I've run my process using gdb. Unfortunately I
didn't get anything more than:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf7e4c6c0 (LWP 20246)]
0xf7f39905 in ompi_comm_set ()
I would guess the #files limit of 1024. However, if it behaves the same way
when spread across multiple machines, I would suspect it is somewhere in your
program itself. Given that the segfault is in your process, can you use gdb to
look at the core file and see where and why it fails?
2010/7/7 Ralph Castain :
> On Jul 6, 2010, at 8:48 AM, Grzegorz Maj wrote:
>> Hi Ralph,
>> sorry for the late response, but I couldn't find free time to play
>> with this. Finally I've applied the patch you prepared. I've launched
>> my processes in the way you've described
On Jul 6, 2010, at 8:48 AM, Grzegorz Maj wrote:
> Hi Ralph,
> sorry for the late response, but I couldn't find free time to play
> with this. Finally I've applied the patch you prepared. I've launched
> my processes in the way you've described and I think it's working as
> you expected. None of
sorry for the late response, but I couldn't find free time to play
with this. Finally I've applied the patch you prepared. I've launched
my processes in the way you've described and I think it's working as
you expected. None of my processes runs the orted daemon and they can
Actually, OMPI is distributed with a daemon that does pretty much what you want. Checkout "man ompi-server". I originally wrote that code to support cross-application MPI publish/subscribe operations, but we can utilize it here too. Have to blame me for not making it more publicly known.The
I'm Krzysztof and I'm working with Grzegorz Maj on this our small
We definitely would like to give your patch a try. But could you please
explain your solution a little more?
You still would like to start one mpirun per mpi grid, and then have
processes started by us
In thinking about this, my proposed solution won't entirely fix the problem -
you'll still wind up with all those daemons. I believe I can resolve that one
as well, but it would require a patch.
Would you like me to send you something you could try? Might take a couple of
iterations to get it
HmmmI -think- this will work, but I cannot guarantee it:
1. launch one process (can just be a spinner) using mpirun that includes the
mpirun -report-uri file
where file is some filename that mpirun can create and insert its contact info
into it. This can be a relative or
To be more precise: by 'server process' I mean some process that I
could run once on my system and it could help in creating those
My typical scenario is:
1. run N separate processes, each without mpirun
2. connect them into MPI group
3. do some job
4. exit all N processes
5. goto 1
Thank you Ralph for your explanation.
And, apart from that descriptors' issue, is there any other way to
solve my problem, i.e. to run separately a number of processes,
without mpirun and then to collect them into an MPI intracomm group?
If I for example would need to run some 'server process'
Yes, I know. The problem is that I need to use some special way for
running my processes provided by the environment in which I'm working
and unfortunately I can't use mpirun.
2010/4/18 Ralph Castain :
> Guess I don't understand why you can't use mpirun - all it does is start
Guess I don't understand why you can't use mpirun - all it does is start
things, provide a means to forward io, etc. It mainly sits there quietly
without using any cpu unless required to support the job.
Sounds like it would solve your problem. Otherwise, I know of no way to get all
I'd like to dynamically create a group of processes communicating via
MPI. Those processes need to be run without mpirun and create
intracommunicator after the startup. Any ideas how to do this
I came up with a solution in which the processes are connecting one by
Mail list logo