No problem at all - glad it works!

On Jul 26, 2010, at 7:58 AM, Grzegorz Maj wrote:

> Hi,
> I'm very sorry, but the problem was on my side. My installation
> process was not always taking the newest sources of openmpi. In this
> case it hasn't installed the version with the latest patch. Now I
> think everything works fine - I could run over 130 processes with no
> problems.
> I'm sorry again that I've wasted your time. And thank you for the patch.
> 
> 2010/7/21 Ralph Castain <r...@open-mpi.org>:
>> We're having some problem replicating this once my patches are applied. Can 
>> you send us your configure cmd? Just the output from "head config.log" will 
>> do for now.
>> 
>> Thanks!
>> 
>> On Jul 20, 2010, at 9:09 AM, Grzegorz Maj wrote:
>> 
>>> My start script looks almost exactly the same as the one published by
>>> Edgar, ie. the processes are starting one by one with no delay.
>>> 
>>> 2010/7/20 Ralph Castain <r...@open-mpi.org>:
>>>> Grzegorz: something occurred to me. When you start all these processes, 
>>>> how are you staggering their wireup? Are they flooding us, or are you 
>>>> time-shifting them a little?
>>>> 
>>>> 
>>>> On Jul 19, 2010, at 10:32 AM, Edgar Gabriel wrote:
>>>> 
>>>>> Hm, so I am not sure how to approach this. First of all, the test case
>>>>> works for me. I used up to 80 clients, and for both optimized and
>>>>> non-optimized compilation. I ran the tests with trunk (not with 1.4
>>>>> series, but the communicator code is identical in both cases). Clearly,
>>>>> the patch from Ralph is necessary to make it work.
>>>>> 
>>>>> Additionally, I went through the communicator creation code for dynamic
>>>>> communicators trying to find spots that could create problems. The only
>>>>> place that I found the number 64 appear is the fortran-to-c mapping
>>>>> arrays (e.g. for communicators), where the initial size of the table is
>>>>> 64. I looked twice over the pointer-array code to see whether we could
>>>>> have a problem their (since it is a key-piece of the cid allocation code
>>>>> for communicators), but I am fairly confident that it is correct.
>>>>> 
>>>>> Note, that we have other (non-dynamic tests), were comm_set is called
>>>>> 100,000 times, and the code per se does not seem to have a problem due
>>>>> to being called too often. So I am not sure what else to look at.
>>>>> 
>>>>> Edgar
>>>>> 
>>>>> 
>>>>> 
>>>>> On 7/13/2010 8:42 PM, Ralph Castain wrote:
>>>>>> As far as I can tell, it appears the problem is somewhere in our 
>>>>>> communicator setup. The people knowledgeable on that area are going to 
>>>>>> look into it later this week.
>>>>>> 
>>>>>> I'm creating a ticket to track the problem and will copy you on it.
>>>>>> 
>>>>>> 
>>>>>> On Jul 13, 2010, at 6:57 AM, Ralph Castain wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Jul 13, 2010, at 3:36 AM, Grzegorz Maj wrote:
>>>>>>> 
>>>>>>>> Bad news..
>>>>>>>> I've tried the latest patch with and without the prior one, but it
>>>>>>>> hasn't changed anything. I've also tried using the old code but with
>>>>>>>> the OMPI_DPM_BASE_MAXJOBIDS constant changed to 80, but it also didn't
>>>>>>>> help.
>>>>>>>> While looking through the sources of openmpi-1.4.2 I couldn't find any
>>>>>>>> call of the function ompi_dpm_base_mark_dyncomm.
>>>>>>> 
>>>>>>> It isn't directly called - it shows in ompi_comm_set as 
>>>>>>> ompi_dpm.mark_dyncomm. You were definitely overrunning that array, but 
>>>>>>> I guess something else is also being hit. Have to look further...
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 2010/7/12 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>> Just so you don't have to wait for 1.4.3 release, here is the patch 
>>>>>>>>> (doesn't include the prior patch).
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Jul 12, 2010, at 12:13 PM, Grzegorz Maj wrote:
>>>>>>>>> 
>>>>>>>>>> 2010/7/12 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>> Dug around a bit and found the problem!!
>>>>>>>>>>> 
>>>>>>>>>>> I have no idea who or why this was done, but somebody set a limit 
>>>>>>>>>>> of 64 separate jobids in the dynamic init called by ompi_comm_set, 
>>>>>>>>>>> which builds the intercommunicator. Unfortunately, they hard-wired 
>>>>>>>>>>> the array size, but never check that size before adding to it.
>>>>>>>>>>> 
>>>>>>>>>>> So after 64 calls to connect_accept, you are overwriting other 
>>>>>>>>>>> areas of the code. As you found, hitting 66 causes it to segfault.
>>>>>>>>>>> 
>>>>>>>>>>> I'll fix this on the developer's trunk (I'll also add that original 
>>>>>>>>>>> patch to it). Rather than my searching this thread in detail, can 
>>>>>>>>>>> you remind me what version you are using so I can patch it too?
>>>>>>>>>> 
>>>>>>>>>> I'm using 1.4.2
>>>>>>>>>> Thanks a lot and I'm looking forward for the patch.
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Thanks for your patience with this!
>>>>>>>>>>> Ralph
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Jul 12, 2010, at 7:20 AM, Grzegorz Maj wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> 1024 is not the problem: changing it to 2048 hasn't change 
>>>>>>>>>>>> anything.
>>>>>>>>>>>> Following your advice I've run my process using gdb. Unfortunately 
>>>>>>>>>>>> I
>>>>>>>>>>>> didn't get anything more than:
>>>>>>>>>>>> 
>>>>>>>>>>>> Program received signal SIGSEGV, Segmentation fault.
>>>>>>>>>>>> [Switching to Thread 0xf7e4c6c0 (LWP 20246)]
>>>>>>>>>>>> 0xf7f39905 in ompi_comm_set () from 
>>>>>>>>>>>> /home/gmaj/openmpi/lib/libmpi.so.0
>>>>>>>>>>>> 
>>>>>>>>>>>> (gdb) bt
>>>>>>>>>>>> #0  0xf7f39905 in ompi_comm_set () from 
>>>>>>>>>>>> /home/gmaj/openmpi/lib/libmpi.so.0
>>>>>>>>>>>> #1  0xf7e3ba95 in connect_accept () from
>>>>>>>>>>>> /home/gmaj/openmpi/lib/openmpi/mca_dpm_orte.so
>>>>>>>>>>>> #2  0xf7f62013 in PMPI_Comm_connect () from 
>>>>>>>>>>>> /home/gmaj/openmpi/lib/libmpi.so.0
>>>>>>>>>>>> #3  0x080489ed in main (argc=825832753, argv=0x34393638) at 
>>>>>>>>>>>> client.c:43
>>>>>>>>>>>> 
>>>>>>>>>>>> What's more: when I've added a breakpoint on ompi_comm_set in 66th
>>>>>>>>>>>> process and stepped a couple of instructions, one of the other
>>>>>>>>>>>> processes crashed (as usualy on ompi_comm_set) earlier than 66th 
>>>>>>>>>>>> did.
>>>>>>>>>>>> 
>>>>>>>>>>>> Finally I decided to recompile openmpi using -g flag for gcc. In 
>>>>>>>>>>>> this
>>>>>>>>>>>> case the 66 processes issue has gone! I was running my applications
>>>>>>>>>>>> exactly the same way as previously (even without recompilation) and
>>>>>>>>>>>> I've run successfully over 130 processes.
>>>>>>>>>>>> When switching back to the openmpi compilation without -g it again 
>>>>>>>>>>>> segfaults.
>>>>>>>>>>>> 
>>>>>>>>>>>> Any ideas? I'm really confused.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 2010/7/7 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>> I would guess the #files limit of 1024. However, if it behaves 
>>>>>>>>>>>>> the same way when spread across multiple machines, I would 
>>>>>>>>>>>>> suspect it is somewhere in your program itself. Given that the 
>>>>>>>>>>>>> segfault is in your process, can you use gdb to look at the core 
>>>>>>>>>>>>> file and see where and why it fails?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Jul 7, 2010, at 10:17 AM, Grzegorz Maj wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 2010/7/7 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Jul 6, 2010, at 8:48 AM, Grzegorz Maj wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>>>> sorry for the late response, but I couldn't find free time to 
>>>>>>>>>>>>>>>> play
>>>>>>>>>>>>>>>> with this. Finally I've applied the patch you prepared. I've 
>>>>>>>>>>>>>>>> launched
>>>>>>>>>>>>>>>> my processes in the way you've described and I think it's 
>>>>>>>>>>>>>>>> working as
>>>>>>>>>>>>>>>> you expected. None of my processes runs the orted daemon and 
>>>>>>>>>>>>>>>> they can
>>>>>>>>>>>>>>>> perform MPI operations. Unfortunately I'm still hitting the 65
>>>>>>>>>>>>>>>> processes issue :(
>>>>>>>>>>>>>>>> Maybe I'm doing something wrong.
>>>>>>>>>>>>>>>> I attach my source code. If anybody could have a look on this, 
>>>>>>>>>>>>>>>> I would
>>>>>>>>>>>>>>>> be grateful.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> When I run that code with clients_count <= 65 everything works 
>>>>>>>>>>>>>>>> fine:
>>>>>>>>>>>>>>>> all the processes create a common grid, exchange some 
>>>>>>>>>>>>>>>> information and
>>>>>>>>>>>>>>>> disconnect.
>>>>>>>>>>>>>>>> When I set clients_count > 65 the 66th process crashes on
>>>>>>>>>>>>>>>> MPI_Comm_connect (segmentation fault).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> I didn't have time to check the code, but my guess is that you 
>>>>>>>>>>>>>>> are still hitting some kind of file descriptor or other limit. 
>>>>>>>>>>>>>>> Check to see what your limits are - usually "ulimit" will tell 
>>>>>>>>>>>>>>> you.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> My limitations are:
>>>>>>>>>>>>>> time(seconds)        unlimited
>>>>>>>>>>>>>> file(blocks)         unlimited
>>>>>>>>>>>>>> data(kb)             unlimited
>>>>>>>>>>>>>> stack(kb)            10240
>>>>>>>>>>>>>> coredump(blocks)     0
>>>>>>>>>>>>>> memory(kb)           unlimited
>>>>>>>>>>>>>> locked memory(kb)    64
>>>>>>>>>>>>>> process              200704
>>>>>>>>>>>>>> nofiles              1024
>>>>>>>>>>>>>> vmemory(kb)          unlimited
>>>>>>>>>>>>>> locks                unlimited
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Which one do you think could be responsible for that?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I was trying to run all the 66 processes on one machine or 
>>>>>>>>>>>>>> spread them
>>>>>>>>>>>>>> across several machines and it always crashes the same way on 
>>>>>>>>>>>>>> the 66th
>>>>>>>>>>>>>> process.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Another thing I would like to know is if it's normal that any 
>>>>>>>>>>>>>>>> of my
>>>>>>>>>>>>>>>> processes when calling MPI_Comm_connect or MPI_Comm_accept 
>>>>>>>>>>>>>>>> when the
>>>>>>>>>>>>>>>> other side is not ready, is eating up a full CPU available.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Yes - the waiting process is polling in a tight loop waiting 
>>>>>>>>>>>>>>> for the connection to be made.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Any help would be appreciated,
>>>>>>>>>>>>>>>> Grzegorz Maj
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 2010/4/24 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>>>>>> Actually, OMPI is distributed with a daemon that does pretty 
>>>>>>>>>>>>>>>>> much what you
>>>>>>>>>>>>>>>>> want. Checkout "man ompi-server". I originally wrote that 
>>>>>>>>>>>>>>>>> code to support
>>>>>>>>>>>>>>>>> cross-application MPI publish/subscribe operations, but we 
>>>>>>>>>>>>>>>>> can utilize it
>>>>>>>>>>>>>>>>> here too. Have to blame me for not making it more publicly 
>>>>>>>>>>>>>>>>> known.
>>>>>>>>>>>>>>>>> The attached patch upgrades ompi-server and modifies the 
>>>>>>>>>>>>>>>>> singleton startup
>>>>>>>>>>>>>>>>> to provide your desired support. This solution works in the 
>>>>>>>>>>>>>>>>> following
>>>>>>>>>>>>>>>>> manner:
>>>>>>>>>>>>>>>>> 1. launch "ompi-server -report-uri <filename>". This starts a 
>>>>>>>>>>>>>>>>> persistent
>>>>>>>>>>>>>>>>> daemon called "ompi-server" that acts as a rendezvous point 
>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>> independently started applications.  The problem with 
>>>>>>>>>>>>>>>>> starting different
>>>>>>>>>>>>>>>>> applications and wanting them to MPI connect/accept lies in 
>>>>>>>>>>>>>>>>> the need to have
>>>>>>>>>>>>>>>>> the applications find each other. If they can't discover 
>>>>>>>>>>>>>>>>> contact info for
>>>>>>>>>>>>>>>>> the other app, then they can't wire up their interconnects. 
>>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>>> "ompi-server" tool provides that rendezvous point. I don't 
>>>>>>>>>>>>>>>>> like that
>>>>>>>>>>>>>>>>> comm_accept segfaulted - should have just error'd out.
>>>>>>>>>>>>>>>>> 2. set OMPI_MCA_orte_server=file:<filename>" in the 
>>>>>>>>>>>>>>>>> environment where you
>>>>>>>>>>>>>>>>> will start your processes. This will allow your singleton 
>>>>>>>>>>>>>>>>> processes to find
>>>>>>>>>>>>>>>>> the ompi-server. I automatically also set the envar to 
>>>>>>>>>>>>>>>>> connect the MPI
>>>>>>>>>>>>>>>>> publish/subscribe system for you.
>>>>>>>>>>>>>>>>> 3. run your processes. As they think they are singletons, 
>>>>>>>>>>>>>>>>> they will detect
>>>>>>>>>>>>>>>>> the presence of the above envar and automatically connect 
>>>>>>>>>>>>>>>>> themselves to the
>>>>>>>>>>>>>>>>> "ompi-server" daemon. This provides each process with the 
>>>>>>>>>>>>>>>>> ability to perform
>>>>>>>>>>>>>>>>> any MPI-2 operation.
>>>>>>>>>>>>>>>>> I tested this on my machines and it worked, so hopefully it 
>>>>>>>>>>>>>>>>> will meet your
>>>>>>>>>>>>>>>>> needs. You only need to run one "ompi-server" period, so long 
>>>>>>>>>>>>>>>>> as you locate
>>>>>>>>>>>>>>>>> it where all of the processes can find the contact file and 
>>>>>>>>>>>>>>>>> can open a TCP
>>>>>>>>>>>>>>>>> socket to the daemon. There is a way to knit multiple 
>>>>>>>>>>>>>>>>> ompi-servers into a
>>>>>>>>>>>>>>>>> broader network (e.g., to connect processes that cannot 
>>>>>>>>>>>>>>>>> directly access a
>>>>>>>>>>>>>>>>> server due to network segmentation), but it's a tad tricky - 
>>>>>>>>>>>>>>>>> let me know if
>>>>>>>>>>>>>>>>> you require it and I'll try to help.
>>>>>>>>>>>>>>>>> If you have trouble wiring them all into a single 
>>>>>>>>>>>>>>>>> communicator, you might
>>>>>>>>>>>>>>>>> ask separately about that and see if one of our MPI experts 
>>>>>>>>>>>>>>>>> can provide
>>>>>>>>>>>>>>>>> advice (I'm just the RTE grunt).
>>>>>>>>>>>>>>>>> HTH - let me know how this works for you and I'll incorporate 
>>>>>>>>>>>>>>>>> it into future
>>>>>>>>>>>>>>>>> OMPI releases.
>>>>>>>>>>>>>>>>> Ralph
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Apr 24, 2010, at 1:49 AM, Krzysztof Zarzycki wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Hi Ralph,
>>>>>>>>>>>>>>>>> I'm Krzysztof and I'm working with Grzegorz Maj on this our 
>>>>>>>>>>>>>>>>> small
>>>>>>>>>>>>>>>>> project/experiment.
>>>>>>>>>>>>>>>>> We definitely would like to give your patch a try. But could 
>>>>>>>>>>>>>>>>> you please
>>>>>>>>>>>>>>>>> explain your solution a little more?
>>>>>>>>>>>>>>>>> You still would like to start one mpirun per mpi grid, and 
>>>>>>>>>>>>>>>>> then have
>>>>>>>>>>>>>>>>> processes started by us to join the MPI comm?
>>>>>>>>>>>>>>>>> It is a good solution of course.
>>>>>>>>>>>>>>>>> But it would be especially preferable to have one daemon 
>>>>>>>>>>>>>>>>> running
>>>>>>>>>>>>>>>>> persistently on our "entry" machine that can handle several 
>>>>>>>>>>>>>>>>> mpi grid starts.
>>>>>>>>>>>>>>>>> Can your patch help us this way too?
>>>>>>>>>>>>>>>>> Thanks for your help!
>>>>>>>>>>>>>>>>> Krzysztof
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On 24 April 2010 03:51, Ralph Castain <r...@open-mpi.org> 
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> In thinking about this, my proposed solution won't entirely 
>>>>>>>>>>>>>>>>>> fix the
>>>>>>>>>>>>>>>>>> problem - you'll still wind up with all those daemons. I 
>>>>>>>>>>>>>>>>>> believe I can
>>>>>>>>>>>>>>>>>> resolve that one as well, but it would require a patch.
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Would you like me to send you something you could try? Might 
>>>>>>>>>>>>>>>>>> take a couple
>>>>>>>>>>>>>>>>>> of iterations to get it right...
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Apr 23, 2010, at 12:12 PM, Ralph Castain wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Hmmm....I -think- this will work, but I cannot guarantee it:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 1. launch one process (can just be a spinner) using mpirun 
>>>>>>>>>>>>>>>>>>> that includes
>>>>>>>>>>>>>>>>>>> the following option:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> mpirun -report-uri file
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> where file is some filename that mpirun can create and 
>>>>>>>>>>>>>>>>>>> insert its
>>>>>>>>>>>>>>>>>>> contact info into it. This can be a relative or absolute 
>>>>>>>>>>>>>>>>>>> path. This process
>>>>>>>>>>>>>>>>>>> must remain alive throughout your application - doesn't 
>>>>>>>>>>>>>>>>>>> matter what it does.
>>>>>>>>>>>>>>>>>>> It's purpose is solely to keep mpirun alive.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 2. set OMPI_MCA_dpm_orte_server=FILE:file in your 
>>>>>>>>>>>>>>>>>>> environment, where
>>>>>>>>>>>>>>>>>>> "file" is the filename given above. This will tell your 
>>>>>>>>>>>>>>>>>>> processes how to
>>>>>>>>>>>>>>>>>>> find mpirun, which is acting as a meeting place to handle 
>>>>>>>>>>>>>>>>>>> the connect/accept
>>>>>>>>>>>>>>>>>>> operations
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Now run your processes, and have them connect/accept to 
>>>>>>>>>>>>>>>>>>> each other.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> The reason I cannot guarantee this will work is that these 
>>>>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>>>>> will all have the same rank && name since they all start as 
>>>>>>>>>>>>>>>>>>> singletons.
>>>>>>>>>>>>>>>>>>> Hence, connect/accept is likely to fail.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> But it -might- work, so you might want to give it a try.
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Apr 23, 2010, at 8:10 AM, Grzegorz Maj wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> To be more precise: by 'server process' I mean some 
>>>>>>>>>>>>>>>>>>>> process that I
>>>>>>>>>>>>>>>>>>>> could run once on my system and it could help in creating 
>>>>>>>>>>>>>>>>>>>> those
>>>>>>>>>>>>>>>>>>>> groups.
>>>>>>>>>>>>>>>>>>>> My typical scenario is:
>>>>>>>>>>>>>>>>>>>> 1. run N separate processes, each without mpirun
>>>>>>>>>>>>>>>>>>>> 2. connect them into MPI group
>>>>>>>>>>>>>>>>>>>> 3. do some job
>>>>>>>>>>>>>>>>>>>> 4. exit all N processes
>>>>>>>>>>>>>>>>>>>> 5. goto 1
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 2010/4/23 Grzegorz Maj <ma...@wp.pl>:
>>>>>>>>>>>>>>>>>>>>> Thank you Ralph for your explanation.
>>>>>>>>>>>>>>>>>>>>> And, apart from that descriptors' issue, is there any 
>>>>>>>>>>>>>>>>>>>>> other way to
>>>>>>>>>>>>>>>>>>>>> solve my problem, i.e. to run separately a number of 
>>>>>>>>>>>>>>>>>>>>> processes,
>>>>>>>>>>>>>>>>>>>>> without mpirun and then to collect them into an MPI 
>>>>>>>>>>>>>>>>>>>>> intracomm group?
>>>>>>>>>>>>>>>>>>>>> If I for example would need to run some 'server process' 
>>>>>>>>>>>>>>>>>>>>> (even using
>>>>>>>>>>>>>>>>>>>>> mpirun) for this task, that's OK. Any ideas?
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>> Grzegorz Maj
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 2010/4/18 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>>>>>>>>>>> Okay, but here is the problem. If you don't use mpirun, 
>>>>>>>>>>>>>>>>>>>>>> and are not
>>>>>>>>>>>>>>>>>>>>>> operating in an environment we support for "direct" 
>>>>>>>>>>>>>>>>>>>>>> launch (i.e., starting
>>>>>>>>>>>>>>>>>>>>>> processes outside of mpirun), then every one of those 
>>>>>>>>>>>>>>>>>>>>>> processes thinks it is
>>>>>>>>>>>>>>>>>>>>>> a singleton - yes?
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> What you may not realize is that each singleton 
>>>>>>>>>>>>>>>>>>>>>> immediately
>>>>>>>>>>>>>>>>>>>>>> fork/exec's an orted daemon that is configured to behave 
>>>>>>>>>>>>>>>>>>>>>> just like mpirun.
>>>>>>>>>>>>>>>>>>>>>> This is required in order to support MPI-2 operations 
>>>>>>>>>>>>>>>>>>>>>> such as
>>>>>>>>>>>>>>>>>>>>>> MPI_Comm_spawn, MPI_Comm_connect/accept, etc.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> So if you launch 64 processes that think they are 
>>>>>>>>>>>>>>>>>>>>>> singletons, then
>>>>>>>>>>>>>>>>>>>>>> you have 64 copies of orted running as well. This eats 
>>>>>>>>>>>>>>>>>>>>>> up a lot of file
>>>>>>>>>>>>>>>>>>>>>> descriptors, which is probably why you are hitting this 
>>>>>>>>>>>>>>>>>>>>>> 65 process limit -
>>>>>>>>>>>>>>>>>>>>>> your system is probably running out of file descriptors. 
>>>>>>>>>>>>>>>>>>>>>> You might check you
>>>>>>>>>>>>>>>>>>>>>> system limits and see if you can get them revised upward.
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> On Apr 17, 2010, at 4:24 PM, Grzegorz Maj wrote:
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> Yes, I know. The problem is that I need to use some 
>>>>>>>>>>>>>>>>>>>>>>> special way for
>>>>>>>>>>>>>>>>>>>>>>> running my processes provided by the environment in 
>>>>>>>>>>>>>>>>>>>>>>> which I'm
>>>>>>>>>>>>>>>>>>>>>>> working
>>>>>>>>>>>>>>>>>>>>>>> and unfortunately I can't use mpirun.
>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> 2010/4/18 Ralph Castain <r...@open-mpi.org>:
>>>>>>>>>>>>>>>>>>>>>>>> Guess I don't understand why you can't use mpirun - 
>>>>>>>>>>>>>>>>>>>>>>>> all it does is
>>>>>>>>>>>>>>>>>>>>>>>> start things, provide a means to forward io, etc. It 
>>>>>>>>>>>>>>>>>>>>>>>> mainly sits there
>>>>>>>>>>>>>>>>>>>>>>>> quietly without using any cpu unless required to 
>>>>>>>>>>>>>>>>>>>>>>>> support the job.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> Sounds like it would solve your problem. Otherwise, I 
>>>>>>>>>>>>>>>>>>>>>>>> know of no
>>>>>>>>>>>>>>>>>>>>>>>> way to get all these processes into comm_world.
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> On Apr 17, 2010, at 2:27 PM, Grzegorz Maj wrote:
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>>> I'd like to dynamically create a group of processes 
>>>>>>>>>>>>>>>>>>>>>>>>> communicating
>>>>>>>>>>>>>>>>>>>>>>>>> via
>>>>>>>>>>>>>>>>>>>>>>>>> MPI. Those processes need to be run without mpirun 
>>>>>>>>>>>>>>>>>>>>>>>>> and create
>>>>>>>>>>>>>>>>>>>>>>>>> intracommunicator after the startup. Any ideas how to 
>>>>>>>>>>>>>>>>>>>>>>>>> do this
>>>>>>>>>>>>>>>>>>>>>>>>> efficiently?
>>>>>>>>>>>>>>>>>>>>>>>>> I came up with a solution in which the processes are 
>>>>>>>>>>>>>>>>>>>>>>>>> connecting
>>>>>>>>>>>>>>>>>>>>>>>>> one by
>>>>>>>>>>>>>>>>>>>>>>>>> one using MPI_Comm_connect, but unfortunately all the 
>>>>>>>>>>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>> are already in the group need to call 
>>>>>>>>>>>>>>>>>>>>>>>>> MPI_Comm_accept. This means
>>>>>>>>>>>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>>>>>>>>>>>> when the n-th process wants to connect I need to 
>>>>>>>>>>>>>>>>>>>>>>>>> collect all the
>>>>>>>>>>>>>>>>>>>>>>>>> n-1
>>>>>>>>>>>>>>>>>>>>>>>>> processes on the MPI_Comm_accept call. After I run 
>>>>>>>>>>>>>>>>>>>>>>>>> about 40
>>>>>>>>>>>>>>>>>>>>>>>>> processes
>>>>>>>>>>>>>>>>>>>>>>>>> every subsequent call takes more and more time, which 
>>>>>>>>>>>>>>>>>>>>>>>>> I'd like to
>>>>>>>>>>>>>>>>>>>>>>>>> avoid.
>>>>>>>>>>>>>>>>>>>>>>>>> Another problem in this solution is that when I try 
>>>>>>>>>>>>>>>>>>>>>>>>> to connect
>>>>>>>>>>>>>>>>>>>>>>>>> 66-th
>>>>>>>>>>>>>>>>>>>>>>>>> process the root of the existing group segfaults on
>>>>>>>>>>>>>>>>>>>>>>>>> MPI_Comm_accept.
>>>>>>>>>>>>>>>>>>>>>>>>> Maybe it's my bug, but it's weird as everything works 
>>>>>>>>>>>>>>>>>>>>>>>>> fine for at
>>>>>>>>>>>>>>>>>>>>>>>>> most
>>>>>>>>>>>>>>>>>>>>>>>>> 65 processes. Is there any limitation I don't know 
>>>>>>>>>>>>>>>>>>>>>>>>> about?
>>>>>>>>>>>>>>>>>>>>>>>>> My last question is about MPI_COMM_WORLD. When I run 
>>>>>>>>>>>>>>>>>>>>>>>>> my processes
>>>>>>>>>>>>>>>>>>>>>>>>> without mpirun their MPI_COMM_WORLD is the same as 
>>>>>>>>>>>>>>>>>>>>>>>>> MPI_COMM_SELF.
>>>>>>>>>>>>>>>>>>>>>>>>> Is
>>>>>>>>>>>>>>>>>>>>>>>>> there any way to change MPI_COMM_WORLD and set it to 
>>>>>>>>>>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>>>>>>>>>>> intracommunicator that I've created?
>>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>>>> Grzegorz Maj
>>>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> <client.c><server.c>_______________________________________________
>>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> users mailing list
>>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> users mailing list
>>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> users mailing list
>>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> users mailing list
>>>>>>>>>> us...@open-mpi.org
>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> users mailing list
>>>>>>>>> us...@open-mpi.org
>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> users mailing list
>>>>>>>> us...@open-mpi.org
>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> users mailing list
>>>>>> us...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> 
>>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to