No problem at all - glad it works! On Jul 26, 2010, at 7:58 AM, Grzegorz Maj wrote:
> Hi, > I'm very sorry, but the problem was on my side. My installation > process was not always taking the newest sources of openmpi. In this > case it hasn't installed the version with the latest patch. Now I > think everything works fine - I could run over 130 processes with no > problems. > I'm sorry again that I've wasted your time. And thank you for the patch. > > 2010/7/21 Ralph Castain <r...@open-mpi.org>: >> We're having some problem replicating this once my patches are applied. Can >> you send us your configure cmd? Just the output from "head config.log" will >> do for now. >> >> Thanks! >> >> On Jul 20, 2010, at 9:09 AM, Grzegorz Maj wrote: >> >>> My start script looks almost exactly the same as the one published by >>> Edgar, ie. the processes are starting one by one with no delay. >>> >>> 2010/7/20 Ralph Castain <r...@open-mpi.org>: >>>> Grzegorz: something occurred to me. When you start all these processes, >>>> how are you staggering their wireup? Are they flooding us, or are you >>>> time-shifting them a little? >>>> >>>> >>>> On Jul 19, 2010, at 10:32 AM, Edgar Gabriel wrote: >>>> >>>>> Hm, so I am not sure how to approach this. First of all, the test case >>>>> works for me. I used up to 80 clients, and for both optimized and >>>>> non-optimized compilation. I ran the tests with trunk (not with 1.4 >>>>> series, but the communicator code is identical in both cases). Clearly, >>>>> the patch from Ralph is necessary to make it work. >>>>> >>>>> Additionally, I went through the communicator creation code for dynamic >>>>> communicators trying to find spots that could create problems. The only >>>>> place that I found the number 64 appear is the fortran-to-c mapping >>>>> arrays (e.g. for communicators), where the initial size of the table is >>>>> 64. I looked twice over the pointer-array code to see whether we could >>>>> have a problem their (since it is a key-piece of the cid allocation code >>>>> for communicators), but I am fairly confident that it is correct. >>>>> >>>>> Note, that we have other (non-dynamic tests), were comm_set is called >>>>> 100,000 times, and the code per se does not seem to have a problem due >>>>> to being called too often. So I am not sure what else to look at. >>>>> >>>>> Edgar >>>>> >>>>> >>>>> >>>>> On 7/13/2010 8:42 PM, Ralph Castain wrote: >>>>>> As far as I can tell, it appears the problem is somewhere in our >>>>>> communicator setup. The people knowledgeable on that area are going to >>>>>> look into it later this week. >>>>>> >>>>>> I'm creating a ticket to track the problem and will copy you on it. >>>>>> >>>>>> >>>>>> On Jul 13, 2010, at 6:57 AM, Ralph Castain wrote: >>>>>> >>>>>>> >>>>>>> On Jul 13, 2010, at 3:36 AM, Grzegorz Maj wrote: >>>>>>> >>>>>>>> Bad news.. >>>>>>>> I've tried the latest patch with and without the prior one, but it >>>>>>>> hasn't changed anything. I've also tried using the old code but with >>>>>>>> the OMPI_DPM_BASE_MAXJOBIDS constant changed to 80, but it also didn't >>>>>>>> help. >>>>>>>> While looking through the sources of openmpi-1.4.2 I couldn't find any >>>>>>>> call of the function ompi_dpm_base_mark_dyncomm. >>>>>>> >>>>>>> It isn't directly called - it shows in ompi_comm_set as >>>>>>> ompi_dpm.mark_dyncomm. You were definitely overrunning that array, but >>>>>>> I guess something else is also being hit. Have to look further... >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> 2010/7/12 Ralph Castain <r...@open-mpi.org>: >>>>>>>>> Just so you don't have to wait for 1.4.3 release, here is the patch >>>>>>>>> (doesn't include the prior patch). >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Jul 12, 2010, at 12:13 PM, Grzegorz Maj wrote: >>>>>>>>> >>>>>>>>>> 2010/7/12 Ralph Castain <r...@open-mpi.org>: >>>>>>>>>>> Dug around a bit and found the problem!! >>>>>>>>>>> >>>>>>>>>>> I have no idea who or why this was done, but somebody set a limit >>>>>>>>>>> of 64 separate jobids in the dynamic init called by ompi_comm_set, >>>>>>>>>>> which builds the intercommunicator. Unfortunately, they hard-wired >>>>>>>>>>> the array size, but never check that size before adding to it. >>>>>>>>>>> >>>>>>>>>>> So after 64 calls to connect_accept, you are overwriting other >>>>>>>>>>> areas of the code. As you found, hitting 66 causes it to segfault. >>>>>>>>>>> >>>>>>>>>>> I'll fix this on the developer's trunk (I'll also add that original >>>>>>>>>>> patch to it). Rather than my searching this thread in detail, can >>>>>>>>>>> you remind me what version you are using so I can patch it too? >>>>>>>>>> >>>>>>>>>> I'm using 1.4.2 >>>>>>>>>> Thanks a lot and I'm looking forward for the patch. >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks for your patience with this! >>>>>>>>>>> Ralph >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Jul 12, 2010, at 7:20 AM, Grzegorz Maj wrote: >>>>>>>>>>> >>>>>>>>>>>> 1024 is not the problem: changing it to 2048 hasn't change >>>>>>>>>>>> anything. >>>>>>>>>>>> Following your advice I've run my process using gdb. Unfortunately >>>>>>>>>>>> I >>>>>>>>>>>> didn't get anything more than: >>>>>>>>>>>> >>>>>>>>>>>> Program received signal SIGSEGV, Segmentation fault. >>>>>>>>>>>> [Switching to Thread 0xf7e4c6c0 (LWP 20246)] >>>>>>>>>>>> 0xf7f39905 in ompi_comm_set () from >>>>>>>>>>>> /home/gmaj/openmpi/lib/libmpi.so.0 >>>>>>>>>>>> >>>>>>>>>>>> (gdb) bt >>>>>>>>>>>> #0 0xf7f39905 in ompi_comm_set () from >>>>>>>>>>>> /home/gmaj/openmpi/lib/libmpi.so.0 >>>>>>>>>>>> #1 0xf7e3ba95 in connect_accept () from >>>>>>>>>>>> /home/gmaj/openmpi/lib/openmpi/mca_dpm_orte.so >>>>>>>>>>>> #2 0xf7f62013 in PMPI_Comm_connect () from >>>>>>>>>>>> /home/gmaj/openmpi/lib/libmpi.so.0 >>>>>>>>>>>> #3 0x080489ed in main (argc=825832753, argv=0x34393638) at >>>>>>>>>>>> client.c:43 >>>>>>>>>>>> >>>>>>>>>>>> What's more: when I've added a breakpoint on ompi_comm_set in 66th >>>>>>>>>>>> process and stepped a couple of instructions, one of the other >>>>>>>>>>>> processes crashed (as usualy on ompi_comm_set) earlier than 66th >>>>>>>>>>>> did. >>>>>>>>>>>> >>>>>>>>>>>> Finally I decided to recompile openmpi using -g flag for gcc. In >>>>>>>>>>>> this >>>>>>>>>>>> case the 66 processes issue has gone! I was running my applications >>>>>>>>>>>> exactly the same way as previously (even without recompilation) and >>>>>>>>>>>> I've run successfully over 130 processes. >>>>>>>>>>>> When switching back to the openmpi compilation without -g it again >>>>>>>>>>>> segfaults. >>>>>>>>>>>> >>>>>>>>>>>> Any ideas? I'm really confused. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2010/7/7 Ralph Castain <r...@open-mpi.org>: >>>>>>>>>>>>> I would guess the #files limit of 1024. However, if it behaves >>>>>>>>>>>>> the same way when spread across multiple machines, I would >>>>>>>>>>>>> suspect it is somewhere in your program itself. Given that the >>>>>>>>>>>>> segfault is in your process, can you use gdb to look at the core >>>>>>>>>>>>> file and see where and why it fails? >>>>>>>>>>>>> >>>>>>>>>>>>> On Jul 7, 2010, at 10:17 AM, Grzegorz Maj wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> 2010/7/7 Ralph Castain <r...@open-mpi.org>: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Jul 6, 2010, at 8:48 AM, Grzegorz Maj wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>>>> sorry for the late response, but I couldn't find free time to >>>>>>>>>>>>>>>> play >>>>>>>>>>>>>>>> with this. Finally I've applied the patch you prepared. I've >>>>>>>>>>>>>>>> launched >>>>>>>>>>>>>>>> my processes in the way you've described and I think it's >>>>>>>>>>>>>>>> working as >>>>>>>>>>>>>>>> you expected. None of my processes runs the orted daemon and >>>>>>>>>>>>>>>> they can >>>>>>>>>>>>>>>> perform MPI operations. Unfortunately I'm still hitting the 65 >>>>>>>>>>>>>>>> processes issue :( >>>>>>>>>>>>>>>> Maybe I'm doing something wrong. >>>>>>>>>>>>>>>> I attach my source code. If anybody could have a look on this, >>>>>>>>>>>>>>>> I would >>>>>>>>>>>>>>>> be grateful. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> When I run that code with clients_count <= 65 everything works >>>>>>>>>>>>>>>> fine: >>>>>>>>>>>>>>>> all the processes create a common grid, exchange some >>>>>>>>>>>>>>>> information and >>>>>>>>>>>>>>>> disconnect. >>>>>>>>>>>>>>>> When I set clients_count > 65 the 66th process crashes on >>>>>>>>>>>>>>>> MPI_Comm_connect (segmentation fault). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I didn't have time to check the code, but my guess is that you >>>>>>>>>>>>>>> are still hitting some kind of file descriptor or other limit. >>>>>>>>>>>>>>> Check to see what your limits are - usually "ulimit" will tell >>>>>>>>>>>>>>> you. >>>>>>>>>>>>>> >>>>>>>>>>>>>> My limitations are: >>>>>>>>>>>>>> time(seconds) unlimited >>>>>>>>>>>>>> file(blocks) unlimited >>>>>>>>>>>>>> data(kb) unlimited >>>>>>>>>>>>>> stack(kb) 10240 >>>>>>>>>>>>>> coredump(blocks) 0 >>>>>>>>>>>>>> memory(kb) unlimited >>>>>>>>>>>>>> locked memory(kb) 64 >>>>>>>>>>>>>> process 200704 >>>>>>>>>>>>>> nofiles 1024 >>>>>>>>>>>>>> vmemory(kb) unlimited >>>>>>>>>>>>>> locks unlimited >>>>>>>>>>>>>> >>>>>>>>>>>>>> Which one do you think could be responsible for that? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I was trying to run all the 66 processes on one machine or >>>>>>>>>>>>>> spread them >>>>>>>>>>>>>> across several machines and it always crashes the same way on >>>>>>>>>>>>>> the 66th >>>>>>>>>>>>>> process. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Another thing I would like to know is if it's normal that any >>>>>>>>>>>>>>>> of my >>>>>>>>>>>>>>>> processes when calling MPI_Comm_connect or MPI_Comm_accept >>>>>>>>>>>>>>>> when the >>>>>>>>>>>>>>>> other side is not ready, is eating up a full CPU available. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes - the waiting process is polling in a tight loop waiting >>>>>>>>>>>>>>> for the connection to be made. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Any help would be appreciated, >>>>>>>>>>>>>>>> Grzegorz Maj >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 2010/4/24 Ralph Castain <r...@open-mpi.org>: >>>>>>>>>>>>>>>>> Actually, OMPI is distributed with a daemon that does pretty >>>>>>>>>>>>>>>>> much what you >>>>>>>>>>>>>>>>> want. Checkout "man ompi-server". I originally wrote that >>>>>>>>>>>>>>>>> code to support >>>>>>>>>>>>>>>>> cross-application MPI publish/subscribe operations, but we >>>>>>>>>>>>>>>>> can utilize it >>>>>>>>>>>>>>>>> here too. Have to blame me for not making it more publicly >>>>>>>>>>>>>>>>> known. >>>>>>>>>>>>>>>>> The attached patch upgrades ompi-server and modifies the >>>>>>>>>>>>>>>>> singleton startup >>>>>>>>>>>>>>>>> to provide your desired support. This solution works in the >>>>>>>>>>>>>>>>> following >>>>>>>>>>>>>>>>> manner: >>>>>>>>>>>>>>>>> 1. launch "ompi-server -report-uri <filename>". This starts a >>>>>>>>>>>>>>>>> persistent >>>>>>>>>>>>>>>>> daemon called "ompi-server" that acts as a rendezvous point >>>>>>>>>>>>>>>>> for >>>>>>>>>>>>>>>>> independently started applications. The problem with >>>>>>>>>>>>>>>>> starting different >>>>>>>>>>>>>>>>> applications and wanting them to MPI connect/accept lies in >>>>>>>>>>>>>>>>> the need to have >>>>>>>>>>>>>>>>> the applications find each other. If they can't discover >>>>>>>>>>>>>>>>> contact info for >>>>>>>>>>>>>>>>> the other app, then they can't wire up their interconnects. >>>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>>> "ompi-server" tool provides that rendezvous point. I don't >>>>>>>>>>>>>>>>> like that >>>>>>>>>>>>>>>>> comm_accept segfaulted - should have just error'd out. >>>>>>>>>>>>>>>>> 2. set OMPI_MCA_orte_server=file:<filename>" in the >>>>>>>>>>>>>>>>> environment where you >>>>>>>>>>>>>>>>> will start your processes. This will allow your singleton >>>>>>>>>>>>>>>>> processes to find >>>>>>>>>>>>>>>>> the ompi-server. I automatically also set the envar to >>>>>>>>>>>>>>>>> connect the MPI >>>>>>>>>>>>>>>>> publish/subscribe system for you. >>>>>>>>>>>>>>>>> 3. run your processes. As they think they are singletons, >>>>>>>>>>>>>>>>> they will detect >>>>>>>>>>>>>>>>> the presence of the above envar and automatically connect >>>>>>>>>>>>>>>>> themselves to the >>>>>>>>>>>>>>>>> "ompi-server" daemon. This provides each process with the >>>>>>>>>>>>>>>>> ability to perform >>>>>>>>>>>>>>>>> any MPI-2 operation. >>>>>>>>>>>>>>>>> I tested this on my machines and it worked, so hopefully it >>>>>>>>>>>>>>>>> will meet your >>>>>>>>>>>>>>>>> needs. You only need to run one "ompi-server" period, so long >>>>>>>>>>>>>>>>> as you locate >>>>>>>>>>>>>>>>> it where all of the processes can find the contact file and >>>>>>>>>>>>>>>>> can open a TCP >>>>>>>>>>>>>>>>> socket to the daemon. There is a way to knit multiple >>>>>>>>>>>>>>>>> ompi-servers into a >>>>>>>>>>>>>>>>> broader network (e.g., to connect processes that cannot >>>>>>>>>>>>>>>>> directly access a >>>>>>>>>>>>>>>>> server due to network segmentation), but it's a tad tricky - >>>>>>>>>>>>>>>>> let me know if >>>>>>>>>>>>>>>>> you require it and I'll try to help. >>>>>>>>>>>>>>>>> If you have trouble wiring them all into a single >>>>>>>>>>>>>>>>> communicator, you might >>>>>>>>>>>>>>>>> ask separately about that and see if one of our MPI experts >>>>>>>>>>>>>>>>> can provide >>>>>>>>>>>>>>>>> advice (I'm just the RTE grunt). >>>>>>>>>>>>>>>>> HTH - let me know how this works for you and I'll incorporate >>>>>>>>>>>>>>>>> it into future >>>>>>>>>>>>>>>>> OMPI releases. >>>>>>>>>>>>>>>>> Ralph >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Apr 24, 2010, at 1:49 AM, Krzysztof Zarzycki wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi Ralph, >>>>>>>>>>>>>>>>> I'm Krzysztof and I'm working with Grzegorz Maj on this our >>>>>>>>>>>>>>>>> small >>>>>>>>>>>>>>>>> project/experiment. >>>>>>>>>>>>>>>>> We definitely would like to give your patch a try. But could >>>>>>>>>>>>>>>>> you please >>>>>>>>>>>>>>>>> explain your solution a little more? >>>>>>>>>>>>>>>>> You still would like to start one mpirun per mpi grid, and >>>>>>>>>>>>>>>>> then have >>>>>>>>>>>>>>>>> processes started by us to join the MPI comm? >>>>>>>>>>>>>>>>> It is a good solution of course. >>>>>>>>>>>>>>>>> But it would be especially preferable to have one daemon >>>>>>>>>>>>>>>>> running >>>>>>>>>>>>>>>>> persistently on our "entry" machine that can handle several >>>>>>>>>>>>>>>>> mpi grid starts. >>>>>>>>>>>>>>>>> Can your patch help us this way too? >>>>>>>>>>>>>>>>> Thanks for your help! >>>>>>>>>>>>>>>>> Krzysztof >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On 24 April 2010 03:51, Ralph Castain <r...@open-mpi.org> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> In thinking about this, my proposed solution won't entirely >>>>>>>>>>>>>>>>>> fix the >>>>>>>>>>>>>>>>>> problem - you'll still wind up with all those daemons. I >>>>>>>>>>>>>>>>>> believe I can >>>>>>>>>>>>>>>>>> resolve that one as well, but it would require a patch. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Would you like me to send you something you could try? Might >>>>>>>>>>>>>>>>>> take a couple >>>>>>>>>>>>>>>>>> of iterations to get it right... >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Apr 23, 2010, at 12:12 PM, Ralph Castain wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Hmmm....I -think- this will work, but I cannot guarantee it: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 1. launch one process (can just be a spinner) using mpirun >>>>>>>>>>>>>>>>>>> that includes >>>>>>>>>>>>>>>>>>> the following option: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> mpirun -report-uri file >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> where file is some filename that mpirun can create and >>>>>>>>>>>>>>>>>>> insert its >>>>>>>>>>>>>>>>>>> contact info into it. This can be a relative or absolute >>>>>>>>>>>>>>>>>>> path. This process >>>>>>>>>>>>>>>>>>> must remain alive throughout your application - doesn't >>>>>>>>>>>>>>>>>>> matter what it does. >>>>>>>>>>>>>>>>>>> It's purpose is solely to keep mpirun alive. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> 2. set OMPI_MCA_dpm_orte_server=FILE:file in your >>>>>>>>>>>>>>>>>>> environment, where >>>>>>>>>>>>>>>>>>> "file" is the filename given above. This will tell your >>>>>>>>>>>>>>>>>>> processes how to >>>>>>>>>>>>>>>>>>> find mpirun, which is acting as a meeting place to handle >>>>>>>>>>>>>>>>>>> the connect/accept >>>>>>>>>>>>>>>>>>> operations >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Now run your processes, and have them connect/accept to >>>>>>>>>>>>>>>>>>> each other. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> The reason I cannot guarantee this will work is that these >>>>>>>>>>>>>>>>>>> processes >>>>>>>>>>>>>>>>>>> will all have the same rank && name since they all start as >>>>>>>>>>>>>>>>>>> singletons. >>>>>>>>>>>>>>>>>>> Hence, connect/accept is likely to fail. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> But it -might- work, so you might want to give it a try. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On Apr 23, 2010, at 8:10 AM, Grzegorz Maj wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> To be more precise: by 'server process' I mean some >>>>>>>>>>>>>>>>>>>> process that I >>>>>>>>>>>>>>>>>>>> could run once on my system and it could help in creating >>>>>>>>>>>>>>>>>>>> those >>>>>>>>>>>>>>>>>>>> groups. >>>>>>>>>>>>>>>>>>>> My typical scenario is: >>>>>>>>>>>>>>>>>>>> 1. run N separate processes, each without mpirun >>>>>>>>>>>>>>>>>>>> 2. connect them into MPI group >>>>>>>>>>>>>>>>>>>> 3. do some job >>>>>>>>>>>>>>>>>>>> 4. exit all N processes >>>>>>>>>>>>>>>>>>>> 5. goto 1 >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> 2010/4/23 Grzegorz Maj <ma...@wp.pl>: >>>>>>>>>>>>>>>>>>>>> Thank you Ralph for your explanation. >>>>>>>>>>>>>>>>>>>>> And, apart from that descriptors' issue, is there any >>>>>>>>>>>>>>>>>>>>> other way to >>>>>>>>>>>>>>>>>>>>> solve my problem, i.e. to run separately a number of >>>>>>>>>>>>>>>>>>>>> processes, >>>>>>>>>>>>>>>>>>>>> without mpirun and then to collect them into an MPI >>>>>>>>>>>>>>>>>>>>> intracomm group? >>>>>>>>>>>>>>>>>>>>> If I for example would need to run some 'server process' >>>>>>>>>>>>>>>>>>>>> (even using >>>>>>>>>>>>>>>>>>>>> mpirun) for this task, that's OK. Any ideas? >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>> Grzegorz Maj >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> 2010/4/18 Ralph Castain <r...@open-mpi.org>: >>>>>>>>>>>>>>>>>>>>>> Okay, but here is the problem. If you don't use mpirun, >>>>>>>>>>>>>>>>>>>>>> and are not >>>>>>>>>>>>>>>>>>>>>> operating in an environment we support for "direct" >>>>>>>>>>>>>>>>>>>>>> launch (i.e., starting >>>>>>>>>>>>>>>>>>>>>> processes outside of mpirun), then every one of those >>>>>>>>>>>>>>>>>>>>>> processes thinks it is >>>>>>>>>>>>>>>>>>>>>> a singleton - yes? >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> What you may not realize is that each singleton >>>>>>>>>>>>>>>>>>>>>> immediately >>>>>>>>>>>>>>>>>>>>>> fork/exec's an orted daemon that is configured to behave >>>>>>>>>>>>>>>>>>>>>> just like mpirun. >>>>>>>>>>>>>>>>>>>>>> This is required in order to support MPI-2 operations >>>>>>>>>>>>>>>>>>>>>> such as >>>>>>>>>>>>>>>>>>>>>> MPI_Comm_spawn, MPI_Comm_connect/accept, etc. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> So if you launch 64 processes that think they are >>>>>>>>>>>>>>>>>>>>>> singletons, then >>>>>>>>>>>>>>>>>>>>>> you have 64 copies of orted running as well. This eats >>>>>>>>>>>>>>>>>>>>>> up a lot of file >>>>>>>>>>>>>>>>>>>>>> descriptors, which is probably why you are hitting this >>>>>>>>>>>>>>>>>>>>>> 65 process limit - >>>>>>>>>>>>>>>>>>>>>> your system is probably running out of file descriptors. >>>>>>>>>>>>>>>>>>>>>> You might check you >>>>>>>>>>>>>>>>>>>>>> system limits and see if you can get them revised upward. >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> On Apr 17, 2010, at 4:24 PM, Grzegorz Maj wrote: >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> Yes, I know. The problem is that I need to use some >>>>>>>>>>>>>>>>>>>>>>> special way for >>>>>>>>>>>>>>>>>>>>>>> running my processes provided by the environment in >>>>>>>>>>>>>>>>>>>>>>> which I'm >>>>>>>>>>>>>>>>>>>>>>> working >>>>>>>>>>>>>>>>>>>>>>> and unfortunately I can't use mpirun. >>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> 2010/4/18 Ralph Castain <r...@open-mpi.org>: >>>>>>>>>>>>>>>>>>>>>>>> Guess I don't understand why you can't use mpirun - >>>>>>>>>>>>>>>>>>>>>>>> all it does is >>>>>>>>>>>>>>>>>>>>>>>> start things, provide a means to forward io, etc. It >>>>>>>>>>>>>>>>>>>>>>>> mainly sits there >>>>>>>>>>>>>>>>>>>>>>>> quietly without using any cpu unless required to >>>>>>>>>>>>>>>>>>>>>>>> support the job. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> Sounds like it would solve your problem. Otherwise, I >>>>>>>>>>>>>>>>>>>>>>>> know of no >>>>>>>>>>>>>>>>>>>>>>>> way to get all these processes into comm_world. >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> On Apr 17, 2010, at 2:27 PM, Grzegorz Maj wrote: >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>>>>>>> I'd like to dynamically create a group of processes >>>>>>>>>>>>>>>>>>>>>>>>> communicating >>>>>>>>>>>>>>>>>>>>>>>>> via >>>>>>>>>>>>>>>>>>>>>>>>> MPI. Those processes need to be run without mpirun >>>>>>>>>>>>>>>>>>>>>>>>> and create >>>>>>>>>>>>>>>>>>>>>>>>> intracommunicator after the startup. Any ideas how to >>>>>>>>>>>>>>>>>>>>>>>>> do this >>>>>>>>>>>>>>>>>>>>>>>>> efficiently? >>>>>>>>>>>>>>>>>>>>>>>>> I came up with a solution in which the processes are >>>>>>>>>>>>>>>>>>>>>>>>> connecting >>>>>>>>>>>>>>>>>>>>>>>>> one by >>>>>>>>>>>>>>>>>>>>>>>>> one using MPI_Comm_connect, but unfortunately all the >>>>>>>>>>>>>>>>>>>>>>>>> processes >>>>>>>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>>>>>> are already in the group need to call >>>>>>>>>>>>>>>>>>>>>>>>> MPI_Comm_accept. This means >>>>>>>>>>>>>>>>>>>>>>>>> that >>>>>>>>>>>>>>>>>>>>>>>>> when the n-th process wants to connect I need to >>>>>>>>>>>>>>>>>>>>>>>>> collect all the >>>>>>>>>>>>>>>>>>>>>>>>> n-1 >>>>>>>>>>>>>>>>>>>>>>>>> processes on the MPI_Comm_accept call. After I run >>>>>>>>>>>>>>>>>>>>>>>>> about 40 >>>>>>>>>>>>>>>>>>>>>>>>> processes >>>>>>>>>>>>>>>>>>>>>>>>> every subsequent call takes more and more time, which >>>>>>>>>>>>>>>>>>>>>>>>> I'd like to >>>>>>>>>>>>>>>>>>>>>>>>> avoid. >>>>>>>>>>>>>>>>>>>>>>>>> Another problem in this solution is that when I try >>>>>>>>>>>>>>>>>>>>>>>>> to connect >>>>>>>>>>>>>>>>>>>>>>>>> 66-th >>>>>>>>>>>>>>>>>>>>>>>>> process the root of the existing group segfaults on >>>>>>>>>>>>>>>>>>>>>>>>> MPI_Comm_accept. >>>>>>>>>>>>>>>>>>>>>>>>> Maybe it's my bug, but it's weird as everything works >>>>>>>>>>>>>>>>>>>>>>>>> fine for at >>>>>>>>>>>>>>>>>>>>>>>>> most >>>>>>>>>>>>>>>>>>>>>>>>> 65 processes. Is there any limitation I don't know >>>>>>>>>>>>>>>>>>>>>>>>> about? >>>>>>>>>>>>>>>>>>>>>>>>> My last question is about MPI_COMM_WORLD. When I run >>>>>>>>>>>>>>>>>>>>>>>>> my processes >>>>>>>>>>>>>>>>>>>>>>>>> without mpirun their MPI_COMM_WORLD is the same as >>>>>>>>>>>>>>>>>>>>>>>>> MPI_COMM_SELF. >>>>>>>>>>>>>>>>>>>>>>>>> Is >>>>>>>>>>>>>>>>>>>>>>>>> there any way to change MPI_COMM_WORLD and set it to >>>>>>>>>>>>>>>>>>>>>>>>> the >>>>>>>>>>>>>>>>>>>>>>>>> intracommunicator that I've created? >>>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>>>>>>>>>>> Grzegorz Maj >>>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> <client.c><server.c>_______________________________________________ >>>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> users mailing list >>>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> users mailing list >>>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> users mailing list >>>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> users mailing list >>>>>>>>>>> us...@open-mpi.org >>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> users mailing list >>>>>>>>>> us...@open-mpi.org >>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> users mailing list >>>>>>>>> us...@open-mpi.org >>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> users mailing list >>>>>>>> us...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> users mailing list >>>>>> us...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> us...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users