Re: [OMPI users] How to construct a datatype over two different arrays?
Hi again, Let me clarify the context of the problem. I'm implementing a MPI piggyback mechanism that should allow for attaching extra data to any MPI message. The idea is to wrap MPI communication calls with PMPI interface (or with dynamic instrumentation or whatsoever) and add/receive extra data in a non expensive way. The best solution I have found so far is dynamic datatype wrapping. That is when a user calls MPI_Send (datatype, count) I create dynamically a new structure type that contains an array [count] of datatype and extra data. To avoid copying the original send buffer I use absolute addresses to define displacaments in the structure. This works fine for all P2P calls and MPI_Bcast. And definitely it has performance benefits when compared to copying bufferers or sending an additional message in a different communicator. Or would you expect something different? The only problem are collective calls like MPI_Gather when a root process receives an array of data items. There is no problem to wrap the message on the sender side (for each task), but the question is how to define a datatype that points both to original receive buffer and extra buffer for piggybacked data AND has an adecuate extent to work as an array element. The real problem is that a structure datatype { original data, extra data} does not have a constant displacement between the original data and extra data. Eg. consider original data = receive buffer in MPI_Gather and extra data is an array of ints somewhere in memory). So it cannot be directly used as an array datatype. Any solution? It could be complex, I don't mind ;) On 11/1/07, George Bosilca wrote: > > The MPI standard defines the upper bound and the upper bound for > similar problems. However, even with all the functions in the MPI > standard we cannot describe all types of data. There is always a > solution, but sometimes one has to ask if the performance gain is > worth the complexity introduced. As I said there is always a solution. In fact there are 2 solution, > one somehow optimal the other ... as bad as you can imagine. > > The bad approach: > 1. Use an MPI_Type_struct to create exactly what you want, element > by element (i.e single pair). This can work in all cases. 2. If the sizeof(int) == sizeof(double) then the displacement inside > each tuple (double_i, int_i) is constant. Therefore, you can start by > creating one "single element" type and then use for each send the > correct displacement in the array (added to the send buffer, > respectively to the receive one). > >george. > > On Oct 31, 2007, at 1:40 PM, Oleg Morajko wrote: > > > Hello, > > > > I have the following problem. There areI two arrays somewere in the > > program: > > > > double weights [MAX_SIZE]; > > ... > > int values [MAX_SIZE]; > > ... > > > > I need to be able to send a single pair { weights [i], values [i] } > > with a single MPI_Send call Or receive it directly into both arrays > > at at given index i. How can I define a datatype that spans this > > pair over both arrays? > > > > The only additional constraint it the fact that the memory location > > of both arrays is fixed and cannot be changed and I should avoid > > extra copies. > > > > Is it possible? > > > > Any help welcome, > > Oleg Morajko > > > > > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > >
Re: [OMPI users] "Hostfile" on Multicore Node?
Also see: http://www.open-mpi.org/faq/?category=tuning#paffinity-defs http://www.open-mpi.org/faq/?category=tuning#using-paffinity and http://www.open-mpi.org/projects/plpa/ On Oct 31, 2007, at 11:55 AM, ky...@neuralbs.com wrote: It will indeed but you can have better control over the processor assignment by using processor affinity (also get better performance) as sen here: http://www.nic.uoregon.edu/tau-wiki/Guide:Opteron_NUMA_Analysis http://www-128.ibm.com/developerworks/linux/library/l-affinity.html Eric I think if you boot the mpi on the host machine, and than run your program with 8 thread (mpirun -np 8 ) , the operating system will automatically distribute it to the cores. Jeff Pummill wrote: I am doing some testing on a variety of 8-core nodes in which I just want to execute a couple of executables and have them distributed to the available cores without overlapping. Typically, this would be done with a parameter like /-machinefile machines/, but I have no idea what names to put into the /machines/ file as this is a single node with two quad core cpu's. As I am launching the jobs sans scheduler, I need to specify what cores to run on I would think to keep from overscheduling some cores while others receive nothing to do at all. Simple suggestions? Maybe Open MPI takes care of this detail for me? Thanks! Jeff Pummill ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] problems running parallel program
On Oct 31, 2007, at 11:47 AM, Karsten Bolding wrote: Does OpenMPI detect if procceses share memory and hence do not communicate via sockets. Yes. But if you lie to Open MPI and tell it that there are more processors than there really are, we may not recognize that the machine is oversubscribed and therefore not call yield(). Hence, performance will *really* go down the drain. and if I don't give any hints - just start my 13 jobs on 4 cores where the load balancing is done based on CPU-requirement (this could also be on 4 single-core processors where jobs can't be swapped) - that is in principle OK? Yes, it should be; Open MPI should detect that you're oversubscribed and set itself to yield() during polling in the MPI library. -- Jeff Squyres Cisco Systems
Re: [OMPI users] Too many open files Error
For some version of Open MPI (recent versions) you can use the btl_tcp_disable_family MCA parameter to disable the IPv6 at runtime. Unfortunately, there is no similar option allowing you to disable IPv6 for the runtime environment. george. On Oct 31, 2007, at 6:55 PM, Tim Prins wrote: Hi Clement, I seem to recall (though this may have changed) that if a system supports ipv6, we may open both ipv4 and ipv6 sockets. This can be worked around by configuring Open MPI with --disable-ipv6 Other then that, I don't know of anything else to do except raise the limit for the number of open files. I know it doesn't help you now, but we are actively working on this problem for Open MPI 1.3. This version will introduce a tree routing scheme which will dramatically reduce the number of open sockets that the runtime system needs. Hope this helps, Tim On Tuesday 30 October 2007 07:15:42 pm Clement Kam Man Chu wrote: Hi, I got a "Too many open files" error while running over 1024 processes on 512 cpus. I found the same error on http://www.open-mpi.org/community/lists/users/2006/11/2216.php, but I would like to know whether it is another solution instead of changing limit descriptors. The limit descriptors is changed by root access and needs to restart the system that I don't want to. Regards, Clement ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] How to construct a datatype over two different arrays?
The MPI standard defines the upper bound and the upper bound for similar problems. However, even with all the functions in the MPI standard we cannot describe all types of data. There is always a solution, but sometimes one has to ask if the performance gain is worth the complexity introduced. As I said there is always a solution. In fact there are 2 solution, one somehow optimal the other ... as bad as you can imagine. The bad approach: 1. Use an MPI_Type_struct to create exactly what you want, element by element (i.e single pair). This can work in all cases. 2. If the sizeof(int) == sizeof(double) then the displacement inside each tuple (double_i, int_i) is constant. Therefore, you can start by creating one "single element" type and then use for each send the correct displacement in the array (added to the send buffer, respectively to the receive one). george. On Oct 31, 2007, at 1:40 PM, Oleg Morajko wrote: Hello, I have the following problem. There areI two arrays somewere in the program: double weights [MAX_SIZE]; ... int values [MAX_SIZE]; ... I need to be able to send a single pair { weights [i], values [i] } with a single MPI_Send call Or receive it directly into both arrays at at given index i. How can I define a datatype that spans this pair over both arrays? The only additional constraint it the fact that the memory location of both arrays is fixed and cannot be changed and I should avoid extra copies. Is it possible? Any help welcome, Oleg Morajko ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI users] Too many open files Error
Hi Clement, I seem to recall (though this may have changed) that if a system supports ipv6, we may open both ipv4 and ipv6 sockets. This can be worked around by configuring Open MPI with --disable-ipv6 Other then that, I don't know of anything else to do except raise the limit for the number of open files. I know it doesn't help you now, but we are actively working on this problem for Open MPI 1.3. This version will introduce a tree routing scheme which will dramatically reduce the number of open sockets that the runtime system needs. Hope this helps, Tim On Tuesday 30 October 2007 07:15:42 pm Clement Kam Man Chu wrote: > Hi, > > I got a "Too many open files" error while running over 1024 processes > on 512 cpus. I found the same error on > http://www.open-mpi.org/community/lists/users/2006/11/2216.php, but I > would like to know whether it is another solution instead of changing > limit descriptors. The limit descriptors is changed by root access and > needs to restart the system that I don't want to. > > Regards, > Clement
Re: [OMPI users] mpirun udapl problem
Hi Jon, Just to make sure, running 'ompi_info' shows that you have the udapl btl installed? Tim On Wednesday 31 October 2007 06:11:39 pm Jon Mason wrote: > I am having a bit of a problem getting udapl to work via mpirun (over > open-mpi, obviously). I am running a basic pingpong test and I get the > following error. > > # mpirun --n 2 --host vic12-10g,vic20-10g -mca btl udapl,self > /usr/mpi/gcc/open*/tests/IMB*/IMB-MPI1 pingpong > -- > Process 0.1.1 is unable to reach 0.1.0 for MPI communication. > If you specified the use of a BTL component, you may have > forgotten a component (such as "self") in the list of > usable components. > -- > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or > environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > PML add procs failed > --> Returned "Unreachable" (-12) instead of "Success" (0) > -- > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (goodbye) > -- > Process 0.1.0 is unable to reach 0.1.1 for MPI communication. > If you specified the use of a BTL component, you may have > forgotten a component (such as "self") in the list of > usable components. > -- > -- > It looks like MPI_INIT failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during MPI_INIT; some of which are due to configuration or > environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > PML add procs failed > --> Returned "Unreachable" (-12) instead of "Success" (0) > -- > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (goodbye) > > > > The command is successful if udapl is replaced with tcp or openib. So I > think my setup is correct. Also, dapltest successfully completes > without any problems over IB or iWARP. > > Any thoughts or suggestions would be greatly appreciated. > > Thanks, > Jon > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
[OMPI users] mpirun udapl problem
I am having a bit of a problem getting udapl to work via mpirun (over open-mpi, obviously). I am running a basic pingpong test and I get the following error. # mpirun --n 2 --host vic12-10g,vic20-10g -mca btl udapl,self /usr/mpi/gcc/open*/tests/IMB*/IMB-MPI1 pingpong -- Process 0.1.1 is unable to reach 0.1.0 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) -- Process 0.1.0 is unable to reach 0.1.1 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. -- -- It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): PML add procs failed --> Returned "Unreachable" (-12) instead of "Success" (0) -- *** An error occurred in MPI_Init *** before MPI was initialized *** MPI_ERRORS_ARE_FATAL (goodbye) The command is successful if udapl is replaced with tcp or openib. So I think my setup is correct. Also, dapltest successfully completes without any problems over IB or iWARP. Any thoughts or suggestions would be greatly appreciated. Thanks, Jon
Re: [OMPI users] How to construct a datatype over two different arrays?
I'm not sure if you understood my question. The case is not trivial at all or I miss something important. Try to design this derived datatype and you will understand my point. Thanks anyway. On 10/31/07, Amit Kumar Saha wrote: > > > > On 10/31/07, Oleg Morajko wrote: > > > > Hello, > > > > I have the following problem. There areI two arrays somewere in the > > program: > > > > double weights [MAX_SIZE]; > > ... > > int values [MAX_SIZE]; > > ... > > > > I need to be able to send a single pair { weights [i], values [i] } with > > a single MPI_Send call Or receive it directly into both arrays at at given > > index i. How can I define a datatype that spans this pair over both arrays? > > > Did you have a look at topics like - MPI derived data types, MPI packing? > May be they can help. > > > > > > > > -- > Amit Kumar Saha > *NetBeans Community Docs > Contribution Coordinator* > me blogs@ http://amitksaha.blogspot.com > URL: http://amitsaha.in.googlepages.com > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] "Hostfile" on Multicore Node?
It will indeed but you can have better control over the processor assignment by using processor affinity (also get better performance) as sen here: http://www.nic.uoregon.edu/tau-wiki/Guide:Opteron_NUMA_Analysis http://www-128.ibm.com/developerworks/linux/library/l-affinity.html Eric > I think if you boot the mpi on the host machine, and than run your > program with 8 thread (mpirun -np 8 ) , the operating > system will automatically distribute it to the cores. > > Jeff Pummill wrote: >> I am doing some testing on a variety of 8-core nodes in which I just >> want to execute a couple of executables and have them distributed to >> the available cores without overlapping. Typically, this would be done >> with a parameter like /-machinefile machines/, but I have no idea what >> names to put into the /machines/ file as this is a single node with >> two quad core cpu's. As I am launching the jobs sans scheduler, I need >> to specify what cores to run on I would think to keep from >> overscheduling some cores while others receive nothing to do at all. >> >> Simple suggestions? Maybe Open MPI takes care of this detail for me? >> >> Thanks! >> >> Jeff Pummill >> >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] problems running parallel program
On Wed, Oct 31, 2007 at 11:13:46 -0700, Jeff Squyres wrote: > On Oct 31, 2007, at 10:45 AM, Karsten Bolding wrote: > > > In a different thread I read about a performance penalty in OpenMPI if > > more than one MPI-process is running on one processor/core - is that > > correct? I mean having max-slots>4 on a quad-core machine. > > Open MPI polls for message passing progress (to get the absolute > minimum latency -- it can be faster than blocking/waking up). If you > overload a machine, Open MPI will usually detect that and know to > call yield() in the middle of its polling so that other processes can > get swapped in and make progress. Does OpenMPI detect if procceses share memory and hence do not communicate via sockets. > > But if you lie to Open MPI and tell it that there are more processors > than there really are, we may not recognize that the machine is > oversubscribed and therefore not call yield(). Hence, performance > will *really* go down the drain. and if I don't give any hints - just start my 13 jobs on 4 cores where the load balancing is done based on CPU-requirement (this could also be on 4 single-core processors where jobs can't be swapped) - that is in principle OK? > > -- > Jeff Squyres > Cisco Systems > Karsten -- -- Karsten BoldingBolding & Burchard Hydrodynamics Strandgyden 25 Phone: +45 64422058 DK-5466 AsperupFax: +45 64422068 DenmarkEmail: kars...@bolding-burchard.com http://www.findvej.dk/Strandgyden25,5466,11,3 --
Re: [OMPI users] "Hostfile" on Multicore Node?
I think if you boot the mpi on the host machine, and than run your program with 8 thread (mpirun -np 8 ) , the operating system will automatically distribute it to the cores. Jeff Pummill wrote: I am doing some testing on a variety of 8-core nodes in which I just want to execute a couple of executables and have them distributed to the available cores without overlapping. Typically, this would be done with a parameter like /-machinefile machines/, but I have no idea what names to put into the /machines/ file as this is a single node with two quad core cpu's. As I am launching the jobs sans scheduler, I need to specify what cores to run on I would think to keep from overscheduling some cores while others receive nothing to do at all. Simple suggestions? Maybe Open MPI takes care of this detail for me? Thanks! Jeff Pummill ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI users] problems running parallel program
On Oct 31, 2007, at 10:45 AM, Karsten Bolding wrote: In a different thread I read about a performance penalty in OpenMPI if more than one MPI-process is running on one processor/core - is that correct? I mean having max-slots>4 on a quad-core machine. Open MPI polls for message passing progress (to get the absolute minimum latency -- it can be faster than blocking/waking up). If you overload a machine, Open MPI will usually detect that and know to call yield() in the middle of its polling so that other processes can get swapped in and make progress. But if you lie to Open MPI and tell it that there are more processors than there really are, we may not recognize that the machine is oversubscribed and therefore not call yield(). Hence, performance will *really* go down the drain. -- Jeff Squyres Cisco Systems
Re: [OMPI users] How to construct a datatype over two different arrays?
On 10/31/07, Oleg Morajko wrote: > > Hello, > > I have the following problem. There areI two arrays somewere in the > program: > > double weights [MAX_SIZE]; > ... > int values [MAX_SIZE]; > ... > > I need to be able to send a single pair { weights [i], values [i] } with a > single MPI_Send call Or receive it directly into both arrays at at given > index i. How can I define a datatype that spans this pair over both arrays? Did you have a look at topics like - MPI derived data types, MPI packing? May be they can help. > > -- Amit Kumar Saha *NetBeans Community Docs Contribution Coordinator* me blogs@ http://amitksaha.blogspot.com URL:http://amitsaha.in.googlepages.com
Re: [OMPI users] problems running parallel program
On Wed, Oct 31, 2007 at 09:27:48 -0700, Jeff Squyres wrote: > I think you should use the MPI_PROC_NULL constant itself, not a hard- > coded value of -1. the value -1 was in the neighbor specification file. > > Specifically: the value of MPI_PROC_NULL is not set in the MPI > standard -- so implementations are free to choose whatever value they > want. In Open MPI, MPI_PROC_NULL is -2. So using -1 is an illegal > destination, and you therefore get the error that you described. now I make a check if the neighbor is -1 - then set to MPI_PROC_NULL - and voila everything is OK. > > In a different thread I read about a performance penalty in OpenMPI if more than one MPI-process is running on one processor/core - is that correct? I mean having max-slots>4 on a quad-core machine. Karsten -- -- Karsten BoldingBolding & Burchard Hydrodynamics Strandgyden 25 Phone: +45 64422058 DK-5466 AsperupFax: +45 64422068 DenmarkEmail: kars...@bolding-burchard.com http://www.findvej.dk/Strandgyden25,5466,11,3 --
[OMPI users] How to construct a datatype over two different arrays?
Hello, I have the following problem. There areI two arrays somewere in the program: double weights [MAX_SIZE]; ... int values [MAX_SIZE]; ... I need to be able to send a single pair { weights [i], values [i] } with a single MPI_Send call Or receive it directly into both arrays at at given index i. How can I define a datatype that spans this pair over both arrays? The only additional constraint it the fact that the memory location of both arrays is fixed and cannot be changed and I should avoid extra copies. Is it possible? Any help welcome, Oleg Morajko
Re: [OMPI users] problems running parallel program
I think you should use the MPI_PROC_NULL constant itself, not a hard- coded value of -1. Specifically: the value of MPI_PROC_NULL is not set in the MPI standard -- so implementations are free to choose whatever value they want. In Open MPI, MPI_PROC_NULL is -2. So using -1 is an illegal destination, and you therefore get the error that you described. On Oct 31, 2007, at 9:00 AM, Karsten Bolding wrote: Hello I've just introduced the possibility to use OpenMPI instead of MPICH in an ocean model. The code is quite well tested and has being run in various parallel setups by various groups. I've compiled the program using mpif90 (instead of ifort). When I run I get the error - shown at the end of this mail. As you can see all 13 jobs are started - but then ... One problem with ocean models using domain decomposition in relation to load balancing is that the computational burden of the equal sized domain is not the same (the different domains have different land-fractions). To overcome this a matlab tool has been developed that allows for assigning more sub-doamins to one processor/core based on the sum of water-points in the sub-domains. Attached is a figure showing the actual setup in this case. The neighbor relation is read from a file produced by said matlab-tool. Non-existing neighbors are set to -1 - MPI_PROC_NULL in MPICH. The setup is run on a quad-core machine for testing purposes only. Any ideas what goes wrong? error == kb@gate:~/DK/setups/north_sea_fine$ mpirun -np 13 bin/getm_prod_IFORT.96x96 Process0 of 13 is alive on gate [gate:18564] *** An error occurred in MPI_Isend [gate:18564] *** on communicator MPI_COMM_WORLD [gate:18564] *** MPI_ERR_RANK: invalid rank [gate:18564] *** MPI_ERRORS_ARE_FATAL (goodbye) Process1 of 13 is alive on gate [gate:18565] *** An error occurred in MPI_Isend [gate:18565] *** on communicator MPI_COMM_WORLD [gate:18565] *** MPI_ERR_RANK: invalid rank [gate:18565] *** MPI_ERRORS_ARE_FATAL (goodbye) Process2 of 13 is alive on gate Process3 of 13 is alive on gate [gate:18567] *** An error occurred in MPI_Isend [gate:18567] *** on communicator MPI_COMM_WORLD [gate:18567] *** MPI_ERR_RANK: invalid rank [gate:18567] *** MPI_ERRORS_ARE_FATAL (goodbye) Process4 of 13 is alive on gate [gate:18568] *** An error occurred in MPI_Isend [gate:18568] *** on communicator MPI_COMM_WORLD [gate:18568] *** MPI_ERR_RANK: invalid rank [gate:18568] *** MPI_ERRORS_ARE_FATAL (goodbye) Process5 of 13 is alive on gate [gate:18569] *** An error occurred in MPI_Isend [gate:18569] *** on communicator MPI_COMM_WORLD [gate:18569] *** MPI_ERR_RANK: invalid rank [gate:18569] *** MPI_ERRORS_ARE_FATAL (goodbye) Process7 of 13 is alive on gate [gate:18571] *** An error occurred in MPI_Isend [gate:18571] *** on communicator MPI_COMM_WORLD [gate:18571] *** MPI_ERR_RANK: invalid rank [gate:18571] *** MPI_ERRORS_ARE_FATAL (goodbye) Process8 of 13 is alive on gate Process9 of 13 is alive on gate [gate:18573] *** An error occurred in MPI_Isend [gate:18573] *** on communicator MPI_COMM_WORLD [gate:18573] *** MPI_ERR_RANK: invalid rank [gate:18573] *** MPI_ERRORS_ARE_FATAL (goodbye) Process 10 of 13 is alive on gate [gate:18574] *** An error occurred in MPI_Isend [gate:18574] *** on communicator MPI_COMM_WORLD [gate:18574] *** MPI_ERR_RANK: invalid rank [gate:18574] *** MPI_ERRORS_ARE_FATAL (goodbye) Process 11 of 13 is alive on gate Process 12 of 13 is alive on gate [gate:18576] *** An error occurred in MPI_Isend [gate:18576] *** on communicator MPI_COMM_WORLD [gate:18576] *** MPI_ERR_RANK: invalid rank [gate:18576] *** MPI_ERRORS_ARE_FATAL (goodbye) [gate:18566] *** An error occurred in MPI_Isend [gate:18566] *** on communicator MPI_COMM_WORLD [gate:18566] *** MPI_ERR_RANK: invalid rank [gate:18566] *** MPI_ERRORS_ARE_FATAL (goodbye) [gate:18572] *** An error occurred in MPI_Isend [gate:18572] *** on communicator MPI_COMM_WORLD [gate:18572] *** MPI_ERR_RANK: invalid rank [gate:18572] *** MPI_ERRORS_ARE_FATAL (goodbye) [gate:18575] *** An error occurred in MPI_Isend [gate:18575] *** on communicator MPI_COMM_WORLD [gate:18575] *** MPI_ERR_RANK: invalid rank [gate:18575] *** MPI_ERRORS_ARE_FATAL (goodbye) Process6 of 13 is alive on gate [gate:18570] *** An error occurred in MPI_Isend [gate:18570] *** on communicator MPI_COMM_WORLD [gate:18570] *** MPI_ERR_RANK: invalid rank [gate:18570] *** MPI_ERRORS_ARE_FATAL (goodbye) [gate:18561] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [gate:18561] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 -- ---
[OMPI users] problems running parallel program
Hello I've just introduced the possibility to use OpenMPI instead of MPICH in an ocean model. The code is quite well tested and has being run in various parallel setups by various groups. I've compiled the program using mpif90 (instead of ifort). When I run I get the error - shown at the end of this mail. As you can see all 13 jobs are started - but then ... One problem with ocean models using domain decomposition in relation to load balancing is that the computational burden of the equal sized domain is not the same (the different domains have different land-fractions). To overcome this a matlab tool has been developed that allows for assigning more sub-doamins to one processor/core based on the sum of water-points in the sub-domains. Attached is a figure showing the actual setup in this case. The neighbor relation is read from a file produced by said matlab-tool. Non-existing neighbors are set to -1 - MPI_PROC_NULL in MPICH. The setup is run on a quad-core machine for testing purposes only. Any ideas what goes wrong? error == kb@gate:~/DK/setups/north_sea_fine$ mpirun -np 13 bin/getm_prod_IFORT.96x96 Process0 of 13 is alive on gate [gate:18564] *** An error occurred in MPI_Isend [gate:18564] *** on communicator MPI_COMM_WORLD [gate:18564] *** MPI_ERR_RANK: invalid rank [gate:18564] *** MPI_ERRORS_ARE_FATAL (goodbye) Process1 of 13 is alive on gate [gate:18565] *** An error occurred in MPI_Isend [gate:18565] *** on communicator MPI_COMM_WORLD [gate:18565] *** MPI_ERR_RANK: invalid rank [gate:18565] *** MPI_ERRORS_ARE_FATAL (goodbye) Process2 of 13 is alive on gate Process3 of 13 is alive on gate [gate:18567] *** An error occurred in MPI_Isend [gate:18567] *** on communicator MPI_COMM_WORLD [gate:18567] *** MPI_ERR_RANK: invalid rank [gate:18567] *** MPI_ERRORS_ARE_FATAL (goodbye) Process4 of 13 is alive on gate [gate:18568] *** An error occurred in MPI_Isend [gate:18568] *** on communicator MPI_COMM_WORLD [gate:18568] *** MPI_ERR_RANK: invalid rank [gate:18568] *** MPI_ERRORS_ARE_FATAL (goodbye) Process5 of 13 is alive on gate [gate:18569] *** An error occurred in MPI_Isend [gate:18569] *** on communicator MPI_COMM_WORLD [gate:18569] *** MPI_ERR_RANK: invalid rank [gate:18569] *** MPI_ERRORS_ARE_FATAL (goodbye) Process7 of 13 is alive on gate [gate:18571] *** An error occurred in MPI_Isend [gate:18571] *** on communicator MPI_COMM_WORLD [gate:18571] *** MPI_ERR_RANK: invalid rank [gate:18571] *** MPI_ERRORS_ARE_FATAL (goodbye) Process8 of 13 is alive on gate Process9 of 13 is alive on gate [gate:18573] *** An error occurred in MPI_Isend [gate:18573] *** on communicator MPI_COMM_WORLD [gate:18573] *** MPI_ERR_RANK: invalid rank [gate:18573] *** MPI_ERRORS_ARE_FATAL (goodbye) Process 10 of 13 is alive on gate [gate:18574] *** An error occurred in MPI_Isend [gate:18574] *** on communicator MPI_COMM_WORLD [gate:18574] *** MPI_ERR_RANK: invalid rank [gate:18574] *** MPI_ERRORS_ARE_FATAL (goodbye) Process 11 of 13 is alive on gate Process 12 of 13 is alive on gate [gate:18576] *** An error occurred in MPI_Isend [gate:18576] *** on communicator MPI_COMM_WORLD [gate:18576] *** MPI_ERR_RANK: invalid rank [gate:18576] *** MPI_ERRORS_ARE_FATAL (goodbye) [gate:18566] *** An error occurred in MPI_Isend [gate:18566] *** on communicator MPI_COMM_WORLD [gate:18566] *** MPI_ERR_RANK: invalid rank [gate:18566] *** MPI_ERRORS_ARE_FATAL (goodbye) [gate:18572] *** An error occurred in MPI_Isend [gate:18572] *** on communicator MPI_COMM_WORLD [gate:18572] *** MPI_ERR_RANK: invalid rank [gate:18572] *** MPI_ERRORS_ARE_FATAL (goodbye) [gate:18575] *** An error occurred in MPI_Isend [gate:18575] *** on communicator MPI_COMM_WORLD [gate:18575] *** MPI_ERR_RANK: invalid rank [gate:18575] *** MPI_ERRORS_ARE_FATAL (goodbye) Process6 of 13 is alive on gate [gate:18570] *** An error occurred in MPI_Isend [gate:18570] *** on communicator MPI_COMM_WORLD [gate:18570] *** MPI_ERR_RANK: invalid rank [gate:18570] *** MPI_ERRORS_ARE_FATAL (goodbye) [gate:18561] [0,0,0] ORTE_ERROR_LOG: Timeout in file base/pls_base_orted_cmds.c at line 275 [gate:18561] [0,0,0] ORTE_ERROR_LOG: Timeout in file pls_rsh_module.c at line 1166 -- -- Karsten BoldingBolding & Burchard Hydrodynamics Strandgyden 25 Phone: +45 64422058 DK-5466 AsperupFax: +45 64422068 DenmarkEmail: kars...@bolding-burchard.com http://www.findvej.dk/Strandgyden25,5466,11,3 --
Re: [OMPI users] Error initializing openMPI
Hi Jeff, Sorry I did not see your post. Attached to this email are the outputs requested by the help page. It is a compressed tar file containing the output of .configure and the output of "make all". Please let me know if more information is needed. Thank you for your help, Jorge On Tue, 30 Oct 2007, Jeff Squyres wrote: On Oct 30, 2007, at 9:42 AM, Jorge Parra wrote: Thank you for your reply. Linux does not freeze. The one that freezes is OpenMPI. Sorry for my unaccurate choice of words that led to confusion. Therefore dmesg does not show anything abnormal (I attached to this email a full dmesg log, captured when openmpi freezes). When openmpi ferezes I can, from another terminal, see that the node on which openmpi is originaly run (the local one) has two processes: orted and mpirun. The remote node has one: orted. This seems to be normal. However, in both nodes there are not any openmpi activity. There is only an initial "calling init" printout in the local node (I included it in the greetins.c program for testing purposes). Unfortunately, I have not been able to compile openmpi 1.2.4 or any of the 1.2 trunk versions. Trunks 1.0 and 1.1 copiled well in my system. I already opened a case for this, but I received a message that the person it was assigned is in paternal leave. So I think I need to wait a bit for help on that :). So I am stuck with version 1.1.5. Are you referring to this thread: http://www.open-mpi.org/community/lists/users/2007/10/4218.php There's currently only one person on paternal leave, and although he is the powerpc guy :-), he's not really the build system guy (I'm kinda *guessing* that either OMPI or libltdl is choosing to build or link the wrong object -- but that's a SWAG without seeing any additional information). I sent you a reply on 24 Oct asking for a bit more information: http://www.open-mpi.org/community/lists/users/2007/10/4310.php I am running openmpi as root because my system has some special conditions. This is an attempt to make an embedded Massive Parallel Processor (MPP), so the nodes are running embedded versions of linux, where normally there is just one user (root). Since this is an isolated system, I did not thing this could be a problem (I don't care about security issues too). Again, thank you for all your help, Jorge On Tue, 30 Oct 2007, Rainer Keller wrote: Hello Jorge, On Monday 29 October 2007 18:27, Jorge Parra wrote: When running openMPI my system freezes when initializing MPI (function MPI_init). This happens only when I try to run the process in multiples nodes in my cluster. Running multiple instances of the testing code locally (i.e ./mpirun -np 2 greetings) is succesful. would it be possible to repeat the tests with the latest Open MPI-1.2.4 version? Even though nothing in Open MPI should make Your system freeze. Could You check the logs on the nodes and possibly have a dmesg created just before the MPI_Init... - rsh runs well, and is configured to full access. (i.e. rsh "192.168.1.103 date" is succesful, so they are "rsh AFRLMPPBM2 date" or "rsh AFRLMPPBM2.MPPdomain.com"). Security is not an issue in this system. - uname -n and hostname return a valid hostname - The testing code (attached to this email) is run (and fails) as: ./mpirun --hostfile /root/hostfile -np 2 greetings . The hostfile has the names of the localnode (first entry:AFRLMPPBM1) and the remote node (second entry: AFRLMPPBM2). This file is also attached to this email. - The environment variables seem to be properly set (see env.log attached file). Local mpi programs (i.e. ./mpirun -np 2 greetings) run well. -.profile has the path information for both the executables and the libraries - orted runs in the remote node, however it does not print anything in console. The only output in the remote node is: pam_rhosts_auth[235]: user root has a `+' user entry pam_rhosts_auth[235]: allowed to r...@afrlmppbm1.mppdomain.com as root PAM_unix[235]: (rsh) session opened for user root by (uid=0) in.rshd[236]: r...@afrlmppbm1.mppdomain.com as root: cmd='( ! [ -e ./.profile ] || . ./.profile; orted --bootproxy 1 --name 0.0.1 --num_procs 3 You're running as root? Why is that? Then the remote process returns command prompt. However orted is in the background. The local process is frozen, and just prints: "Calling init", which is just before MPI_Init (see greetings.c). I believe the COMM WORLD cannot be correctly initialized. However I can't see which part of my configuration is wrong. Any help is greatly appreciated. With best regards, Rainer ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ompi-output.tar.gz Description: GNU Zip compressed data
Re: [OMPI users] Merge blocks depending on spawn order
I would try attaching to the processes to see where things are getting stuck. On Oct 31, 2007, at 5:51 AM, Murat Knecht wrote: Jeff Squyres schrieb: On Oct 31, 2007, at 1:18 AM, Murat Knecht wrote: Yes I am, (master and child 1 running on the same machine). But knowing the oversubscribing issue, I am using mpi_yield_when_idle which should fix precisely this problem, right? It won't *fix* the problem -- you're still oversubscribing the nodes, so things will run slowly. But it should help, in that the processes will yield regularly. Yes. I meant "solving the blocking problem by letting others get some CPU time" by "fix". What version of OMPI are you using? I am using 1.2.4 I did give both machines multiple slots, so OpenMPI "knows" that the possibility for more oversubscription may arise. I'm not sure what you mean by this -- you should not "lie" to OMPI and tell it that it has more slots than it physically does. But keep in mind that, as I described in my first mail, OMPI does not currently re-compute the number of processes on a host as you spawn (which can lead to the oversubscription problem). If you're explicitly setting yield_when_idle, that *may* help, but we may or may not be explicitly propoagating that value to spawned processes... I'll have to check. In the hostfile I specified for each host the number of physically available cores. Together with the "yield" setting I hoped the oversubscription would be recognised even if the "oversubscribing" processes are dynamically started. I re-checked the high/low parameter, but it does seem alright. Would be kind of awkward for this to be the reason, as the problem seems to depend on the host and the order. Thanks, Murat ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Merge blocks depending on spawn order
Jeff Squyres schrieb: > On Oct 31, 2007, at 1:18 AM, Murat Knecht wrote: > > >> Yes I am, (master and child 1 running on the same machine). >> But knowing the oversubscribing issue, I am using >> mpi_yield_when_idle which should fix precisely this problem, right? >> > > It won't *fix* the problem -- you're still oversubscribing the nodes, > so things will run slowly. But it should help, in that the processes > will yield regularly. > Yes. I meant "solving the blocking problem by letting others get some CPU time" by "fix". > What version of OMPI are you using? > I am using 1.2.4 >> I did give both machines multiple slots, so OpenMPI >> "knows" that the possibility for more oversubscription may arise. >> > > I'm not sure what you mean by this -- you should not "lie" to OMPI > and tell it that it has more slots than it physically does. But keep > in mind that, as I described in my first mail, OMPI does not > currently re-compute the number of processes on a host as you spawn > (which can lead to the oversubscription problem). If you're > explicitly setting yield_when_idle, that *may* help, but we may or > may not be explicitly propoagating that value to spawned > processes... I'll have to check. > In the hostfile I specified for each host the number of physically available cores. Together with the "yield" setting I hoped the oversubscription would be recognised even if the "oversubscribing" processes are dynamically started. I re-checked the high/low parameter, but it does seem alright. Would be kind of awkward for this to be the reason, as the problem seems to depend on the host and the order. Thanks, Murat
Re: [OMPI users] OpenMP and OpenMPI Issue
THREAD_MULTIPLE support does not work in the 1.2 series. Try turning it off. On Oct 30, 2007, at 12:17 AM, Neeraj Chourasia wrote: Hi folks, I have been seeing some nasty behaviour in MPI_Send/Recv with large dataset(8 MB), when used with OpenMP and Openmpi together with IB Interconnect. Attached is a program. The code first calls MPI_Init_thread() followed by openmp thread creation API. The program works fine, if we do single side comm unication [Thread 0 of process 0 sending some data to any thread of process 1], but it hangs if both side tries to send some data (8 MB) using IB Interconnect Interesting to note that program works fine, if we send short data(1 MB or below). I see this with openmpi-1.2 or openmpi-1.2.4 (compiled with --enable-mpi- threads) ofed 1.2 2.6.9-42.4sp.XCsmp icc (Intel Compiler) compiled as mpicc -O3 -openmp temp.c run as mpirun -np 2 -hostfile nodelist a.out The error i am getting is -- -- -- [0,1,1][btl_openib_component.c: 1199:btl_openib_component_progress] from n129 to: n115 error polling LP CQ with status LOCAL PROTOCOL ERROR status number 4 for wr_id 6391728 opcode 0 [0,1,1][btl_openib_component.c:1199:btl_openib_component_progress] from n129 to: n115 error polling LP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 7058304 opcode 128 [0,1,0][btl_openib_component.c:1199:btl_openib_component_progress] from n115 to: n129 [0,1,0][btl_openib_component.c: 1199:btl_openib_component_progress] from n115 to: n129 error polling LP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 6854256 opcode 128 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 6920112 opcode 0 -- -- --- Anyone else seeing similar? Any ideas for workarounds? As a point of reference, program works fine, if we force openmpi to select TCP interconnect using --mca btl tcp,self. -Neeraj ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Merge blocks depending on spawn order
On Oct 31, 2007, at 1:18 AM, Murat Knecht wrote: Yes I am, (master and child 1 running on the same machine). But knowing the oversubscribing issue, I am using mpi_yield_when_idle which should fix precisely this problem, right? It won't *fix* the problem -- you're still oversubscribing the nodes, so things will run slowly. But it should help, in that the processes will yield regularly. What version of OMPI are you using? Or is the option ignored,when initially there is no second process? No, the option should not be ignored. I did give both machines multiple slots, so OpenMPI "knows" that the possibility for more oversubscription may arise. I'm not sure what you mean by this -- you should not "lie" to OMPI and tell it that it has more slots than it physically does. But keep in mind that, as I described in my first mail, OMPI does not currently re-compute the number of processes on a host as you spawn (which can lead to the oversubscription problem). If you're explicitly setting yield_when_idle, that *may* help, but we may or may not be explicitly propoagating that value to spawned processes... I'll have to check. Another possibility is that you might have something wrong in your algorithm. E.g., did you ensure to set high/low in the intercomm_merge properly? You might want to attach to the "frozen" processes and see where exactly they are stuck. Confused, Murat Jeff Squyres schrieb: Are you perchance oversubscribing your nodes? Open MPI does not currently handle well when you initially undersubscribe your nodes but then, due to spawning, oversubscribe your nodes. In this case, OMPI will be aggressively polling in all processes, not realizing that the node is now oversubscribed and it should be yielding the processor so that other processes can run. On Oct 30, 2007, at 10:57 AM, Murat Knecht wrote: Hi, does someone know whether there is a special requirement on the order of spawning processes and the consequent merge of the intercommunicators? I have two hosts, let's name them local and remote, and a parent process on local that goes on spawning one process on each one of the two nodes. After each spawn the parent process and all existing childs participate in merging the created Intercommunicator into an Intracommunicator that connects - in the end - alls three processes. The weird thing is though, when I spawn them in the order local, remote at the second, the last spawn all three processes block when encountering MPI_Merge. Though, when I switch the order around to spawning first the process on remote and then on local, everything works out: The two processes are spawned and the Intracommunicators created from the Merge. Everything goes well, too, if I decide to spawn both processes on either one of the machines. (The existing children are informed via a message that they shall participate in the Spawn and Merge since these are collective operations.) Is there some implicit developer-level knowledge that explains why the order defines the outcome? Logically, there ought to be no difference. Btw, I work with two Linux nodes and an ordinary Ethernet-TCP connection between them. Thanks, Murat ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/ listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
[OMPI users] ETH BTL
Sorry if this has already been discussed, am new to this list. I came across the ETH BTL from http://archiv.tu-chemnitz.de/pub/2006/0111/data/hoefler-CSR-06-06.pdf and was wondering whether this protocol is available / integrated into OpenMPI. Kind regards, Mattijs -- Mattijs Janssens
Re: [OMPI users] Merge blocks depending on spawn order
Yes I am, (master and child 1 running on the same machine). But knowing the oversubscribing issue, I am using mpi_yield_when_idle which should fix precisely this problem, right? Or is the option ignored,when initially there is no second process? I did give both machines multiple slots, so OpenMPI "knows" that the possibility for more oversubscription may arise. Confused, Murat Jeff Squyres schrieb: > Are you perchance oversubscribing your nodes? > > Open MPI does not currently handle well when you initially > undersubscribe your nodes but then, due to spawning, oversubscribe > your nodes. In this case, OMPI will be aggressively polling in all > processes, not realizing that the node is now oversubscribed and it > should be yielding the processor so that other processes can run. > > On Oct 30, 2007, at 10:57 AM, Murat Knecht wrote: > > >> Hi, >> >> does someone know whether there is a special requirement on the >> order of >> spawning processes and the consequent merge of the intercommunicators? >> I have two hosts, let's name them local and remote, and a parent >> process >> on local that goes on spawning one process on each one of the two >> nodes. >> After each spawn the parent process and all existing childs >> participate >> in merging the created Intercommunicator into an Intracommunicator >> that >> connects - in the end - alls three processes. >> >> The weird thing is though, when I spawn them in the order local, >> remote >> at the second, the last spawn all three processes block when >> encountering MPI_Merge. Though, when I switch the order around to >> spawning first the process on remote and then on local, everything >> works >> out: The two processes are spawned and the Intracommunicators created >> from the Merge. Everything goes well, too, if I decide to spawn both >> processes on either one of the machines. (The existing children are >> informed via a message that they shall participate in the Spawn and >> Merge since these are collective operations.) >> >> Is there some implicit developer-level knowledge that explains why the >> order defines the outcome? Logically, there ought to be no difference. >> Btw, I work with two Linux nodes and an ordinary Ethernet-TCP >> connection >> between them. >> >> Thanks, >> Murat >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> > > >
Re: [OMPI users] Merge blocks depending on spawn order
Are you perchance oversubscribing your nodes? Open MPI does not currently handle well when you initially undersubscribe your nodes but then, due to spawning, oversubscribe your nodes. In this case, OMPI will be aggressively polling in all processes, not realizing that the node is now oversubscribed and it should be yielding the processor so that other processes can run. On Oct 30, 2007, at 10:57 AM, Murat Knecht wrote: Hi, does someone know whether there is a special requirement on the order of spawning processes and the consequent merge of the intercommunicators? I have two hosts, let's name them local and remote, and a parent process on local that goes on spawning one process on each one of the two nodes. After each spawn the parent process and all existing childs participate in merging the created Intercommunicator into an Intracommunicator that connects - in the end - alls three processes. The weird thing is though, when I spawn them in the order local, remote at the second, the last spawn all three processes block when encountering MPI_Merge. Though, when I switch the order around to spawning first the process on remote and then on local, everything works out: The two processes are spawned and the Intracommunicators created from the Merge. Everything goes well, too, if I decide to spawn both processes on either one of the machines. (The existing children are informed via a message that they shall participate in the Spawn and Merge since these are collective operations.) Is there some implicit developer-level knowledge that explains why the order defines the outcome? Logically, there ought to be no difference. Btw, I work with two Linux nodes and an ordinary Ethernet-TCP connection between them. Thanks, Murat ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Error initializing openMPI
On Oct 30, 2007, at 9:42 AM, Jorge Parra wrote: Thank you for your reply. Linux does not freeze. The one that freezes is OpenMPI. Sorry for my unaccurate choice of words that led to confusion. Therefore dmesg does not show anything abnormal (I attached to this email a full dmesg log, captured when openmpi freezes). When openmpi ferezes I can, from another terminal, see that the node on which openmpi is originaly run (the local one) has two processes: orted and mpirun. The remote node has one: orted. This seems to be normal. However, in both nodes there are not any openmpi activity. There is only an initial "calling init" printout in the local node (I included it in the greetins.c program for testing purposes). Unfortunately, I have not been able to compile openmpi 1.2.4 or any of the 1.2 trunk versions. Trunks 1.0 and 1.1 copiled well in my system. I already opened a case for this, but I received a message that the person it was assigned is in paternal leave. So I think I need to wait a bit for help on that :). So I am stuck with version 1.1.5. Are you referring to this thread: http://www.open-mpi.org/community/lists/users/2007/10/4218.php There's currently only one person on paternal leave, and although he is the powerpc guy :-), he's not really the build system guy (I'm kinda *guessing* that either OMPI or libltdl is choosing to build or link the wrong object -- but that's a SWAG without seeing any additional information). I sent you a reply on 24 Oct asking for a bit more information: http://www.open-mpi.org/community/lists/users/2007/10/4310.php I am running openmpi as root because my system has some special conditions. This is an attempt to make an embedded Massive Parallel Processor (MPP), so the nodes are running embedded versions of linux, where normally there is just one user (root). Since this is an isolated system, I did not thing this could be a problem (I don't care about security issues too). Again, thank you for all your help, Jorge On Tue, 30 Oct 2007, Rainer Keller wrote: Hello Jorge, On Monday 29 October 2007 18:27, Jorge Parra wrote: When running openMPI my system freezes when initializing MPI (function MPI_init). This happens only when I try to run the process in multiples nodes in my cluster. Running multiple instances of the testing code locally (i.e ./mpirun -np 2 greetings) is succesful. would it be possible to repeat the tests with the latest Open MPI-1.2.4 version? Even though nothing in Open MPI should make Your system freeze. Could You check the logs on the nodes and possibly have a dmesg created just before the MPI_Init... - rsh runs well, and is configured to full access. (i.e. rsh "192.168.1.103 date" is succesful, so they are "rsh AFRLMPPBM2 date" or "rsh AFRLMPPBM2.MPPdomain.com"). Security is not an issue in this system. - uname -n and hostname return a valid hostname - The testing code (attached to this email) is run (and fails) as: ./mpirun --hostfile /root/hostfile -np 2 greetings . The hostfile has the names of the localnode (first entry:AFRLMPPBM1) and the remote node (second entry: AFRLMPPBM2). This file is also attached to this email. - The environment variables seem to be properly set (see env.log attached file). Local mpi programs (i.e. ./mpirun -np 2 greetings) run well. -.profile has the path information for both the executables and the libraries - orted runs in the remote node, however it does not print anything in console. The only output in the remote node is: pam_rhosts_auth[235]: user root has a `+' user entry pam_rhosts_auth[235]: allowed to r...@afrlmppbm1.mppdomain.com as root PAM_unix[235]: (rsh) session opened for user root by (uid=0) in.rshd[236]: r...@afrlmppbm1.mppdomain.com as root: cmd='( ! [ -e ./.profile ] || . ./.profile; orted --bootproxy 1 --name 0.0.1 --num_procs 3 You're running as root? Why is that? Then the remote process returns command prompt. However orted is in the background. The local process is frozen, and just prints: "Calling init", which is just before MPI_Init (see greetings.c). I believe the COMM WORLD cannot be correctly initialized. However I can't see which part of my configuration is wrong. Any help is greatly appreciated. With best regards, Rainer ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems