Re: [OMPI users] quadrics support?
So, first run i seem to have run into a bit of an issue. All the Quadrics modules are compiled and loaded. I can ping between nodes over the quadrics interfaces. But when i try to run one of the hello mpi example from openmpi, i get: first run, the process hung - killed with ctl-c though it doesnt seem to actually die and kill -9 doesn't work second run, the process fails with failed elan4_attach Device or resource busy elan_allocSleepDesc Failed to allocate IRQ cookie 2a: 22 Invalid argument all subsequent runs fail the same way and i have to reboot the box to get the processes to go away I'm not sure if this is a quadrics or openmpi issue at this point, but i figured since there are quadrics people on the list its a good place to start On Tue, Jul 7, 2009 at 3:30 PM, Michael Di Domenico wrote: > Does OpenMPI/Quadrics require the Quadrics Kernel patches in order to > operate? Or operate at full speed or are the Quadrics modules > sufficient? > > On Thu, Jul 2, 2009 at 1:52 PM, Ashley Pittman wrote: >> On Thu, 2009-07-02 at 09:34 -0400, Michael Di Domenico wrote: >>> Jeff, >>> >>> Okay, thanks. I'll give it a shot and report back. I can't >>> contribute any code, but I can certainly do testing... >> >> I'm from the Quadrics stable so could certainty support a port should >> you require it but I don't have access to hardware either currently. >> >> Ashley, >> >> -- >> >> Ashley Pittman, Bath, UK. >> >> Padb - A parallel job inspection tool for cluster computing >> http://padb.pittman.org.uk >> >> ___ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >
Re: [OMPI users] OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 quads
Hi, Am 07.07.2009 um 22:12 schrieb Lengyel, Florian: Hi, I may have overlooked something in the archives (not to mention Googling)--if so I apologize, however I have been unable to find info on this particular problem. OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 quads. Could use some troubleshooting assistance. Thanks. Is this what you found our your question? I'm not aware of this. What should be the cause of it?!? Do you have a link - was it on the SGE list? -- Reuti I'm running SGE 6.0u10 on a linux cluster running OpenSuse 11. OpenMPI was compiled with SGE, and the required components are present: [flengyel@nept OPENMPI]$ ompi_info | grep gridengine MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7) MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7) The parallel execution environment for OpenMPI is as follows: [flengyel@nept OPENMPI]$ qconf -sp ompi pe_name ompi slots 999 user_listsResearch xuser_lists NONE start_proc_args /bin/true stop_proc_args/bin/true allocation_rule $fill_up control_slavesTRUE job_is_first_task FALSE urgency_slots min A trivial OpenMPI job using this pe will run on a queue for Intel E6600 core duo machines: [flengyel@nept OPENMPI]$ cat sum2.sh #!/bin/bash #$ -S /bin/bash #$ -q x86_64.q #$ -N sum #$ -pe ompi 4 #$ -cwd export PATH=/home/nept/apps64/openmpi/bin:$PATH export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nept/apps64/openmpi/lib . /usr/local/sge/default/common/settings.sh mpirun --mca pls_gridengine_verbose 2 --prefix /home/nept/apps64/ openmpi -v ./sum Here are the results: [flengyel@nept OPENMPI]$ qsub sum2.sh Your job 23194 ("sum") has been submitted [flengyel@nept OPENMPI]$ qstat -r -u flengyel job-ID prior name user state submit/start at queue slots ja-task-ID -- --- 23194 0.25007 sumflengyel r 07/07/2009 14:14:40 x86_6...@m49.gc.cuny.edu 4 Full jobname: sum Master queue: x86_6...@m49.gc.cuny.edu Requested PE: ompi 4 Granted PE: ompi 4 Hard Resources: Soft Resources: Hard requested queues: x86_64.q [flengyel@nept OPENMPI]$ more sum.o23194 The sum from 1 to 1000 is: 500500 [flengyel@nept OPENMPI]$ more sum.e23194 Starting server daemon at host "m49.gc.cuny.edu" Starting server daemon at host "m33.gc.cuny.edu" Server daemon successfully started with task id "1.m49" Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m49.gc.cuny.edu ... Server daemon successfully started with task id "1.m33" Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m33.gc.cuny.edu ... /usr/local/sge/utilbin/lx24-amd64/rsh exited with exit code 0 reading exit code from shepherd ... But the same job with the queue set to quad.q for the Q9550 quad core machines has daemon trouble: [flengyel@nept OPENMPI]$ !qstat qstat -r -u flengyel job-ID prior name user state submit/start at queue slots ja-task-ID -- --- 23196 0.25000 sumflengyel r 07/07/2009 14:26:21 qua...@m09.gc.cuny.edu 2 Full jobname: sum Master queue: qua...@m09.gc.cuny.edu Requested PE: ompi 2 Granted PE: ompi 2 Hard Resources: Soft Resources: Hard requested queues: quad.q [flengyel@nept OPENMPI]$ more sum.e23196 Starting server daemon at host "m15.gc.cuny.edu" Starting server daemon at host "m09.gc.cuny.edu" Server daemon successfully started with task id "1.m15" Server daemon successfully started with task id "1.m09" Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m15.gc.cuny.e du ... /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE) reading exit code from shepherd ... Establishing /usr/local/sge/ utilbin/lx24-amd 64/rsh session to host m09.gc.cuny.edu ... /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE) reading exit code from shepherd ... 129 [m09.gc.cuny.edu:11413] ERROR: A daemon on node m15.gc.cuny.edu failed to start as expected. [m09.gc.cuny.edu:11413] ERROR: There may be more information available from [m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine tasks. [m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart the [m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job [m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status 129. 129 [m09.gc.cuny.edu:11413] ERROR: A daemon on node m09.gc.cuny.edu failed to start as expected. [m09.gc.cuny.edu:11413] ERROR: There may be more informat
[OMPI users] OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 quads
Hi, I may have overlooked something in the archives (not to mention Googling)--if so I apologize, however I have been unable to find info on this particular problem. OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 quads. Could use some troubleshooting assistance. Thanks. I'm running SGE 6.0u10 on a linux cluster running OpenSuse 11. OpenMPI was compiled with SGE, and the required components are present: [flengyel@nept OPENMPI]$ ompi_info | grep gridengine MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7) MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7) The parallel execution environment for OpenMPI is as follows: [flengyel@nept OPENMPI]$ qconf -sp ompi pe_name ompi slots 999 user_listsResearch xuser_lists NONE start_proc_args /bin/true stop_proc_args/bin/true allocation_rule $fill_up control_slavesTRUE job_is_first_task FALSE urgency_slots min A trivial OpenMPI job using this pe will run on a queue for Intel E6600 core duo machines: [flengyel@nept OPENMPI]$ cat sum2.sh #!/bin/bash #$ -S /bin/bash #$ -q x86_64.q #$ -N sum #$ -pe ompi 4 #$ -cwd export PATH=/home/nept/apps64/openmpi/bin:$PATH export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nept/apps64/openmpi/lib . /usr/local/sge/default/common/settings.sh mpirun --mca pls_gridengine_verbose 2 --prefix /home/nept/apps64/openmpi -v ./sum Here are the results: [flengyel@nept OPENMPI]$ qsub sum2.sh Your job 23194 ("sum") has been submitted [flengyel@nept OPENMPI]$ qstat -r -u flengyel job-ID prior name user state submit/start at queue slots ja-task-ID - 23194 0.25007 sumflengyel r 07/07/2009 14:14:40 x86_6...@m49.gc.cuny.edu 4 Full jobname: sum Master queue: x86_6...@m49.gc.cuny.edu Requested PE: ompi 4 Granted PE: ompi 4 Hard Resources: Soft Resources: Hard requested queues: x86_64.q [flengyel@nept OPENMPI]$ more sum.o23194 The sum from 1 to 1000 is: 500500 [flengyel@nept OPENMPI]$ more sum.e23194 Starting server daemon at host "m49.gc.cuny.edu" Starting server daemon at host "m33.gc.cuny.edu" Server daemon successfully started with task id "1.m49" Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m49.gc.cuny.edu ... Server daemon successfully started with task id "1.m33" Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m33.gc.cuny.edu ... /usr/local/sge/utilbin/lx24-amd64/rsh exited with exit code 0 reading exit code from shepherd ... But the same job with the queue set to quad.q for the Q9550 quad core machines has daemon trouble: [flengyel@nept OPENMPI]$ !qstat qstat -r -u flengyel job-ID prior name user state submit/start at queue slots ja-task-ID - 23196 0.25000 sumflengyel r 07/07/2009 14:26:21 qua...@m09.gc.cuny.edu 2 Full jobname: sum Master queue: qua...@m09.gc.cuny.edu Requested PE: ompi 2 Granted PE: ompi 2 Hard Resources: Soft Resources: Hard requested queues: quad.q [flengyel@nept OPENMPI]$ more sum.e23196 Starting server daemon at host "m15.gc.cuny.edu" Starting server daemon at host "m09.gc.cuny.edu" Server daemon successfully started with task id "1.m15" Server daemon successfully started with task id "1.m09" Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m15.gc.cuny.e du ... /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE) reading exit code from shepherd ... Establishing /usr/local/sge/utilbin/lx24-amd 64/rsh session to host m09.gc.cuny.edu ... /usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE) reading exit code from shepherd ... 129 [m09.gc.cuny.edu:11413] ERROR: A daemon on node m15.gc.cuny.edu failed to start as expected. [m09.gc.cuny.edu:11413] ERROR: There may be more information available from [m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine tasks. [m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart the [m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job [m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status 129. 129 [m09.gc.cuny.edu:11413] ERROR: A daemon on node m09.gc.cuny.edu failed to start as expected. [m09.gc.cuny.edu:11413] ERROR: There may be more information available from [m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine tasks. [m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart the [m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job [m09.gc.cuny.edu:11
Re: [OMPI users] quadrics support?
Does OpenMPI/Quadrics require the Quadrics Kernel patches in order to operate? Or operate at full speed or are the Quadrics modules sufficient? On Thu, Jul 2, 2009 at 1:52 PM, Ashley Pittman wrote: > On Thu, 2009-07-02 at 09:34 -0400, Michael Di Domenico wrote: >> Jeff, >> >> Okay, thanks. I'll give it a shot and report back. I can't >> contribute any code, but I can certainly do testing... > > I'm from the Quadrics stable so could certainty support a port should > you require it but I don't have access to hardware either currently. > > Ashley, > > -- > > Ashley Pittman, Bath, UK. > > Padb - A parallel job inspection tool for cluster computing > http://padb.pittman.org.uk > > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
[OMPI users] Bug: coll_tuned_dynamic_rules_filename and duplicate communicators
I am attempting to use coll_tuned_dynamic_rules_filename to tune Open MPI 1.3.2. Based on my testing, it appears that the dynamic rules file *only* influences the algorithm selection for MPI_COMM_WORLD. Any duplicate communicators will only use fixed or forced rules, which may have much worse performance than the custom-tuned collectives in the dynamic rules file. The following code demonstrates the difference between MPI_COMM_WORLD and a duplicate communicator. test.c: #include int main( int argc, char** argv ) { float u = 0.0, v = 0.0; MPI_Comm world_dup; MPI_Init( &argc, &argv ); MPI_Comm_dup( MPI_COMM_WORLD, &world_dup ); MPI_Allreduce( &u, &v, 1, MPI_FLOAT, MPI_SUM, world_dup ); MPI_Barrier( MPI_COMM_WORLD ); MPI_Allreduce( &u, &v, 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD ); MPI_Finalize(); return 0; } allreduce.ompi: 1 2 1 9 1 0 1 0 0 invocation: orterun -np 9 \ -mca btl self,sm,openib,tcp \ -mca coll_tuned_use_dynamic_rules 1 \ -mca coll_tuned_dynamic_rules_filename allreduce.ompi \ -mca coll_base_verbose 1000 \ -- test This program is run with tracing, and the barrier is only used to separate the allreduce calls in the trace. The trace for one node is at the end of the message, and the relevant section is the choice of algorithms for the two allreduce calls. The allreduce.ompi file indicates that all size 9 communicators should use the basic linear allreduce algorithm. MPI_COMM_WORLD uses basic_linear, but the world_dup communicator uses the fixed algorithm (for this message size, the fixed algorithm is recursive doubling). Thank you. John Jumper Trace of one process for the above program: mca: base: components_open: opening coll components mca: base: components_open: found loaded component basic mca: base: components_open: component basic register function successful mca: base: components_open: component basic has no open function mca: base: components_open: found loaded component hierarch mca: base: components_open: component hierarch has no register function mca: base: components_open: component hierarch open function successful mca: base: components_open: found loaded component inter mca: base: components_open: component inter has no register function mca: base: components_open: component inter open function successful mca: base: components_open: found loaded component self mca: base: components_open: component self has no register function mca: base: components_open: component self open function successful mca: base: components_open: found loaded component sm mca: base: components_open: component sm has no register function mca: base: components_open: component sm open function successful mca: base: components_open: found loaded component sync mca: base: components_open: component sync register function successful mca: base: components_open: component sync has no open function mca: base: components_open: found loaded component tuned mca: base: components_open: component tuned has no register function coll:tuned:component_open: done! mca: base: components_open: component tuned open function successful coll:find_available: querying coll component basic coll:find_available: coll component basic is available coll:find_available: querying coll component hierarch coll:find_available: coll component hierarch is available coll:find_available: querying coll component inter coll:find_available: coll component inter is available coll:find_available: querying coll component self coll:find_available: coll component self is available coll:find_available: querying coll component sm coll:find_available: coll component sm is available coll:find_available: querying coll component sync coll:find_available: coll component sync is available coll:find_available: querying coll component tuned coll:find_available: coll component tuned is available coll:base:comm_select: new communicator: MPI_COMM_WORLD (cid 0) coll:base:comm_select: Checking all available modules coll:base:comm_select: component available: basic, priority: 10 coll:base:comm_select: component not available: hierarch coll:base:comm_select: component not available: inter coll:base:comm_select: component not available: self coll:base:comm_select: component not available: sm coll:base:comm_select: component not available: sync coll:tuned:module_tuned query called coll:tuned:module_query using intra_dynamic coll:base:comm_select: component available: tuned, priority: 30 coll:tuned:module_init called. coll:tuned:module_init MCW & Dynamic coll:tuned:module_init Opening [allreduce.ompi] Reading dynamic rule for collective ID 2 Read communicator count 1 for dynamic rule for collective ID 2 Read message count 1 for dynamic rule for collective ID 2 and comm size 9 Done reading dynamic rule for collective ID 2 Collectives with rules : 1 Communicator sizes with rules : 1 Message sizes with rules: 1 Lines in configuration file read
[OMPI users] Segfault when using valgrind
(Sorry if this is posted twice, I sent the same email yesterday but it never appeared on the list). Hi, I am attempting to debug a memory corruption in an mpi program using valgrind. However, when I run with valgrind I get semi-random segfaults and valgrind messages with the openmpi library. Here is an example of such a seg fault: ==6153== ==6153== Invalid read of size 8 ==6153==at 0x19102EA0: (within /usr/lib/openmpi/lib/openmpi/ mca_btl_sm.so) ==6153==by 0x182ABACB: (within /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so) ==6153==by 0x182A3040: (within /usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so) ==6153==by 0xB425DD3: PMPI_Isend (in /usr/lib/openmpi/lib/libmpi.so.0.0.0) ==6153==by 0x7B83DA8: int Uintah::SFC::MergeExchange(int, std::vector, std::allocator > >&, std::vector, std::allocator > >&, std::vector, std::allocator > >&) (SFC.h:2989) ==6153==by 0x7B84A8F: void Uintah::SFC::Batcherschar>(std::vector, std::allocator > >&, std::vector, std::allocator > >&, std::vector, std::allocator > >&) (SFC.h:3730) ==6153==by 0x7B8857B: void Uintah::SFC::Cleanupchar>(std::vector, std::allocator > >&, std::vector, std::allocator > >&, std::vector, std::allocator > >&) (SFC.h:3695) ==6153==by 0x7B88CC6: void Uintah::SFC::Parallel0<3, unsigned char>() (SFC.h:2928) ==6153==by 0x7C00AAB: void Uintah::SFC::Parallel<3, unsigned char>() (SFC.h:1108) ==6153==by 0x7C0EF39: void Uintah::SFC::GenerateDim<3>(int) (SFC.h:694) ==6153==by 0x7C0F0F2: Uintah::SFC::GenerateCurve(int) (SFC.h:670) ==6153==by 0x7B30CAC: Uintah::DynamicLoadBalancer::useSFC(Uintah::Handle const&, int*) (DynamicLoadBalancer.cc:429) ==6153== Address 0x10 is not stack'd, malloc'd or (recently) free'd ^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil) (segmentation violation) Looking at the code for our isend at SFC.h:298 does not seem to have any errors: = MergeInfo myinfo,theirinfo; MPI_Request srequest, rrequest; MPI_Status status; myinfo.n=n; if(n!=0) { myinfo.min=sendbuf[0].bits; myinfo.max=sendbuf[n-1].bits; } //cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:" << (int)myinfo.max << endl; MPI_Isend(&myinfo,sizeof(MergeInfo),MPI_BYTE,to,0,Comm,&srequest); == myinfo is a struct located on the stack, to is the rank of the processor that the message is being sent to, and srequest is also on the stack. In addition this message is waited on prior to exiting this block of code so they still exist on the stack. When I don't run with valgrind my program runs past this point just fine. I am currently using openmpi 1.3 from the debian unstable branch. I also see the same type of segfault in a different portion of the code involving an MPI_Allgather which can be seen below: == ==22736== Use of uninitialised value of size 8 ==22736==at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322) ==22736==by 0x1382CE09: opal_progress (opal_progress.c:207) ==22736==by 0xB404264: ompi_request_default_wait_all (condition.h:99) ==22736==by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55) ==22736==by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck (coll_tuned_util.h:60) ==22736==by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121) ==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728) ==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537) ==22736==by 0x6465457: Uintah::Grid::problemSetup(Uintah::Handle const&, Uintah::ProcessorGroup const*, bool) (Grid.cc:866) ==22736==by 0x8345759: Uintah::SimulationController::gridSetup() (SimulationController.cc:243) ==22736==by 0x834F418: Uintah::AMRSimulationController::run() (AMRSimulationController.cc:117) ==22736==by 0x4089AE: main (sus.cc:629) ==22736== ==22736== Invalid read of size 8 ==22736==at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322) ==22736==by 0x1382CE09: opal_progress (opal_progress.c:207) ==22736==by 0xB404264: ompi_request_default_wait_all (condition.h:99) ==22736==by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual (coll_tuned_util.c:55) ==22736==by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck (coll_tuned_util.h:60) ==22736==by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121) ==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728) ==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537) ==22736==by 0x6465457: Uintah::Grid::problemSetup(Uintah::Handle const&, Uintah::ProcessorGroup const*, bool) (Grid.cc:866) ==22736==by 0x8345759: Uintah::SimulationController::gridSetup() (SimulationController.cc:243) ==22736==by 0x834F418: Uintah::AMRSimulationController::run() (AMRSimulationController.cc:117) ==22736==by 0x4089AE: main (sus.cc:629)
Re: [OMPI users] any way to get serial time on head node?
You probably want to use an MPI tracing tool that can break down the times spent inside and outside of the MPI library. User vs. system time, as you noted, can get quite blurred. On Jul 6, 2009, at 12:48 PM, Ross Boylan wrote: Let total time on my slot 0 process be S+C+B+I = serial computations + communication + busy wait + idle Is there a way to find out S? S+C would probably also be useful, since I assume C is low. The problem is that I = 0, roughly, and B is big. Since B is big, the usual process timing methods don't work. If B all went to "system" as opposed to "user" time I could use the latter, but I don't think that's the case. Can anyone confirm that? If S is big, I might be able to gain by parallelizing in a different way. By S I mean to refer to serial computation that is part of my algorithm, rather than the technical fact that all the computation is serial on a given slot. I'm running R/RMPI. Thanks. Ross ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] MPI and C++ (Boost)
Ok, after all the considerations, I'll try Boost, today, make some experiments and see if I can use it or if I'll avoid it yet. But as said by Raimond I think, the problem is been dependent of a rich-incredible-amazing-toolset but still implementing only MPI-1, and do not implement all the MPI functions main drawbacks of boost, but the set of functions implemented do not compromise the functionality, i don't know about the MPI-1, MPI-2 and future MPI-3 specifications, how this specifications implementations affect boost and the developer using Boost, with OpenMPI of course. Continuing if something change in the boost how can I guarantee it won't affect my code in the future ? It is impossible. Anyway I'll test it today and without it and choose my direction, thanks for all the replies, suggestions, solutions, that you all pointed to me I really appreciate all your help and comments about boost or not my code. Thanks and Regards. Vitorio. Le 09-07-07 à 08:26, Jeff Squyres a écrit : I think you face a common trade-off: - use a well-established, debugged, abstraction-rich library - write all of that stuff yourself FWIW, I think the first one is a no-brainer. There's a reason they wrote Boost.MPI: it's complex, difficult stuff, and is perfect as middleware for others to use. If having users perform a 2nd step is undesirable (i.e., install Boost before installing your software), how about embedding Boost in your software? Your configure/build process can certainly be tailored to include Boost[.MPI]. Hence, users will only perform 1 step, but it actually performs "2" steps under the covers (configures +installs Boost.MPI and then configures+installs your software, which uses Boost). FWIW: Open MPI does exactly this. Open MPI embeds at least 5 software packages: PLPA, VampirTrace, ROMIO, libltdl, and libevent. But 99.9% of our users don't know/care because it's all hidden in our configure / make process. If you watch carefully, you can see the output go by from each of those configure sections when running OMPI's configure. But no one does. ;-) Sidenote: I would echo that the Forum is not considering including Boost.MPI at all. Indeed, as mentioned in different threads, the Forum has already voted once to deprecate the MPI C++ bindings, partly *because* of Boost. Boost.MPI has shown that the C++ community is better at making C++ APIs for MPI than the Forum is. Hence, our role should be to make the base building blocks and let the language experts make their own preferred tools. On Jul 7, 2009, at 5:03 AM, Matthieu Brucher wrote: > IF boost is attached to MPI 3 (or whatever), AND it becomes part of the > mainstream MPI implementations, THEN you can have the discussion again. Hi, At the moment, I think that Boost.MPI only supports MPI1.1, and even then, some additional work may be done, at least regarding the complex datatypes. Matthieu -- Information System Engineer, Ph.D. Website: http://matthieu-brucher.developpez.com/ Blogs: http://matt.eifelle.com and http://blog.developpez.com/? blog=92 LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users smime.p7s Description: S/MIME cryptographic signature PGP.sig Description: Ceci est une signature électronique PGP
Re: [OMPI users] Segmentation fault - Address not mapped
On Jul 7, 2009, at 8:08 AM, Catalin David wrote: Thank you very much for the help and assistance :) Using -isystem /users/cluster/cdavid/local/include the program now runs fine (loads the correct mpi.h). This is very fishy. If mpic++ is in /users/cluster/cdavid/local/bin, and that directory is in the front of your $PATH, then using that to compile your application should pull in the right mpi.h file. To be clear: if you use the right mpicc / mpic++ / mpif77 / mpif90, the Right header files should get pulled in because the wrappers will do the proper -I for you. You can verify this by checking the output of "mpic++ my_program.cc -o my_program --showme" and see what compiler flags are getting passed down to the underlying compiler. You might want to double check your setup to ensure that your PATH is absolutely correct, you have run "rehash" if you needed to (csh / tcsh), your LD_LIBRARY_PATH points to the right library (on all nodes, even for non-interactive logins), etc. -- Jeff Squyres Cisco Systems
Re: [OMPI users] Question on running the openmpi test modules
On Jul 7, 2009, at 12:43 AM, Prasadcse Perera wrote: I'm new to openmpi and currently I have setup openmpi-1.3.3a1r21566 on my Linux machines. I have run some of available examples and also noticed there are some test modules under /openmpi-1.3.3a1r21566/ test. Are these tests run on batchwise? then how ? or are these tests suppose to run individually by compiling and executing seperately ? They are run via "make check" -- it's a standard GNU mechanism that is built into our make system automatically by Automake. These tests are loosely maintained at best -- they were put in a long time ago, but the bulk of our regression testing codes are in different, not- publicly-accessible repositories (mainly because many of them were not originally written by us and we were too lazy to look into public redistribution rights). I'm hoping to contribute openmpi as a developer, so I would like to know can users contribute by adding more example codes ? Great! More tests, examples, documentation, and code are always appreciated! Note that we have a separate "de...@open-mpi.org" list for developer- level discussions. -- Jeff Squyres Cisco Systems
Re: [OMPI users] Configuration problem or network problem?
You might want to use a tracing library to see where exactly your synchronization issues are occurring. It may depend on the communication pattern between your nodes and the timing between them. Additionally, your network switch(es) performance characteristics may come into effect here: are there retransmissions, timeouts, etc.? It can sometimes be helpful to insert an MPI_BARRIER every few iterations just to keep all processes well-synchronized. It seems counter-intuitive, but sometimes waiting a short time in a barrier can increase overall throughput (rather than waiting progressively longer times in poorly-synchronized blocking communications). On Jul 6, 2009, at 11:33 PM, Zou, Lin (GE, Research, Consultant) wrote: Thank you for your suggestion, I tried this solution, but it doesn't work. In fact, the headnode doesn't participate the computing and communication, it only malloc a large a memory, and when the loop in every PS3 is over, the headnode gather the data from every PS3. The strange thing is that sometimes the program can work well, but when reboot the system, without any change to the program, it can't work, so I think it should be some mechanism in OpenMPI that can configure to let the program work well. Lin From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Doug Reeder Sent: 2009年7月7日 10:49 To: Open MPI Users Subject: Re: [OMPI users] Configuration problem or network problem? Lin, Try -np 16 and not running on the head node. Doug Reeder On Jul 6, 2009, at 7:08 PM, Zou, Lin (GE, Research, Consultant) wrote: Hi all, The system I use is a PS3 cluster, with 16 PS3s and a PowerPC as a headnode, they are connected by a high speed switch. There are point-to-point communication functions( MPI_Send and MPI_Recv ), the data size is about 40KB, and a lot of computings which will consume a long time(about 1 sec)in a loop.The co- processor in PS3 can take care of the computation, the main processor take care of point-to-point communication,so the computing and communication can overlap.The communication funtions should return much faster than computing function. My question is that after some circles, the time consumed by communication functions in a PS3 will increase heavily, and the whole cluster's sync state will corrupt.When I decrease the computing time, this situation just disappeare.I am very confused about this. I think there is a mechanism in OpenMPI that cause this case, does everyone get this situation before? I use "mpirun --mca btl tcp, self -np 17 --hostfile ...", is there something i should added? Lin ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] MPI and C++ (Boost)
I think you face a common trade-off: - use a well-established, debugged, abstraction-rich library - write all of that stuff yourself FWIW, I think the first one is a no-brainer. There's a reason they wrote Boost.MPI: it's complex, difficult stuff, and is perfect as middleware for others to use. If having users perform a 2nd step is undesirable (i.e., install Boost before installing your software), how about embedding Boost in your software? Your configure/build process can certainly be tailored to include Boost[.MPI]. Hence, users will only perform 1 step, but it actually performs "2" steps under the covers (configures+installs Boost.MPI and then configures+installs your software, which uses Boost). FWIW: Open MPI does exactly this. Open MPI embeds at least 5 software packages: PLPA, VampirTrace, ROMIO, libltdl, and libevent. But 99.9% of our users don't know/care because it's all hidden in our configure / make process. If you watch carefully, you can see the output go by from each of those configure sections when running OMPI's configure. But no one does. ;-) Sidenote: I would echo that the Forum is not considering including Boost.MPI at all. Indeed, as mentioned in different threads, the Forum has already voted once to deprecate the MPI C++ bindings, partly *because* of Boost. Boost.MPI has shown that the C++ community is better at making C++ APIs for MPI than the Forum is. Hence, our role should be to make the base building blocks and let the language experts make their own preferred tools. On Jul 7, 2009, at 5:03 AM, Matthieu Brucher wrote: > IF boost is attached to MPI 3 (or whatever), AND it becomes part of the > mainstream MPI implementations, THEN you can have the discussion again. Hi, At the moment, I think that Boost.MPI only supports MPI1.1, and even then, some additional work may be done, at least regarding the complex datatypes. Matthieu -- Information System Engineer, Ph.D. Website: http://matthieu-brucher.developpez.com/ Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn: http://www.linkedin.com/in/matthieubrucher ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users -- Jeff Squyres Cisco Systems
Re: [OMPI users] Segmentation fault - Address not mapped
Thank you very much for the help and assistance :) Using -isystem /users/cluster/cdavid/local/include the program now runs fine (loads the correct mpi.h). Thank you again, Catalin On Tue, Jul 7, 2009 at 12:29 PM, Catalin David wrote: > #include > #include > int main(int argc, char *argv[]) > { > printf("%d %d %d\n", OMPI_MAJOR_VERSION, > OMPI_MINOR_VERSION,OMPI_RELEASE_VERSION); > return 0; > } > > returns: > > test.cpp: In function ‘int main(int, char**)’: > test.cpp:11: error: ‘OMPI_MAJOR_VERSION’ was not declared in this scope > test.cpp:11: error: ‘OMPI_MINOR_VERSION’ was not declared in this scope > test.cpp:11: error: ‘OMPI_RELEASE_VERSION’ was not declared in this scope > > So, I am definitely using another library (mpich). > > Thanks one more time!!! I will try to fix it and come back with results. > > Catalin > > On Tue, Jul 7, 2009 at 12:23 PM, Dorian Krause wrote: >> Catalin David wrote: >>> >>> Hello, all! >>> >>> Just installed Valgrind (since this seems like a memory issue) and got >>> this interesting output (when running the test program): >>> >>> ==4616== Syscall param sched_setaffinity(mask) points to unaddressable >>> byte(s) >>> ==4616== at 0x43656BD: syscall (in /lib/tls/libc-2.3.2.so) >>> ==4616== by 0x4236A75: opal_paffinity_linux_plpa_init >>> (plpa_runtime.c:37) >>> ==4616== by 0x423779B: >>> opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:501) >>> ==4616== by 0x4235FEE: linux_module_init (paffinity_linux_module.c:119) >>> ==4616== by 0x447F114: opal_paffinity_base_select >>> (paffinity_base_select.c:64) >>> ==4616== by 0x444CD71: opal_init (opal_init.c:292) >>> ==4616== by 0x43CE7E6: orte_init (orte_init.c:76) >>> ==4616== by 0x4067A50: ompi_mpi_init (ompi_mpi_init.c:342) >>> ==4616== by 0x40A3444: PMPI_Init (pinit.c:80) >>> ==4616== by 0x804875C: main (test.cpp:17) >>> ==4616== Address 0x0 is not stack'd, malloc'd or (recently) free'd >>> ==4616== >>> ==4616== Invalid read of size 4 >>> ==4616== at 0x4095772: ompi_comm_invalid (communicator.h:261) >>> ==4616== by 0x409581E: PMPI_Comm_size (pcomm_size.c:46) >>> ==4616== by 0x8048770: main (test.cpp:18) >>> ==4616== Address 0x44a0 is not stack'd, malloc'd or (recently) free'd >>> [denali:04616] *** Process received signal *** >>> [denali:04616] Signal: Segmentation fault (11) >>> [denali:04616] Signal code: Address not mapped (1) >>> [denali:04616] Failing at address: 0x44a0 >>> [denali:04616] [ 0] /lib/tls/libc.so.6 [0x42b4de0] >>> [denali:04616] [ 1] >>> /users/cluster/cdavid/local/lib/libmpi.so.0(MPI_Comm_size+0x6f) >>> [0x409581f] >>> [denali:04616] [ 2] ./test(__gxx_personality_v0+0x12d) [0x8048771] >>> [denali:04616] [ 3] /lib/tls/libc.so.6(__libc_start_main+0xf8) [0x42a2768] >>> [denali:04616] [ 4] ./test(__gxx_personality_v0+0x3d) [0x8048681] >>> [denali:04616] *** End of error message *** >>> ==4616== >>> ==4616== Invalid read of size 4 >>> ==4616== at 0x4095782: ompi_comm_invalid (communicator.h:261) >>> ==4616== by 0x409581E: PMPI_Comm_size (pcomm_size.c:46) >>> ==4616== by 0x8048770: main (test.cpp:18) >>> ==4616== Address 0x44a0 is not stack'd, malloc'd or (recently) free'd >>> >>> >>> The problem is that, now, I don't know where the issue comes from (is >>> it libc that is too old and incompatible with g++ 4.4/OpenMPI? is libc >>> broken?). >>> >> >> Looking at the code for ompi_comm_invalid: >> >> static inline int ompi_comm_invalid(ompi_communicator_t* comm) >> { >> if ((NULL == comm) || (MPI_COMM_NULL == comm) || >> (OMPI_COMM_IS_FREED(comm)) || (OMPI_COMM_IS_INVALID(comm)) ) >> return true; >> else >> return false; >> } >> >> >> the interesting point is that (MPI_COMM_NULL == comm) evaluates to false, >> otherwise the following macros (where the invalid read occurs) would not be >> evaluated. >> >> The only idea that comes to my mind is that you are mixing MPI versions, but >> as you said your PATH is fine ?! >> >> Regards, >> Dorian >> >> >> >>> Any help would be highly appreciated. >>> >>> Thanks, >>> Catalin >>> >>> >>> On Mon, Jul 6, 2009 at 3:36 PM, Catalin David >>> wrote: >>> On Mon, Jul 6, 2009 at 3:26 PM, jody wrote: > > Hi > Are you also sure that you have the same version of Open-MPI > on every machine of your cluster, and that it is the mpicxx of this > version that is called when you run your program? > I ask because you mentioned that there was an old version of Open-MPI > present... die you remove this? > > Jody > Hi I have just logged in a few other boxes and they all mount my home folder. When running `echo $LD_LIBRARY_PATH` and other commands, I get what I expect to get, but this might be because I have set these variables in the .bashrc file. So, I tried compiling/running like this ~/local/bin/mpicxx [stuff] and ~/local/bin/mpirun -np 4 ray-trace, but I get the same errors. >>>
Re: [OMPI users] Segmentation fault - Address not mapped
#include #include int main(int argc, char *argv[]) { printf("%d %d %d\n", OMPI_MAJOR_VERSION, OMPI_MINOR_VERSION,OMPI_RELEASE_VERSION); return 0; } returns: test.cpp: In function ‘int main(int, char**)’: test.cpp:11: error: ‘OMPI_MAJOR_VERSION’ was not declared in this scope test.cpp:11: error: ‘OMPI_MINOR_VERSION’ was not declared in this scope test.cpp:11: error: ‘OMPI_RELEASE_VERSION’ was not declared in this scope So, I am definitely using another library (mpich). Thanks one more time!!! I will try to fix it and come back with results. Catalin On Tue, Jul 7, 2009 at 12:23 PM, Dorian Krause wrote: > Catalin David wrote: >> >> Hello, all! >> >> Just installed Valgrind (since this seems like a memory issue) and got >> this interesting output (when running the test program): >> >> ==4616== Syscall param sched_setaffinity(mask) points to unaddressable >> byte(s) >> ==4616== at 0x43656BD: syscall (in /lib/tls/libc-2.3.2.so) >> ==4616== by 0x4236A75: opal_paffinity_linux_plpa_init >> (plpa_runtime.c:37) >> ==4616== by 0x423779B: >> opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:501) >> ==4616== by 0x4235FEE: linux_module_init (paffinity_linux_module.c:119) >> ==4616== by 0x447F114: opal_paffinity_base_select >> (paffinity_base_select.c:64) >> ==4616== by 0x444CD71: opal_init (opal_init.c:292) >> ==4616== by 0x43CE7E6: orte_init (orte_init.c:76) >> ==4616== by 0x4067A50: ompi_mpi_init (ompi_mpi_init.c:342) >> ==4616== by 0x40A3444: PMPI_Init (pinit.c:80) >> ==4616== by 0x804875C: main (test.cpp:17) >> ==4616== Address 0x0 is not stack'd, malloc'd or (recently) free'd >> ==4616== >> ==4616== Invalid read of size 4 >> ==4616== at 0x4095772: ompi_comm_invalid (communicator.h:261) >> ==4616== by 0x409581E: PMPI_Comm_size (pcomm_size.c:46) >> ==4616== by 0x8048770: main (test.cpp:18) >> ==4616== Address 0x44a0 is not stack'd, malloc'd or (recently) free'd >> [denali:04616] *** Process received signal *** >> [denali:04616] Signal: Segmentation fault (11) >> [denali:04616] Signal code: Address not mapped (1) >> [denali:04616] Failing at address: 0x44a0 >> [denali:04616] [ 0] /lib/tls/libc.so.6 [0x42b4de0] >> [denali:04616] [ 1] >> /users/cluster/cdavid/local/lib/libmpi.so.0(MPI_Comm_size+0x6f) >> [0x409581f] >> [denali:04616] [ 2] ./test(__gxx_personality_v0+0x12d) [0x8048771] >> [denali:04616] [ 3] /lib/tls/libc.so.6(__libc_start_main+0xf8) [0x42a2768] >> [denali:04616] [ 4] ./test(__gxx_personality_v0+0x3d) [0x8048681] >> [denali:04616] *** End of error message *** >> ==4616== >> ==4616== Invalid read of size 4 >> ==4616== at 0x4095782: ompi_comm_invalid (communicator.h:261) >> ==4616== by 0x409581E: PMPI_Comm_size (pcomm_size.c:46) >> ==4616== by 0x8048770: main (test.cpp:18) >> ==4616== Address 0x44a0 is not stack'd, malloc'd or (recently) free'd >> >> >> The problem is that, now, I don't know where the issue comes from (is >> it libc that is too old and incompatible with g++ 4.4/OpenMPI? is libc >> broken?). >> > > Looking at the code for ompi_comm_invalid: > > static inline int ompi_comm_invalid(ompi_communicator_t* comm) > { > if ((NULL == comm) || (MPI_COMM_NULL == comm) || > (OMPI_COMM_IS_FREED(comm)) || (OMPI_COMM_IS_INVALID(comm)) ) > return true; > else > return false; > } > > > the interesting point is that (MPI_COMM_NULL == comm) evaluates to false, > otherwise the following macros (where the invalid read occurs) would not be > evaluated. > > The only idea that comes to my mind is that you are mixing MPI versions, but > as you said your PATH is fine ?! > > Regards, > Dorian > > > >> Any help would be highly appreciated. >> >> Thanks, >> Catalin >> >> >> On Mon, Jul 6, 2009 at 3:36 PM, Catalin David >> wrote: >> >>> >>> On Mon, Jul 6, 2009 at 3:26 PM, jody wrote: >>> Hi Are you also sure that you have the same version of Open-MPI on every machine of your cluster, and that it is the mpicxx of this version that is called when you run your program? I ask because you mentioned that there was an old version of Open-MPI present... die you remove this? Jody >>> >>> Hi >>> >>> I have just logged in a few other boxes and they all mount my home >>> folder. When running `echo $LD_LIBRARY_PATH` and other commands, I get >>> what I expect to get, but this might be because I have set these >>> variables in the .bashrc file. So, I tried compiling/running like this >>> ~/local/bin/mpicxx [stuff] and ~/local/bin/mpirun -np 4 ray-trace, >>> but I get the same errors. >>> >>> As for the previous version, I don't have root access, therefore I was >>> not able to remove it. I was just trying to outrun it by setting the >>> $PATH variable to point first at my local installation. >>> >>> >>> Catalin >>> >>> >>> -- >>> >>> ** >>> Catalin David >>> B.Sc. Computer Science 2010 >>> Jacobs University Bremen >>> >>> Phone: +49-(0)1577-4
Re: [OMPI users] Segmentation fault - Address not mapped
Catalin David wrote: Hello, all! Just installed Valgrind (since this seems like a memory issue) and got this interesting output (when running the test program): ==4616== Syscall param sched_setaffinity(mask) points to unaddressable byte(s) ==4616==at 0x43656BD: syscall (in /lib/tls/libc-2.3.2.so) ==4616==by 0x4236A75: opal_paffinity_linux_plpa_init (plpa_runtime.c:37) ==4616==by 0x423779B: opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:501) ==4616==by 0x4235FEE: linux_module_init (paffinity_linux_module.c:119) ==4616==by 0x447F114: opal_paffinity_base_select (paffinity_base_select.c:64) ==4616==by 0x444CD71: opal_init (opal_init.c:292) ==4616==by 0x43CE7E6: orte_init (orte_init.c:76) ==4616==by 0x4067A50: ompi_mpi_init (ompi_mpi_init.c:342) ==4616==by 0x40A3444: PMPI_Init (pinit.c:80) ==4616==by 0x804875C: main (test.cpp:17) ==4616== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==4616== ==4616== Invalid read of size 4 ==4616==at 0x4095772: ompi_comm_invalid (communicator.h:261) ==4616==by 0x409581E: PMPI_Comm_size (pcomm_size.c:46) ==4616==by 0x8048770: main (test.cpp:18) ==4616== Address 0x44a0 is not stack'd, malloc'd or (recently) free'd [denali:04616] *** Process received signal *** [denali:04616] Signal: Segmentation fault (11) [denali:04616] Signal code: Address not mapped (1) [denali:04616] Failing at address: 0x44a0 [denali:04616] [ 0] /lib/tls/libc.so.6 [0x42b4de0] [denali:04616] [ 1] /users/cluster/cdavid/local/lib/libmpi.so.0(MPI_Comm_size+0x6f) [0x409581f] [denali:04616] [ 2] ./test(__gxx_personality_v0+0x12d) [0x8048771] [denali:04616] [ 3] /lib/tls/libc.so.6(__libc_start_main+0xf8) [0x42a2768] [denali:04616] [ 4] ./test(__gxx_personality_v0+0x3d) [0x8048681] [denali:04616] *** End of error message *** ==4616== ==4616== Invalid read of size 4 ==4616==at 0x4095782: ompi_comm_invalid (communicator.h:261) ==4616==by 0x409581E: PMPI_Comm_size (pcomm_size.c:46) ==4616==by 0x8048770: main (test.cpp:18) ==4616== Address 0x44a0 is not stack'd, malloc'd or (recently) free'd The problem is that, now, I don't know where the issue comes from (is it libc that is too old and incompatible with g++ 4.4/OpenMPI? is libc broken?). Looking at the code for ompi_comm_invalid: static inline int ompi_comm_invalid(ompi_communicator_t* comm) { if ((NULL == comm) || (MPI_COMM_NULL == comm) || (OMPI_COMM_IS_FREED(comm)) || (OMPI_COMM_IS_INVALID(comm)) ) return true; else return false; } the interesting point is that (MPI_COMM_NULL == comm) evaluates to false, otherwise the following macros (where the invalid read occurs) would not be evaluated. The only idea that comes to my mind is that you are mixing MPI versions, but as you said your PATH is fine ?! Regards, Dorian Any help would be highly appreciated. Thanks, Catalin On Mon, Jul 6, 2009 at 3:36 PM, Catalin David wrote: On Mon, Jul 6, 2009 at 3:26 PM, jody wrote: Hi Are you also sure that you have the same version of Open-MPI on every machine of your cluster, and that it is the mpicxx of this version that is called when you run your program? I ask because you mentioned that there was an old version of Open-MPI present... die you remove this? Jody Hi I have just logged in a few other boxes and they all mount my home folder. When running `echo $LD_LIBRARY_PATH` and other commands, I get what I expect to get, but this might be because I have set these variables in the .bashrc file. So, I tried compiling/running like this ~/local/bin/mpicxx [stuff] and ~/local/bin/mpirun -np 4 ray-trace, but I get the same errors. As for the previous version, I don't have root access, therefore I was not able to remove it. I was just trying to outrun it by setting the $PATH variable to point first at my local installation. Catalin -- ** Catalin David B.Sc. Computer Science 2010 Jacobs University Bremen Phone: +49-(0)1577-49-38-667 College Ring 4, #343 Bremen, 28759 Germany **
Re: [OMPI users] Segmentation fault - Address not mapped
This is the error you get when an invalid communicator handle is passed to a MPI function, the handle is deferenced so you may or may not get a SEGV from it depending on the value you pass. The 0x44a0 address is an offset from 0x4400, the value of MPI_COMM_WORLD in mpich2, my guess would be you are either picking up a mpich2 mpi.h or the mpich2 mpicc. Ashley, On Tue, 2009-07-07 at 11:05 +0100, Catalin David wrote: > Hello, all! > > Just installed Valgrind (since this seems like a memory issue) and got > this interesting output (when running the test program): > > ==4616== Syscall param sched_setaffinity(mask) points to unaddressable byte(s) > ==4616==at 0x43656BD: syscall (in /lib/tls/libc-2.3.2.so) > ==4616==by 0x4236A75: opal_paffinity_linux_plpa_init (plpa_runtime.c:37) > ==4616==by 0x423779B: > opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:501) > ==4616==by 0x4235FEE: linux_module_init (paffinity_linux_module.c:119) > ==4616==by 0x447F114: opal_paffinity_base_select > (paffinity_base_select.c:64) > ==4616==by 0x444CD71: opal_init (opal_init.c:292) > ==4616==by 0x43CE7E6: orte_init (orte_init.c:76) > ==4616==by 0x4067A50: ompi_mpi_init (ompi_mpi_init.c:342) > ==4616==by 0x40A3444: PMPI_Init (pinit.c:80) > ==4616==by 0x804875C: main (test.cpp:17) > ==4616== Address 0x0 is not stack'd, malloc'd or (recently) free'd > ==4616== > ==4616== Invalid read of size 4 > ==4616==at 0x4095772: ompi_comm_invalid (communicator.h:261) > ==4616==by 0x409581E: PMPI_Comm_size (pcomm_size.c:46) > ==4616==by 0x8048770: main (test.cpp:18) > ==4616== Address 0x44a0 is not stack'd, malloc'd or (recently) free'd > [denali:04616] *** Process received signal *** > [denali:04616] Signal: Segmentation fault (11) > [denali:04616] Signal code: Address not mapped (1) > [denali:04616] Failing at address: 0x44a0 > [denali:04616] [ 0] /lib/tls/libc.so.6 [0x42b4de0] > [denali:04616] [ 1] > /users/cluster/cdavid/local/lib/libmpi.so.0(MPI_Comm_size+0x6f) > [0x409581f] > [denali:04616] [ 2] ./test(__gxx_personality_v0+0x12d) [0x8048771] > [denali:04616] [ 3] /lib/tls/libc.so.6(__libc_start_main+0xf8) [0x42a2768] > [denali:04616] [ 4] ./test(__gxx_personality_v0+0x3d) [0x8048681] > [denali:04616] *** End of error message *** > ==4616== > ==4616== Invalid read of size 4 > ==4616==at 0x4095782: ompi_comm_invalid (communicator.h:261) > ==4616==by 0x409581E: PMPI_Comm_size (pcomm_size.c:46) > ==4616==by 0x8048770: main (test.cpp:18) > ==4616== Address 0x44a0 is not stack'd, malloc'd or (recently) free'd > > > The problem is that, now, I don't know where the issue comes from (is > it libc that is too old and incompatible with g++ 4.4/OpenMPI? is libc > broken?). > > Any help would be highly appreciated. > > Thanks, > Catalin > > > On Mon, Jul 6, 2009 at 3:36 PM, Catalin David > wrote: > > On Mon, Jul 6, 2009 at 3:26 PM, jody wrote: > >> Hi > >> Are you also sure that you have the same version of Open-MPI > >> on every machine of your cluster, and that it is the mpicxx of this > >> version that is called when you run your program? > >> I ask because you mentioned that there was an old version of Open-MPI > >> present... die you remove this? > >> > >> Jody > > > > Hi > > > > I have just logged in a few other boxes and they all mount my home > > folder. When running `echo $LD_LIBRARY_PATH` and other commands, I get > > what I expect to get, but this might be because I have set these > > variables in the .bashrc file. So, I tried compiling/running like this > > ~/local/bin/mpicxx [stuff] and ~/local/bin/mpirun -np 4 ray-trace, > > but I get the same errors. > > > > As for the previous version, I don't have root access, therefore I was > > not able to remove it. I was just trying to outrun it by setting the > > $PATH variable to point first at my local installation. > > > > > > Catalin > > > > > > -- > > > > ** > > Catalin David > > B.Sc. Computer Science 2010 > > Jacobs University Bremen > > > > Phone: +49-(0)1577-49-38-667 > > > > College Ring 4, #343 > > Bremen, 28759 > > Germany > > ** > > > > > -- Ashley Pittman, Bath, UK. Padb - A parallel job inspection tool for cluster computing http://padb.pittman.org.uk
Re: [OMPI users] MPI and C++ - now Send and Receive of Classes and STL containers
Hi, On Mon, Jul 06, 2009 at 03:24:07PM -0400, Luis Vitorio Cargnini wrote: > Thanks, but I really do not want to use Boost. > Is easier ? certainly is, but I want to make it using only MPI > itself > and not been dependent of a Library, or templates like the majority > of > boost a huge set of templates and wrappers for different libraries, > implemented in C, supplying a wrapper for C++. > I admit Boost is a valuable tool, but in my case, as much > independent I > could be from additional libs, better. > If you do not want to use boost, then I suggest not using nested vectors but just ones that contain PODs as value_type (or even C-arrays). If you insist on using complicated containers you will end up writing your own MPI-C++ abstraction (resulting in a library). This will be a lot of (unnecessary and hard) work. Just my 2 cents. Cheers, Markus
Re: [OMPI users] Segmentation fault - Address not mapped
Hello, all! Just installed Valgrind (since this seems like a memory issue) and got this interesting output (when running the test program): ==4616== Syscall param sched_setaffinity(mask) points to unaddressable byte(s) ==4616==at 0x43656BD: syscall (in /lib/tls/libc-2.3.2.so) ==4616==by 0x4236A75: opal_paffinity_linux_plpa_init (plpa_runtime.c:37) ==4616==by 0x423779B: opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:501) ==4616==by 0x4235FEE: linux_module_init (paffinity_linux_module.c:119) ==4616==by 0x447F114: opal_paffinity_base_select (paffinity_base_select.c:64) ==4616==by 0x444CD71: opal_init (opal_init.c:292) ==4616==by 0x43CE7E6: orte_init (orte_init.c:76) ==4616==by 0x4067A50: ompi_mpi_init (ompi_mpi_init.c:342) ==4616==by 0x40A3444: PMPI_Init (pinit.c:80) ==4616==by 0x804875C: main (test.cpp:17) ==4616== Address 0x0 is not stack'd, malloc'd or (recently) free'd ==4616== ==4616== Invalid read of size 4 ==4616==at 0x4095772: ompi_comm_invalid (communicator.h:261) ==4616==by 0x409581E: PMPI_Comm_size (pcomm_size.c:46) ==4616==by 0x8048770: main (test.cpp:18) ==4616== Address 0x44a0 is not stack'd, malloc'd or (recently) free'd [denali:04616] *** Process received signal *** [denali:04616] Signal: Segmentation fault (11) [denali:04616] Signal code: Address not mapped (1) [denali:04616] Failing at address: 0x44a0 [denali:04616] [ 0] /lib/tls/libc.so.6 [0x42b4de0] [denali:04616] [ 1] /users/cluster/cdavid/local/lib/libmpi.so.0(MPI_Comm_size+0x6f) [0x409581f] [denali:04616] [ 2] ./test(__gxx_personality_v0+0x12d) [0x8048771] [denali:04616] [ 3] /lib/tls/libc.so.6(__libc_start_main+0xf8) [0x42a2768] [denali:04616] [ 4] ./test(__gxx_personality_v0+0x3d) [0x8048681] [denali:04616] *** End of error message *** ==4616== ==4616== Invalid read of size 4 ==4616==at 0x4095782: ompi_comm_invalid (communicator.h:261) ==4616==by 0x409581E: PMPI_Comm_size (pcomm_size.c:46) ==4616==by 0x8048770: main (test.cpp:18) ==4616== Address 0x44a0 is not stack'd, malloc'd or (recently) free'd The problem is that, now, I don't know where the issue comes from (is it libc that is too old and incompatible with g++ 4.4/OpenMPI? is libc broken?). Any help would be highly appreciated. Thanks, Catalin On Mon, Jul 6, 2009 at 3:36 PM, Catalin David wrote: > On Mon, Jul 6, 2009 at 3:26 PM, jody wrote: >> Hi >> Are you also sure that you have the same version of Open-MPI >> on every machine of your cluster, and that it is the mpicxx of this >> version that is called when you run your program? >> I ask because you mentioned that there was an old version of Open-MPI >> present... die you remove this? >> >> Jody > > Hi > > I have just logged in a few other boxes and they all mount my home > folder. When running `echo $LD_LIBRARY_PATH` and other commands, I get > what I expect to get, but this might be because I have set these > variables in the .bashrc file. So, I tried compiling/running like this > ~/local/bin/mpicxx [stuff] and ~/local/bin/mpirun -np 4 ray-trace, > but I get the same errors. > > As for the previous version, I don't have root access, therefore I was > not able to remove it. I was just trying to outrun it by setting the > $PATH variable to point first at my local installation. > > > Catalin > > > -- > > ** > Catalin David > B.Sc. Computer Science 2010 > Jacobs University Bremen > > Phone: +49-(0)1577-49-38-667 > > College Ring 4, #343 > Bremen, 28759 > Germany > ** > -- ** Catalin David B.Sc. Computer Science 2010 Jacobs University Bremen Phone: +49-(0)1577-49-38-667 College Ring 4, #343 Bremen, 28759 Germany **
Re: [OMPI users] MPI and C++ (Boost)
> IF boost is attached to MPI 3 (or whatever), AND it becomes part of the > mainstream MPI implementations, THEN you can have the discussion again. Hi, At the moment, I think that Boost.MPI only supports MPI1.1, and even then, some additional work may be done, at least regarding the complex datatypes. Matthieu -- Information System Engineer, Ph.D. Website: http://matthieu-brucher.developpez.com/ Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92 LinkedIn: http://www.linkedin.com/in/matthieubrucher
Re: [OMPI users] MPI and C++ (Boost)
Hi Luis, Luis Vitorio Cargnini wrote: Your suggestion is a great and interesting idea. I only have the fear to get used to the Boost and could not get rid of Boost anymore, because one thing is sure the abstraction added by Boost is impressive, it turn I should add that I fully understand what it is you are saying and despite all the good things there were being said about Boost, I was avoiding it for a very long time because of the dependency issue. For two reasons -- the dependency issue for myself (exactly like what you said) and distributing it means users will have to do an extra step (regardless of how easy/hard the step is, it's an extra step). I finally switched over :-) and the "prototype" idea was just a way to ease you into it. MPI programs are hard to get right, and Boost aside, it is a good idea to have something working that is easy to do and then you can remove the parts that you don't like later. By the way, it seems that less-used parts of MPI do not have equivalents in Boost.MPI, so just using Boost won't solve all of your problems. There is a list here (the table with the entries that say "unsupported"): http://www.boost.org/doc/libs/1_39_0/doc/html/mpi/tutorial.html#mpi.c_mapping Good luck! Ray
Re: [OMPI users] MPI and C++ (Boost)
Terry Frankcombe wrote: I understand Luis' position completely. He wants an MPI program, not a program that's written in some other environment, no matter how attractive that may be. It's like the difference between writing a numerical program in standard-conforming Fortran and writing it in the latest flavour of the month interpreted language calling highly optimised libraries behind the scenes. IF boost is attached to MPI 3 (or whatever), AND it becomes part of the mainstream MPI implementations, THEN you can have the discussion again. Ciao Terry I guess we view it differently. Boost.MPI isn't a language at all. It is a library written in fully ISO compliant C++, that exists to make doing an otherwise complex and error prone job simpler and more readable. As such, I would compare it to using a well tested BLAS library to do matrix manipulations in your Fortran code or writing it yourself. Both can be standard conforming Fortran (though many BLAS implementations include lower level optimized code), and neither is a flavor of the month interpreted language. The advantage of the library is that it allows you to work at a level of abstraction that may be better suited to your work. For you, as for everyone else, make your choices based on what you believe best serves the needs of your program, whether that includes Boost.MPI or not. However, making the choices with an understanding of the options strengths and weaknesses gives the best chance of writing a good program. John PS - I am not part of the MPI Forum, but I would be surprised if they chose to add boost to any MPI version. Possibly an analog of Boost.MPI, but not all of boost. There are over 100 different libraries, covering many different areas of use in boost, and most of them have no direct connection to MPI. PPS - If anyone would like to know more about Boost, I would suggest the website (http://www.boost.org) or the user mailing list. Folks who don't write in C++ will probably not be very interested.
[OMPI users] Question on running the openmpi test modules
Hi, I'm new to openmpi and currently I have setup openmpi-1.3.3a1r21566 on my Linux machines. I have run some of available examples and also noticed there are some test modules under /openmpi-1.3.3a1r21566/test. Are these tests run on batchwise? then how ? or are these tests suppose to run individually by compiling and executing seperately ? I'm hoping to contribute openmpi as a developer, so I would like to know can users contribute by adding more example codes ? Thanks, Prasad. -- http://www.codeproject.com/script/Articles/MemberArticles.aspx?amid=3489381
[OMPI users] bulding rpm
Hi every one I built a rpm file for openmpi-1.3.2 with openmpi.spec and buildrpm.sh on the http://www.open-mpi.org/software/ompi/v1.3/srpm.php I change buildrpm.sh as fllowing: prefix="/usr/local/openmpi/intel/1.3.2" specfile="openmpi.spec" #rpmbuild_options=${rpmbuild_options:-"--define 'mflags -j4'"} # -j4 is an option to make, specifies the number of jobs (4) to run simultaneously. rpmbuild_options="--define 'mflags -j4'" #configure_options=${configure_options:-""} configure_options="FC=ifort F77=ifort CC=icc CXX=icpc --with-sge --with-threads=posix --enable-mpi-threads" # install ${prefix}/bin/mpivars.* scripts rpmbuild_options=${rpmbuild_options}" --define 'install_in_opt 0' --define 'install_shell_scripts 1' --define 'install_modulefile 0'" # prefix variable has to be passed to rpmbuild rpmbuild_options=${rpmbuild_options}" --define '_prefix ${prefix}'" # Note that this script can build one or all of the following RPMs: # SRPM, all-in-one, multiple. # If you want to build the SRPM, put "yes" here build_srpm=${build_srpm:-"no"} # If you want to build the "all in one RPM", put "yes" here build_single=${build_single:-"yes"} # If you want to build the "multiple" RPMs, put "yes" here build_multiple=${build_multiple:-"no"} it create openmpi-1.3.2-1.x86_64.rpm with no error, but when I install it with rpm -ivh I see: error: Failed dependencies: libifcoremt.so.5()(64bit) is needed by openmpi-1.3.2-1.x86_64 libifport.so.5()(64bit) is needed by openmpi-1.3.2-1.x86_64 libimf.so()(64bit) is needed by openmpi-1.3.2-1.x86_64 libintlc.so.5()(64bit) is needed by openmpi-1.3.2-1.x86_64 libiomp5.so()(64bit) is needed by openmpi-1.3.2-1.x86_64 libsvml.so()(64bit) is needed by openmpi-1.3.2-1.x86_64 libtorque.so.2()(64bit) is needed by openmpi-1.3.2-1.x86_64 but all above library are in my computer I use rpm -ivh --nodeps and it install completely, but when I use mpif90 and mpirun I see: $ /usr/local/openmpi/intel/1.3.2/bin/mpif90 gfortran: no input files (I compile with ifort) $ /usr/local/openmpi/intel/1.3.2/bin/mpirun usr/local/openmpi/intel/1.3.2/bin/mpirun: symbol lookup error: /usr/local/openmpi/intel/1.3.2/bin/mpirun: undefined symbol: orted_cmd_line What is wrong? How can I build a rpm of openmpi with intel compiler? Thanks