Re: [OMPI users] quadrics support?

2009-07-07 Thread Michael Di Domenico
So, first run i seem to have run into a bit of an issue.  All the
Quadrics modules are compiled and loaded.  I can ping between nodes
over the quadrics interfaces.  But when i try to run one of the hello
mpi example from openmpi, i get:

first run, the process hung - killed with ctl-c
though it doesnt seem to actually die and kill -9 doesn't work

second run, the process fails with
  failed elan4_attach  Device or resource busy
  
  elan_allocSleepDesc  Failed to allocate IRQ cookie 2a: 22
Invalid argument
all subsequent runs fail the same way and i have to reboot the box to
get the processes to go away

I'm not sure if this is a quadrics or openmpi issue at this point, but
i figured since there are quadrics people on the list its a good place
to start

On Tue, Jul 7, 2009 at 3:30 PM, Michael Di
Domenico wrote:
> Does OpenMPI/Quadrics require the Quadrics Kernel patches in order to
> operate?  Or operate at full speed or are the Quadrics modules
> sufficient?
>
> On Thu, Jul 2, 2009 at 1:52 PM, Ashley Pittman wrote:
>> On Thu, 2009-07-02 at 09:34 -0400, Michael Di Domenico wrote:
>>> Jeff,
>>>
>>> Okay, thanks.  I'll give it a shot and report back.  I can't
>>> contribute any code, but I can certainly do testing...
>>
>> I'm from the Quadrics stable so could certainty support a port should
>> you require it but I don't have access to hardware either currently.
>>
>> Ashley,
>>
>> --
>>
>> Ashley Pittman, Bath, UK.
>>
>> Padb - A parallel job inspection tool for cluster computing
>> http://padb.pittman.org.uk
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>


Re: [OMPI users] OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 quads

2009-07-07 Thread Reuti

Hi,

Am 07.07.2009 um 22:12 schrieb Lengyel, Florian:


Hi,
I may have overlooked something in the archives (not to mention  
Googling)--if so I apologize, however

I have been unable to find info on this particular problem.

OpenMPI+SGE tight integration works on E6600 core duo systems but  
not on Q9550 quads.

Could use some troubleshooting assistance. Thanks.


Is this what you found our your question?

I'm not aware of this. What should be the cause of it?!? Do you have  
a link - was it on the SGE list?


-- Reuti



I'm running SGE 6.0u10 on a linux cluster running OpenSuse 11.

OpenMPI was compiled with SGE, and the required components are  
present:


[flengyel@nept OPENMPI]$ ompi_info | grep gridengine
 MCA ras: gridengine (MCA v1.0, API v1.3, Component  
v1.2.7)
 MCA pls: gridengine (MCA v1.0, API v1.3, Component  
v1.2.7)



The parallel execution environment for OpenMPI is as follows:

[flengyel@nept OPENMPI]$ qconf -sp ompi
pe_name   ompi
slots 999
user_listsResearch
xuser_lists   NONE
start_proc_args   /bin/true
stop_proc_args/bin/true
allocation_rule   $fill_up
control_slavesTRUE
job_is_first_task FALSE
urgency_slots min

A trivial OpenMPI job using this pe will run on a queue for Intel  
E6600 core duo machines:


[flengyel@nept OPENMPI]$ cat sum2.sh

#!/bin/bash
#$ -S /bin/bash
#$ -q x86_64.q
#$ -N sum
#$ -pe ompi 4

#$ -cwd

export PATH=/home/nept/apps64/openmpi/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nept/apps64/openmpi/lib
. /usr/local/sge/default/common/settings.sh
mpirun --mca pls_gridengine_verbose 2  --prefix /home/nept/apps64/ 
openmpi -v  ./sum


Here are the results:

[flengyel@nept OPENMPI]$ qsub sum2.sh
Your job 23194 ("sum") has been submitted

[flengyel@nept OPENMPI]$ qstat -r -u flengyel

job-ID  prior   name   user state submit/start at  
queue  slots ja-task-ID
-- 
---
  23194 0.25007 sumflengyel r 07/07/2009 14:14:40  
x86_6...@m49.gc.cuny.edu   4

   Full jobname: sum
   Master queue: x86_6...@m49.gc.cuny.edu
   Requested PE: ompi 4
   Granted PE:   ompi 4
   Hard Resources:
   Soft Resources:
   Hard requested queues: x86_64.q


[flengyel@nept OPENMPI]$ more sum.o23194

The sum from 1 to 1000 is: 500500
[flengyel@nept OPENMPI]$ more sum.e23194
Starting server daemon at host "m49.gc.cuny.edu"
Starting server daemon at host "m33.gc.cuny.edu"
Server daemon successfully started with task id "1.m49"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host  
m49.gc.cuny.edu ...

Server daemon successfully started with task id "1.m33"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host  
m33.gc.cuny.edu ...

/usr/local/sge/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ...

But the same job with the queue set to quad.q for the Q9550 quad  
core machines

has daemon trouble:


[flengyel@nept OPENMPI]$ !qstat
qstat -r -u flengyel
job-ID  prior   name   user state submit/start at  
queue  slots ja-task-ID
-- 
---
  23196 0.25000 sumflengyel r 07/07/2009 14:26:21  
qua...@m09.gc.cuny.edu 2

   Full jobname: sum
   Master queue: qua...@m09.gc.cuny.edu
   Requested PE: ompi 2
   Granted PE:   ompi 2
   Hard Resources:
   Soft Resources:
   Hard requested queues: quad.q
[flengyel@nept OPENMPI]$ more sum.e23196
Starting server daemon at host "m15.gc.cuny.edu"
Starting server daemon at host "m09.gc.cuny.edu"
Server daemon successfully started with task id "1.m15"
Server daemon successfully started with task id "1.m09"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host  
m15.gc.cuny.e

du ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
reading exit code from shepherd ... Establishing /usr/local/sge/ 
utilbin/lx24-amd

64/rsh session to host m09.gc.cuny.edu ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
reading exit code from shepherd ... 129
[m09.gc.cuny.edu:11413] ERROR: A daemon on node m15.gc.cuny.edu  
failed to start

as expected.
[m09.gc.cuny.edu:11413] ERROR: There may be more information  
available from
[m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid  
Engine tasks.
[m09.gc.cuny.edu:11413] ERROR: If the problem persists, please  
restart the

[m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
[m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with  
status 129.

129
[m09.gc.cuny.edu:11413] ERROR: A daemon on node m09.gc.cuny.edu  
failed to start

as expected.
[m09.gc.cuny.edu:11413] ERROR: There may be more informat

[OMPI users] OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 quads

2009-07-07 Thread Lengyel, Florian
Hi,
I may have overlooked something in the archives (not to mention Googling)--if 
so I apologize, however 
I have been unable to find info on this particular problem.

OpenMPI+SGE tight integration works on E6600 core duo systems but not on Q9550 
quads.
Could use some troubleshooting assistance. Thanks. 

I'm running SGE 6.0u10 on a linux cluster running OpenSuse 11.

OpenMPI was compiled with SGE, and the required components are present:

[flengyel@nept OPENMPI]$ ompi_info | grep gridengine
 MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7)
 MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7)


The parallel execution environment for OpenMPI is as follows:

[flengyel@nept OPENMPI]$ qconf -sp ompi
pe_name   ompi
slots 999
user_listsResearch
xuser_lists   NONE
start_proc_args   /bin/true
stop_proc_args/bin/true
allocation_rule   $fill_up
control_slavesTRUE
job_is_first_task FALSE
urgency_slots min

A trivial OpenMPI job using this pe will run on a queue for Intel E6600 core 
duo machines:

[flengyel@nept OPENMPI]$ cat sum2.sh

#!/bin/bash
#$ -S /bin/bash
#$ -q x86_64.q
#$ -N sum
#$ -pe ompi 4

#$ -cwd

export PATH=/home/nept/apps64/openmpi/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/nept/apps64/openmpi/lib
. /usr/local/sge/default/common/settings.sh
mpirun --mca pls_gridengine_verbose 2  --prefix /home/nept/apps64/openmpi -v  
./sum

Here are the results:

[flengyel@nept OPENMPI]$ qsub sum2.sh
Your job 23194 ("sum") has been submitted

[flengyel@nept OPENMPI]$ qstat -r -u flengyel

job-ID  prior   name   user state submit/start at queue 
 slots ja-task-ID
-
  23194 0.25007 sumflengyel r 07/07/2009 14:14:40 
x86_6...@m49.gc.cuny.edu   4   
   Full jobname: sum
   Master queue: x86_6...@m49.gc.cuny.edu
   Requested PE: ompi 4
   Granted PE:   ompi 4
   Hard Resources:  
   Soft Resources:  
   Hard requested queues: x86_64.q


[flengyel@nept OPENMPI]$ more sum.o23194

The sum from 1 to 1000 is: 500500
[flengyel@nept OPENMPI]$ more sum.e23194
Starting server daemon at host "m49.gc.cuny.edu"
Starting server daemon at host "m33.gc.cuny.edu"
Server daemon successfully started with task id "1.m49"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host 
m49.gc.cuny.edu ...
Server daemon successfully started with task id "1.m33"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host 
m33.gc.cuny.edu ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited with exit code 0
reading exit code from shepherd ...

But the same job with the queue set to quad.q for the Q9550 quad core machines
has daemon trouble:


[flengyel@nept OPENMPI]$ !qstat
qstat -r -u flengyel
job-ID  prior   name   user state submit/start at queue 
 slots ja-task-ID 
-
  23196 0.25000 sumflengyel r 07/07/2009 14:26:21 
qua...@m09.gc.cuny.edu 2
   Full jobname: sum
   Master queue: qua...@m09.gc.cuny.edu
   Requested PE: ompi 2
   Granted PE:   ompi 2
   Hard Resources:   
   Soft Resources:   
   Hard requested queues: quad.q
[flengyel@nept OPENMPI]$ more sum.e23196 
Starting server daemon at host "m15.gc.cuny.edu"
Starting server daemon at host "m09.gc.cuny.edu"
Server daemon successfully started with task id "1.m15"
Server daemon successfully started with task id "1.m09"
Establishing /usr/local/sge/utilbin/lx24-amd64/rsh session to host m15.gc.cuny.e
du ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
reading exit code from shepherd ... Establishing /usr/local/sge/utilbin/lx24-amd
64/rsh session to host m09.gc.cuny.edu ...
/usr/local/sge/utilbin/lx24-amd64/rsh exited on signal 13 (PIPE)
reading exit code from shepherd ... 129
[m09.gc.cuny.edu:11413] ERROR: A daemon on node m15.gc.cuny.edu failed to start 
as expected.
[m09.gc.cuny.edu:11413] ERROR: There may be more information available from
[m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart the
[m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
[m09.gc.cuny.edu:11413] ERROR: The daemon exited unexpectedly with status 129.
129
[m09.gc.cuny.edu:11413] ERROR: A daemon on node m09.gc.cuny.edu failed to start 
as expected.
[m09.gc.cuny.edu:11413] ERROR: There may be more information available from
[m09.gc.cuny.edu:11413] ERROR: the 'qstat -t' command on the Grid Engine tasks.
[m09.gc.cuny.edu:11413] ERROR: If the problem persists, please restart the
[m09.gc.cuny.edu:11413] ERROR: Grid Engine PE job
[m09.gc.cuny.edu:11

Re: [OMPI users] quadrics support?

2009-07-07 Thread Michael Di Domenico
Does OpenMPI/Quadrics require the Quadrics Kernel patches in order to
operate?  Or operate at full speed or are the Quadrics modules
sufficient?

On Thu, Jul 2, 2009 at 1:52 PM, Ashley Pittman wrote:
> On Thu, 2009-07-02 at 09:34 -0400, Michael Di Domenico wrote:
>> Jeff,
>>
>> Okay, thanks.  I'll give it a shot and report back.  I can't
>> contribute any code, but I can certainly do testing...
>
> I'm from the Quadrics stable so could certainty support a port should
> you require it but I don't have access to hardware either currently.
>
> Ashley,
>
> --
>
> Ashley Pittman, Bath, UK.
>
> Padb - A parallel job inspection tool for cluster computing
> http://padb.pittman.org.uk
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] Bug: coll_tuned_dynamic_rules_filename and duplicate communicators

2009-07-07 Thread Jumper, John
I am attempting to use coll_tuned_dynamic_rules_filename to tune Open
MPI 1.3.2.  Based on my testing, it appears that the dynamic rules file
*only* influences the algorithm selection for MPI_COMM_WORLD.  Any
duplicate communicators will only use fixed or forced rules, which may
have much worse performance than the custom-tuned collectives in the
dynamic rules file.  The following code demonstrates the difference
between MPI_COMM_WORLD and a duplicate communicator.  

test.c:
#include 

int main( int argc, char** argv ) {
  float u = 0.0, v = 0.0; 
  MPI_Comm world_dup;

  MPI_Init( &argc, &argv );
  MPI_Comm_dup( MPI_COMM_WORLD, &world_dup );

  MPI_Allreduce( &u, &v, 1, MPI_FLOAT, MPI_SUM, world_dup );
  MPI_Barrier( MPI_COMM_WORLD );
  MPI_Allreduce( &u, &v, 1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD );

  MPI_Finalize();
  return 0;
}

allreduce.ompi:
1
2
1
9
1
0 1 0 0

invocation:
orterun -np 9 \
-mca btl self,sm,openib,tcp \
-mca coll_tuned_use_dynamic_rules 1 \
-mca coll_tuned_dynamic_rules_filename allreduce.ompi \
-mca coll_base_verbose 1000 \
-- test

This program is run with tracing, and the barrier is only used to
separate the allreduce calls in the trace.  The trace for one node is at
the end of the message, and the relevant section is the choice of
algorithms for the two allreduce calls.  The allreduce.ompi file
indicates that all size 9 communicators should use the basic linear
allreduce algorithm.  MPI_COMM_WORLD uses basic_linear, but the
world_dup communicator uses the fixed algorithm (for this message size,
the fixed algorithm is recursive doubling).

Thank you.

John Jumper



Trace of one process for the above program:
mca: base: components_open: opening coll components
mca: base: components_open: found loaded component basic
mca: base: components_open: component basic register function successful
mca: base: components_open: component basic has no open function
mca: base: components_open: found loaded component hierarch
mca: base: components_open: component hierarch has no register function
mca: base: components_open: component hierarch open function successful
mca: base: components_open: found loaded component inter
mca: base: components_open: component inter has no register function
mca: base: components_open: component inter open function successful
mca: base: components_open: found loaded component self
mca: base: components_open: component self has no register function
mca: base: components_open: component self open function successful
mca: base: components_open: found loaded component sm
mca: base: components_open: component sm has no register function
mca: base: components_open: component sm open function successful
mca: base: components_open: found loaded component sync
mca: base: components_open: component sync register function successful
mca: base: components_open: component sync has no open function
mca: base: components_open: found loaded component tuned
mca: base: components_open: component tuned has no register function
coll:tuned:component_open: done!
mca: base: components_open: component tuned open function successful
coll:find_available: querying coll component basic
coll:find_available: coll component basic is available
coll:find_available: querying coll component hierarch
coll:find_available: coll component hierarch is available
coll:find_available: querying coll component inter
coll:find_available: coll component inter is available
coll:find_available: querying coll component self
coll:find_available: coll component self is available
coll:find_available: querying coll component sm
coll:find_available: coll component sm is available
coll:find_available: querying coll component sync
coll:find_available: coll component sync is available
coll:find_available: querying coll component tuned
coll:find_available: coll component tuned is available
coll:base:comm_select: new communicator: MPI_COMM_WORLD (cid 0)
coll:base:comm_select: Checking all available modules
coll:base:comm_select: component available: basic, priority: 10
coll:base:comm_select: component not available: hierarch
coll:base:comm_select: component not available: inter
coll:base:comm_select: component not available: self
coll:base:comm_select: component not available: sm
coll:base:comm_select: component not available: sync
coll:tuned:module_tuned query called
coll:tuned:module_query using intra_dynamic
coll:base:comm_select: component available: tuned, priority: 30
coll:tuned:module_init called.
coll:tuned:module_init MCW & Dynamic
coll:tuned:module_init Opening [allreduce.ompi]
Reading dynamic rule for collective ID 2
Read communicator count 1 for dynamic rule for collective ID 2
Read message count 1 for dynamic rule for collective ID 2 and comm size
9
Done reading dynamic rule for collective ID 2

Collectives with rules  : 1
Communicator sizes with rules   : 1
Message sizes with rules: 1
Lines in configuration file read  

[OMPI users] Segfault when using valgrind

2009-07-07 Thread Justin
(Sorry if this is posted twice, I sent the same email yesterday but it 
never appeared on the list).



Hi,  I am attempting to debug a memory corruption in an mpi program 
using valgrind.  However, when I run with valgrind I get semi-random 
segfaults and valgrind messages with the openmpi library.  Here is an 
example of such a seg fault:


==6153==
==6153== Invalid read of size 8
==6153==at 0x19102EA0: (within /usr/lib/openmpi/lib/openmpi/
mca_btl_sm.so)
==6153==by 0x182ABACB: (within 
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==6153==by 0x182A3040: (within 
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==6153==by 0xB425DD3: PMPI_Isend (in 
/usr/lib/openmpi/lib/libmpi.so.0.0.0)
==6153==by 0x7B83DA8: int 
Uintah::SFC::MergeExchange(int, 
std::vector, 
std::allocator > >&, 
std::vector, 
std::allocator > >&, 
std::vector, 
std::allocator > >&) (SFC.h:2989)
==6153==by 0x7B84A8F: void Uintah::SFC::Batcherschar>(std::vector, 
std::allocator > >&, 
std::vector, 
std::allocator > >&, 
std::vector, 
std::allocator > >&) (SFC.h:3730)
==6153==by 0x7B8857B: void Uintah::SFC::Cleanupchar>(std::vector, 
std::allocator > >&, 
std::vector, 
std::allocator > >&, 
std::vector, 
std::allocator > >&) (SFC.h:3695)
==6153==by 0x7B88CC6: void Uintah::SFC::Parallel0<3, 
unsigned char>() (SFC.h:2928)
==6153==by 0x7C00AAB: void Uintah::SFC::Parallel<3, unsigned 
char>() (SFC.h:1108)
==6153==by 0x7C0EF39: void Uintah::SFC::GenerateDim<3>(int) 
(SFC.h:694)
==6153==by 0x7C0F0F2: Uintah::SFC::GenerateCurve(int) 
(SFC.h:670)
==6153==by 0x7B30CAC: 
Uintah::DynamicLoadBalancer::useSFC(Uintah::Handle 
const&, int*) (DynamicLoadBalancer.cc:429)

==6153==  Address 0x10 is not stack'd, malloc'd or (recently) free'd
^G^G^GThread "main"(pid 6153) caught signal SIGSEGV at address (nil) 
(segmentation violation)


Looking at the code for our isend at SFC.h:298 does not seem to have any 
errors: 


=
 MergeInfo myinfo,theirinfo;

 MPI_Request srequest, rrequest;
 MPI_Status status;

 myinfo.n=n;
 if(n!=0)
 {
   myinfo.min=sendbuf[0].bits;
   myinfo.max=sendbuf[n-1].bits;
 }
 //cout << rank << " n:" << n << " min:" << (int)myinfo.min << "max:" 
<< (int)myinfo.max << endl;


 MPI_Isend(&myinfo,sizeof(MergeInfo),MPI_BYTE,to,0,Comm,&srequest);
==

myinfo is a struct located on the stack, to is the rank of the processor 
that the message is being sent to, and srequest is also on the stack.  
In addition this message is waited on prior to exiting this block of 
code so they still exist on the stack.  When I don't run with valgrind 
my program runs past this point just fine. 

I am currently using openmpi 1.3 from the debian unstable branch.  I 
also see the same type of segfault in a different portion of the code 
involving an MPI_Allgather which can be seen below:


==
==22736== Use of uninitialised value of size 8
==22736==at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322)
==22736==by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736==by 0xB404264: ompi_request_default_wait_all (condition.h:99)
==22736==by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual 
(coll_tuned_util.c:55)
==22736==by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck 
(coll_tuned_util.h:60)

==22736==by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736==by 0x6465457: 
Uintah::Grid::problemSetup(Uintah::Handle const&, 
Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736==by 0x8345759: Uintah::SimulationController::gridSetup() 
(SimulationController.cc:243)
==22736==by 0x834F418: Uintah::AMRSimulationController::run() 
(AMRSimulationController.cc:117)

==22736==by 0x4089AE: main (sus.cc:629)
==22736==
==22736== Invalid read of size 8
==22736==at 0x19104775: mca_btl_sm_component_progress (opal_list.h:322)
==22736==by 0x1382CE09: opal_progress (opal_progress.c:207)
==22736==by 0xB404264: ompi_request_default_wait_all (condition.h:99)
==22736==by 0x1A1ADC16: ompi_coll_tuned_sendrecv_actual 
(coll_tuned_util.c:55)
==22736==by 0x1A1B61E1: ompi_coll_tuned_allgatherv_intra_bruck 
(coll_tuned_util.h:60)

==22736==by 0xB418B2E: PMPI_Allgatherv (pallgatherv.c:121)
==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736==by 0x6465457: 
Uintah::Grid::problemSetup(Uintah::Handle const&, 
Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736==by 0x8345759: Uintah::SimulationController::gridSetup() 
(SimulationController.cc:243)
==22736==by 0x834F418: Uintah::AMRSimulationController::run() 
(AMRSimulationController.cc:117)

==22736==by 0x4089AE: main (sus.cc:629)

Re: [OMPI users] any way to get serial time on head node?

2009-07-07 Thread Jeff Squyres
You probably want to use an MPI tracing tool that can break down the  
times spent inside and outside of the MPI library.  User vs. system  
time, as you noted, can get quite blurred.



On Jul 6, 2009, at 12:48 PM, Ross Boylan wrote:


Let total time on my slot 0 process be S+C+B+I
= serial computations + communication + busy wait + idle
Is there a way to find out S?
S+C would probably also be useful, since I assume C is low.

The problem is that I = 0, roughly, and B is big.  Since B is big, the
usual process timing methods don't work.

If B all went to "system" as opposed to "user" time I could use the
latter, but I don't think that's the case.  Can anyone confirm that?

If S is big, I might be able to gain by parallelizing in a different
way.  By S I mean to refer to serial computation that is part of my
algorithm, rather than the technical fact that all the computation is
serial on a given slot.

I'm running R/RMPI.

Thanks.
Ross

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
Cisco Systems



Re: [OMPI users] MPI and C++ (Boost)

2009-07-07 Thread Luis Vitorio Cargnini
Ok, after all the considerations, I'll try Boost, today, make some  
experiments and see if I can use it or if I'll avoid it yet.


But as said by Raimond I think, the problem is been dependent of a  
rich-incredible-amazing-toolset  but still implementing only  
MPI-1, and do not implement all the MPI functions main drawbacks of  
boost, but the set of functions implemented do not compromise the  
functionality, i don't know about the MPI-1, MPI-2 and future MPI-3  
specifications, how this specifications implementations affect boost  
and the developer using Boost, with OpenMPI of course.


Continuing if something change in the boost how can I guarantee it  
won't affect my code in the future ? It is impossible.


Anyway I'll test it today and without it and choose my direction,  
thanks for all the replies, suggestions, solutions, that you all  
pointed to me I really appreciate all your help and comments about  
boost or not my code.


Thanks and Regards.
Vitorio.


Le 09-07-07 à 08:26, Jeff Squyres a écrit :


I think you face a common trade-off:

- use a well-established, debugged, abstraction-rich library
- write all of that stuff yourself

FWIW, I think the first one is a no-brainer.  There's a reason they  
wrote Boost.MPI: it's complex, difficult stuff, and is perfect as  
middleware for others to use.


If having users perform a 2nd step is undesirable (i.e., install  
Boost before installing your software), how about embedding Boost in  
your software?  Your configure/build process can certainly be  
tailored to include Boost[.MPI].  Hence, users will only perform 1  
step, but it actually performs "2" steps under the covers (configures 
+installs Boost.MPI and then configures+installs your software,  
which uses Boost).


FWIW: Open MPI does exactly this.  Open MPI embeds at least 5  
software packages: PLPA, VampirTrace, ROMIO, libltdl, and libevent.   
But 99.9% of our users don't know/care because it's all hidden in  
our configure / make process.  If you watch carefully, you can see  
the output go by from each of those configure sections when running  
OMPI's configure.  But no one does.  ;-)


Sidenote: I would echo that the Forum is not considering including  
Boost.MPI at all.  Indeed, as mentioned in different threads, the  
Forum has already voted once to deprecate the MPI C++ bindings,  
partly *because* of Boost.  Boost.MPI has shown that the C++  
community is better at making C++ APIs for MPI than the Forum is.   
Hence, our role should be to make the base building blocks and let  
the language experts make their own preferred tools.





On Jul 7, 2009, at 5:03 AM, Matthieu Brucher wrote:

> IF boost is attached to MPI 3 (or whatever), AND it becomes part  
of the
> mainstream MPI implementations, THEN you can have the discussion  
again.


Hi,

At the moment, I think that Boost.MPI only supports MPI1.1, and even
then, some additional work may be done, at least regarding the  
complex

datatypes.

Matthieu
--
Information System Engineer, Ph.D.
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/? 
blog=92

LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




smime.p7s
Description: S/MIME cryptographic signature


PGP.sig
Description: Ceci est une signature électronique PGP


Re: [OMPI users] Segmentation fault - Address not mapped

2009-07-07 Thread Jeff Squyres

On Jul 7, 2009, at 8:08 AM, Catalin David wrote:


Thank you very much for the help and assistance :)

Using -isystem /users/cluster/cdavid/local/include the program now
runs fine (loads the correct mpi.h).



This is very fishy.

If mpic++ is in /users/cluster/cdavid/local/bin, and that directory is  
in the front of your $PATH, then using that to compile your  
application should pull in the right mpi.h file.


To be clear: if you use the right mpicc / mpic++ / mpif77 / mpif90,  
the Right header files should get pulled in because the wrappers will  
do the proper -I for you.  You can verify this by checking the output  
of "mpic++ my_program.cc -o my_program --showme" and see what compiler  
flags are getting passed down to the underlying compiler.


You might want to double check your setup to ensure that your PATH is  
absolutely correct, you have run "rehash" if you needed to (csh /  
tcsh), your LD_LIBRARY_PATH points to the right library (on all nodes,  
even for non-interactive logins), etc.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Question on running the openmpi test modules

2009-07-07 Thread Jeff Squyres

On Jul 7, 2009, at 12:43 AM, Prasadcse Perera wrote:

I'm  new to openmpi and currently I have setup openmpi-1.3.3a1r21566  
on my Linux machines. I have run some of available examples and also  
noticed there are some test modules under /openmpi-1.3.3a1r21566/ 
test. Are these tests run on batchwise? then how ? or are these  
tests suppose to run individually by compiling and executing  
seperately ?


They are run via "make check" -- it's a standard GNU mechanism that is  
built into our make system automatically by Automake.  These tests are  
loosely maintained at best -- they were put in a long time ago, but  
the bulk of our regression testing codes are in different, not- 
publicly-accessible repositories (mainly because many of them were not  
originally written by us and we were too lazy to look into public  
redistribution rights).


I'm hoping to contribute openmpi as a developer, so I would like to  
know can users contribute by adding more example codes ?



Great!  More tests, examples, documentation, and code are always  
appreciated!


Note that we have a separate "de...@open-mpi.org" list for developer- 
level discussions.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Configuration problem or network problem?

2009-07-07 Thread Jeff Squyres
You might want to use a tracing library to see where exactly your  
synchronization issues are occurring.  It may depend on the  
communication pattern between your nodes and the timing between them.   
Additionally, your network switch(es) performance characteristics may  
come into effect here: are there retransmissions, timeouts, etc.?


It can sometimes be helpful to insert an MPI_BARRIER every few  
iterations just to keep all processes well-synchronized.  It seems  
counter-intuitive, but sometimes waiting a short time in a barrier can  
increase overall throughput (rather than waiting progressively longer  
times in poorly-synchronized blocking communications).




On Jul 6, 2009, at 11:33 PM, Zou, Lin (GE, Research, Consultant) wrote:

 Thank you for your suggestion, I tried this solution, but it  
doesn't work. In fact, the headnode doesn't participate the  
computing and communication, it only malloc a large a memory, and  
when the loop in every PS3 is over, the headnode gather the data  
from every PS3.
The strange thing is that sometimes the program can work well, but  
when reboot the system, without any change to the program, it can't  
work, so I think it should be some mechanism in OpenMPI that can  
configure to let the program work well.


Lin

From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]  
On Behalf Of Doug Reeder

Sent: 2009年7月7日 10:49
To: Open MPI Users
Subject: Re: [OMPI users] Configuration problem or network problem?

Lin,

Try -np 16 and not running on the head node.

Doug Reeder
On Jul 6, 2009, at 7:08 PM, Zou, Lin (GE, Research, Consultant) wrote:


Hi all,
The system I use is a PS3 cluster, with 16 PS3s and a PowerPC  
as a headnode, they are connected by a high speed switch.
There are point-to-point communication functions( MPI_Send and  
MPI_Recv ), the data size is about 40KB, and a lot of computings  
which will consume a long time(about 1 sec)in a loop.The co- 
processor in PS3 can take care of the computation, the main  
processor take care of point-to-point communication,so the  
computing and communication can overlap.The communication funtions  
should return much faster than computing function.
My question is that after some circles, the time consumed by  
communication functions in a PS3 will increase heavily, and the  
whole cluster's sync state will corrupt.When I decrease the  
computing time, this situation just disappeare.I am very confused  
about this.
I think there is a mechanism in OpenMPI that cause this case, does  
everyone get this situation before?
I use "mpirun --mca btl tcp, self -np 17 --hostfile ...", is there  
something i should added?

Lin
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems




Re: [OMPI users] MPI and C++ (Boost)

2009-07-07 Thread Jeff Squyres

I think you face a common trade-off:

- use a well-established, debugged, abstraction-rich library
- write all of that stuff yourself

FWIW, I think the first one is a no-brainer.  There's a reason they  
wrote Boost.MPI: it's complex, difficult stuff, and is perfect as  
middleware for others to use.


If having users perform a 2nd step is undesirable (i.e., install Boost  
before installing your software), how about embedding Boost in your  
software?  Your configure/build process can certainly be tailored to  
include Boost[.MPI].  Hence, users will only perform 1 step, but it  
actually performs "2" steps under the covers (configures+installs  
Boost.MPI and then configures+installs your software, which uses Boost).


FWIW: Open MPI does exactly this.  Open MPI embeds at least 5 software  
packages: PLPA, VampirTrace, ROMIO, libltdl, and libevent.  But 99.9%  
of our users don't know/care because it's all hidden in our  
configure / make process.  If you watch carefully, you can see the  
output go by from each of those configure sections when running OMPI's  
configure.  But no one does.  ;-)


Sidenote: I would echo that the Forum is not considering including  
Boost.MPI at all.  Indeed, as mentioned in different threads, the  
Forum has already voted once to deprecate the MPI C++ bindings, partly  
*because* of Boost.  Boost.MPI has shown that the C++ community is  
better at making C++ APIs for MPI than the Forum is.  Hence, our role  
should be to make the base building blocks and let the language  
experts make their own preferred tools.





On Jul 7, 2009, at 5:03 AM, Matthieu Brucher wrote:

> IF boost is attached to MPI 3 (or whatever), AND it becomes part  
of the
> mainstream MPI implementations, THEN you can have the discussion  
again.


Hi,

At the moment, I think that Boost.MPI only supports MPI1.1, and even
then, some additional work may be done, at least regarding the complex
datatypes.

Matthieu
--
Information System Engineer, Ph.D.
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn: http://www.linkedin.com/in/matthieubrucher
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Segmentation fault - Address not mapped

2009-07-07 Thread Catalin David
Thank you very much for the help and assistance :)

Using -isystem /users/cluster/cdavid/local/include the program now
runs fine (loads the correct mpi.h).

Thank you again,

Catalin

On Tue, Jul 7, 2009 at 12:29 PM, Catalin
David wrote:
>  #include 
>  #include 
>  int main(int argc, char *argv[])
>  {
>   printf("%d %d %d\n", OMPI_MAJOR_VERSION,
> OMPI_MINOR_VERSION,OMPI_RELEASE_VERSION);
>   return 0;
>  }
>
> returns:
>
> test.cpp: In function ‘int main(int, char**)’:
> test.cpp:11: error: ‘OMPI_MAJOR_VERSION’ was not declared in this scope
> test.cpp:11: error: ‘OMPI_MINOR_VERSION’ was not declared in this scope
> test.cpp:11: error: ‘OMPI_RELEASE_VERSION’ was not declared in this scope
>
> So, I am definitely using another library (mpich).
>
> Thanks one more time!!! I will try to fix it and come back with results.
>
> Catalin
>
> On Tue, Jul 7, 2009 at 12:23 PM, Dorian Krause wrote:
>> Catalin David wrote:
>>>
>>> Hello, all!
>>>
>>> Just installed Valgrind (since this seems like a memory issue) and got
>>> this interesting output (when running the test program):
>>>
>>> ==4616== Syscall param sched_setaffinity(mask) points to unaddressable
>>> byte(s)
>>> ==4616==    at 0x43656BD: syscall (in /lib/tls/libc-2.3.2.so)
>>> ==4616==    by 0x4236A75: opal_paffinity_linux_plpa_init
>>> (plpa_runtime.c:37)
>>> ==4616==    by 0x423779B:
>>> opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:501)
>>> ==4616==    by 0x4235FEE: linux_module_init (paffinity_linux_module.c:119)
>>> ==4616==    by 0x447F114: opal_paffinity_base_select
>>> (paffinity_base_select.c:64)
>>> ==4616==    by 0x444CD71: opal_init (opal_init.c:292)
>>> ==4616==    by 0x43CE7E6: orte_init (orte_init.c:76)
>>> ==4616==    by 0x4067A50: ompi_mpi_init (ompi_mpi_init.c:342)
>>> ==4616==    by 0x40A3444: PMPI_Init (pinit.c:80)
>>> ==4616==    by 0x804875C: main (test.cpp:17)
>>> ==4616==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
>>> ==4616==
>>> ==4616== Invalid read of size 4
>>> ==4616==    at 0x4095772: ompi_comm_invalid (communicator.h:261)
>>> ==4616==    by 0x409581E: PMPI_Comm_size (pcomm_size.c:46)
>>> ==4616==    by 0x8048770: main (test.cpp:18)
>>> ==4616==  Address 0x44a0 is not stack'd, malloc'd or (recently) free'd
>>> [denali:04616] *** Process received signal ***
>>> [denali:04616] Signal: Segmentation fault (11)
>>> [denali:04616] Signal code: Address not mapped (1)
>>> [denali:04616] Failing at address: 0x44a0
>>> [denali:04616] [ 0] /lib/tls/libc.so.6 [0x42b4de0]
>>> [denali:04616] [ 1]
>>> /users/cluster/cdavid/local/lib/libmpi.so.0(MPI_Comm_size+0x6f)
>>> [0x409581f]
>>> [denali:04616] [ 2] ./test(__gxx_personality_v0+0x12d) [0x8048771]
>>> [denali:04616] [ 3] /lib/tls/libc.so.6(__libc_start_main+0xf8) [0x42a2768]
>>> [denali:04616] [ 4] ./test(__gxx_personality_v0+0x3d) [0x8048681]
>>> [denali:04616] *** End of error message ***
>>> ==4616==
>>> ==4616== Invalid read of size 4
>>> ==4616==    at 0x4095782: ompi_comm_invalid (communicator.h:261)
>>> ==4616==    by 0x409581E: PMPI_Comm_size (pcomm_size.c:46)
>>> ==4616==    by 0x8048770: main (test.cpp:18)
>>> ==4616==  Address 0x44a0 is not stack'd, malloc'd or (recently) free'd
>>>
>>>
>>> The problem is that, now, I don't know where the issue comes from (is
>>> it libc that is too old and incompatible with g++ 4.4/OpenMPI? is libc
>>> broken?).
>>>
>>
>> Looking at the code for ompi_comm_invalid:
>>
>> static inline int ompi_comm_invalid(ompi_communicator_t* comm)
>> {
>>   if ((NULL == comm) || (MPI_COMM_NULL == comm) ||
>>       (OMPI_COMM_IS_FREED(comm)) || (OMPI_COMM_IS_INVALID(comm)) )
>>       return true;
>>   else
>>       return false;
>> }
>>
>>
>> the interesting point is that (MPI_COMM_NULL == comm) evaluates to false,
>> otherwise the following macros (where the invalid read occurs) would not be
>> evaluated.
>>
>> The only idea that comes to my mind is that you are mixing MPI versions, but
>> as you said your PATH is fine ?!
>>
>> Regards,
>> Dorian
>>
>>
>>
>>> Any help would be highly appreciated.
>>>
>>> Thanks,
>>> Catalin
>>>
>>>
>>> On Mon, Jul 6, 2009 at 3:36 PM, Catalin David
>>> wrote:
>>>

 On Mon, Jul 6, 2009 at 3:26 PM, jody wrote:

>
> Hi
> Are you also sure that you have the same version of Open-MPI
> on every machine of your cluster, and that it is the mpicxx of this
> version that is called when you run your program?
> I ask because you mentioned that there was an old version of Open-MPI
> present... die you remove this?
>
> Jody
>

 Hi

 I have just logged in a few other boxes and they all mount my home
 folder. When running `echo $LD_LIBRARY_PATH` and other commands, I get
 what I expect to get, but this might be because I have set these
 variables in the .bashrc file. So, I tried compiling/running like this
  ~/local/bin/mpicxx [stuff] and ~/local/bin/mpirun -np 4 ray-trace,
 but I get the same errors.

>>>

Re: [OMPI users] Segmentation fault - Address not mapped

2009-07-07 Thread Catalin David
 #include 
 #include 
 int main(int argc, char *argv[])
 {
   printf("%d %d %d\n", OMPI_MAJOR_VERSION,
OMPI_MINOR_VERSION,OMPI_RELEASE_VERSION);
   return 0;
 }

returns:

test.cpp: In function ‘int main(int, char**)’:
test.cpp:11: error: ‘OMPI_MAJOR_VERSION’ was not declared in this scope
test.cpp:11: error: ‘OMPI_MINOR_VERSION’ was not declared in this scope
test.cpp:11: error: ‘OMPI_RELEASE_VERSION’ was not declared in this scope

So, I am definitely using another library (mpich).

Thanks one more time!!! I will try to fix it and come back with results.

Catalin

On Tue, Jul 7, 2009 at 12:23 PM, Dorian Krause wrote:
> Catalin David wrote:
>>
>> Hello, all!
>>
>> Just installed Valgrind (since this seems like a memory issue) and got
>> this interesting output (when running the test program):
>>
>> ==4616== Syscall param sched_setaffinity(mask) points to unaddressable
>> byte(s)
>> ==4616==    at 0x43656BD: syscall (in /lib/tls/libc-2.3.2.so)
>> ==4616==    by 0x4236A75: opal_paffinity_linux_plpa_init
>> (plpa_runtime.c:37)
>> ==4616==    by 0x423779B:
>> opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:501)
>> ==4616==    by 0x4235FEE: linux_module_init (paffinity_linux_module.c:119)
>> ==4616==    by 0x447F114: opal_paffinity_base_select
>> (paffinity_base_select.c:64)
>> ==4616==    by 0x444CD71: opal_init (opal_init.c:292)
>> ==4616==    by 0x43CE7E6: orte_init (orte_init.c:76)
>> ==4616==    by 0x4067A50: ompi_mpi_init (ompi_mpi_init.c:342)
>> ==4616==    by 0x40A3444: PMPI_Init (pinit.c:80)
>> ==4616==    by 0x804875C: main (test.cpp:17)
>> ==4616==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
>> ==4616==
>> ==4616== Invalid read of size 4
>> ==4616==    at 0x4095772: ompi_comm_invalid (communicator.h:261)
>> ==4616==    by 0x409581E: PMPI_Comm_size (pcomm_size.c:46)
>> ==4616==    by 0x8048770: main (test.cpp:18)
>> ==4616==  Address 0x44a0 is not stack'd, malloc'd or (recently) free'd
>> [denali:04616] *** Process received signal ***
>> [denali:04616] Signal: Segmentation fault (11)
>> [denali:04616] Signal code: Address not mapped (1)
>> [denali:04616] Failing at address: 0x44a0
>> [denali:04616] [ 0] /lib/tls/libc.so.6 [0x42b4de0]
>> [denali:04616] [ 1]
>> /users/cluster/cdavid/local/lib/libmpi.so.0(MPI_Comm_size+0x6f)
>> [0x409581f]
>> [denali:04616] [ 2] ./test(__gxx_personality_v0+0x12d) [0x8048771]
>> [denali:04616] [ 3] /lib/tls/libc.so.6(__libc_start_main+0xf8) [0x42a2768]
>> [denali:04616] [ 4] ./test(__gxx_personality_v0+0x3d) [0x8048681]
>> [denali:04616] *** End of error message ***
>> ==4616==
>> ==4616== Invalid read of size 4
>> ==4616==    at 0x4095782: ompi_comm_invalid (communicator.h:261)
>> ==4616==    by 0x409581E: PMPI_Comm_size (pcomm_size.c:46)
>> ==4616==    by 0x8048770: main (test.cpp:18)
>> ==4616==  Address 0x44a0 is not stack'd, malloc'd or (recently) free'd
>>
>>
>> The problem is that, now, I don't know where the issue comes from (is
>> it libc that is too old and incompatible with g++ 4.4/OpenMPI? is libc
>> broken?).
>>
>
> Looking at the code for ompi_comm_invalid:
>
> static inline int ompi_comm_invalid(ompi_communicator_t* comm)
> {
>   if ((NULL == comm) || (MPI_COMM_NULL == comm) ||
>       (OMPI_COMM_IS_FREED(comm)) || (OMPI_COMM_IS_INVALID(comm)) )
>       return true;
>   else
>       return false;
> }
>
>
> the interesting point is that (MPI_COMM_NULL == comm) evaluates to false,
> otherwise the following macros (where the invalid read occurs) would not be
> evaluated.
>
> The only idea that comes to my mind is that you are mixing MPI versions, but
> as you said your PATH is fine ?!
>
> Regards,
> Dorian
>
>
>
>> Any help would be highly appreciated.
>>
>> Thanks,
>> Catalin
>>
>>
>> On Mon, Jul 6, 2009 at 3:36 PM, Catalin David
>> wrote:
>>
>>>
>>> On Mon, Jul 6, 2009 at 3:26 PM, jody wrote:
>>>

 Hi
 Are you also sure that you have the same version of Open-MPI
 on every machine of your cluster, and that it is the mpicxx of this
 version that is called when you run your program?
 I ask because you mentioned that there was an old version of Open-MPI
 present... die you remove this?

 Jody

>>>
>>> Hi
>>>
>>> I have just logged in a few other boxes and they all mount my home
>>> folder. When running `echo $LD_LIBRARY_PATH` and other commands, I get
>>> what I expect to get, but this might be because I have set these
>>> variables in the .bashrc file. So, I tried compiling/running like this
>>>  ~/local/bin/mpicxx [stuff] and ~/local/bin/mpirun -np 4 ray-trace,
>>> but I get the same errors.
>>>
>>> As for the previous version, I don't have root access, therefore I was
>>> not able to remove it. I was just trying to outrun it by setting the
>>> $PATH variable to point first at my local installation.
>>>
>>>
>>> Catalin
>>>
>>>
>>> --
>>>
>>> **
>>> Catalin David
>>> B.Sc. Computer Science 2010
>>> Jacobs University Bremen
>>>
>>> Phone: +49-(0)1577-4

Re: [OMPI users] Segmentation fault - Address not mapped

2009-07-07 Thread Dorian Krause

Catalin David wrote:

Hello, all!

Just installed Valgrind (since this seems like a memory issue) and got
this interesting output (when running the test program):

==4616== Syscall param sched_setaffinity(mask) points to unaddressable byte(s)
==4616==at 0x43656BD: syscall (in /lib/tls/libc-2.3.2.so)
==4616==by 0x4236A75: opal_paffinity_linux_plpa_init (plpa_runtime.c:37)
==4616==by 0x423779B:
opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:501)
==4616==by 0x4235FEE: linux_module_init (paffinity_linux_module.c:119)
==4616==by 0x447F114: opal_paffinity_base_select
(paffinity_base_select.c:64)
==4616==by 0x444CD71: opal_init (opal_init.c:292)
==4616==by 0x43CE7E6: orte_init (orte_init.c:76)
==4616==by 0x4067A50: ompi_mpi_init (ompi_mpi_init.c:342)
==4616==by 0x40A3444: PMPI_Init (pinit.c:80)
==4616==by 0x804875C: main (test.cpp:17)
==4616==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==4616==
==4616== Invalid read of size 4
==4616==at 0x4095772: ompi_comm_invalid (communicator.h:261)
==4616==by 0x409581E: PMPI_Comm_size (pcomm_size.c:46)
==4616==by 0x8048770: main (test.cpp:18)
==4616==  Address 0x44a0 is not stack'd, malloc'd or (recently) free'd
[denali:04616] *** Process received signal ***
[denali:04616] Signal: Segmentation fault (11)
[denali:04616] Signal code: Address not mapped (1)
[denali:04616] Failing at address: 0x44a0
[denali:04616] [ 0] /lib/tls/libc.so.6 [0x42b4de0]
[denali:04616] [ 1]
/users/cluster/cdavid/local/lib/libmpi.so.0(MPI_Comm_size+0x6f)
[0x409581f]
[denali:04616] [ 2] ./test(__gxx_personality_v0+0x12d) [0x8048771]
[denali:04616] [ 3] /lib/tls/libc.so.6(__libc_start_main+0xf8) [0x42a2768]
[denali:04616] [ 4] ./test(__gxx_personality_v0+0x3d) [0x8048681]
[denali:04616] *** End of error message ***
==4616==
==4616== Invalid read of size 4
==4616==at 0x4095782: ompi_comm_invalid (communicator.h:261)
==4616==by 0x409581E: PMPI_Comm_size (pcomm_size.c:46)
==4616==by 0x8048770: main (test.cpp:18)
==4616==  Address 0x44a0 is not stack'd, malloc'd or (recently) free'd


The problem is that, now, I don't know where the issue comes from (is
it libc that is too old and incompatible with g++ 4.4/OpenMPI? is libc
broken?).
  

Looking at the code for ompi_comm_invalid:

static inline int ompi_comm_invalid(ompi_communicator_t* comm)
{
   if ((NULL == comm) || (MPI_COMM_NULL == comm) ||
   (OMPI_COMM_IS_FREED(comm)) || (OMPI_COMM_IS_INVALID(comm)) )
   return true;
   else
   return false;
}


the interesting point is that (MPI_COMM_NULL == comm) evaluates to 
false, otherwise the following macros (where the invalid read occurs) 
would not be evaluated.


The only idea that comes to my mind is that you are mixing MPI versions, 
but as you said your PATH is fine ?!


Regards,
Dorian




Any help would be highly appreciated.

Thanks,
Catalin


On Mon, Jul 6, 2009 at 3:36 PM, Catalin David wrote:
  

On Mon, Jul 6, 2009 at 3:26 PM, jody wrote:


Hi
Are you also sure that you have the same version of Open-MPI
on every machine of your cluster, and that it is the mpicxx of this
version that is called when you run your program?
I ask because you mentioned that there was an old version of Open-MPI
present... die you remove this?

Jody
  

Hi

I have just logged in a few other boxes and they all mount my home
folder. When running `echo $LD_LIBRARY_PATH` and other commands, I get
what I expect to get, but this might be because I have set these
variables in the .bashrc file. So, I tried compiling/running like this
 ~/local/bin/mpicxx [stuff] and ~/local/bin/mpirun -np 4 ray-trace,
but I get the same errors.

As for the previous version, I don't have root access, therefore I was
not able to remove it. I was just trying to outrun it by setting the
$PATH variable to point first at my local installation.


Catalin


--

**
Catalin David
B.Sc. Computer Science 2010
Jacobs University Bremen

Phone: +49-(0)1577-49-38-667

College Ring 4, #343
Bremen, 28759
Germany
**






  




Re: [OMPI users] Segmentation fault - Address not mapped

2009-07-07 Thread Ashley Pittman

This is the error you get when an invalid communicator handle is passed
to a MPI function, the handle is deferenced so you may or may not get a
SEGV from it depending on the value you pass.

The  0x44a0 address is an offset from 0x4400, the value of
MPI_COMM_WORLD in mpich2, my guess would be you are either picking up a
mpich2 mpi.h or the mpich2 mpicc.

Ashley,

On Tue, 2009-07-07 at 11:05 +0100, Catalin David wrote:
> Hello, all!
> 
> Just installed Valgrind (since this seems like a memory issue) and got
> this interesting output (when running the test program):
> 
> ==4616== Syscall param sched_setaffinity(mask) points to unaddressable byte(s)
> ==4616==at 0x43656BD: syscall (in /lib/tls/libc-2.3.2.so)
> ==4616==by 0x4236A75: opal_paffinity_linux_plpa_init (plpa_runtime.c:37)
> ==4616==by 0x423779B:
> opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:501)
> ==4616==by 0x4235FEE: linux_module_init (paffinity_linux_module.c:119)
> ==4616==by 0x447F114: opal_paffinity_base_select
> (paffinity_base_select.c:64)
> ==4616==by 0x444CD71: opal_init (opal_init.c:292)
> ==4616==by 0x43CE7E6: orte_init (orte_init.c:76)
> ==4616==by 0x4067A50: ompi_mpi_init (ompi_mpi_init.c:342)
> ==4616==by 0x40A3444: PMPI_Init (pinit.c:80)
> ==4616==by 0x804875C: main (test.cpp:17)
> ==4616==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
> ==4616==
> ==4616== Invalid read of size 4
> ==4616==at 0x4095772: ompi_comm_invalid (communicator.h:261)
> ==4616==by 0x409581E: PMPI_Comm_size (pcomm_size.c:46)
> ==4616==by 0x8048770: main (test.cpp:18)
> ==4616==  Address 0x44a0 is not stack'd, malloc'd or (recently) free'd
> [denali:04616] *** Process received signal ***
> [denali:04616] Signal: Segmentation fault (11)
> [denali:04616] Signal code: Address not mapped (1)
> [denali:04616] Failing at address: 0x44a0
> [denali:04616] [ 0] /lib/tls/libc.so.6 [0x42b4de0]
> [denali:04616] [ 1]
> /users/cluster/cdavid/local/lib/libmpi.so.0(MPI_Comm_size+0x6f)
> [0x409581f]
> [denali:04616] [ 2] ./test(__gxx_personality_v0+0x12d) [0x8048771]
> [denali:04616] [ 3] /lib/tls/libc.so.6(__libc_start_main+0xf8) [0x42a2768]
> [denali:04616] [ 4] ./test(__gxx_personality_v0+0x3d) [0x8048681]
> [denali:04616] *** End of error message ***
> ==4616==
> ==4616== Invalid read of size 4
> ==4616==at 0x4095782: ompi_comm_invalid (communicator.h:261)
> ==4616==by 0x409581E: PMPI_Comm_size (pcomm_size.c:46)
> ==4616==by 0x8048770: main (test.cpp:18)
> ==4616==  Address 0x44a0 is not stack'd, malloc'd or (recently) free'd
> 
> 
> The problem is that, now, I don't know where the issue comes from (is
> it libc that is too old and incompatible with g++ 4.4/OpenMPI? is libc
> broken?).
> 
> Any help would be highly appreciated.
> 
> Thanks,
> Catalin
> 
> 
> On Mon, Jul 6, 2009 at 3:36 PM, Catalin David 
> wrote:
> > On Mon, Jul 6, 2009 at 3:26 PM, jody wrote:
> >> Hi
> >> Are you also sure that you have the same version of Open-MPI
> >> on every machine of your cluster, and that it is the mpicxx of this
> >> version that is called when you run your program?
> >> I ask because you mentioned that there was an old version of Open-MPI
> >> present... die you remove this?
> >>
> >> Jody
> >
> > Hi
> >
> > I have just logged in a few other boxes and they all mount my home
> > folder. When running `echo $LD_LIBRARY_PATH` and other commands, I get
> > what I expect to get, but this might be because I have set these
> > variables in the .bashrc file. So, I tried compiling/running like this
> >  ~/local/bin/mpicxx [stuff] and ~/local/bin/mpirun -np 4 ray-trace,
> > but I get the same errors.
> >
> > As for the previous version, I don't have root access, therefore I was
> > not able to remove it. I was just trying to outrun it by setting the
> > $PATH variable to point first at my local installation.
> >
> >
> > Catalin
> >
> >
> > --
> >
> > **
> > Catalin David
> > B.Sc. Computer Science 2010
> > Jacobs University Bremen
> >
> > Phone: +49-(0)1577-49-38-667
> >
> > College Ring 4, #343
> > Bremen, 28759
> > Germany
> > **
> >
> 
> 
> 
-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] MPI and C++ - now Send and Receive of Classes and STL containers

2009-07-07 Thread Markus Blatt
Hi,

On Mon, Jul 06, 2009 at 03:24:07PM -0400, Luis Vitorio Cargnini wrote:
> Thanks, but I really do not want to use Boost.
> Is easier ? certainly is, but I want to make it using only MPI
> itself
> and not been dependent of a Library, or templates like the majority
> of
> boost a huge set of templates and wrappers for different libraries,
> implemented in C, supplying a wrapper for C++.
> I admit Boost is a valuable tool, but in my case, as much
> independent I
> could be from additional libs, better.
>

If you do not want to use boost, then I suggest not using nested
vectors but just ones that contain PODs as value_type (or even
C-arrays).


If you insist on using complicated containers you will end up
writing your own MPI-C++ abstraction (resulting in a library). This
will be a lot of (unnecessary and hard) work.

Just my 2 cents.

Cheers,

Markus



Re: [OMPI users] Segmentation fault - Address not mapped

2009-07-07 Thread Catalin David
Hello, all!

Just installed Valgrind (since this seems like a memory issue) and got
this interesting output (when running the test program):

==4616== Syscall param sched_setaffinity(mask) points to unaddressable byte(s)
==4616==at 0x43656BD: syscall (in /lib/tls/libc-2.3.2.so)
==4616==by 0x4236A75: opal_paffinity_linux_plpa_init (plpa_runtime.c:37)
==4616==by 0x423779B:
opal_paffinity_linux_plpa_have_topology_information (plpa_map.c:501)
==4616==by 0x4235FEE: linux_module_init (paffinity_linux_module.c:119)
==4616==by 0x447F114: opal_paffinity_base_select
(paffinity_base_select.c:64)
==4616==by 0x444CD71: opal_init (opal_init.c:292)
==4616==by 0x43CE7E6: orte_init (orte_init.c:76)
==4616==by 0x4067A50: ompi_mpi_init (ompi_mpi_init.c:342)
==4616==by 0x40A3444: PMPI_Init (pinit.c:80)
==4616==by 0x804875C: main (test.cpp:17)
==4616==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==4616==
==4616== Invalid read of size 4
==4616==at 0x4095772: ompi_comm_invalid (communicator.h:261)
==4616==by 0x409581E: PMPI_Comm_size (pcomm_size.c:46)
==4616==by 0x8048770: main (test.cpp:18)
==4616==  Address 0x44a0 is not stack'd, malloc'd or (recently) free'd
[denali:04616] *** Process received signal ***
[denali:04616] Signal: Segmentation fault (11)
[denali:04616] Signal code: Address not mapped (1)
[denali:04616] Failing at address: 0x44a0
[denali:04616] [ 0] /lib/tls/libc.so.6 [0x42b4de0]
[denali:04616] [ 1]
/users/cluster/cdavid/local/lib/libmpi.so.0(MPI_Comm_size+0x6f)
[0x409581f]
[denali:04616] [ 2] ./test(__gxx_personality_v0+0x12d) [0x8048771]
[denali:04616] [ 3] /lib/tls/libc.so.6(__libc_start_main+0xf8) [0x42a2768]
[denali:04616] [ 4] ./test(__gxx_personality_v0+0x3d) [0x8048681]
[denali:04616] *** End of error message ***
==4616==
==4616== Invalid read of size 4
==4616==at 0x4095782: ompi_comm_invalid (communicator.h:261)
==4616==by 0x409581E: PMPI_Comm_size (pcomm_size.c:46)
==4616==by 0x8048770: main (test.cpp:18)
==4616==  Address 0x44a0 is not stack'd, malloc'd or (recently) free'd


The problem is that, now, I don't know where the issue comes from (is
it libc that is too old and incompatible with g++ 4.4/OpenMPI? is libc
broken?).

Any help would be highly appreciated.

Thanks,
Catalin


On Mon, Jul 6, 2009 at 3:36 PM, Catalin David wrote:
> On Mon, Jul 6, 2009 at 3:26 PM, jody wrote:
>> Hi
>> Are you also sure that you have the same version of Open-MPI
>> on every machine of your cluster, and that it is the mpicxx of this
>> version that is called when you run your program?
>> I ask because you mentioned that there was an old version of Open-MPI
>> present... die you remove this?
>>
>> Jody
>
> Hi
>
> I have just logged in a few other boxes and they all mount my home
> folder. When running `echo $LD_LIBRARY_PATH` and other commands, I get
> what I expect to get, but this might be because I have set these
> variables in the .bashrc file. So, I tried compiling/running like this
>  ~/local/bin/mpicxx [stuff] and ~/local/bin/mpirun -np 4 ray-trace,
> but I get the same errors.
>
> As for the previous version, I don't have root access, therefore I was
> not able to remove it. I was just trying to outrun it by setting the
> $PATH variable to point first at my local installation.
>
>
> Catalin
>
>
> --
>
> **
> Catalin David
> B.Sc. Computer Science 2010
> Jacobs University Bremen
>
> Phone: +49-(0)1577-49-38-667
>
> College Ring 4, #343
> Bremen, 28759
> Germany
> **
>



-- 

**
Catalin David
B.Sc. Computer Science 2010
Jacobs University Bremen

Phone: +49-(0)1577-49-38-667

College Ring 4, #343
Bremen, 28759
Germany
**



Re: [OMPI users] MPI and C++ (Boost)

2009-07-07 Thread Matthieu Brucher
> IF boost is attached to MPI 3 (or whatever), AND it becomes part of the
> mainstream MPI implementations, THEN you can have the discussion again.

Hi,

At the moment, I think that Boost.MPI only supports MPI1.1, and even
then, some additional work may be done, at least regarding the complex
datatypes.

Matthieu
-- 
Information System Engineer, Ph.D.
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn: http://www.linkedin.com/in/matthieubrucher


Re: [OMPI users] MPI and C++ (Boost)

2009-07-07 Thread Raymond Wan


Hi Luis,


Luis Vitorio Cargnini wrote:
Your suggestion is a great and interesting idea. I only have the fear to 
get used to the Boost and could not get rid of Boost anymore, because 
one thing is sure the abstraction added by Boost is impressive, it turn 



I should add that I fully understand what it is you are saying and despite all 
the good things there were being said about Boost, I was avoiding it for a very 
long time because of the dependency issue.  For two reasons -- the dependency 
issue for myself (exactly like what you said) and distributing it means users 
will have to do an extra step (regardless of how easy/hard the step is, it's an 
extra step).


I finally switched over :-) and the "prototype" idea was just a way to ease you 
into it.  MPI programs are hard to get right, and Boost aside, it is a good idea 
to have something working that is easy to do and then you can remove the parts 
that you don't like later.


By the way, it seems that less-used parts of MPI do not have equivalents in 
Boost.MPI, so just using Boost won't solve all of your problems.  There is a 
list here (the table with the entries that say "unsupported"):


http://www.boost.org/doc/libs/1_39_0/doc/html/mpi/tutorial.html#mpi.c_mapping

Good luck!

Ray




Re: [OMPI users] MPI and C++ (Boost)

2009-07-07 Thread John Phillips

Terry Frankcombe wrote:



I understand Luis' position completely.  He wants an MPI program, not a
program that's written in some other environment, no matter how
attractive that may be.  It's like the difference between writing a
numerical program in standard-conforming Fortran and writing it in the
latest flavour of the month interpreted language calling highly
optimised libraries behind the scenes.

IF boost is attached to MPI 3 (or whatever), AND it becomes part of the
mainstream MPI implementations, THEN you can have the discussion again.

Ciao
Terry




  I guess we view it differently. Boost.MPI isn't a language at all. It 
is a library written in fully ISO compliant C++, that exists to make 
doing an otherwise complex and error prone job simpler and more 
readable. As such, I would compare it to using a well tested BLAS 
library to do matrix manipulations in your Fortran code or writing it 
yourself. Both can be standard conforming Fortran (though many BLAS 
implementations include lower level optimized code), and neither is a 
flavor of the month interpreted language. The advantage of the library 
is that it allows you to work at a level of abstraction that may be 
better suited to your work.


  For you, as for everyone else, make your choices based on what you 
believe best serves the needs of your program, whether that includes 
Boost.MPI or not. However, making the choices with an understanding of 
the options strengths and weaknesses gives the best chance of writing a 
good program.


John

PS - I am not part of the MPI Forum, but I would be surprised if they 
chose to add boost to any MPI version. Possibly an analog of Boost.MPI, 
but not all of boost. There are over 100 different libraries, covering 
many different areas of use in boost, and most of them have no direct 
connection to MPI.


PPS - If anyone would like to know more about Boost, I would suggest the 
website (http://www.boost.org) or the user mailing list. Folks who don't 
write in C++ will probably not be very interested.




[OMPI users] Question on running the openmpi test modules

2009-07-07 Thread Prasadcse Perera
Hi,
I'm  new to openmpi and currently I have setup openmpi-1.3.3a1r21566 on my
Linux machines. I have run some of available examples and also noticed there
are some test modules under /openmpi-1.3.3a1r21566/test. Are these tests run
on batchwise? then how ? or are these tests suppose to run individually by
compiling and executing seperately ?
I'm hoping to contribute openmpi as a developer, so I would like to know can
users contribute by adding more example codes ?

Thanks,
Prasad.

-- 
http://www.codeproject.com/script/Articles/MemberArticles.aspx?amid=3489381


[OMPI users] bulding rpm

2009-07-07 Thread rahmani
Hi every one
I built a rpm file for openmpi-1.3.2 with openmpi.spec and buildrpm.sh on the 
http://www.open-mpi.org/software/ompi/v1.3/srpm.php 

I change buildrpm.sh as fllowing:
prefix="/usr/local/openmpi/intel/1.3.2"
specfile="openmpi.spec"
#rpmbuild_options=${rpmbuild_options:-"--define 'mflags -j4'"}
# -j4 is an option to make, specifies the number of jobs (4) to run 
simultaneously.
rpmbuild_options="--define 'mflags -j4'"
#configure_options=${configure_options:-""}
configure_options="FC=ifort F77=ifort CC=icc CXX=icpc --with-sge 
--with-threads=posix --enable-mpi-threads"

# install ${prefix}/bin/mpivars.* scripts
rpmbuild_options=${rpmbuild_options}" --define 'install_in_opt 0' --define 
'install_shell_scripts 1' --define 'install_modulefile 0'"
# prefix variable has to be passed to rpmbuild
rpmbuild_options=${rpmbuild_options}" --define '_prefix ${prefix}'"


# Note that this script can build one or all of the following RPMs:
# SRPM, all-in-one, multiple.

# If you want to build the SRPM, put "yes" here
build_srpm=${build_srpm:-"no"}
# If you want to build the "all in one RPM", put "yes" here
build_single=${build_single:-"yes"}
# If you want to build the "multiple" RPMs, put "yes" here
build_multiple=${build_multiple:-"no"}

it create openmpi-1.3.2-1.x86_64.rpm  with no error, but when I install it with 
rpm -ivh I see:
error: Failed dependencies:
libifcoremt.so.5()(64bit) is needed by openmpi-1.3.2-1.x86_64
libifport.so.5()(64bit) is needed by openmpi-1.3.2-1.x86_64
libimf.so()(64bit) is needed by openmpi-1.3.2-1.x86_64
libintlc.so.5()(64bit) is needed by openmpi-1.3.2-1.x86_64
libiomp5.so()(64bit) is needed by openmpi-1.3.2-1.x86_64
libsvml.so()(64bit) is needed by openmpi-1.3.2-1.x86_64
libtorque.so.2()(64bit) is needed by openmpi-1.3.2-1.x86_64
but all above library are in my computer

I use rpm -ivh --nodeps and it install completely, but when I use mpif90 and 
mpirun I see:
  $ /usr/local/openmpi/intel/1.3.2/bin/mpif90
gfortran: no input files   (I compile with ifort)

  $ /usr/local/openmpi/intel/1.3.2/bin/mpirun
usr/local/openmpi/intel/1.3.2/bin/mpirun: symbol lookup error: 
/usr/local/openmpi/intel/1.3.2/bin/mpirun: undefined symbol: orted_cmd_line

What is wrong?
How can I build a rpm of openmpi with intel compiler?
Thanks