[OMPI users] TCP btl misbehaves if btl_tcp_port_min_v4 is not set.

2009-07-23 Thread Eric Thibodeau
Hello all,

   (this _might_ be related to https://svn.open-mpi.org/trac/ompi/ticket/1505)

   I just compiled and installed 1.3.3 ins a CentOS 5 environment and we 
noticed the
processes would deadlock as soon as they would start using TCP communications. 
The
test program is one that has been running on other clusters for years with no
problems. Furthermore, using local cores doesn't deadlock the process whereas 
forcing
inter-node communications (-bynode scheduling), immediately causes the problem.

Symptoms:
- processes don't crash or die, the use 100% CPU in system space (as opposed to 
user space)
- stracing one of the processes will show it is freewheeling in a polling loop.
- executing with --mca btl_base_verbose 30 will show weird port assignments, 
either they
are wrong or should be interpreted as being an offset from the default
btl_tcp_port_min_v4 (1024).
- The error "mca_btl_tcp_endpoint_complete_connect] connect() to  
failed: No
route to host (113)" _may_ be seen. We noticed it only showed up if we had vmnet
interfaces up and running on certain nodes. Note that setting

 oob_tcp_listen_mode=listen_thread
 oob_tcp_if_include=eth0
 btl_tcp_if_include=eth0

was one of our first reaction to this to no avail.

Workaround we found:

While keeping the above mentioned MCA parameters, we added 
btl_tcp_port_min_v4=2000 due
to some firewall rules (which we had obviously disabled as part of the trouble 
shooting
process) and noticed everything seemed to start working correctly from here on.

This seems to work but I can find no logical explanation as the code seems to 
be clean
in that respect.

Some pasting for people searching frantically for a solution:

[cluster-srv2:20379] btl: tcp: attempting to connect() to address 10.194.32.113 
on port
2052
[cluster-srv2:20381] btl: tcp: attempting to connect() to address 10.194.32.113 
on port
3076
[cluster-srv2:20377] btl: tcp: attempting to connect() to address 10.194.32.113 
on port 260
[cluster-srv2:20383] btl: tcp: attempting to connect() to address 10.194.32.113 
on port
3588
[cluster-srv1:19900] btl: tcp: attempting to connect() to address 10.194.32.117 
on port
1540
[cluster-srv2:20377] btl: tcp: attempting to connect() to address 10.194.32.117 
on port
2052
[cluster-srv2:20383] btl: tcp: attempting to connect() to address 10.194.32.117 
on port
3076
[cluster-srv1:19894] btl: tcp: attempting to connect() to address 10.194.32.117 
on port 516
[cluster-srv2:20379] btl: tcp: attempting to connect() to address 10.194.32.117 
on port
3588
[cluster-srv1:19898] btl: tcp: attempting to connect() to address 10.194.32.117 
on port
1028
[cluster-srv2:20381] btl: tcp: attempting to connect() to address 10.194.32.117 
on port
2564
[cluster-srv1:19896] btl: tcp: attempting to connect() to address 10.194.32.117 
on port 4
[cluster-srv3:13665] btl: tcp: attempting to connect() to address 10.194.32.115 
on port
1028
[cluster-srv3:13663] btl: tcp: attempting to connect() to address 10.194.32.115 
on port 4
[cluster-srv2][[44096,1],9][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
[cluster-srv2][[44096,1],13][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.194.32.117 failed: No route to host (113)
connect() to 10.194.32.117 failed: No route to host (113)
[cluster-srv3][[44096,1],20][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
connect() to 10.194.32.115 failed: No route to host (113)

Cheers!

Eric Thiboedau



Re: [OMPI users] Can 2 IB HCAs give twice the bandwidth?

2008-10-19 Thread Eric Thibodeau

Jeff Squyres wrote:

On Oct 18, 2008, at 9:19 PM, Mostyn Lewis wrote:


Can OpenMPI do like Scali and MVAPICH2 and utilize 2 IB HCAs per machine
to approach double the bandwidth on simple tests such as IMB PingPong?



Yes.  OMPI will automatically (and aggressively) use as many active 
ports as you have.  So you shouldn't need to list devices+ports -- 
OMPI will simply use all ports that it finds in the active state.  If 
your ports are on physically separate IB networks, then each IB 
network will require a different subnet ID so that OMPI can compute 
reachability properly.


Does this apply to all fabrics, or, at which level is this implemented 
in ompi? (ie: multiple GigE nics...but I doubt it applies given the 
restricted intricacies of the IP implementation)


Eric


[OMPI users] Tuned Collective MCA params

2008-10-03 Thread Eric Thibodeau

Hello all,

   I am currently profiling a simple case where I replace multiple S/R 
calls with Allgather calls and it would _seem_ the simple S/R calls are 
faster. Now, *before* I come to any conclusion on this, one of the 
pieces I am missing is more details on how /if/when the tuned coll MCA 
is selected. In other words, can I assume the tuned versions are used by 
default? I skimmed through the well documented source code but before I 
can even start to analyze the replacement's impact (in a small cluster), 
I need to know how and when the tuned coll MCA is used/selected.


Thanks,

Eric


Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2

2008-09-15 Thread Eric Thibodeau

Simply to keep track of what's going on:

I checked the build environment for openmpi and the system's setting, 
they were built using gcc 3.4.4 with -Os, which was reputed unstable and 
problematic with this compiler version. I've asked Prasanna to rebuild 
using -O2 but this could be a bit lengthy since the entire system (or at 
least all libs openmpi links to) needs to be rebuilt.


Eric

Eric Thibodeau wrote:

Prasanna,

Please send me your /etc/make.conf and the contents of 
/var/db/pkg/sys-cluster/openmpi-1.2.7/


You can package this with the following command line:

tar -cjf data.tbz /etc/make.conf /var/db/pkg/sys-cluster/openmpi-1.2.7/

And simply send me the data.tbz file.

Thanks,

Eric

Prasanna Ranganathan wrote:

Hi,

 I did make sure at the beginning that only eth0 was activated on all the
nodes. Nevertheless, I am currently verifying the NIC configuration on all
the nodes and making sure things are as expected.

While trying different things, I did come across this peculiar error which I
had detailed in one of my previous mails in this thread.

I am testing the helloWorld program in the following trivial case:

mpirun -np 1 -host localhost /main/mpiHelloWorld

Which works fine.

But,

mpirun -np 1 -host localhost --debug-daemons /main/mpiHelloWorld

always fails as follows:

Daemon [0,0,1] checking in as pid 2059 on host localhost
[idx1:02059] [0,0,1] orted: received launch callback
idx1 is node 0 of 1
ranks sum to 0
[idx1:02059] [0,0,1] orted_recv_pls: received message from [0,0,0]
[idx1:02059] [0,0,1] orted_recv_pls: received exit
[idx1:02059] *** Process received signal ***
[idx1:02059] Signal: Segmentation fault (11)
[idx1:02059] Signal code:  (128)
[idx1:02059] Failing at address: (nil)
[idx1:02059] [ 0] /lib/libpthread.so.0 [0x2afa8c597f30]
[idx1:02059] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close+0x18)
[0x2afa8be8e2a2]
[idx1:02059] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize+0x70)
[0x2afa8be795ac]
[idx1:02059] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20)
[0x2afa8be7675c]
[idx1:02059] [ 4] orted(main+0x8a6) [0x4024ae]
[idx1:02059] *** End of error message ***

The failure happens with more verbose output when using the -d flag.

Does this point to some bug in OpenMPI or am I missing something here?

I have attached ompi_info output on this node.

Regards,

Prasanna.

  



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] MPI_sendrecv = MPI_Send+ MPI_RECV ?

2008-09-15 Thread Eric Thibodeau
Sorry about that, I had misinterpreted your original post as being the 
pair of send-receive. The example you give below does seem correct 
indeed, which means you might have to show us the code that doesn't 
work. Note that I am in no way a Fortran expert, I'm more versed in C. 
The only hint I'd give a C programmer in this case is "make sure your 
receiving structures are indeed large enough (ie: you send 3d but 
eventually receive 4d...did you allocate for 3d or 4d for receiving the 
converted array...).


Eric

Enrico Barausse wrote:

sorry, I hadn't changed the subject. I'm reposting:

Hi

I think it's correct. what I want to to is to send a 3d array from the
process 1 to process 0 =root):
call MPI_Send(toroot,3,MPI_DOUBLE_PRECISION,root,n,MPI_COMM_WORLD

in some other part of the code process 0 acts on the 3d array and
turns it into a 4d one and sends it back to process 1, which receives
it with

call MPI_RECV(tonode,4,MPI_DOUBLE_PRECISION,root,n,MPI_COMM_WORLD,status,ierr)

in practice, what I do i basically give by this simple code (which
doesn't give the segmentation fault unfortunately):



   a=(/1,2,3,4,5/)

   call MPI_INIT(ierr)
   call MPI_COMM_RANK(MPI_COMM_WORLD, id, ierr)
   call MPI_COMM_SIZE(MPI_COMM_WORLD, numprocs, ierr)

   if(numprocs/=2) stop

   if(id==0) then
   do k=1,5
   a=a+1
   call MPI_SEND(a,5,MPI_INTEGER,1,k,MPI_COMM_WORLD,ierr)
   call
MPI_RECV(b,4,MPI_INTEGER,1,k,MPI_COMM_WORLD,status,ierr)
   end do
   else
   do k=1,5
   call
MPI_RECV(a,5,MPI_INTEGER,0,k,MPI_COMM_WORLD,status,ierr)
   b=a(1:4)
   call MPI_SEND(b,4,MPI_INTEGER,0,k,MPI_COMM_WORLD,ierr)
   end do
   end if
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  




Re: [OMPI users] MPI_sendrecv = MPI_Send+ MPI_RECV ?

2008-09-13 Thread Eric Thibodeau

Enrico Barausse wrote:

Hello,

I apologize in advance if my question is naive, but I started to use
open-mpi only one week ago.
I have a complicated fortran 90 code which is giving me a segmentation
fault (address not mapped). I tracked down the problem to the
following lines:

 call
MPI_Send(toroot,3,MPI_DOUBLE_PRECISION,root,n,MPI_COMM_WORLD
 call
MPI_RECV(tonode,4,MPI_DOUBLE_PRECISION,root,n,MPI_COMM_WORLD,status,ierr)
  
Well, for starters, your receive count doesn't match the send count. (4 
Vs 3). Is this a typo?

the MPI_send is executed by a process (say 1) which sends the array
toroot to another process (say 0). Process 0 successfully receives the
array toroot (I print out its components and they are correct), does
some calculations on it and sends back an array tonode to process 1.
Nevertheless, the MPI_Send routine above never returns controls to
process 1 (although the array toroot seems to have been transmitted
alright) and gives a segmentation fault (Signal code: Address not
mapped (1))

Now, if replace the two lines above with

call
MPI_sendrecv(toroot,3,MPI_DOUBLE_PRECISION,root,n,tonode,4,MPI_DOUBLE_PRECISION,root,n,MPI_COMM_WORLD,status,ierr)

I get no errors and the code works perfectly (I tested it vs the
serial version from which I started). But, and here is my question,
shouldn't MPI_sendrecv be equivalent to MPI_Send followed by MPI_RECV?

thank you in advance for helping with this

cheers

enrico
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  




Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2

2008-09-12 Thread Eric Thibodeau

Prasanna,

   Please send me your /etc/make.conf and the contents of 
/var/db/pkg/sys-cluster/openmpi-1.2.7/


You can package this with the following command line:

tar -cjf data.tbz /etc/make.conf /var/db/pkg/sys-cluster/openmpi-1.2.7/

And simply send me the data.tbz file.

Thanks,

Eric

Prasanna Ranganathan wrote:

Hi,

 I did make sure at the beginning that only eth0 was activated on all the
nodes. Nevertheless, I am currently verifying the NIC configuration on all
the nodes and making sure things are as expected.

While trying different things, I did come across this peculiar error which I
had detailed in one of my previous mails in this thread.

I am testing the helloWorld program in the following trivial case:

mpirun -np 1 -host localhost /main/mpiHelloWorld

Which works fine.

But,

mpirun -np 1 -host localhost --debug-daemons /main/mpiHelloWorld

always fails as follows:

Daemon [0,0,1] checking in as pid 2059 on host localhost
[idx1:02059] [0,0,1] orted: received launch callback
idx1 is node 0 of 1
ranks sum to 0
[idx1:02059] [0,0,1] orted_recv_pls: received message from [0,0,0]
[idx1:02059] [0,0,1] orted_recv_pls: received exit
[idx1:02059] *** Process received signal ***
[idx1:02059] Signal: Segmentation fault (11)
[idx1:02059] Signal code:  (128)
[idx1:02059] Failing at address: (nil)
[idx1:02059] [ 0] /lib/libpthread.so.0 [0x2afa8c597f30]
[idx1:02059] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close+0x18)
[0x2afa8be8e2a2]
[idx1:02059] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize+0x70)
[0x2afa8be795ac]
[idx1:02059] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20)
[0x2afa8be7675c]
[idx1:02059] [ 4] orted(main+0x8a6) [0x4024ae]
[idx1:02059] *** End of error message ***

The failure happens with more verbose output when using the -d flag.

Does this point to some bug in OpenMPI or am I missing something here?

I have attached ompi_info output on this node.

Regards,

Prasanna.

  



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2

2008-09-11 Thread Eric Thibodeau

Prasanna,

   I opened up a bug report to enable a better control over the 
threading options (http://bugs.gentoo.org/show_bug.cgi?id=237435). In 
the meanwhile, if your helloWorld isn't too fluffy, could you send it 
over (off list if you prefer) so I can take a look at it, the 
Segmentation fault is probably hinting at another problem. Also, could 
you send the output of ompi_info now that you've recompiled openmpi with 
USE=-threads, I want to make sure the option went through as I hope it 
should. Simply attach the file named out.txt after running the following 
command:


ompi_info > out.txt

...RTF files tend to make my eyes cross over ;)

Thanks,

Eric

Prasanna Ranganathan wrote:

Hi,

I have tried the following to no avail.

On 499 machines running openMPI 1.2.7:

mpirun -np 499 -bynode -hostfile nodelist /main/mpiHelloWorld ...

With different combinations of the following parameters

-mca btl_base_verbose 1 -mca btl_base_debug 2 -mca oob_base_verbose 1 -mca
oob_tcp_debug 1 -mca oob_tcp_listen_mode listen_thread -mca
btl_tcp_endpoint_cache 65536 -mca oob_tcp_peer_retries 120

I still get the No route to Host error messages.

Also, I tried with -mca pls_rsh_num_concurrent 499 --debug-daemons and did
not get any additional useful debug output other than the error messages.

I did notice one strange thing though. The following is always successful
(atleast all my attempts)

mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld

but

mpirun -np 100 -bynode -hostfile nodelist /main/mpiHelloWorld
--debug-daemons

prints these error messages at the end from each of the nodes :

[idx2:04064] [0,0,1] orted_recv_pls: received message from [0,0,0]
[idx2:04064] [0,0,1] orted_recv_pls: received exit
[idx2:04064] *** Process received signal ***
[idx2:04064] Signal: Segmentation fault (11)
[idx2:04064] Signal code:  (128)
[idx2:04064] Failing at address: (nil)
[idx2:04064] [ 0] /lib/libpthread.so.0 [0x2b92cc729f30]
[idx2:04064] [ 1] /usr/lib64/libopen-rte.so.0(orte_pls_base_close+0x18)
[0x2b92cc0202a2]
[idx2:04064] [ 2] /usr/lib64/libopen-rte.so.0(orte_system_finalize+0x70)
[0x2b92cc00b5ac]
[idx2:04064] [ 3] /usr/lib64/libopen-rte.so.0(orte_finalize+0x20)
[0x2b92cc00875c]
[idx2:04064] [ 4] /usr/bin/orted(main+0x8a6) [0x4024ae]
[idx2:04064] *** End of error message ***


I am not sure if this points to the actual cause for these issues. Is is to
do with the openMPI 1.2.7 having posix enabled in the current configuration
on these nodes? 


Thanks again for your continued help.

Regards,

Prasanna.  

  

Message: 2
Date: Thu, 11 Sep 2008 12:16:50 -0400
From: Jeff Squyres 
Subject: Re: [OMPI users] Need help resolving No route to host error
with OpenMPI 1.1.2
To: Open MPI Users 
Message-ID: <7110e2d0-eb89-4293-a241-8487174b4...@cisco.com>
Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes

On Sep 10, 2008, at 9:29 PM, Prasanna Ranganathan wrote:



I have upgraded to 1.2.7 and am still noticing the issue.
  

FWIW, we didn't change anything with regards to OOB and TCP from 1.2.6
-> 1.2.7, but it's still good to be at the latest version.

Try running with this MCA parameter:

 mpirun --mca oob_tcp_listen_mode listen_thread ...

Sorry; I forgot that we did not enable that option by default in the
v1.2 series.



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  




Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2

2008-09-11 Thread Eric Thibodeau

Jeff Squyres wrote:

On Sep 11, 2008, at 3:27 PM, Eric Thibodeau wrote:

Ok, added to the information from the README, I'm thinking none of 
the 3 configure options have an impact on the said 'threaded TCP 
listener' and the MCA option you suggested should still work, is this 
correct?


It should default to --with-threads=posix, which you'll need for the 
threaded listener (it just means that the system supports posix 
threads).  You can either specify that explicitly or trust configure 
to get it right (you can examine the output of configure to check that 
it got it right -- but I'm sure it did).


On that matter, since we're modifying the package to correct this, 
how would I go about enabling `oob_tcp_listen_mode listen_thread` by 
default at compile time?


You can't at compile time, sorry.  There's just too many MCA 
parameters for us to offer a configure parameter for each one of them.


But you can set the global config file to set this MCA param value by 
default:


http://www.open-mpi.org/faq/?category=tuning#setting-mca-params
Thanks, we're adding this as a default parameter to the openmpi package 
if threads option was selected.


Eric



Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2

2008-09-11 Thread Eric Thibodeau

Jeff Squyres wrote:

On Sep 11, 2008, at 2:38 PM, Eric Thibodeau wrote:


In short:

Which of the 3 options is the one known to be unstable in the following:

--enable-mpi-threadsEnable threads for MPI applications (default:
disabled)
--enable-progress-threads
Enable threads asynchronous communication 
progress

(default: disabled)
--with-threads  Set thread type (solaris / posix)


You shouldn't need to specify any of these.


In long (rationale):

  Just to make sure we don't contradict each other, you're suggesting 
the use of 'listen_thread' but, at the same time I'm telling Prasanna 
to _disable_ threads the threads USE flag which translates into the 
following logic (in the package):


Heh; yes, it's a bit confusing -- I apologize.
Don't, I forgot about the README which is more explicit about the 
options and the fact that --with-threads=x was directly linked to the 2 
other options; my bad.


The "threads" that I'm saying don't work is the MPI multi-threaded 
support (i.e., MPI_THREAD_MULTIPLE) and support for progress threads 
within MPI's progression engine.


What *does* work is a tiny threaded TCP listener for incoming 
connections.  Since the processing for each TCP connection takes a 
little time, we found that for scalability reasons, it was good to 
have a tiny thread that does nothing but block on TCP accept(), get 
the connection, and then hand it off to the main back-end thread for 
processing.  This allows our accept() rate to be quite high, even if 
the actual processing is slower.  *This* is the "listen_thread" mode, 
and turns out to be quite necessary for running at scale because our 
initial wireup coordination occurs over TCP -- there's a flood of 
incoming TCP connections back to the starter.  With the threaded TCP 
listener, the accept rate is high enough to not cause timeouts for the 
incoming TCP flood.
Ok, added to the information from the README, I'm thinking none of the 3 
configure options have an impact on the said 'threaded TCP listener' and 
the MCA option you suggested should still work, is this correct?


Hope that made sense...

It did, I just want to make sure we're not disabling the listener thread.

On that matter, since we're modifying the package to correct this, how 
would I go about enabling `oob_tcp_listen_mode listen_thread` by default 
at compile time?


Many thanks,

Eric



Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2

2008-09-11 Thread Eric Thibodeau

Jeff,

In short:

Which of the 3 options is the one known to be unstable in the following:

 --enable-mpi-threadsEnable threads for MPI applications (default:
 disabled)
 --enable-progress-threads
 Enable threads asynchronous communication progress
 (default: disabled)
 --with-threads  Set thread type (solaris / posix)

?

In long (rationale):

   Just to make sure we don't contradict each other, you're suggesting 
the use of 'listen_thread' but, at the same time I'm telling Prasanna to 
_disable_ threads the threads USE flag which translates into the 
following logic (in the package):


   if use threads; then
   myconf="${myconf}
   --enable-mpi-threads
   --with-progress-threads
   --with-threads=posix"
   fi

The decision was made based on the configure --help information (most 
probably from the 1.1 series), which lead to arbitrarily 
enabling/disabling all that has to do with threads using a single 
keyword. Now, based on :


https://svn.open-mpi.org/trac/ompi/wiki/ThreadSafetySupport

So, is it only --enable-mpi-threads that is unstable in the "*thread*" 
options?


Thanks,

Eric

Jeff Squyres wrote:

On Sep 10, 2008, at 9:29 PM, Prasanna Ranganathan wrote:


I have upgraded to 1.2.7 and am still noticing the issue.


FWIW, we didn't change anything with regards to OOB and TCP from 1.2.6 
-> 1.2.7, but it's still good to be at the latest version.


Try running with this MCA parameter:

mpirun --mca oob_tcp_listen_mode listen_thread ...

Sorry; I forgot that we did not enable that option by default in the 
v1.2 series.






Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2

2008-09-11 Thread Eric Thibodeau

Jeff Squyres wrote:
I'm not sure what USE=-threads means, but I would discourage the use 
of threads in the v1.2 series; our thread support is pretty much 
broken in the 1.2 series.
That's exactly what it means, hence the following BFW I had originally 
inserted in the package to this effect:


   ewarn
   ewarn "WARNING: use of threads is still disabled by default in"
   ewarn "upstream builds."
   ewarn "You may stop now and set USE=-threads"
   ewarn
   epause 5

...ok, so it's maybe not that B and F but it's still there to be noticed 
and logged ;)



On Sep 10, 2008, at 7:52 PM, Eric Thibodeau wrote:

Prasanna, also make sure you try with USE=-threads ...as the ebuild 
states, it's _experimental_  ;)


Keep your eye on: 
https://svn.open-mpi.org/trac/ompi/wiki/ThreadSafetySupport


Eric

Prasanna Ranganathan wrote:


Hi,

I have upgraded my openMPI to 1.2.6 (We have gentoo and emerge showed
1.2.6-r1 to be the latest stable version of openMPI).

I do still get the following error message when running my test 
helloWorld

program:

[10.12.77.21][0,1,95][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c 


onnect] connect() failed with
errno=113[10.12.16.13][0,1,408][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_ 


complete_connect] connect() failed with errno=113
[10.12.77.15][0,1,89][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c 


onnect] connect() failed with errno=113
[10.12.77.22][0,1,96][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c 


onnect] connect() failed with errno=113

Again, this error does not happen with every run of the test program 
and

occurs only certain times.

How do I take care of this?

Regards,

Prasanna.


On 9/9/08 9:00 AM, "users-requ...@open-mpi.org" 
<users-requ...@open-mpi.org>

wrote:



Message: 1
Date: Mon, 8 Sep 2008 16:43:33 -0400
From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] Need help resolving No route to host error
with OpenMPI 1.1.2
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <af302d68-0d30-469e-afd3-566ff9628...@cisco.com>
Content-Type: text/plain; charset=WINDOWS-1252; format=flowed;
delsp=yes

Are you able to upgrade to Open MPI v1.2.7?

There were *many* bug fixes and changes in the 1.2 series compared to
the 1.1 series, some, in particular, were dealing with TCP socket
timeouts (which are important when dealing with large numbers of MPI
processes).



On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote:



Hi,

I am trying to run a test mpiHelloWorld program that simply
initializes the MPI environment on all the nodes, prints the
hostname and rank of each node in the MPI process group and exits.

I am using MPI 1.1.2 and am running 997 processes on 499 nodes
(Nodes have 2 dual core CPUs).

I get the following error messages when I run my program as follows:
mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld
.
.
.
[0,1,380][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,142]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
[0,1,140][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,390]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
connect() failed with errno=113connect() failed with
errno=113connect() failed with errno=113[0,1,138][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,144]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
[0,1,388][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113[0,1,386][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113
[0,1,139][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113
connect() failed with errno=113
.
.

The main thing is that I get these error messages around 3-4 times
out of 10 attempts with the rest all completing successfully. I have
looked into the FAQs in detail and also checked the tcp btl settings
but am not able to figure it out.

All the 499 nodes have only eth0 active and I get the error even
when I run the following: mpirun -np 997 -bynode ?hostfile nodelist
--mca btl_tcp_if_include eth0 /main/mpiHelloWorld

I have attached the output of ompi_info ?all.

The following is the output of /sbin/ifconfig on the node where I
start the mpi process (it is one of the 499 nodes)

eth0  Link encap:Ethernet  HWaddr 00:03:25:44:8F:D6
  inet addr:10.12.1.11  Bcast:10.12.255.255  Mask:255.255.0.0
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:1978724556 errors:17 dropped:0 overruns:0 frame:
17
  TX packets:1767028063 errors:0 dropped:0 overruns:0
carrier:0
  collisions:0 txqueue

Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2

2008-09-10 Thread Eric Thibodeau

Prasanna Ranganathan wrote:

Hi Eric,

Thanks a lot for the reply.

I am currently working on upgrading to 1.2.7

I do not quite follow your directions; What do you refer to when you say say
"try with USE=-threads..."
  
I am referring to the USE variable which is used to set global package 
specificities. If you want to disable threads only for openmpi, edit 
/etc/portage/package.use and add the following line to it:


sys-cluster/openmpi -threads

And re-emerge openmpi, this will disable threads.

Kindly excuse if it is a silly question and pardon my ignorance :D
  
It is related to using Gentoo, if you're new to it, I suggest you give 
the documentation a shot:


http://www.gentoo.org/doc/en/index.xml?catid=gentoo

Regards,

Prasanna.
  


Eric


Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2

2008-09-10 Thread Eric Thibodeau
Prasanna, also make sure you try with USE=-threads ...as the ebuild 
states, it's _experimental_  ;)


Keep your eye on: 
https://svn.open-mpi.org/trac/ompi/wiki/ThreadSafetySupport


Eric

Prasanna Ranganathan wrote:

Hi,

I have upgraded my openMPI to 1.2.6 (We have gentoo and emerge showed
1.2.6-r1 to be the latest stable version of openMPI).

I do still get the following error message when running my test helloWorld
program:

[10.12.77.21][0,1,95][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
onnect] connect() failed with
errno=113[10.12.16.13][0,1,408][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_
complete_connect] connect() failed with errno=113
[10.12.77.15][0,1,89][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
onnect] connect() failed with errno=113
[10.12.77.22][0,1,96][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
onnect] connect() failed with errno=113

Again, this error does not happen with every run of the test program and
occurs only certain times.

How do I take care of this?

Regards,

Prasanna.


On 9/9/08 9:00 AM, "users-requ...@open-mpi.org" 
wrote:

  

Message: 1
Date: Mon, 8 Sep 2008 16:43:33 -0400
From: Jeff Squyres 
Subject: Re: [OMPI users] Need help resolving No route to host error
with OpenMPI 1.1.2
To: Open MPI Users 
Message-ID: 
Content-Type: text/plain; charset=WINDOWS-1252; format=flowed;
delsp=yes

Are you able to upgrade to Open MPI v1.2.7?

There were *many* bug fixes and changes in the 1.2 series compared to
the 1.1 series, some, in particular, were dealing with TCP socket
timeouts (which are important when dealing with large numbers of MPI
processes).



On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote:



Hi,

I am trying to run a test mpiHelloWorld program that simply
initializes the MPI environment on all the nodes, prints the
hostname and rank of each node in the MPI process group and exits.

I am using MPI 1.1.2 and am running 997 processes on 499 nodes
(Nodes have 2 dual core CPUs).

I get the following error messages when I run my program as follows:
mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld
.
.
.
[0,1,380][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,142]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
[0,1,140][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,390]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
connect() failed with errno=113connect() failed with
errno=113connect() failed with errno=113[0,1,138][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,144]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
[0,1,388][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113[0,1,386][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113
[0,1,139][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113
connect() failed with errno=113
.
.

The main thing is that I get these error messages around 3-4 times
out of 10 attempts with the rest all completing successfully. I have
looked into the FAQs in detail and also checked the tcp btl settings
but am not able to figure it out.

All the 499 nodes have only eth0 active and I get the error even
when I run the following: mpirun -np 997 -bynode ?hostfile nodelist
--mca btl_tcp_if_include eth0 /main/mpiHelloWorld

I have attached the output of ompi_info ?all.

The following is the output of /sbin/ifconfig on the node where I
start the mpi process (it is one of the 499 nodes)

eth0  Link encap:Ethernet  HWaddr 00:03:25:44:8F:D6
  inet addr:10.12.1.11  Bcast:10.12.255.255  Mask:255.255.0.0
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:1978724556 errors:17 dropped:0 overruns:0 frame:
17
  TX packets:1767028063 errors:0 dropped:0 overruns:0
carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:580938897359 (554026.5 Mb)  TX bytes:689318600552
(657385.4 Mb)
  Interrupt:22 Base address:0xc000

loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:70560 errors:0 dropped:0 overruns:0 frame:0
  TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:339687635 (323.9 Mb)  TX bytes:339687635 (323.9 Mb)


Kindly help.

Regards,

Prasanna.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  



Re: [OMPI users] Need help resolving No route to host error with OpenMPI 1.1.2

2008-09-10 Thread Eric Thibodeau

Prasanna Ranganathan wrote:

Hi,

I have upgraded my openMPI to 1.2.6 (We have gentoo and emerge showed
1.2.6-r1 to be the latest stable version of openMPI).
  

Prasanna, do a sync, 1.2.7 is in portage and report back.

Eric

I do still get the following error message when running my test helloWorld
program:

[10.12.77.21][0,1,95][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
onnect] connect() failed with
errno=113[10.12.16.13][0,1,408][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_
complete_connect] connect() failed with errno=113
[10.12.77.15][0,1,89][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
onnect] connect() failed with errno=113
[10.12.77.22][0,1,96][btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_c
onnect] connect() failed with errno=113

Again, this error does not happen with every run of the test program and
occurs only certain times.

How do I take care of this?

Regards,

Prasanna.


On 9/9/08 9:00 AM, "users-requ...@open-mpi.org" 
wrote:

  

Message: 1
Date: Mon, 8 Sep 2008 16:43:33 -0400
From: Jeff Squyres 
Subject: Re: [OMPI users] Need help resolving No route to host error
with OpenMPI 1.1.2
To: Open MPI Users 
Message-ID: 
Content-Type: text/plain; charset=WINDOWS-1252; format=flowed;
delsp=yes

Are you able to upgrade to Open MPI v1.2.7?

There were *many* bug fixes and changes in the 1.2 series compared to
the 1.1 series, some, in particular, were dealing with TCP socket
timeouts (which are important when dealing with large numbers of MPI
processes).



On Sep 8, 2008, at 4:36 PM, Prasanna Ranganathan wrote:



Hi,

I am trying to run a test mpiHelloWorld program that simply
initializes the MPI environment on all the nodes, prints the
hostname and rank of each node in the MPI process group and exits.

I am using MPI 1.1.2 and am running 997 processes on 499 nodes
(Nodes have 2 dual core CPUs).

I get the following error messages when I run my program as follows:
mpirun -np 997 -bynode -hostfile nodelist /main/mpiHelloWorld
.
.
.
[0,1,380][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,142]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
[0,1,140][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,390]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
connect() failed with errno=113connect() failed with
errno=113connect() failed with errno=113[0,1,138][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113[0,1,384][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] [0,1,144]
[btl_tcp_endpoint.c:572:mca_btl_tcp_endpoint_complete_connect]
connect() failed with errno=113
[0,1,388][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113[0,1,386][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113
[0,1,139][btl_tcp_endpoint.c:
572:mca_btl_tcp_endpoint_complete_connect] connect() failed with
errno=113
connect() failed with errno=113
.
.

The main thing is that I get these error messages around 3-4 times
out of 10 attempts with the rest all completing successfully. I have
looked into the FAQs in detail and also checked the tcp btl settings
but am not able to figure it out.

All the 499 nodes have only eth0 active and I get the error even
when I run the following: mpirun -np 997 -bynode ?hostfile nodelist
--mca btl_tcp_if_include eth0 /main/mpiHelloWorld

I have attached the output of ompi_info ?all.

The following is the output of /sbin/ifconfig on the node where I
start the mpi process (it is one of the 499 nodes)

eth0  Link encap:Ethernet  HWaddr 00:03:25:44:8F:D6
  inet addr:10.12.1.11  Bcast:10.12.255.255  Mask:255.255.0.0
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:1978724556 errors:17 dropped:0 overruns:0 frame:
17
  TX packets:1767028063 errors:0 dropped:0 overruns:0
carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:580938897359 (554026.5 Mb)  TX bytes:689318600552
(657385.4 Mb)
  Interrupt:22 Base address:0xc000

loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:70560 errors:0 dropped:0 overruns:0 frame:0
  TX packets:70560 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:339687635 (323.9 Mb)  TX bytes:339687635 (323.9 Mb)


Kindly help.

Regards,

Prasanna.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
  




Re: [OMPI users] Configure fails with icc 10.1.008

2007-12-07 Thread Eric Thibodeau

Jeff,

   Thanks...at 23h30 coffee is far off... I saw the proper section of 
the config.log showing exactly that (hello world not working). For 
everyone else's benefit, ICC (up to 10.1.008) is _not_ compatible with 
GCC 4.2... (guess I'll have to retro back to 4.1 series...)


Eric

Jeff Squyres wrote:
This is not an Open MPI problem; Open MPI is simply reporting that 
your C++ compiler is not working.  OMPI tests a trivial C++ program 
that uses the STL to ensure that your C++ program is working.  It's 
essentially:


#include 
int
main ()
{
std::string foo = "Hello, world"
  ;
  return 0;
}

You should probably check with Intel support for more details.



On Dec 6, 2007, at 11:25 PM, Eric Thibodeau wrote:


Hello all,

  I am unable to get past ./configure as ICC fails on C++ tests (see 
attached ompi-output.tar.gz). Configure was called without and the 
with sourcing `/opt/intel/cc/10.1.xxx/bin/iccvars.sh`  as per one of 
the invocation options in icc's doc. I was unable to find the 
relevant (well..intelligible for me that is ;P ) cause of the failure 
in config.log. Any help would be appreciated.


Thanks,

Eric Thibodeau
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





[OMPI users] Configure fails with icc 10.1.008

2007-12-06 Thread Eric Thibodeau

Hello all,

   I am unable to get past ./configure as ICC fails on C++ tests (see 
attached ompi-output.tar.gz). Configure was called without and the with 
sourcing `/opt/intel/cc/10.1.xxx/bin/iccvars.sh`  as per one of the 
invocation options in icc's doc. I was unable to find the relevant 
(well..intelligible for me that is ;P ) cause of the failure in 
config.log. Any help would be appreciated.


Thanks,

Eric Thibodeau


ompi-output.tar.gz
Description: application/gzip


Re: [OMPI users] Performance of MPI_Isend() worse than MPI_Send() and even MPI_Ssend()

2007-10-15 Thread Eric Thibodeau
George,

For completedness's sake, from what I understand here, the only way to 
get "true" communications and computation overlap is to have and "MPI broker" 
thread which would take care of all communications in the form of sync MPI 
calls. It is that thread which you call asynchronously and then let it manage 
the communications in the back... correct?

Eric

Le October 15, 2007, George Bosilca a écrit :
> Eric,
> 
> No there is no documentation about this on Open MPI. However, what I  
> described here, is not related to Open MPI, it's a general problem  
> with most/all MPI libraries. There are multiple scenarios where non  
> blocking communications can improve the overall performance of a  
> parallel application. But, in general, the reason is related to  
> overlapping communications with computations, or communications with  
> communications.
> 
> The problem is that using non blocking will increase the critical  
> path compared with blocking, which usually never help at improving  
> performance. Now I'll explain the real reason behind that. The REAL  
> problem is that usually a MPI library cannot make progress while the  
> application is not in an MPI call. Therefore, as soon as the MPI  
> library return after posting the non-blocking send, no progress is  
> possible on that send until the user goes back in the MPI library. If  
> you compare this with the case of a blocking send, there the library  
> do not return until the data is pushed on the network buffers, i.e.  
> the library is the one in control until the send is completed.
> 
>Thanks,
>  george.
> 
> On Oct 15, 2007, at 2:23 PM, Eric Thibodeau wrote:
> 
> > Hello George,
> >
> > What you're saying here is very interesting. I am presently  
> > profiling communication patterns for Parallel Genetic Algorithms  
> > and could not figure out why the async versions tended to be worst  
> > than the sync counterpart (imho, that was counter-intuitive). What  
> > you're basically saying here is that the async communications  
> > actually add some sychronization overhead that can only be  
> > compensated if the application overlaps computation with the async  
> > communications? Is there some "official" reference/documentation to  
> > this behaviour from OpenMPI (I know the MPI standard doesn't define  
> > the actual implementation of the communications and therefore lets  
> > the implementer do as he pleases).
> >
> > Thanks,
> >
> > Eric
> >
> > Le October 15, 2007, George Bosilca a écrit :
> >> Your conclusion is not necessarily/always true. The MPI_Isend is just
> >> the non blocking version of the send operation. As one can imagine, a
> >> MPI_Isend + MPI_Wait increase the execution path [inside the MPI
> >> library] compared with any blocking point-to-point communication,
> >> leading to worst performances. The main interest of the MPI_Isend
> >> operation is the possible overlap of computation with communications,
> >> or the possible overlap between multiple communications.
> >>
> >> However, depending on the size of the message this might not be true.
> >> For large messages, in order to keep the memory usage on the receiver
> >> at a reasonable level, a rendezvous protocol is used. The sender
> >> [after sending a small packet] wait until the receiver confirm the
> >> message exchange (i.e. the corresponding receive operation has been
> >> posted) to send the large data. Using MPI_Isend can lead to longer
> >> execution times, as the real transfer will be delayed until the
> >> program enter in the next MPI call.
> >>
> >> In general, using non-blocking operations can improve the performance
> >> of the application, if and only if the application is carefully  
> >> crafted.
> >>
> >>george.
> >>
> >> On Oct 14, 2007, at 2:38 PM, Jeremias Spiegel wrote:
> >>
> >>> Hi,
> >>> I'm working with Open-Mpi on an infiniband-cluster and have some
> >>> strange
> >>> effect when using MPI_Isend(). To my understanding this should
> >>> always be
> >>> quicker than MPI_Send() and MPI_Ssend(), yet in my program both
> >>> MPI_Send()
> >>> and MPI_Ssend() reproducably perform quicker than SSend(). Is there
> >>> something
> >>> obvious I'm missing?
> >>>
> >>> Regards,
> >>> Jeremias
> >>> ___
> >>> users mailing list
> >>> us...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>
> >>
> >
> >
> >
> > -- 
> > Eric Thibodeau
> > Neural Bucket Solutions Inc.
> > T. (514) 736-1436
> > C. (514) 710-0517
> 
> 



-- 
Eric Thibodeau
Neural Bucket Solutions Inc.
T. (514) 736-1436
C. (514) 710-0517



Re: [OMPI users] "Address not mapped" error on user defined MPI_OP function

2007-04-04 Thread Eric Thibodeau
hehe...don't we all love it when a problem "fixes" itself. I was missing a line 
in my Type creation to realigne the elements correctly:

// Displacement is RELATIVE to it's first structure element!
for(i=2; i >= 0; i--) Displacement[i] -= Displacement[0]; 

I'm attaching the functionnal code so that others can maybe see this one as an 
example ;)

Le mercredi 4 avril 2007 11:47, Eric Thibodeau a écrit :
> Hello all,
> 
>   First off, please excuse the attached code as I may be naïve in my 
> attempts to implement my own MPI_OP.
> 
>   I am attempting to create my own MPI_OP to use with MPI_Allreduce. I 
> have been able to find very little examples off the net of creating MPI_OPs. 
> My present references are "MPI The complete reference Volume 1 2nd edition" 
> and some rather good slides I found at 
> http://www.mpi-hd.mpg.de/personalhomes/stiff/MPI/ . I am attaching my "proof 
> of concept" code which fails with:
> 
> [kyron:14074] *** Process received signal ***
> [kyron:14074] Signal: Segmentation fault (11)
> [kyron:14074] Signal code: Address not mapped (1)
> [kyron:14074] Failing at address: 0x801da600
> [kyron:14074] [ 0] [0x6ffa6440]
> [kyron:14074] [ 1] 
> /home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_recursivedoubling+0x700)
>  [0x6fbb0dd0]
> [kyron:14074] [ 2] 
> /home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xb2)
>  [0x6fbae9a2]
> [kyron:14074] [ 3] 
> /home/kyron/openmpi_i686/lib/libmpi.so.0(PMPI_Allreduce+0x1a6) [0x6ff61e86]
> [kyron:14074] [ 4] AllReduceTest(main+0x180) [0x8048ee8]
> [kyron:14074] [ 5] /lib/libc.so.6(__libc_start_main+0xe3) [0x6fcbd823]
> [kyron:14074] *** End of error message ***
> 
> 
> Eric Thibodeau
> 

-- 
Eric Thibodeau
Neural Bucket Solutions Inc.
#include 
#include 
#include 
#include 

#define V_LEN 10 //Vector Length
#define E_CNT 10 //Element count

MPI_Op   MPI_MySum; //Custom Sum function
MPI_Datatype MPI_MyType;//We need this MPI Datatype to make MPI aware of our custom structure

int i,j,true=1;
int totalnodes,mynode;

typedef struct CustomType_t {
   float feat[V_LEN];	//Some vector of float
   float distc;		//An independant float value 
   int   number;	//A counter of a different type
} CustomType;

CustomType *SharedStruct;

void construct_MyType(void){
	int i;
	CustomType p;
	int BlockLengths[3] = {V_LEN,1,1};
	MPI_Aint Displacement[3];
	MPI_Datatype types[3] = {MPI_FLOAT, MPI_FLOAT, MPI_INT};

	/* Compute relative displacements w/r to the Type's begining address 
	 * using portable technique
	 * */
	MPI_Get_address([0],[0]);
	MPI_Get_address(  ,[1]);
	MPI_Get_address( ,[2]);

	// Displacement is RELATIVE to it's first structure element!
	for(i=2; i >= 0; i--) Displacement[i] -= Displacement[0]; 

	// It is good practice to include this in case 
	// the compiler pads your data structure
/*	BlockLengths[3] = 1; types[3] = MPI_UB;
	Displacement[3] = sizeof(CustomType); */

	MPI_Type_create_struct(3, BlockLengths, Displacement, types, _MyType);
	MPI_Type_commit(_MyType); // important!!
	return;
}

void MySum(CustomType *cin, CustomType *cinout, int *len, MPI_Datatype *dptr)
{
	int i,j;
	// Some sanity check
	printf("\nIn MySum, Node %d with len=%d\n",mynode,*len);

	if(*dptr != MPI_MyType)
	{
	   printf("Invalid datatype\n");
	   MPI_Abort(MPI_COMM_WORLD, 3);
	}

	for(i=0; i < *len; i++)
	{
		cinout[i].distc +=cin[i].distc;
		cinout[i].number+=cin[i].number;
		for(j=0; j<V_LEN; j++)
			cinout[i].feat[j]+=cin[i].feat[j];
	}
}

void PrintStruct(void)
{
	//We print the result from all nodes:
	printf("Node %d has the following in SharedStruct:\n",mynode);
	for(i=0; i<E_CNT; i++)
	{
		printf("D:%2.1f #:%d Vect:",SharedStruct[i].distc,SharedStruct[i].number);
		for(j=0; j<V_LEN; j++)
			printf("%f,",SharedStruct[i].feat[j]);
		printf("\n");
	}
	printf("= Node %d =\n",mynode);
}

int main(int argc, char *argv[])
{	
	MPI_Init(, );
	MPI_Comm_size(MPI_COMM_WORLD, );
	MPI_Comm_rank(MPI_COMM_WORLD, );

	// Create the MPI_MyType Type
	construct_MyType();
	// Create the MPI_MySum Operator
	MPI_Op_create((MPI_User_function*)MySum, true, _MySum);

	SharedStruct= (CustomType *)malloc(E_CNT * sizeof(CustomType)); //The dist and number part of the structure never get used at the moment...

	SharedStruct[0].distc=mynode+1.0;
	SharedStruct[0].number=mynode;
	for(i=0; i<V_LEN; i++) SharedStruct[0].feat[i]=mynode+i;

	// To speed up the process we replicate the process using memcpy:
	for(i=1; i<E_CNT; i++)
		memcpy((void*)[i],(void*)SharedStruct,sizeof(CustomType));

	//Print Before:
	PrintStruct();
	// We add the content of all nodes _on_ all nodes: 
	MPI_Allreduce(MPI_IN_PLACE, SharedStruct, E_CNT, MPI_MyType, MPI_MySum, MPI_COMM_WORLD);
	//Print After:
	PrintStruct();
	return 0;
}


Re: [OMPI users] "Address not mapped" error on user defined MPI_OP function

2007-04-04 Thread Eric Thibodeau
I completely forgot to mention which version of OpenMPI I am using, I'll gladly 
post additional info if required :

kyron@kyron ~/openmpi-1.2 $ ompi_info |head
Open MPI: 1.2
   Open MPI SVN revision: r14027
Open RTE: 1.2
   Open RTE SVN revision: r14027
OPAL: 1.2
   OPAL SVN revision: r14027
  Prefix: /home/kyron/openmpi_i686
 Configured architecture: i686-pc-linux-gnu
   Configured by: kyron
   Configured on: Wed Apr  4 10:21:34 EDT 2007

Le mercredi 4 avril 2007 11:47, Eric Thibodeau a écrit :
> Hello all,
> 
>   First off, please excuse the attached code as I may be naïve in my 
> attempts to implement my own MPI_OP.
> 
>   I am attempting to create my own MPI_OP to use with MPI_Allreduce. I 
> have been able to find very little examples off the net of creating MPI_OPs. 
> My present references are "MPI The complete reference Volume 1 2nd edition" 
> and some rather good slides I found at 
> http://www.mpi-hd.mpg.de/personalhomes/stiff/MPI/ . I am attaching my "proof 
> of concept" code which fails with:
> 
> [kyron:14074] *** Process received signal ***
> [kyron:14074] Signal: Segmentation fault (11)
> [kyron:14074] Signal code: Address not mapped (1)
> [kyron:14074] Failing at address: 0x801da600
> [kyron:14074] [ 0] [0x6ffa6440]
> [kyron:14074] [ 1] 
> /home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_recursivedoubling+0x700)
>  [0x6fbb0dd0]
> [kyron:14074] [ 2] 
> /home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xb2)
>  [0x6fbae9a2]
> [kyron:14074] [ 3] 
> /home/kyron/openmpi_i686/lib/libmpi.so.0(PMPI_Allreduce+0x1a6) [0x6ff61e86]
> [kyron:14074] [ 4] AllReduceTest(main+0x180) [0x8048ee8]
> [kyron:14074] [ 5] /lib/libc.so.6(__libc_start_main+0xe3) [0x6fcbd823]
> [kyron:14074] *** End of error message ***
> 
> 
> Eric Thibodeau


[OMPI users] "Address not mapped" error on user defined MPI_OP function

2007-04-04 Thread Eric Thibodeau
Hello all,

First off, please excuse the attached code as I may be naïve in my 
attempts to implement my own MPI_OP.

I am attempting to create my own MPI_OP to use with MPI_Allreduce. I 
have been able to find very little examples off the net of creating MPI_OPs. My 
present references are "MPI The complete reference Volume 1 2nd edition" and 
some rather good slides I found at 
http://www.mpi-hd.mpg.de/personalhomes/stiff/MPI/ . I am attaching my "proof of 
concept" code which fails with:

[kyron:14074] *** Process received signal ***
[kyron:14074] Signal: Segmentation fault (11)
[kyron:14074] Signal code: Address not mapped (1)
[kyron:14074] Failing at address: 0x801da600
[kyron:14074] [ 0] [0x6ffa6440]
[kyron:14074] [ 1] 
/home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_recursivedoubling+0x700)
 [0x6fbb0dd0]
[kyron:14074] [ 2] 
/home/kyron/openmpi_i686/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_allreduce_intra_dec_fixed+0xb2)
 [0x6fbae9a2]
[kyron:14074] [ 3] 
/home/kyron/openmpi_i686/lib/libmpi.so.0(PMPI_Allreduce+0x1a6) [0x6ff61e86]
[kyron:14074] [ 4] AllReduceTest(main+0x180) [0x8048ee8]
[kyron:14074] [ 5] /lib/libc.so.6(__libc_start_main+0xe3) [0x6fcbd823]
[kyron:14074] *** End of error message ***


Eric Thibodeau
#include 
#include 
#include 
#include 

#define V_LEN 10 //Vector Length
#define E_CNT 10 //Element count

MPI_Op   MPI_MySum; //Custom Sum function
MPI_Datatype MPI_MyType;//We need this MPI Datatype to make MPI aware of our custom structure

int i,j,true=1;
int totalnodes,mynode;

typedef struct CustomType_t {
   float feat[V_LEN];	//Some vector of float
   float distc;		//An independant float value 
   int   number;	//A counter of a different type
} CustomType;

CustomType *SharedStruct;

void construct_MyType(void){
	CustomType p;
	int BlockLengths[3] = {V_LEN,1,1};
	MPI_Aint Displacement[3];
	MPI_Datatype types[3] = {MPI_FLOAT, MPI_FLOAT, MPI_INT};

	/* Compute relative displacements w/r to the Type's begining address 
	 * using portable technique
	 * */
	MPI_Get_address([0],[0]);
	MPI_Get_address(  ,[1]);
	MPI_Get_address( ,[2]);

	// It is good practice to include this in case 
	// the compiler pads your data structure
/*	BlockLengths[3] = 1; types[3] = MPI_UB;
	Displacement[3] = sizeof(CustomType); */

	MPI_Type_create_struct(3, BlockLengths, Displacement, types, _MyType);
	MPI_Type_commit(_MyType); // important!!
	return;
}

void MySum(CustomType *cin, CustomType *cinout, int *len, MPI_Datatype *dptr)
{
	int i,j;
	// Some sanity check
	printf("\nIn MySum, Node %d with len=\n",mynode,*len);

	if(*dptr != MPI_MyType)
	{
	   printf("Invalid datatype\n");
	   MPI_Abort(MPI_COMM_WORLD, 3);
	}

	for(i=0; i < *len; i++)
	{
		cinout[i].distc +=cin[i].distc;
		cinout[i].number+=cin[i].number;
		for(j=0; j<V_LEN; j++)
			cinout[i].feat[j]+=cin[i].feat[j];
	}
}

void PrintStruct(void)
{
	//We print the result from all nodes:
	printf("Node %d has the following in SharedStruct:\n",mynode);
	for(i=0; i<E_CNT; i++)
	{
		printf("D:%2.1f #:%d Vect:",SharedStruct[i].distc,SharedStruct[i].number);
		for(j=0; j<V_LEN; j++)
			printf("%f,",SharedStruct[i].feat[j]);
		printf("\n");
	}
	printf("= Node %d =\n",mynode);
}

main(int argc, char *argv[])
{	
	MPI_Init(, );
	MPI_Comm_size(MPI_COMM_WORLD, );
	MPI_Comm_rank(MPI_COMM_WORLD, );

	// Create the MPI_MyType Type
	construct_MyType();
	// Create the MPI_MySum Operator
	MPI_Op_create((MPI_User_function*)MySum, true, _MySum);

	SharedStruct= (CustomType *)malloc(E_CNT * sizeof(CustomType)); //The dist and number part of the structure never get used at the moment...

	SharedStruct[0].distc=mynode+1.0;
	SharedStruct[0].number=mynode;
	for(i=0; i<V_LEN; i++) SharedStruct[0].feat[i]=mynode+i;

	// To speed up the process we replicate the process using memcpy:
	for(i=1; i<E_CNT; i++)
		memcpy((void*)[i],(void*)SharedStruct,sizeof(CustomType));

	//Print Before:
	PrintStruct();
	// We add the content of all nodes _on_ all nodes: 
	MPI_Allreduce(MPI_IN_PLACE, SharedStruct, E_CNT, MPI_MyType, MPI_MySum, MPI_COMM_WORLD);
	//Print After:
	PrintStruct();
}


Re: [OMPI users] Compiling HPCC with OpenMPI

2007-02-27 Thread Eric Thibodeau
Hi Jeff,

I had noticed the the library name switched but thanks for pointing it 
out 
still ;) As for the compilation route, I chose to use mpicc as the preferred 
approach and indeed let the wrapper do the work.

FWIW, I got HPCC running, now to find a nice way to sort through all 
the 
data ;)

Eric

Le lundi 26 février 2007 06:53, Jeff Squyres a écrit :
> Note that George listed the v1.2 OMPI libraries (-lopen-rte and - 
> lopenpal) -- the v.1.1.x names are slightly different (-lorte and - 
> lopal).  We had to change the back-end library names between v1.1 and  
> v1.2 because someone else out in the Linux community uses "libopal".
> 
> I typically prefer using "mpicc" as CC and LINKER and therefore  
> letting the OMPI wrapper handle everything for exactly this reason.
> 
> 
> On Feb 21, 2007, at 12:39 PM, Eric Thibodeau wrote:
> 
> > Hi George,
> >
> > Would you say this is preferred to changing the default CC + LINKER?
> > Eric
> > Le mercredi 21 février 2007 12:04, George Bosilca a écrit :
> >> You should use something like this
> >> MPdir = /usr/local/mpi
> >> MPinc = -I$(MPdir)/include
> >> MPlib = -L$(MPdir)/lib -lmpi -lopen-rte -lopen-pal
> >>
> >>george.
> >
> > _______
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 

-- 
Eric Thibodeau
Neural Bucket Solutions Inc.
T. (514) 736-1436
C. (514) 710-0517



Re: [OMPI users] Compiling HPCC with OpenMPI

2007-02-21 Thread Eric Thibodeau
Thanks Laurent, I will try your proposed settings.

Note that I didn't want to use CC= and LINKER= since I dont know the 
probable impacts on the rest of the benchmarks...hmm...though this IS a 
clustering benchamrk. Also note that I wasn't trying to compile for MPICH, I 
merely copied the lines from a "clean" config as a reference ;)

Eric

Le mercredi 21 février 2007 11:48, Laurent Nguyen a écrit :
> Hello,
> 
> I believe that you are trying to use mpich, not openmpi (libmpich.a).
> Personnally, I've compiling HPCC on AIX IBM with OpenMPI with theses lines:
> 
>   # --
>   # - Message Passing library (MPI) --
>   # --
>   # MPinc tells the  C  compiler where to find the Message Passing library
>   # header files,  MPlib  is defined  to be the name of  the library to be
>   # used. The variable MPdir is only used for defining MPinc and MPlib.
>   #
>   MPdir=
>   MPinc=
>   MPlib=
> ...
> CC = mpicc
> 
> LINKER = mpicc
> 
> But, in my environnment variable $PATH, I've the directory where OpenMPI 
> executables are: //openmpi/bin
> 
> I hope I could help you...
> 
> Regards
> 
> 
> **
> NGUYEN Anh-Khai Laurent - Ingénieur de Recherche
> Equipe Support Utilisateur
> 
> Email:laurent.ngu...@idris.fr
> Tél  :01.69.35.85.66
> Adresse  :IDRIS - Institut du Développement et des Ressources en
>Informatique Scientifique
>CNRS
>Batiment 506
>BP 167
>F - 91403 ORSAY Cedex
> Site Web :http://www.idris.fr
> **
> 
> Eric Thibodeau a écrit :
> > Hello all,
> > 
> > As we all know, compiling OpenMPI is not a matter of adding -lmpi 
> > (http://www.open-mpi.org/faq/?category=mpi-apps). I have tried many 
> > different approaches on configuring the 3 crucial MPI lines in the HPCC 
> > Makefiles with no success. There seems to be no correct way to get mpicc 
> > --shome:* to return the correct info and forcing the correct paths/info 
> > seems to be incorrect (ie, what OpenMPI lib do I point to here:  MPlib = 
> > $(MPdir)/lib/libmpich.a)
> > 
> > Any help would be greatly appreciated!
> > 
> > Exerp from the Makefile:
> > 
> > # --
> > # - Message Passing library (MPI) --
> > # --
> > # MPinc tells the  C  compiler where to find the Message Passing library
> > # header files,  MPlib  is defined  to be the name of  the library to be
> > # used. The variable MPdir is only used for defining MPinc and MPlib.
> > #
> > MPdir= /usr/local/mpi
> > MPinc= -I$(MPdir)/include
> > MPlib= $(MPdir)/lib/libmpich.a
> > 
> > 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

-- 
Eric Thibodeau
Neural Bucket Solutions Inc.
T. (514) 736-1436
C. (514) 710-0517



Re: [OMPI users] compiling mpptest using OpenMPI

2007-02-19 Thread Eric Thibodeau
Hi Jeff,

I just tried with 1.2b4r13690 and the problem is still present. Only 
nottable differance is that CTRL-C gave me orterun: killing job... but stuck 
there untill I hit CTRL-\..if it has any bearing on the issue. Again, the 
command line was:

orterun -np 11 ./perftest-1.3c/mpptest -max_run_time 1800 -bisect -size 0 4096 
1 -gnuplot -fname HyperTransport/Global_bisect_0_4096_1.gpl

(only difference is that I had 11 procs instead of 9 available)

Le vendredi 16 février 2007 06:50, Jeff Squyres a écrit :
> Could you try one of the later nightly 1.2 tarballs?  We just fixed a  
> shared memory race condition, for example:
> 
>   http://www.open-mpi.org/nightly/v1.2/
> 
> 
> On Feb 16, 2007, at 12:12 AM, Eric Thibodeau wrote:
> 
> > Hello devs,
> >
> > Thought I would let you know there seems to be a problem with  
> > 1.2b3r13112 when running the "bisection" test on a Tyan VX50  
> > machine (the 8 DualCore model with 32Gigs of RAM).
> >
> > OpenMPI was compiled with (as seen from config.log):
> > configure:116866: running /bin/sh './configure'  CFLAGS="-O3 - 
> > DNDEBUG -finline-functions -fno-strict-aliasing -pthread"  
> > CPPFLAGS=" " FFLAGS="" LDFLAGS=" " --enable-shared --disable- 
> > static  --prefix=/export/livia/home/parallel/eric/openmpi_x86_64 -- 
> > with-mpi=open_mpi --cache-file=/dev/null --srcdir=.
> >
> > MPPTEST (1.3c) was compiled with:
> > ./configure --with-mpi=$HOME/openmpi_`uname -m`
> >
> > ...which, for some reason, works fine on that system that doesn't  
> > have any other MPI implementation (ie: doesn't have LAM-MPI as per  
> > this thread).
> >
> > Then I ran a few tests but this one ran for over it's allowed time  
> > (1800 seconds and was going over 50minutes...) and was up to 16Gigs  
> > of RAM:
> >
> > orterun -np 9 ./perftest-1.3c/mpptest -max_run_time 1800 -bisect - 
> > size 0 4096 1 -gnuplot -fname HyperTransport/ 
> > Global_bisect_0_4096_1.gpl
> >
> > I had to CTRL-\ the process as CTRL-C wasn't sufficient. 2 mpptest  
> > processes and 1 orterun process were using 100% CPU ou of of the 16  
> > cores.
> >
> > If any of this can be indicative of an OpenMPI bug and if I can  
> > help in tracking it down, don't hesitate to ask for details.
> >
> > And, finally, Anthony, thanks for the MPICC and --with-mpich  
> > pointers, I will try those to simplify the build process!
> >
> > Eric
> >
> > Le jeudi 15 février 2007 19:51, Anthony Chan a écrit :
> >>
> >> As long as mpicc is working, try configuring mpptest as
> >>
> >> mpptest/configure MPICC=/bin/mpicc
> >>
> >> or
> >>
> >> mpptest/configure  --with-mpich=
> >>
> >> A.Chan
> >>
> >> On Thu, 15 Feb 2007, Eric Thibodeau wrote:
> >>
> >>> Hi Jeff,
> >>>
> >>>   Thanks for your response, I eventually figured it out, here is the
> >>> only way I got mpptest to compile:
> >>>
> >>> export LD_LIBRARY_PATH="$HOME/openmpi_`uname -m`/lib"
> >>> CC="$HOME/openmpi_`uname -m`/bin/mpicc" ./configure --with- 
> >>> mpi="$HOME/openmpi_`uname -m`"
> >>>
> >>> And, yes I know I should use the mpicc wrapper and all (I do  
> >>> RTFM :P ) but
> >>> mpptest is less than cooperative and hasn't been updated lately  
> >>> AFAIK.
> >>>
> >>> I'll keep you posted on some results as I get some results out  
> >>> (testing
> >>> TCP/IP as well as the HyperTransport on a Tyan Beast). Up to now,  
> >>> LAM-MPI
> >>> seems less efficient at async communications and shows no  
> >>> improovments
> >>> with persistant communications under TCP/IP. OpenMPI, on the  
> >>> other hand,
> >>> seems more efficient using persistant communications when in a
> >>> HyperTransport (shmem) environment... I know I am crossing many test
> >>> boudaries but I will post some PNGs of my results (as well as how  
> >>> I got to
> >>> them ;)
> >>>
> >>> Eric
> >>>
> >>> On Thu, 15 Feb 2007, Jeff Squyres wrote:
> >>>
> >>>> I think you want to add $HOME/openmpi_`uname -m`/lib to your
> >>>> LD_LIBRARY_PATH.  This should allow executables created by mpicc  
> >>>> (or
> >>>> any derivation thereof, such as extracting flags vi

Re: [OMPI users] compiling mpptest using OpenMPI

2007-02-16 Thread Eric Thibodeau
Hello devs,

Thought I would let you know there seems to be a problem with 1.2b3r13112 when 
running the "bisection" test on a Tyan VX50 machine (the 8 DualCore model with 
32Gigs of RAM).

OpenMPI was compiled with (as seen from config.log):
configure:116866: running /bin/sh './configure'  CFLAGS="-O3 -DNDEBUG 
-finline-functions -fno-strict-aliasing -pthread" CPPFLAGS=" " FFLAGS="" 
LDFLAGS=" " --enable-shared --disable-static  
--prefix=/export/livia/home/parallel/eric/openmpi_x86_64 --with-mpi=open_mpi 
--cache-file=/dev/null --srcdir=.

MPPTEST (1.3c) was compiled with:
./configure --with-mpi=$HOME/openmpi_`uname -m`

...which, for some reason, works fine on that system that doesn't have any 
other MPI implementation (ie: doesn't have LAM-MPI as per this thread).

Then I ran a few tests but this one ran for over it's allowed time (1800 
seconds and was going over 50minutes...) and was up to 16Gigs of RAM:

orterun -np 9 ./perftest-1.3c/mpptest -max_run_time 1800 -bisect -size 0 4096 1 
-gnuplot -fname HyperTransport/Global_bisect_0_4096_1.gpl

I had to CTRL-\ the process as CTRL-C wasn't sufficient. 2 mpptest processes 
and 1 orterun process were using 100% CPU ou of of the 16 cores. 

If any of this can be indicative of an OpenMPI bug and if I can help in 
tracking it down, don't hesitate to ask for details.

And, finally, Anthony, thanks for the MPICC and --with-mpich pointers, I will 
try those to simplify the build process!

Eric

Le jeudi 15 février 2007 19:51, Anthony Chan a écrit :
> 
> As long as mpicc is working, try configuring mpptest as
> 
> mpptest/configure MPICC=/bin/mpicc
> 
> or
> 
> mpptest/configure  --with-mpich=
> 
> A.Chan
> 
> On Thu, 15 Feb 2007, Eric Thibodeau wrote:
> 
> > Hi Jeff,
> >
> > Thanks for your response, I eventually figured it out, here is the
> > only way I got mpptest to compile:
> >
> > export LD_LIBRARY_PATH="$HOME/openmpi_`uname -m`/lib"
> > CC="$HOME/openmpi_`uname -m`/bin/mpicc" ./configure 
> > --with-mpi="$HOME/openmpi_`uname -m`"
> >
> > And, yes I know I should use the mpicc wrapper and all (I do RTFM :P ) but
> > mpptest is less than cooperative and hasn't been updated lately AFAIK.
> >
> > I'll keep you posted on some results as I get some results out (testing
> > TCP/IP as well as the HyperTransport on a Tyan Beast). Up to now, LAM-MPI
> > seems less efficient at async communications and shows no improovments
> > with persistant communications under TCP/IP. OpenMPI, on the other hand,
> > seems more efficient using persistant communications when in a
> > HyperTransport (shmem) environment... I know I am crossing many test
> > boudaries but I will post some PNGs of my results (as well as how I got to
> > them ;)
> >
> > Eric
> >
> > On Thu, 15 Feb 2007, Jeff Squyres wrote:
> >
> > > I think you want to add $HOME/openmpi_`uname -m`/lib to your
> > > LD_LIBRARY_PATH.  This should allow executables created by mpicc (or
> > > any derivation thereof, such as extracting flags via showme) to find
> > > the Right shared libraries.
> > >
> > > Let us know if that works for you.
> > >
> > > FWIW, we do recommend using the wrapper compilers over extracting the
> > > flags via --showme whenever possible (it's just simpler and should do
> > > what you need).
> > >
> > >
> > > On Feb 15, 2007, at 3:38 PM, Eric Thibodeau wrote:
> > >
> > > > Hello all,
> > > >
> > > >
> > > > I have been attempting to compile mpptest on my nodes in vain. Here
> > > > is my current setup:
> > > >
> > > >
> > > > Openmpi is in "$HOME/openmpi_`uname -m`" which translates to "/
> > > > export/home/eric/openmpi_i686/". I tried the following approaches
> > > > (you can see some of these were out of desperation):
> > > >
> > > >
> > > > CFLAGS=`mpicc --showme:compile` LDFLAGS=`mpicc --showme:link` ./
> > > > configure
> > > >
> > > >
> > > > Configure fails on:
> > > >
> > > > checking whether the C compiler works... configure: error: cannot
> > > > run C compiled programs.
> > > >
> > > >
> > > > The log shows that:
> > > >
> > > > ./a.out: error while loading shared libraries: liborte.so.0: cannot
> > > > open shared object file: No such file or directory
> > > >
> > > >
> > > >
> > > > CC="/export/home/eric/openmpi_

Re: [OMPI users] x86_64 head with x86 diskless nodes, Node execution fails with SEGV_MAPERR

2006-07-16 Thread Eric Thibodeau
Thanks, now all makes more sense to me. I'll try the hard way, multiple builds 
for multiple envs ;)

Eric

Le dimanche 16 juillet 2006 18:21, Brian Barrett a écrit :
> On Jul 16, 2006, at 4:13 PM, Eric Thibodeau wrote:
> > Now that I have that out of the way, I'd like to know how I am  
> > supposed to compile my apps so that they can run on an homogenous  
> > network with mpi. Here is an example:
> >
> > kyron@headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ mpicc -L/ 
> > usr/X/lib -lm -lX11 -O3 mandelbrot-mpi.c -o mandelbrot-mpi
> >
> > kyron@headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ mpirun -- 
> > hostfile hostlist -np 3 ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2/ 
> > mandelbrot-mpi
> >
> > -- 
> > 
> >
> > Could not execute the executable "/home/kyron/1_Files/1_ETS/ 
> > 1_Maitrise/MGL810/Devoir2/mandelbrot-mpi": Exec format error
> >
> >
> > This could mean that your PATH or executable name is wrong, or that  
> > you do not
> >
> > have the necessary permissions. Please ensure that the executable  
> > is able to be
> >
> > found and executed.
> >
> > -- 
> > 
> >
> > As can be seen with the uname -a that was run previously, I have 2  
> > "local nodes" on the x86_64 and two i686 nodes. I tried to find  
> > examples in the Doc on howto compile applications correctly for  
> > such a setup without compromising performance but I came short of  
> > an example.
> 
>  From the sound of it, you have a heterogeneous configuration -- some  
> nodes are x86_64 and some are x86.  Because of this, you either have  
> to compile your application twice, once for each platform or compile  
> your application for the lowest common denominator.  My guess would  
> be that it easier and more foolproof if you compiled everything in 32  
> bit mode.  If you run in a mixed mode, using application schemas (see  
> the mpirun man page) will be the easiest way to make things work.
> 
> 
> Brian
> 

-- 
Eric Thibodeau
Neural Bucket Solutions Inc.
T. (514) 736-1436
C. (514) 710-0517



Re: [OMPI users] x86_64 head with x86 diskless nodes, Node execution fails with SEGV_MAPERR

2006-07-16 Thread Eric Thibodeau
/me blushes in shame, it would seem that all I needed to do since the begining 
was to run a make distclean. I apprantly had some old compiled files lying 
around. Now I get:

kyron@headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ mpirun  --hostfile 
hostlist -np 4 uname -a
Linux headless 2.6.17-ck1-r1 #1 SMP Tue Jul 11 16:39:18 EDT 2006 x86_64 AMD 
Opteron(tm) Processor 244 GNU/Linux
Linux headless 2.6.17-ck1-r1 #1 SMP Tue Jul 11 16:39:18 EDT 2006 x86_64 AMD 
Opteron(tm) Processor 244 GNU/Linux
Linux node0 2.6.16-gentoo-r7 #5 Tue Jul 11 12:30:41 EDT 2006 i686 AMD 
Athlon(TM) XP 2500+ GNU/Linux
Linux node1 2.6.16-gentoo-r7 #5 Tue Jul 11 12:30:41 EDT 2006 i686 AMD 
Athlon(TM) XP 2500+ GNU/Linux

Which is correct. Sorry for the misfire, I hadn't thought of cleaning up the 
compilation dir...

Now that I have that out of the way, I'd like to know how I am supposed to 
compile my apps so that they can run on an homogenous network with mpi. Here is 
an example:
kyron@headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ mpicc -L/usr/X/lib 
-lm -lX11 -O3 mandelbrot-mpi.c -o mandelbrot-mpi
kyron@headless ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ mpirun --hostfile 
hostlist -np 3 ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2/mandelbrot-mpi
--
Could not execute the executable 
"/home/kyron/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2/mandelbrot-mpi": Exec 
format error

This could mean that your PATH or executable name is wrong, or that you do not
have the necessary permissions.  Please ensure that the executable is able to be
found and executed.
--
As can be seen with the uname -a that was run previously, I have 2 "local 
nodes" on the x86_64 and two i686 nodes. I tried to find examples in the Doc on 
howto compile applications correctly for such a setup without compromising 
performance but I came short of an example.

Thanks,

Eric
PS: I know..maybe I should start another thread ;)

Le dimanche 16 juillet 2006 14:31, Brian Barrett a écrit :
> On Jul 15, 2006, at 2:58 PM, Eric Thibodeau wrote:
> > But, for some reason, on the Athlon node (in their image on the  
> > server I should say) OpenMPI still doesn't seem to be built  
> > correctly since it crashes as follows:
> >
> >
> > kyron@node0 ~ $ mpirun -np 1 uptime
> >
> > Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
> >
> > Failing at addr:(nil)
> >
> > [0] func:/home/kyron/openmpi_i686/lib/libopal.so.0 [0xb7f6258f]
> >
> > [1] func:[0xe440]
> >
> > [2] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_init_stage1 
> > +0x1d7) [0xb7fa0227]
> >
> > [3] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_system_init 
> > +0x23) [0xb7fa3683]
> >
> > [4] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_init+0x5f)  
> > [0xb7f9ff7f]
> >
> > [5] func:mpirun(orterun+0x255) [0x804a015]
> >
> > [6] func:mpirun(main+0x22) [0x8049db6]
> >
> > [7] func:/lib/tls/libc.so.6(__libc_start_main+0xdb) [0xb7de8f0b]
> >
> > [8] func:mpirun [0x8049d11]
> >
> > *** End of error message ***
> >
> > Segmentation fault
> >
> >
> > The crash happens both in the chrooted env and on the nodes. I  
> > configured both systems to have Linux and POSIX threads, though I  
> > see openmpi is calling the POSIX version (a message on the mailling  
> > list had hinted on keeping the Linux threads around...I have to  
> > anyways since sone apps like Matlab extensions still depend on  
> > this...). The following is the output for the libc info.
> 
> That's interesting...  We regularly build Open MPI on 32 bit Linux  
> machines (and in 32 bit mode on Opteron machines) without too much  
> issue.  It looks like we're jumping into a NULL pointer, which  
> generally means that a ORTE framework failed to initialize itself  
> properly.  It would be useful if you could rebuild with debugging  
> symbols (just add -g to CFLAGS when configuring) and run mpirun in  
> gdb.  If we can determine where the error is occurring, that would  
> definitely help in debugging your problem.
> 
> Brian
> 
> 

-- 
Eric Thibodeau
Neural Bucket Solutions Inc.
T. (514) 736-1436
C. (514) 710-0517

[OMPI users] x86_64 head with x86 diskless nodes, Node execution fails with SEGV_MAPERR

2006-07-15 Thread Eric Thibodeau
Hello all,

I've been trying to set up a small test cluster with a dual Opteron 
head and Athlon nodes. My environment in both cases is Gentoo and the nodes 
boot off PXE using an image built and stored on the master node. I chroot into 
the node's environment using:

linux32 chroot ${ROOT} /bin/bash

To cross over the 64/32bit barrier. My user's home direcory is loop-mounted 
into that environment and NFS exported to the nodes. I build OpenMPI in the 
following way:

In the build folder of OpenMPI-1.1:
./configure --cache-file=config_`uname -m`.cache 
--enable-pretty-print-stacktrace --prefix=$HOME/openmpi_`uname -m`
make -j4 && make install

I perform this exact same command in the Opteron and chrooted environment for 
the Athlon machines. This then gives me the following folders in my $HOME:
/home/kyron/openmpi_i686
/home/kyron/openmpi_x86_64

But, for some reason, on the Athlon node (in their image on the server I should 
say) OpenMPI still doesn't seem to be built correctly since it crashes as 
follows:

kyron@node0 ~ $ mpirun -np 1 uptime
Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:(nil)
[0] func:/home/kyron/openmpi_i686/lib/libopal.so.0 [0xb7f6258f]
[1] func:[0xe440]
[2] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_init_stage1+0x1d7) 
[0xb7fa0227]
[3] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_system_init+0x23) 
[0xb7fa3683]
[4] func:/home/kyron/openmpi_i686/lib/liborte.so.0(orte_init+0x5f) [0xb7f9ff7f]
[5] func:mpirun(orterun+0x255) [0x804a015]
[6] func:mpirun(main+0x22) [0x8049db6]
[7] func:/lib/tls/libc.so.6(__libc_start_main+0xdb) [0xb7de8f0b]
[8] func:mpirun [0x8049d11]
*** End of error message ***
Segmentation fault

The crash happens both in the chrooted env and on the  nodes. I configured both 
systems to have Linux and POSIX threads, though I see openmpi is calling the 
POSIX version (a message on the mailling list had hinted on keeping the Linux 
threads around...I have to anyways since sone apps like Matlab extensions still 
depend on this...). The following is the output for the libc info.

kyron@headless ~ $ /lib/tls/libc.so.6
GNU C Library stable release version 2.3.6, by Roland McGrath et al.
Copyright (C) 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.1.1 (Gentoo 4.1.1).
Compiled on a Linux 2.6.11 system on 2006-07-14.
Available extensions:
GNU libio by Per Bothner
crypt add-on version 2.1 by Michael Glad and others
Native POSIX Threads Library by Ulrich Drepper et al
The C stubs add-on version 2.1.2.
GNU Libidn by Simon Josefsson
BIND-8.2.3-T5B
NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk
Thread-local storage support included.
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

I am attaching the config.log and ompi_info for both platforms. Before sending 
this e-mail I tried compiling OpenMPI on one of the nodes (booted off the 
image) and I am getting the exact same problem (so chroot vs local build 
doesn't seem to be a factor). The attached file contains:

config.log.x86_64   <--config log for the Opteron build (works locally)
config.log_node0<--config log for the Athlon build (on the node)
ompi_info.i686  <--ompi_info on the Athlon node
ompi_info.x86_64<--ompi_info on the Opteron Master

Thanks,

-- 
Eric Thibodeau
Neural Bucket Solutions Inc.
T. (514) 736-1436
C. (514) 710-0517

ENV_info.tbz
Description: application/tbz


Re: [OMPI users] Tutorial

2006-07-11 Thread Eric Thibodeau
www.clustermonkey.net is a very good place to start, click on the "Columns" 
section in the "Main Menu" in the left pane.

Le mardi 11 juillet 2006 07:25, Tony Power a écrit :
> Hi!
> Where can I find a introductory tutorial on open-mpi?
> Thank you ;)
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

-- 
Eric Thibodeau
Neural Bucket Solutions Inc.
T. (514) 736-1436
C. (514) 710-0517



Re: [OMPI users] MPI_Recv, is it possible to switch on/off aggresive mode during runtime?

2006-07-07 Thread Eric Thibodeau
Although it will add some overhead, have you tried using MPI_Probe before 
calling MPI_Recv. I am curious to know if the Probe is less CPU intensive than 
a direct call to MPI_Recv. An example of how I use it:

MPI_Probe(MPI_ANY_SOURCE,MPI_ANY_TAG,MPI_COMM_WORLD,);

MPI_Recv(DispBuff,height,MPI_UNSIGNED_LONG,status.MPI_SOURCE,status.MPI_TAG,MPI_COMM_WORLD,);

(This is used to receive known data from an unknown source)

Eric

Le mercredi 5 juillet 2006 10:54, Marcin Skoczylas a écrit :
> Dear open-mpi users,
> 
> I saw some posts ago almost the same question as I have, but it didn't 
> give me satisfactional answer.
> I have setup like this:
> 
> GUI program on some machine (f.e. laptop)
> Head listening on tcpip socket for commands from GUI.
> Workers waiting for commands from Head / processing the data.
> 
> And now it's problematic. For passing the commands from Head I'm using:
> while(true)
> {
> MPI_Recv...
>
> do whatever head said (process small portion of the data, return 
> result to head, wait for another commands)
> }
> 
> So in the idle time workers are stuck in MPI_Recv and have 100% CPU 
> usage, even if they are just waiting for the commands from Head. 
> Normally, I would not prefer to have this situation as I sometimes have 
> to share the cluster with others. I would prefer not to stop whole mpi 
> program, but just go into 'idle' mode, and thus make it run again soon. 
> Also I would like to have this aggresive MPI_Recv approach switched on 
> when I'm alone on the cluster. So is it possible somehow to switch this 
> mode on/off during runtime? Thank you in advance!
> 
> greetings, Marcin
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

-- 
Eric Thibodeau
Neural Bucket Solutions Inc.
T. (514) 736-1436
C. (514) 710-0517



Re: [OMPI users] Can I install OpenMPI on a machine where I have mpich2

2006-07-04 Thread Eric Thibodeau
The only reference I have at the moment (technical article in french).

http://www.manitou.uqam.ca/manitou.dll?lire+recherche+_DEFAUT+format+html+expression+%2340786002:2

I strongly recommend scanning IEEE on the subject though and cheching out the 
beowulf mailling list.

Eric

Le lundi 3 juillet 2006 23:40, Manal Helal a écrit :
> Hi Eric
> 
> Thank you very much for your reply.
> 
> I am a PhD student, and I do need this comparison for academic purposes;
> a fairly generic one will do, and I guess after running on both, I might
> have my own application/hardware specific points to add, 
> 
> Thanks again, I appreciate it, 
> 
> Manal
> 
> On Mon, 2006-07-03 at 23:17 -0400, Eric Thibodeau wrote:
> > See comments below:
> > 
> > Le lundi 3 juillet 2006 23:01, Manal Helal a écrit :
> > > Hi
> > > 
> > > I am having problems running a multi-threaded  applications using MPICH
> > > 2, and considering moving to OpenMPI. I already have mpich2 installed,
> > > and don't want to uninstall as yet. Can I have both installed and works
> > > fine on the same machine?
> > Yes, simply run the configure script with something like:
> > 
> > ./configure --prefix=$HOME/openmpi-`uname -m`
> > 
> > You will then be able to compile applications with:
> > 
> > ~/openmpi-i686/bin/mpicc app.c -o app
> > 
> > And run them with:
> > 
> > ~/openmpi-i686/bin/mpirun -np 3 app
> > 
> > > Also, I searched for a comparison of features of mpich vs lammpi vs
> > > openmpi and didn't find any so far. Will you please help me find one?
> > 
> > Comparison is only relevant on your hardware with you application. Any 
> > other comparison are mostly for academic purposes and grand assignments ;)
> > 
> > > Thank you for your help in advance, 
> > > 
> > > Regards, 
> > > 
> > > Manal
> > 
> 

-- 
Eric Thibodeau
Neural Bucket Solutions Inc.
T. (514) 736-1436
C. (514) 710-0517



Re: [OMPI users] Re : OpenMPI 1.1: Signal:10, info.si_errno:0(Unknown, error: 0), si_code:1(BUS_ADRALN)

2006-06-28 Thread Eric Thibodeau
I am actually running the released 1.1. I can send you my code, if you want, 
and you could try running it off a single node with -np 4 or 5 
(oversubscribing) and see if you get a BUS_ADRALN error off one node. The only 
restriction to compiling the code is that X libs be available (display is not 
required for the execution though it's more fun :P)

Eric

Le mercredi 28 juin 2006 13:02, Terry D. Dontje a écrit :
> Well, I've been using the trunk and not 1.1.  I also just built 
> 1.1.1a1r10538 and ran
> it with no bus error.  Though you are running 1.1b5r10421 so we're not 
> running the
> same thing, as of yet.
> 
> I have a cluster of two v440 that have 4 cpus each running Solaris 10.  
> The tests I
> am running are np=2 one process on each node.
> 
> --td
> 
> Eric Thibodeau wrote:
> 
> >Terry,
> >
> > I was about to comment on this. could you tell me the specs of your 
> > machine. As you will notice in "my thread", I am running into problems on 
> > Sparc SPM systems where the CPU borad's RTC are in a doubtfull state. 
> > Are-you running 1.1 on SMP machines. If so, on how many procs and what 
> > hardware/OS version is this running off?
> >
> >ET
> >
> >Le mercredi 28 juin 2006 10:35, Terry D. Dontje a écrit :
> >  
> >
> >>Frank,
> >>
> >>Can you set your limit coredumpsize to non-zero rerun the program
> >>and then get the stack via dbx?
> >>
> >>So, I have a similar case of BUS_ADRALN on SPARC systems with an
> >>older version (June 21st) of the trunk.  I've since run using the latest 
> >>trunk and the
> >>bus went away.  I am now going to try this out with v1.1 to see if I get 
> >>similar
> >>results.  Your stack would help me try and determine if this is an 
> >>OpenMPI issue
> >>or possibly some type of platform problem.
> >>
> >>There is another thread with Eric Thibodeau that I am unsure if it is 
> >>the same issue
> >>as either of our situation. 
> >>
> >>--td
> >>
> >>
> >[...snip...]
> >  
> >
> 

-- 
Eric Thibodeau
Neural Bucket Solutions Inc.
T. (514) 736-1436
C. (514) 710-0517



Re: [OMPI users] users Digest, Vol 317, Issue 4

2006-06-28 Thread Eric Thibodeau
The problems was resolved in the 1.1 series...so I didn't push any further. 
Thanks!

Le mercredi 28 juin 2006 09:21, openmpi-user a écrit :
> Hi Eric (and all),
> 
> don't know if this really messes things up, but you have set up lam-mpi 
> in your path-variables, too:
> 
> [enterprise:24786] pls:rsh: reset LD_LIBRARY_PATH: 
> /export/lca/home/lca0/etudiants/ac38820/openmpi_sun4u/lib:/export/lca/appl/Forte/SUNWspro/WS6U2/lib:/usr/local/lib:*/usr/local/lam-mpi/7.1.1/lib*:/opt/sfw/lib
> 
> 
> Yours,
> Frank
> 
> users-requ...@open-mpi.org wrote:
> Send users mailing list submissions to
> > us...@open-mpi.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > or, via email, send a message with subject or body 'help' to
> > users-requ...@open-mpi.org
> >
> > You can reach the person managing the list at
> > users-ow...@open-mpi.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of users digest..."
> >
> >
> > Today's Topics:
> >
> >1. Re: Installing OpenMPI on a solaris (Jeff Squyres (jsquyres))
> >
> >
> > --
> >
> > Message: 1
> > Date: Wed, 28 Jun 2006 08:56:36 -0400
> > From: "Jeff Squyres \(jsquyres\)" <jsquy...@cisco.com>
> > Subject: Re: [OMPI users] Installing OpenMPI on a solaris
> > To: "Open MPI Users" <us...@open-mpi.org>
> > Message-ID:
> > <c835b9c9cb0f1c4e9da48913c9e8f8afae9...@xmb-rtp-215.amer.cisco.com>
> > Content-Type: text/plain; charset="iso-8859-1"
> >
> > Bummer!  :-(
> >  
> > Just to be sure -- you had a clean config.cache file before you ran 
> > configure, right?  (e.g., the file didn't exist -- just to be sure it 
> > didn't get potentially erroneous values from a previous run of configure)  
> > Also, FWIW, it's not necessary to specify --enable-ltdl-convenience; that 
> > should be automatic.
> >  
> > If you had a clean configure, we *suspect* that this might be due to 
> > alignment issues on Solaris 64 bit platforms, but thought that we might 
> > have had a pretty good handle on it in 1.1.  Obviously we didn't solve 
> > everything.  Bonk.
> >  
> > Did you get a corefile, perchance?  If you could send a stack trace, that 
> > would be most helpful.
> >
> >
> > 
> >
> > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> > Behalf Of Eric Thibodeau
> > Sent: Tuesday, June 20, 2006 8:36 PM
> > To: us...@open-mpi.org
> > Subject: Re: [OMPI users] Installing OpenMPI on a solaris
> > 
> > 
> >
> > Hello Brian (and all),
> >
> > 
> >
> > Well, the joy was short lived. On a 12 CPU Enterprise machine and on a 
> > 4 CPU one, I seem to be able to start up to 4 processes. Above 4, I seem to 
> > inevitably get BUS_ADRALN (Bus collisions?). Below are some traces of the 
> > failling runs as well as a detailed (mpirun -d) of one of these situations 
> > and ompi_info output. Obviously, don't hesitate to ask if more information 
> > is requred.
> >
> > 
> >
> > Buid version: openmpi-1.1b5r10421
> >
> > Config parameters:
> >
> > Open MPI config.status 1.1b5
> >
> > configured by ./configure, generated by GNU Autoconf 2.59,
> >
> > with options \"'--cache-file=config.cache' 'CFLAGS=-mcpu=v9' 
> > 'CXXFLAGS=-mcpu=v9' 'FFLAGS=-mcpu=v9' 
> > '--prefix=/export/lca/home/lca0/etudiants/ac38820/openmp
> >
> > i_sun4u' --enable-ltdl-convenience\"
> >
> > 
> >
> > The traces:
> >
> > sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ 
> > ~/openmpi_sun4u/bin/mpirun -np 10 mandelbrot-mpi 100 400 400
> >
> > Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)
> >
> > Failing at addr:2f4f04
> >
> > *** End of error message ***
> >
> > sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ 
> > ~/openmpi_sun4u/bin/mpirun -np 8 mandelbrot-mpi 100 400 400
> >
> > Signal:10 info.si_errno:0(Error 0) si_code:1(BUS_ADRALN)
> >
> > Failing at addr:2b354c
> >
> > *** End of error message ***
> >
> > sshd@enterprise ~/1_Files/1_ETS/1_Maitrise/MGL810/Devoir2 $ 
> > ~/openmpi

Re: [OMPI users] Installing OpenMPI on a solaris

2006-06-28 Thread Eric Thibodeau
Yeah bummers, but something tells me it might not be OpenMPI's fault. Here's 
why:

1- The tech that takes care of these machines told me that he gets RTC errors 
on bootup (the cpu borads are apprantly "out of sync" since the clocks aren't 
set correctly).
2- There is also a possibility that the prior admin did not put in a "stable" 
firmware version.

So if any Sun guru can help out by telling me which command or point to a quick 
HOWTO for resolvin these clock issues, it would be greatly appreciated (our 
analyst is overloaded and he would not be able to justify the 3 days of reading 
up docs just to satisfy my running parallel code problems ;P)

3- I realised that the OS is not booted in 64 O_o!! (not that this has to do 
with OpenMPI bombing):

Jun 21 07:45:15 unknown genunix: [ID 540533 kern.notice] ^MSunOS Release 5.8 
Version Generic_108528-29 32-bit
Jun 21 07:45:15 unknown NOTICE: 64-bit OS installed, but the 32-bit OS is the 
default
Jun 21 07:45:15 unknown Booting the 32-bit OS ...

4- LAM-MPI 7.1.1 also bombs, but it does so at a much higher processor count 
(OpenMPI bombs at 5, LAM-MPI bombs around 10, but it vraies).

As for the questions regarding OpenMPI build, I just recently built 1.1 with 
the same basic configure options with the exact same results (clean cache).

So, I guess this one is on pause untill I have the confirmation that the clocks 
on the processor boards are set correctly. There is one this that bothers me 
though, one of the machines has only 1 processor board (4 procs) and I still 
get the error on that machine if I go over 4 pcrosesses...how can a board be 
out of sync with itself??

Eric
PS: I am at liberty of providing the source code if anyone wants it.

Le mercredi 28 juin 2006 08:56, Jeff Squyres (jsquyres) a écrit :
> Bummer!  :-(
>  
> Just to be sure -- you had a clean config.cache file before you ran 
> configure, right?  (e.g., the file didn't exist -- just to be sure it didn't 
> get potentially erroneous values from a previous run of configure)  Also, 
> FWIW, it's not necessary to specify --enable-ltdl-convenience; that should be 
> automatic.
>  
> If you had a clean configure, we *suspect* that this might be due to 
> alignment issues on Solaris 64 bit platforms, but thought that we might have 
> had a pretty good handle on it in 1.1.  Obviously we didn't solve everything. 
>  Bonk.
>  
> Did you get a corefile, perchance?  If you could send a stack trace, that 
> would be most helpful.
> 
> 
[...snip...]


Re: [OMPI users] Installing OpenMPI on a solaris

2006-06-20 Thread Eric Thibodeau
onent v1.1)
   MCA timer: solaris (MCA v1.0, API v1.0, Component v1.1)
   MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
   MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA v1.0, API v1.0, Component v1.1)
MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1)
MCA coll: self (MCA v1.0, API v1.0, Component v1.1)
MCA coll: sm (MCA v1.0, API v1.0, Component v1.1)
MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1)
  MCA io: romio (MCA v1.0, API v1.0, Component v1.1)
   MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1)
 MCA pml: dr (MCA v1.0, API v1.0, Component v1.1)
 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1)
 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1)
  MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1)
 MCA btl: self (MCA v1.0, API v1.0, Component v1.1)
 MCA btl: sm (MCA v1.0, API v1.0, Component v1.1)
 MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
MCA topo: unity (MCA v1.0, API v1.0, Component v1.1)
 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0)
 MCA gpr: null (MCA v1.0, API v1.0, Component v1.1)
 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1)
 MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1)
 MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1)
 MCA iof: svc (MCA v1.0, API v1.0, Component v1.1)
  MCA ns: proxy (MCA v1.0, API v1.0, Component v1.1)
  MCA ns: replica (MCA v1.0, API v1.0, Component v1.1)
 MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
 MCA ras: dash_host (MCA v1.0, API v1.0, Component v1.1)
 MCA ras: hostfile (MCA v1.0, API v1.0, Component v1.1)
 MCA ras: localhost (MCA v1.0, API v1.0, Component v1.1)
 MCA rds: hostfile (MCA v1.0, API v1.0, Component v1.1)
 MCA rds: resfile (MCA v1.0, API v1.0, Component v1.1)
   MCA rmaps: round_robin (MCA v1.0, API v1.0, Component v1.1)
MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.1)
MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.1)
 MCA rml: oob (MCA v1.0, API v1.0, Component v1.1)
 MCA pls: fork (MCA v1.0, API v1.0, Component v1.1)
 MCA pls: rsh (MCA v1.0, API v1.0, Component v1.1)
 MCA sds: env (MCA v1.0, API v1.0, Component v1.1)
 MCA sds: pipe (MCA v1.0, API v1.0, Component v1.1)
 MCA sds: seed (MCA v1.0, API v1.0, Component v1.1)
 MCA sds: singleton (MCA v1.0, API v1.0, Component v1.1)

Le mardi 20 juin 2006 17:06, Eric Thibodeau a écrit :
> Thanks for the pointer, it WORKS!! (yay)
> 
> Le mardi 20 juin 2006 12:21, Brian Barrett a écrit :
> > On Jun 19, 2006, at 12:15 PM, Eric Thibodeau wrote:
> > 
> > > I checked the thread with the same title as this e-mail and tried  
> > > compiling openmpi-1.1b4r10418 with:
> > >
> > > ./configure CFLAGS="-mv8plus" CXXFLAGS="-mv8plus" FFLAGS="-mv8plus"  
> > > FCFLAGS="-mv8plus" --prefix=$HOME/openmpi-SUN-`uname -r` --enable- 
> > > pretty-print-stacktrace
> > I put the incorrect flags in the error message - can you try again with:
> > 
> > 
> >./configure CFLAGS=-mcpu=v9 CXXFLAGS=-mcpu=v9 FFLAGS=-mcpu=v9  
> > FCFLAGS=-mcpu=v9 --prefix=$HOME/openmpi-SUN-`uname -r` --enable- 
> > pretty-print-stacktrace
> > 
> > 
> > and see if that helps?  By the way, I'm not sure if Solaris has the  
> > required support for the pretty-print stack trace feature.  It likely  
> > will print what signal caused the error, but will not actually print  
> > the stack trace.  It's enabled by default on Solaris, with this  
> > limited functionality (the option exists for platforms that have  
> > broken half-support for GNU libc's stack trace feature, and for users  
> > that don't like us registering a signal handler to do the work).
> > 
> > Brian
> > 
> > 
> 

-- 
Eric Thibodeau
Neural Bucket Solutions Inc.
T. (514) 736-1436
C. (514) 710-0517

Re: [OMPI users] pls:rsh: execv failed with errno=2

2006-06-17 Thread Eric Thibodeau
Hello Jeff,

Fristly, don't worry about jumping in late, I'll send you a skid rope 
;) Secondly, thanks for your nice little artilces on clustermonkey.net (good 
refresher on MPI). And finally, down to my issues, thanks for clearing out the 
--prefix LD_LIBRARY_PATH and all. The ebuild I made/mangled for Openmpi under 
Gentoo was modified by some of the devs to follow some of the lib Vs lib64 
reqs. I might change them to be identicall (only $PREFIX/lib) across platforms 
since multi-arch MPI will be hell to get working with a changing 
LD_LIBRARY_PATH.

After some recommendations, I tried openmpi-1.1b3r10389 on the AMD64 arch and 
got my MPI app running on that single sual Opteron node, I still have to figure 
out the --prefix/PATH/LD_LIBRARY_PATH mess to get the app to spawn across that 
dual Opteron node and 2 single Athlon nodes (cross arch with the variying 
LD_LIBRARY_PATH). But that's another issue for the moment (a bit of fiddling on 
my side to get orte to be recognized on the nodes)

As for the sparc-sun-solaris2.8 , I tried compiling openmpi-1.1b3r10389 but it 
bombs with both gcc or the SUN cc:

Making all in asm
source='asm.c' object='asm.lo' libtool=yes \
DEPDIR=.deps depmode=none /bin/bash ../.././config/depcomp \
/bin/bash ../../libtool --tag=CC --mode=compile 
/export/lca/appl/Forte/SUNWspro/WS6U2/bin/cc -DHAVE_CONFIG_H  -I. -I. 
-I../../opal/include -I../../orte/include -I../../ompi/include 
-I../../ompi/include   -I../..   -O -DNDEBUG  -mt -c -o asm.lo asm.c
 /export/lca/appl/Forte/SUNWspro/WS6U2/bin/cc -DHAVE_CONFIG_H -I. -I. 
-I../../opal/include -I../../orte/include -I../../ompi/include 
-I../../ompi/include -I../.. -O -DNDEBUG -mt -c asm.c  -KPIC -DPIC -o 
.libs/asm.o
"../../opal/include/opal/sys/atomic.h", line 486: #error: Atomic arithmetic on 
pointers not supported
cc: acomp failed for asm.c
*** Error code 1

I was told by one of the system's admin that the SUN Enterprise machine (12 
proc) has "special" considerations when using semaphores (it's hardware 
implemented O_o! ), I'm only mentionning this due to the error message (Atomic 
arithmetic ...)

So, I got half my problem resolved with the upgrade, any suggestions for 
compiling OpenMPI on this _old_ but very educationnal SMP machine?

Eric

Le vendredi 16 juin 2006 17:32, Jeff Squyres (jsquyres) a écrit :
> Sorry for jumping in late...
> 
> The /lib vs. /lib64 thing as part of --prefix was definitely broken until 
> recently.  This behavior has been fixed in the 1.1 series.  Specifically, 
> OMPI will take the prefix that you provided and append the basename of the 
> local $libdir.  So if you configured OMPI with something like:
> 
>  shell$ ./configure --libdir=/some/path/lib64 ...
> 
> And then you run:
> 
>  shell$ mpirun --prefix /some/path ...
> 
> Then OMPI will add /some/path/lib64 to the remote LD_LIBRARY_PATH.  The 
> previous behavior would always add "/lib" to the remote LD_LIBRARY_PATH, 
> regardless of what the local $libdir was (i.e., it ignored the basename of 
> your $libdir).  
> 
> If you have a situation more complicated than this (e.g., your $libdir is 
> different than your prefix by more than just the basename), then --prefix is 
> not the solution for you.  Instead, you'll need to set your $PATH and 
> $LD_LIBRARY_PATH properly on all nodes (e.g., in your shell startup files). 
> Specifically, --prefix is meant to be an easy workaround for common 
> configurations where $libdir is a subdirectory under $prefix.
> 
> Another random note: invoking mpirun with an absolute path (e.g., 
> /path/to/bin/mpirun) is exactly the same as specifying --prefix /path/to -- 
> so you don't have to do both.
> 
> 
[..SNIP..]

Re: [OMPI users] pls:rsh: execv failed with errno=2

2006-06-16 Thread Eric Thibodeau
Hello,
I don't want to get too much off topic in this reply but you're 
brigning out a point here. I am unable to run mpi apps on the AMD64 platform 
with the regular exporting of $LD_LIBRARY_PATH and $PATH, this is why I have no 
choice but to revert to using the --prefix approach. Here are a few execution 
examples to demonstrate my point:

kyron@headless ~ $ /usr/lib64/openmpi/1.0.2-gcc-4.1/bin/mpirun --prefix 
/usr/lib64/openmpi/1.0.2-gcc-4.1/ -np 2 ./a.out
./a.out: error while loading shared libraries: libmpi.so.0: cannot open shared 
object file: No such file or directory
kyron@headless ~ $ /usr/lib64/openmpi/1.0.2-gcc-4.1/bin/mpirun --prefix 
/usr/lib64/openmpi/1.0.2-gcc-4.1/lib64/ -np 2 ./a.out
[headless:10827] pls:rsh: execv failed with errno=2
[headless:10827] ERROR: A daemon on node localhost failed to start as expected.
[headless:10827] ERROR: There may be more information available from
[headless:10827] ERROR: the remote shell (see above).
[headless:10827] ERROR: The daemon exited unexpectedly with status 255.
kyron@headless ~ $ cat opmpi64.sh
#!/bin/bash
MPI_BASE='/usr/lib64/openmpi/1.0.2-gcc-4.1'
export PATH=$PATH:${MPI_BASE}/bin
LD_LIBRARY_PATH=${MPI_BASE}/lib64
kyron@headless ~ $ . opmpi64.sh
kyron@headless ~ $ mpirun -np 2 ./a.out
./a.out: error while loading shared libraries: libmpi.so.0: cannot open shared 
object file: No such file or directory
kyron@headless ~ $

Eric

Le vendredi 16 juin 2006 10:31, Pak Lui a écrit :
> Hi, I noticed your prefix set to the lib dir, can you try without the 
> lib64 part and rerun?
> 
> Eric Thibodeau wrote:
> > Hello everyone,
> > 
> > Well, first off, I hope this problem I am reporting is of some validity, 
> > I tried finding simmilar situations off Google and the mailing list but 
> > came up with only one reference [1] which seems invalid in my case since 
> > all executions are local (naïve assumptions that it makes a difference 
> > on the calling stack). I am trying to run asimple HelloWorld using 
> > OpenMPI 1.0.2 on an AMD64 machine and a Sun Enterprise (12 procs) 
> > machine. In both cases I get the following error:
> > 
> > pls:rsh: execv failed with errno=2
> > 
> > Here is the mpirun -d trace when running my HelloWorld (on AMD64):
> > 
> > kyron@headless ~ $ mpirun -d --prefix 
> > /usr/lib64/openmpi/1.0.2-gcc-4.1/lib64/ -np 4 ./hello
> > 
> > [headless:10461] procdir: (null)
> > 
> > [headless:10461] jobdir: (null)
> > 
> > [headless:10461] unidir: 
> > /tmp/openmpi-sessions-kyron@headless_0/default-universe
> > 
> > [headless:10461] top: openmpi-sessions-kyron@headless_0
> > 
> > [headless:10461] tmp: /tmp
> > 
> > [headless:10461] [0,0,0] setting up session dir with
> > 
> > [headless:10461] tmpdir /tmp
> > 
> > [headless:10461] universe default-universe-10461
> > 
> > [headless:10461] user kyron
> > 
> > [headless:10461] host headless
> > 
> > [headless:10461] jobid 0
> > 
> > [headless:10461] procid 0
> > 
> > [headless:10461] procdir: 
> > /tmp/openmpi-sessions-kyron@headless_0/default-universe-10461/0/0
> > 
> > [headless:10461] jobdir: 
> > /tmp/openmpi-sessions-kyron@headless_0/default-universe-10461/0
> > 
> > [headless:10461] unidir: 
> > /tmp/openmpi-sessions-kyron@headless_0/default-universe-10461
> > 
> > [headless:10461] top: openmpi-sessions-kyron@headless_0
> > 
> > [headless:10461] tmp: /tmp
> > 
> > [headless:10461] [0,0,0] contact_file 
> > /tmp/openmpi-sessions-kyron@headless_0/default-universe-10461/universe-setup.txt
> > 
> > [headless:10461] [0,0,0] wrote setup file
> > 
> > [headless:10461] spawn: in job_state_callback(jobid = 1, state = 0x1)
> > 
> > [headless:10461] pls:rsh: local csh: 0, local bash: 1
> > 
> > [headless:10461] pls:rsh: assuming same remote shell as local shell
> > 
> > [headless:10461] pls:rsh: remote csh: 0, remote bash: 1
> > 
> > [headless:10461] pls:rsh: final template argv:
> > 
> > [headless:10461] pls:rsh: /usr/bin/ssh  orted --debug 
> > --bootproxy 1 --name  --num_procs 2 --vpid_start 0 --nodename 
> >  --universe kyron@headless:default-universe-10461 --nsreplica 
> > "0.0.0;tcp://142.137.135.124:37657;tcp://192.168.1.1:37657" --gprreplica 
> > "0.0.0;tcp://142.137.135.124:37657;tcp://192.168.1.1:37657" 
> > --mpi-call-yield 0
> > 
> > [headless:10461] pls:rsh: launching on node localhost
> > 
> > [headless:10461] pls:rsh: oversubscribed -- setting mpi_yield_when_idle 
> > to 1 (1 4)
> > 
> > [headless:10461] pls:rsh: localhost is a LOCAL node
> >