Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Randolph Pullen
Std TCP/IP stack
it hung with an unknown but large(ish) quantity of data.  when I ran just one 
Bcast it was fine but Bcasts in lots in separate MPI_WORLD's hung.   - All the 
details are in some recent posts.

I could not figure it out and moved back to my PVM solution.


--- On Wed, 25/8/10, Rahul Nabar  wrote:

From: Rahul Nabar 
Subject: Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: 
debug ideas?
To: "Open MPI Users" 
Received: Wednesday, 25 August, 2010, 3:38 AM

On Mon, Aug 23, 2010 at 8:39 PM, Randolph Pullen
 wrote:
>
> I have had a similar load related problem with Bcast.

Thanks Randolph! That's interesting to know! What was the hardware you
were using? Does your bcast fail at the exact same point too?

>
> I don't know what caused it though.  With this one, what about the 
> possibility of a buffer overrun or network saturation?

How can I test for a buffer overrun?

For network saturation I guess I could use something like mrtg to
monitor the bandwidth used. On the other hand, all 32 servers are
connected to a single dedicated Nexus5000. The back-plane carries no
other traffic. Hence I am skeptical that just 41943040 saturated what
Cisco rates as a 10GigE fabric. But I might be wrong.

-- 
Rahul

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



  

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Jeff Squyres
On Aug 24, 2010, at 1:58 PM, Rahul Nabar wrote:

> There are a few unusual things about the cluster. We are using a
> 10GigE ethernet fabric. Each node has dual eth adapters. One 1GigE and
> the other 10GigE. These are on seperate subnets although the order of
> the eth interfaces is variable. i.e. 10GigE might be eth0 on one and
> eth2 on the next. In case this matters. I was told this shouldn't be
> an issue.

Are all the eth0's on one subnet and all the eth2's on a different subnet?

Or are all eth0's and eth2's all on the same subnet?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar
On Mon, Aug 23, 2010 at 9:43 PM, Richard Treumann  wrote:
> Bugs are always a possibility but unless there is something very unusual
> about the cluster and interconnect or this is an unstable version of MPI, it
> seems very unlikely this use of MPI_Bcast with so few tasks and only a 1/2
> MB message would trip on one.  80 tasks is a very small number in modern
> parallel computing.  Thousands of tasks involved in an MPI collective has
> become pretty standard.

Here's something absolutely strange that I accidentally stumbled upon:

I ran the test  again, but accidentally forgot to kill the
user-jobs already running on the test servers (via. Torque and our
usual queues).
I was about to kick myself, but I couldn't believe that the test
actually completes! I mean the timings are horribly bad but the test
( for the first time ) runs to completion. How could this be happening?
Doesn't make sense to me that the test completes when the
cards+servers+network is loaded but not otherwise! But I repeated the
experiment many times and still the same result.

# /opt/src/mpitests/imb/src/IMB-MPI1 -npmin 256 bcast
[snip]
# Bcast
   #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
0 1000 0.02 0.02 0.02
1   34546807.94626743.09565196.07
2   34 37159.11 52942.09 44910.73
4   34 19777.97 40382.53 29656.53
8   34 36060.21 53265.27 43909.68
   16   34 11765.59 31912.50 19611.75
   32   34 23530.79 41176.94 32532.89
   64   34 11735.91 23529.02 16552.16
  128   34 47998.44 59323.76 55164.14
  256   34 18121.96 30500.15 25528.95
  512   34 20072.76 33787.32 26786.55
 1024   34 39737.29 55589.97 45704.99
 20489 77787.56150555.66118741.83
 40969 4.67118331.78 77201.40
 81929 80835.6616.56133781.08
163849 77032.88149890.66119558.73
327689111819.4518.99149048.91
655369159304.6798.99195071.34
   1310729172941.13262216.57218351.14
   2621449161371.65266703.79223514.31
   5242882   497.46   4402568.94   2183980.20
  10485762  5401.49   3519284.01   1947754.45
  20971522 75251.10   4137861.49   2220910.50
  41943042 33270.48   4601072.91   2173905.32
# All processes entering MPI_Finalize

Another observation is that if I replace the openib BTL with the tcp
BTL the tests run OK.


-- 
Rahul



Re: [OMPI users] Open-MPI 1.4.2 : mpirun core-dumps when "-npernode N" is used at command line

2010-08-24 Thread Michael E. Thomadakis

 Hi Jeff

On 08/24/10 15:24, Jeff Squyres wrote:

I'm a little confused by your configure line:

./configure --prefix=/g/software/openmpi-1.4.3a1r23542/gcc-4.1.2 2 
--enable-cxx-exceptions CFLAGS=-O2 CXXFLAGS=-O2 FFLAGS=-O2 FCFLAGS=-O2



"oppss" that '2' was some leftover character after I edited the command 
line to configure wrt to GCC (from an original command line configuring 
with Intel compilers) *thanks for noticing this.*


I rerun the configure with

./configure --prefix="/g/software/openmpi-1.4.3a1r23542/gcc-4.1.2" 
--enable-cxx-exceptions  CFLAGS="-O2" CXXFLAGS="-O2"  FFLAGS="-O2" 
FCFLAGS="-O2"


and run make and I did NOT this time notice any error messages.

*Thanks* for the help with this. I will run now mpirun with various 
options in a PBS/Torque environmnet and see if hybrid MPI+OMP jobs are 
placed on the nodes in a sane fashion


Thanks

Michael




What's the lone "2" in the middle (after the prefix)?

With that extra "2", I'm not able to get configure to complete successfully (because it interprets 
that "2" as a platform name that does not exist).  If I remove that "2", configure 
completes properly and the build completes properly.

I'm afraid I no longer have any RH hosts to test on.  Can you do the following:

cd top_of_build_dir
cd ompi/debuggers
rm ompi_debuggers.lo
make

Then copy-n-paste the gcc command used to compile the ompi_debuggers.o file, remove "-o 
.libs/libdebuggers_la-ompi_debuggers.o", and add "-E", and redirect the output to a 
file.  Then send me that file -- it should give more of a clue as to exactly what the problem is 
that you're seeing.




On Aug 24, 2010, at 3:25 PM, Michael E. Thomadakis wrote:


On 08/24/10 14:22, Michael E. Thomadakis wrote:

Hi,

I used a 'tee' command to capture the output but I forgot to also redirect
stderr to the file.

This is what a fresh make gave (gcc 4.1.2 again) :

--
ompi_debuggers.c:81: error: missing terminating " character
ompi_debuggers.c:81: error: expected expression before \u2018;\u2019 token
ompi_debuggers.c: In function \u2018ompi_wait_for_debugger\u2019:
ompi_debuggers.c:212: error: \u2018mpidbg_dll_locations\u2019 undeclared
(first use in this function)
ompi_debuggers.c:212: error: (Each undeclared identifier is reported only once
ompi_debuggers.c:212: error: for each function it appears in.)
ompi_debuggers.c:212: warning: passing argument 3 of \u2018check\u2019 from
incompatible pointer type
make[2]: *** [libdebuggers_la-ompi_debuggers.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

--

Is this critical to run OMPI code?

Thanks for the quick reply Ralph,

Michael

On Tue, 24 Aug 2010, Ralph Castain wrote:

| Date: Tue, 24 Aug 2010 13:16:10 -0600
| From: Ralph Castain
| To: Michael E.Thomadakis
| Cc: Open MPI Users, mi...@sc.tamu.edu
| Subject: Re: [OMPI users] Open-MPI 1.4.2 : mpirun core-dumps when
| "-npernode N" is used at command line
|
| Ummmthe configure log terminates normally, indicating it configured fine. 
The make log ends, but with no error shown - everything was building just fine.
|
| Did you maybe stop it before it was complete? Run out of disk quota? Or...?
|
|
| On Aug 24, 2010, at 1:06 PM, Michael E. Thomadakis wrote:
|
|>   Hi Ralph,
|>
|>   I tried to build 1.4.3.a1r23542 (08/02/2010) with
|>
|>   ./configure --prefix="/g/software/openmpi-1.4.3a1r23542/gcc-4.1.2 2" --enable-cxx-exceptions  
CFLAGS="-O2" CXXFLAGS="-O2"  FFLAGS="-O2" FCFLAGS="-O2"
|>   with the GCC 4.1.2
|>
|>   miket@login002[pts/26]openmpi-1.4.3a1r23542 $ gcc -v
|>   Using built-in specs.
|>   Target: x86_64-redhat-linux
|>   Configured with: ../configure --prefix=/usr --mandir=/usr/share/man 
--infodir=/usr/share/info --enable-shared --enable-threads=posix 
--enable-checking=release --with-system-zlib --enable-__cxa_atexit 
--disable-libunwind-exceptions --enable-libgcj-multifile 
--enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk 
--disable-dssi --enable-plugin 
--with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic 
--host=x86_64-redhat-linux
|>   Thread model: posix
|>   gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)
|>
|>
|>   but it failed. I am attaching the configure and make logs.
|>
|>   regards
|>
|>   Michael
|>
|>
|>   On 08/23/10 20:53, Ralph Castain wrote:
|>>
|>>   Nope - none of them will work with 1.4.2. Sorry - bug not discovered 
until after release
|>>
|>>   On Aug 23, 2010, at 7:45 PM, Michael E. Thomadakis wrote:
|>>
|>>>   Hi Jeff,
|>>>   thanks for the quick reply.
|>>>
|>>>   Would using '--cpus-per-proc N' in place of '-npernode N' or just 
'-bynode' do the trick?
|>>>
|>>>   It seems that using '--loadbalance' also crashes mpirun.
|>>>
|>>>   best ...
|>>>
|>>>   Michael
|>>>
|>>>
|>>>   On 08/23/10 19:30, Jeff 

Re: [OMPI users] Open-MPI 1.4.2 : mpirun core-dumps when "-npernode N" is used at command line

2010-08-24 Thread Jeff Squyres
I'm a little confused by your configure line:

./configure --prefix=/g/software/openmpi-1.4.3a1r23542/gcc-4.1.2 2 
--enable-cxx-exceptions CFLAGS=-O2 CXXFLAGS=-O2 FFLAGS=-O2 FCFLAGS=-O2

What's the lone "2" in the middle (after the prefix)?

With that extra "2", I'm not able to get configure to complete successfully 
(because it interprets that "2" as a platform name that does not exist).  If I 
remove that "2", configure completes properly and the build completes properly.

I'm afraid I no longer have any RH hosts to test on.  Can you do the following:

cd top_of_build_dir
cd ompi/debuggers
rm ompi_debuggers.lo
make

Then copy-n-paste the gcc command used to compile the ompi_debuggers.o file, 
remove "-o .libs/libdebuggers_la-ompi_debuggers.o", and add "-E", and redirect 
the output to a file.  Then send me that file -- it should give more of a clue 
as to exactly what the problem is that you're seeing.




On Aug 24, 2010, at 3:25 PM, Michael E. Thomadakis wrote:

> 
> On 08/24/10 14:22, Michael E. Thomadakis wrote:
>> Hi,
>> 
>> I used a 'tee' command to capture the output but I forgot to also redirect
>> stderr to the file.
>> 
>> This is what a fresh make gave (gcc 4.1.2 again) :
>> 
>> --
>> ompi_debuggers.c:81: error: missing terminating " character
>> ompi_debuggers.c:81: error: expected expression before \u2018;\u2019 token
>> ompi_debuggers.c: In function \u2018ompi_wait_for_debugger\u2019:
>> ompi_debuggers.c:212: error: \u2018mpidbg_dll_locations\u2019 undeclared
>> (first use in this function)
>> ompi_debuggers.c:212: error: (Each undeclared identifier is reported only 
>> once
>> ompi_debuggers.c:212: error: for each function it appears in.)
>> ompi_debuggers.c:212: warning: passing argument 3 of \u2018check\u2019 from
>> incompatible pointer type
>> make[2]: *** [libdebuggers_la-ompi_debuggers.lo] Error 1
>> make[1]: *** [all-recursive] Error 1
>> make: *** [all-recursive] Error 1
>> 
>> --
>> 
>> Is this critical to run OMPI code?
>> 
>> Thanks for the quick reply Ralph,
>> 
>> Michael
>> 
>> On Tue, 24 Aug 2010, Ralph Castain wrote:
>> 
>> | Date: Tue, 24 Aug 2010 13:16:10 -0600
>> | From: Ralph Castain
>> | To: Michael E.Thomadakis
>> | Cc: Open MPI Users, mi...@sc.tamu.edu
>> | Subject: Re: [OMPI users] Open-MPI 1.4.2 : mpirun core-dumps when
>> | "-npernode N" is used at command line
>> |
>> | Ummmthe configure log terminates normally, indicating it configured 
>> fine. The make log ends, but with no error shown - everything was building 
>> just fine.
>> |
>> | Did you maybe stop it before it was complete? Run out of disk quota? Or...?
>> |
>> |
>> | On Aug 24, 2010, at 1:06 PM, Michael E. Thomadakis wrote:
>> |
>> |>  Hi Ralph,
>> |>
>> |>  I tried to build 1.4.3.a1r23542 (08/02/2010) with
>> |>
>> |>  ./configure --prefix="/g/software/openmpi-1.4.3a1r23542/gcc-4.1.2 2" 
>> --enable-cxx-exceptions  CFLAGS="-O2" CXXFLAGS="-O2"  FFLAGS="-O2" 
>> FCFLAGS="-O2"
>> |>  with the GCC 4.1.2
>> |>
>> |>  miket@login002[pts/26]openmpi-1.4.3a1r23542 $ gcc -v
>> |>  Using built-in specs.
>> |>  Target: x86_64-redhat-linux
>> |>  Configured with: ../configure --prefix=/usr --mandir=/usr/share/man 
>> --infodir=/usr/share/info --enable-shared --enable-threads=posix 
>> --enable-checking=release --with-system-zlib --enable-__cxa_atexit 
>> --disable-libunwind-exceptions --enable-libgcj-multifile 
>> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk 
>> --disable-dssi --enable-plugin 
>> --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic 
>> --host=x86_64-redhat-linux
>> |>  Thread model: posix
>> |>  gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)
>> |>
>> |>
>> |>  but it failed. I am attaching the configure and make logs.
>> |>
>> |>  regards
>> |>
>> |>  Michael
>> |>
>> |>
>> |>  On 08/23/10 20:53, Ralph Castain wrote:
>> |>>
>> |>>  Nope - none of them will work with 1.4.2. Sorry - bug not discovered 
>> until after release
>> |>>
>> |>>  On Aug 23, 2010, at 7:45 PM, Michael E. Thomadakis wrote:
>> |>>
>> |>>>  Hi Jeff,
>> |>>>  thanks for the quick reply.
>> |>>>
>> |>>>  Would using '--cpus-per-proc N' in place of '-npernode N' or just 
>> '-bynode' do the trick?
>> |>>>
>> |>>>  It seems that using '--loadbalance' also crashes mpirun.
>> |>>>
>> |>>>  best ...
>> |>>>
>> |>>>  Michael
>> |>>>
>> |>>>
>> |>>>  On 08/23/10 19:30, Jeff Squyres wrote:
>> |
>> |  Yes, the -npernode segv is a known issue.
>> |
>> |  We have it fixed in the 1.4.x nightly tarballs; can you give it a 
>> whirl and see if that fixes your problem?
>> |
>> |  http://www.open-mpi.org/nightly/v1.4/
>> |
>> |
>> |
>> |  On Aug 23, 2010, at 8:20 PM, Michael E. Thomadakis wrote:
>> |
>> |>  Hello OMPI:
>> |>
>> |>  We have 

Re: [OMPI users] Open-MPI 1.4.2 : mpirun core-dumps when "-npernode N" is used at command line

2010-08-24 Thread Michael E. Thomadakis

 On 08/24/10 14:22, Michael E. Thomadakis wrote:

Hi,

I used a 'tee' command to capture the output but I forgot to also redirect
stderr to the file.

This is what a fresh make gave (gcc 4.1.2 again) :

--
ompi_debuggers.c:81: error: missing terminating " character
ompi_debuggers.c:81: error: expected expression before \u2018;\u2019 token
ompi_debuggers.c: In function \u2018ompi_wait_for_debugger\u2019:
ompi_debuggers.c:212: error: \u2018mpidbg_dll_locations\u2019 undeclared
(first use in this function)
ompi_debuggers.c:212: error: (Each undeclared identifier is reported only once
ompi_debuggers.c:212: error: for each function it appears in.)
ompi_debuggers.c:212: warning: passing argument 3 of \u2018check\u2019 from
incompatible pointer type
make[2]: *** [libdebuggers_la-ompi_debuggers.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1

--

Is this critical to run OMPI code?

Thanks for the quick reply Ralph,

Michael

On Tue, 24 Aug 2010, Ralph Castain wrote:

| Date: Tue, 24 Aug 2010 13:16:10 -0600
| From: Ralph Castain
| To: Michael E.Thomadakis
| Cc: Open MPI Users, mi...@sc.tamu.edu
| Subject: Re: [OMPI users] Open-MPI 1.4.2 : mpirun core-dumps when
| "-npernode N" is used at command line
|
| Ummmthe configure log terminates normally, indicating it configured fine. 
The make log ends, but with no error shown - everything was building just fine.
|
| Did you maybe stop it before it was complete? Run out of disk quota? Or...?
|
|
| On Aug 24, 2010, at 1:06 PM, Michael E. Thomadakis wrote:
|
|>  Hi Ralph,
|>
|>  I tried to build 1.4.3.a1r23542 (08/02/2010) with
|>
|>  ./configure --prefix="/g/software/openmpi-1.4.3a1r23542/gcc-4.1.2 2" --enable-cxx-exceptions  
CFLAGS="-O2" CXXFLAGS="-O2"  FFLAGS="-O2" FCFLAGS="-O2"
|>  with the GCC 4.1.2
|>
|>  miket@login002[pts/26]openmpi-1.4.3a1r23542 $ gcc -v
|>  Using built-in specs.
|>  Target: x86_64-redhat-linux
|>  Configured with: ../configure --prefix=/usr --mandir=/usr/share/man 
--infodir=/usr/share/info --enable-shared --enable-threads=posix 
--enable-checking=release --with-system-zlib --enable-__cxa_atexit 
--disable-libunwind-exceptions --enable-libgcj-multifile 
--enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk 
--disable-dssi --enable-plugin 
--with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic 
--host=x86_64-redhat-linux
|>  Thread model: posix
|>  gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)
|>
|>
|>  but it failed. I am attaching the configure and make logs.
|>
|>  regards
|>
|>  Michael
|>
|>
|>  On 08/23/10 20:53, Ralph Castain wrote:
|>>
|>>  Nope - none of them will work with 1.4.2. Sorry - bug not discovered until 
after release
|>>
|>>  On Aug 23, 2010, at 7:45 PM, Michael E. Thomadakis wrote:
|>>
|>>>  Hi Jeff,
|>>>  thanks for the quick reply.
|>>>
|>>>  Would using '--cpus-per-proc N' in place of '-npernode N' or just 
'-bynode' do the trick?
|>>>
|>>>  It seems that using '--loadbalance' also crashes mpirun.
|>>>
|>>>  best ...
|>>>
|>>>  Michael
|>>>
|>>>
|>>>  On 08/23/10 19:30, Jeff Squyres wrote:
|
|  Yes, the -npernode segv is a known issue.
|
|  We have it fixed in the 1.4.x nightly tarballs; can you give it a whirl 
and see if that fixes your problem?
|
|  http://www.open-mpi.org/nightly/v1.4/
|
|
|
|  On Aug 23, 2010, at 8:20 PM, Michael E. Thomadakis wrote:
|
|>  Hello OMPI:
|>
|>  We have installed OMPI V1.4.2 on a Nehalem cluster running CentOS5.4. 
OMPI was built uisng Intel compilers 11.1.072. I am attaching the configuration log and output 
from ompi_info -a.
|>
|>  The problem we are encountering is that whenever we use option 
'-npernode N' in the mpirun command line we get a segmentation fault as in below:
|>
|>
|>  miket@login002[pts/7]PS $ mpirun -npernode 1  --display-devel-map  
--tag-output -np 6 -cpus-per-proc 2 -H 'login001,login002,login003' hostname
|>
|>   Map generated by mapping policy: 0402
|>  Npernode: 1 Oversubscribe allowed: TRUE CPU Lists: FALSE
|>  Num new daemons: 2  New daemon starting vpid 1
|>  Num nodes: 3
|>
|>   Data for node: Name: login001  Launch id: -1   Arch: 0 State: 2
|>  Num boards: 1   Num sockets/board: 2Num cores/socket: 4
|>  Daemon: [[44812,0],1]   Daemon launched: False
|>  Num slots: 1Slots in use: 2
|>  Num slots allocated: 1  Max slots: 0
|>  Username on node: NULL
|>  Num procs: 1Next node_rank: 1
|>  Data for proc: [[44812,1],0]
|>  Pid: 0  Local rank: 0   Node rank: 0
|>  State: 0App_context: 0  Slot list: NULL

Re: [OMPI users] Open-MPI 1.4.2 : mpirun core-dumps when "-npernode N" is used at command line

2010-08-24 Thread Ralph Castain
Ummmthe configure log terminates normally, indicating it configured fine. 
The make log ends, but with no error shown - everything was building just fine.

Did you maybe stop it before it was complete? Run out of disk quota? Or...?


On Aug 24, 2010, at 1:06 PM, Michael E. Thomadakis wrote:

> Hi Ralph, 
> 
> I tried to build 1.4.3.a1r23542 (08/02/2010) with
> 
> ./configure --prefix="/g/software/openmpi-1.4.3a1r23542/gcc-4.1.2 2" 
> --enable-cxx-exceptions  CFLAGS="-O2" CXXFLAGS="-O2"  FFLAGS="-O2" 
> FCFLAGS="-O2"
> with the GCC 4.1.2
> 
> miket@login002[pts/26]openmpi-1.4.3a1r23542 $ gcc -v
> Using built-in specs.
> Target: x86_64-redhat-linux
> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man 
> --infodir=/usr/share/info --enable-shared --enable-threads=posix 
> --enable-checking=release --with-system-zlib --enable-__cxa_atexit 
> --disable-libunwind-exceptions --enable-libgcj-multifile 
> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-awt=gtk 
> --disable-dssi --enable-plugin 
> --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic 
> --host=x86_64-redhat-linux
> Thread model: posix
> gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)
> 
> 
> but it failed. I am attaching the configure and make logs.
> 
> regards
> 
> Michael
> 
> 
> On 08/23/10 20:53, Ralph Castain wrote:
>> 
>> Nope - none of them will work with 1.4.2. Sorry - bug not discovered until 
>> after release
>> 
>> On Aug 23, 2010, at 7:45 PM, Michael E. Thomadakis wrote:
>> 
>>> Hi Jeff, 
>>> thanks for the quick reply. 
>>> 
>>> Would using '--cpus-per-proc N' in place of '-npernode N' or just '-bynode' 
>>> do the trick?
>>> 
>>> It seems that using '--loadbalance' also crashes mpirun.
>>> 
>>> best ...
>>> 
>>> Michael
>>> 
>>> 
>>> On 08/23/10 19:30, Jeff Squyres wrote:
 
 Yes, the -npernode segv is a known issue.
 
 We have it fixed in the 1.4.x nightly tarballs; can you give it a whirl 
 and see if that fixes your problem?
 
 http://www.open-mpi.org/nightly/v1.4/
 
 
 
 On Aug 23, 2010, at 8:20 PM, Michael E. Thomadakis wrote:
 
> Hello OMPI:
> 
> We have installed OMPI V1.4.2 on a Nehalem cluster running CentOS5.4. 
> OMPI was built uisng Intel compilers 11.1.072. I am attaching the 
> configuration log and output from ompi_info -a.
> 
> The problem we are encountering is that whenever we use option '-npernode 
> N' in the mpirun command line we get a segmentation fault as in below:
> 
> 
> miket@login002[pts/7]PS $ mpirun -npernode 1  --display-devel-map  
> --tag-output -np 6 -cpus-per-proc 2 -H 'login001,login002,login003' 
> hostname
> 
>  Map generated by mapping policy: 0402
> Npernode: 1 Oversubscribe allowed: TRUE CPU Lists: FALSE
> Num new daemons: 2  New daemon starting vpid 1
> Num nodes: 3
> 
>  Data for node: Name: login001  Launch id: -1   Arch: 0 State: 2
> Num boards: 1   Num sockets/board: 2Num cores/socket: 4
> Daemon: [[44812,0],1]   Daemon launched: False
> Num slots: 1Slots in use: 2
> Num slots allocated: 1  Max slots: 0
> Username on node: NULL
> Num procs: 1Next node_rank: 1
> Data for proc: [[44812,1],0]
> Pid: 0  Local rank: 0   Node rank: 0
> State: 0App_context: 0  Slot list: NULL
> 
>  Data for node: Name: login002  Launch id: -1   Arch: ffc91200  
> State: 2
> Num boards: 1   Num sockets/board: 2Num cores/socket: 4
> Daemon: [[44812,0],0]   Daemon launched: True
> Num slots: 1Slots in use: 2
> Num slots allocated: 1  Max slots: 0
> Username on node: NULL
> Num procs: 1Next node_rank: 1
> Data for proc: [[44812,1],0]
> Pid: 0  Local rank: 0   Node rank: 0
> State: 0App_context: 0  Slot list: NULL
> 
>  Data for node: Name: login003  Launch id: -1   Arch: 0 State: 2
> Num boards: 1   Num sockets/board: 2Num cores/socket: 4
> Daemon: [[44812,0],2]   Daemon launched: False
> Num slots: 1Slots in use: 2
> Num slots allocated: 1  Max slots: 0
> Username on node: NULL
> Num procs: 1Next node_rank: 1
> Data for proc: [[44812,1],0]
> Pid: 0  Local rank: 0   Node rank: 0
> State: 0App_context: 0  Slot list: NULL
> [login002:02079] *** Process received signal ***
> [login002:02079] Signal: Segmentation fault (11)
> [login002:02079] Signal code: Address not mapped (1)
> [login002:02079] Failing at address: 0x50
> [login002:02079] [ 0] /lib64/libpthread.so.0 [0x3569a0e7c0]
> [login002:02079] [ 1] 

Re: [OMPI users] Open-MPI 1.4.2 : mpirun core-dumps when "-npernode N" is used at command line

2010-08-24 Thread Michael E. Thomadakis

 Hi Ralph,

I tried to build 1.4.3.a1r23542 (08/02/2010) with

./configure --prefix="/g/software/openmpi-1.4.3a1r23542/gcc-4.1.2 2" 
--enable-cxx-exceptions  CFLAGS="-O2" CXXFLAGS="-O2"  FFLAGS="-O2" 
FCFLAGS="-O2"

with the GCC 4.1.2

miket@login002[pts/26]openmpi-1.4.3a1r23542 $ gcc -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man 
--infodir=/usr/share/info --enable-shared --enable-threads=posix 
--enable-checking=release --with-system-zlib --enable-__cxa_atexit 
--disable-libunwind-exceptions --enable-libgcj-multifile 
--enable-languages=c,c++,objc,obj-c++,java,fortran,ada 
--enable-java-awt=gtk --disable-dssi --enable-plugin 
--with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre 
--with-cpu=generic --host=x86_64-redhat-linux

Thread model: posix
gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)


but it failed. I am attaching the configure and make logs.

regards

Michael


On 08/23/10 20:53, Ralph Castain wrote:
Nope - none of them will work with 1.4.2. Sorry - bug not discovered 
until after release


On Aug 23, 2010, at 7:45 PM, Michael E. Thomadakis wrote:


Hi Jeff,
thanks for the quick reply.

Would using '--cpus-per-proc /N/' in place of '-npernode /N/' or just 
'-bynode' do the trick?


It seems that using '--loadbalance' also crashes mpirun.

best ...

Michael


On 08/23/10 19:30, Jeff Squyres wrote:

Yes, the -npernode segv is a known issue.

We have it fixed in the 1.4.x nightly tarballs; can you give it a whirl and see 
if that fixes your problem?

 http://www.open-mpi.org/nightly/v1.4/



On Aug 23, 2010, at 8:20 PM, Michael E. Thomadakis wrote:


Hello OMPI:

We have installed OMPI V1.4.2 on a Nehalem cluster running CentOS5.4. OMPI was 
built uisng Intel compilers 11.1.072. I am attaching the configuration log and 
output from ompi_info -a.

The problem we are encountering is that whenever we use option '-npernode N' in 
the mpirun command line we get a segmentation fault as in below:


miket@login002[pts/7]PS $ mpirun -npernode 1  --display-devel-map  --tag-output 
-np 6 -cpus-per-proc 2 -H 'login001,login002,login003' hostname

  Map generated by mapping policy: 0402
 Npernode: 1 Oversubscribe allowed: TRUE CPU Lists: FALSE
 Num new daemons: 2  New daemon starting vpid 1
 Num nodes: 3

  Data for node: Name: login001  Launch id: -1   Arch: 0 State: 2
 Num boards: 1   Num sockets/board: 2Num cores/socket: 4
 Daemon: [[44812,0],1]   Daemon launched: False
 Num slots: 1Slots in use: 2
 Num slots allocated: 1  Max slots: 0
 Username on node: NULL
 Num procs: 1Next node_rank: 1
 Data for proc: [[44812,1],0]
 Pid: 0  Local rank: 0   Node rank: 0
 State: 0App_context: 0  Slot list: NULL

  Data for node: Name: login002  Launch id: -1   Arch: ffc91200  State: 
2
 Num boards: 1   Num sockets/board: 2Num cores/socket: 4
 Daemon: [[44812,0],0]   Daemon launched: True
 Num slots: 1Slots in use: 2
 Num slots allocated: 1  Max slots: 0
 Username on node: NULL
 Num procs: 1Next node_rank: 1
 Data for proc: [[44812,1],0]
 Pid: 0  Local rank: 0   Node rank: 0
 State: 0App_context: 0  Slot list: NULL

  Data for node: Name: login003  Launch id: -1   Arch: 0 State: 2
 Num boards: 1   Num sockets/board: 2Num cores/socket: 4
 Daemon: [[44812,0],2]   Daemon launched: False
 Num slots: 1Slots in use: 2
 Num slots allocated: 1  Max slots: 0
 Username on node: NULL
 Num procs: 1Next node_rank: 1
 Data for proc: [[44812,1],0]
 Pid: 0  Local rank: 0   Node rank: 0
 State: 0App_context: 0  Slot list: NULL
[login002:02079] *** Process received signal ***
[login002:02079] Signal: Segmentation fault (11)
[login002:02079] Signal code: Address not mapped (1)
[login002:02079] Failing at address: 0x50
[login002:02079] [ 0] /lib64/libpthread.so.0 [0x3569a0e7c0]
[login002:02079] [ 1] 
/g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xa7)
 [0x2afa70d25de7]
[login002:02079] [ 2] 
/g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x3b8)
 [0x2afa70d36088]
[login002:02079] [ 3] 
/g/software/openmpi-1.4.2/intel/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0xd7)
 [0x2afa70d37fc7]
[login002:02079] [ 4] 
/g/software/openmpi-1.4.2/intel/lib/openmpi/mca_plm_rsh.so [0x2afa721085a1]
[login002:02079] [ 5] mpirun [0x404c27]
[login002:02079] [ 6] mpirun [0x403e38]
[login002:02079] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3568e1d994]
[login002:02079] [ 8] mpirun [0x403d69]
[login002:02079] *** End of error message ***
Segmentation fault

We tried version 1.4.1 and this problem did not emerge.

This 

Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar
On Mon, Aug 23, 2010 at 8:39 PM, Randolph Pullen
 wrote:
>
> I have had a similar load related problem with Bcast.

Thanks Randolph! That's interesting to know! What was the hardware you
were using? Does your bcast fail at the exact same point too?

>
> I don't know what caused it though.  With this one, what about the 
> possibility of a buffer overrun or network saturation?

How can I test for a buffer overrun?

For network saturation I guess I could use something like mrtg to
monitor the bandwidth used. On the other hand, all 32 servers are
connected to a single dedicated Nexus5000. The back-plane carries no
other traffic. Hence I am skeptical that just 41943040 saturated what
Cisco rates as a 10GigE fabric. But I might be wrong.

-- 
Rahul



Re: [OMPI users] IMB-MPI broadcast test stalls for large core counts: debug ideas?

2010-08-24 Thread Rahul Nabar
On Mon, Aug 23, 2010 at 6:39 PM, Richard Treumann  wrote:
> It is hard to imagine how a total data load of 41,943,040 bytes could be a
> problem. That is really not much data. By the time the BCAST is done, each
> task (except root) will have received a single half meg message form one
> sender. That is not much.

Thanks very much for your comments Dick! I'm somewhat new to MPI so
appreciate all the advice I can get.My main roadblock is I'm not sure
how to attack this problem more? How can I obtain more diagnostic
output to help me trace what the origin of this "broadcast stall" is?
So far I've obtained a stack trace via padb (
http://dl.dropbox.com/u/118481/padb.log.new.new.txt ) but that is
about all.

Any suggestions as to what else I could try? Would a full dump by
something like tcpdump or wireshark on the packets passing the network
be of any relevance? Or is there something useful to be known from the
switch side? The technology is fairly new for HPC (Chelsio 10GigE
adapters + Cisco Nexus5000 switches). So I wouldn't rule out some
strange hardware or firmware bug that's tickled by this particular
suite of tests.   I'm grasping at straws here.

 [ On the other hand I'm fairly new so I wouldn't rule out some silly
setting by me as well. ]

-- 
Rahul


Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

2010-08-24 Thread Ralph Castain
Yes, that's fine. Thx!

On Aug 24, 2010, at 9:02 AM, Philippe wrote:

> awesome, I'll give it a spin! with the parameters as below?
> 
> p.
> 
> On Tue, Aug 24, 2010 at 10:47 AM, Ralph Castain  wrote:
>> I think I have this working now - try anything on or after r23647
>> 
>> 
>> On Aug 23, 2010, at 1:36 PM, Philippe wrote:
>> 
>>> sure. I took a guess at ppn and nodes for the case where 2 processes
>>> are on the same node... I dont claim these are the right values ;-)
>>> 
>>> 
>>> 
>>> c0301b10e1 ~/mpi> env|grep OMPI
>>> OMPI_MCA_orte_nodes=c0301b10e1
>>> OMPI_MCA_orte_rank=0
>>> OMPI_MCA_orte_ppn=2
>>> OMPI_MCA_orte_num_procs=2
>>> OMPI_MCA_oob_tcp_static_ports_v6=1-11000
>>> OMPI_MCA_ess=generic
>>> OMPI_MCA_orte_jobid=
>>> OMPI_MCA_oob_tcp_static_ports=1-11000
>>> c0301b10e1 ~/hpa/benchmark/mpi> ./ben1 1 1 1
>>> [c0301b10e1:22827] [[0,],0] assigned port 10001
>>> [c0301b10e1:22827] [[0,],0] accepting connections via event library
>>> minsize=1 maxsize=1 delay=1.00
>>> 
>>> 
>>> 
>>> 
>>> c0301b10e1 ~/mpi> env|grep OMPI
>>> OMPI_MCA_orte_nodes=c0301b10e1
>>> OMPI_MCA_orte_rank=1
>>> OMPI_MCA_orte_ppn=2
>>> OMPI_MCA_orte_num_procs=2
>>> OMPI_MCA_oob_tcp_static_ports_v6=1-11000
>>> OMPI_MCA_ess=generic
>>> OMPI_MCA_orte_jobid=
>>> OMPI_MCA_oob_tcp_static_ports=1-11000
>>> c0301b10e1 ~/hpa/benchmark/mpi> ./ben1 1 1 1
>>> [c0301b10e1:22830] [[0,],1] assigned port 10002
>>> [c0301b10e1:22830] [[0,],1] accepting connections via event library
>>> [c0301b10e1:22830] [[0,],1]-[[0,0],0] mca_oob_tcp_send_nb: tag 15 size 
>>> 189
>>> [c0301b10e1:22830] [[0,],1]-[[0,0],0]
>>> mca_oob_tcp_peer_try_connect: connecting port 10002 to:
>>> 10.4.72.110:1
>>> [c0301b10e1:22830] [[0,],1]-[[0,0],0]
>>> mca_oob_tcp_peer_complete_connect: connection failed: Connection
>>> refused (111) - retrying
>>> [c0301b10e1:22830] [[0,],1]-[[0,0],0]
>>> mca_oob_tcp_peer_try_connect: connecting port 10002 to:
>>> 10.4.72.110:1
>>> [c0301b10e1:22830] [[0,],1]-[[0,0],0]
>>> mca_oob_tcp_peer_complete_connect: connection failed: Connection
>>> refused (111) - retrying
>>> [c0301b10e1:22830] [[0,],1]-[[0,0],0]
>>> mca_oob_tcp_peer_try_connect: connecting port 10002 to:
>>> 10.4.72.110:1
>>> [c0301b10e1:22830] [[0,],1]-[[0,0],0]
>>> mca_oob_tcp_peer_complete_connect: connection failed: Connection
>>> refused (111) - retrying
>>> 
>>> 
>>> 
>>> 
>>> Thanks!
>>> p.
>>> 
>>> 
>>> On Mon, Aug 23, 2010 at 3:24 PM, Ralph Castain  wrote:
 Can you send me the values you are using for the relevant envars? That way 
 I can try to replicate here
 
 
 On Aug 23, 2010, at 1:15 PM, Philippe wrote:
 
> I took a look at the code but I'm afraid I dont see anything wrong.
> 
> p.
> 
> On Thu, Aug 19, 2010 at 2:32 PM, Ralph Castain  wrote:
>> Yes, that is correct - we reserve the first port in the range for a 
>> daemon,
>> should one exist.
>> The problem is clearly that get_node_rank is returning the wrong value 
>> for
>> the second process (your rank=1). If you want to dig deeper, look at the
>> orte/mca/ess/generic code where it generates the nidmap and pidmap. 
>> There is
>> a bug down there somewhere that gives the wrong answer when ppn > 1.
>> 
>> 
>> On Thu, Aug 19, 2010 at 12:12 PM, Philippe  wrote:
>>> 
>>> Ralph,
>>> 
>>> somewhere in ./orte/mca/oob/tcp/oob_tcp.c, there is this comment:
>>> 
>>>orte_node_rank_t nrank;
>>>/* do I know my node_local_rank yet? */
>>>if (ORTE_NODE_RANK_INVALID != (nrank =
>>> orte_ess.get_node_rank(ORTE_PROC_MY_NAME)) &&
>>>(nrank+1) <
>>> opal_argv_count(mca_oob_tcp_component.tcp4_static_ports)) {
>>>/* any daemon takes the first entry, so we start
>>> with the second */
>>> 
>>> which seems constant with process #0 listening on 10001. the question
>>> would be why process #1 attempt to connect to port 1 then? or
>>> maybe totally unrelated :-)
>>> 
>>> btw, if I trick process #1 to open the connection to 10001 by shifting
>>> the range, I now get this error and the process terminate immediately:
>>> 
>>> [c0301b10e1:03919] [[0,],1]-[[0,0],0]
>>> mca_oob_tcp_peer_recv_connect_ack: received unexpected process
>>> identifier [[0,],0]
>>> 
>>> good luck with the surgery and wishing you a prompt recovery!
>>> 
>>> p.
>>> 
>>> On Thu, Aug 19, 2010 at 2:02 PM, Ralph Castain  
>>> wrote:
 Something doesn't look right - here is what the algo attempts to do:
 given a port range of 1-12000, the lowest rank'd process on the 
 node
 should open port 1. The next lowest rank on the node will open

Re: [OMPI users] OpenMPI with BLCR runtime problem

2010-08-24 Thread Joshua Hursey

On Aug 24, 2010, at 10:27 AM, 陈文浩 wrote:

> Dear OMPI users,
>  
> I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 �C blade10, 
> nfs)
> BLCR configure script: ./configure �Cprefix=/opt/blcr �Cenable-static
> After the installation, I can see the ‘blcr’ module loaded correctly (lsmod | 
> grep blcr). And I can also run ‘cr_run’, ‘cr_checkpoint’, ‘cr_restart’ to C/R 
> the examples correctly under /blcr/examples/.
> Then, OMPI configure script is: ./configure �Cprefix=/opt/ompi �Cwith-ft=cr 
> �Cwith-blcr=/opt/blcr �Cenable-ft-thread �Cenable-mpi-threads �Cenable-static
> The installation is okay too.
>  
> Then here comes the problem.
> On one node:
>  mpirun -np 2 ./hello_c.c
>  mpirun -np 2 �Cam ft-enable-cr ./hello_c.c
>  are both okay.
> On two nodes(blade01, blade02):
>  mpirun �Cnp 2 �Cmachinefile mf ./hello_c.c  OK.
> mpirun �Cnp 2 �Cmachinefile mf �Cam ft-enable-cr ./hello_c.c ERROR. Listed 
> below:
>  
> *** An error occurred in MPI_Init 
> *** before MPI was initialized 
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) 
> [blade02:28896] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed! 
> -- 
> It looks like opal_init failed for some reason; your parallel process is 
> likely to abort. There are many reasons that a parallel process can 
> fail during opal_init; some of which are due to configuration or 
> environment problems. This failure appears to be an internal failure; 
> here's some additional information (which may only be relevant to an 
> Open MPI developer):
>   opal_cr_init() failed failed 
>   --> Returned value -1 instead of OPAL_SUCCESS 
> -- 
> [blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file 
> runtime/orte_init.c at line 77 
> -- 
> It looks like MPI_INIT failed for some reason; your parallel process is 
> likely to abort. There are many reasons that a parallel process can 
> fail during MPI_INIT; some of which are due to configuration or environment 
> problems. This failure appears to be an internal failure; here's some 
> additional information (which may only be relevant to an Open MPI 
> developer):
>   ompi_mpi_init: orte_init failed 
>   --> Returned "Error" (-1) instead of "Success" (0) 
> --
>  
> I have no idea about the error. Our blades use nfs, does it matter? Can 
> anyone help me solve the problem? I really appreciate it. Thank you.
>  
> btw, similar error like:
> “Oops, cr_init() failed (the initialization call to the BLCR checkpointing 
> system). Abort in despair.
> The crmpi SSI subsystem failed to initialized modules successfully during 
> MPI_INIT. This is a fatal error; I must abort.” occurs when I use LAM/MPI + 
> BLCR.

This seems to indicate that BLCR is not working correctly on one of the compute 
nodes. Did you try some of the BLCR example programs on both of the compute 
nodes? If BLCRs cr_init() fails, then there is not much the MPI library can do 
for you.

I would check the installation of BLCR on all of the compute nodes (blade01 and 
blade02). Make sure the modules are loaded and that the BLCR single process 
examples work on all nodes. I suspect that one of the nodes is having trouble 
initializing the BLCR library.

You may also want to check to make sure prelinking is turned off on all nodes 
as well:
  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

If that doesn't work then I would suggest trying the current Open MPI trunk. 
There should not be any problem with using NFS, since this is occurring in 
MPI_Init, this is well before we ever try to use the file system. I also test 
with NFS, and local staging on a fairly regular basis, so it shouldn't be a 
problem even when checkpointing/restarting.

-- Josh

>  
> Regards
>  
> whchen
>  
> 


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey







Re: [OMPI users] MPI process dies with a route error when using dynamic process calls to connect more than 2 clients to a server with InfiniBand

2010-08-24 Thread Ralph Castain
I think I have this working now - try anything on or after r23647


On Aug 23, 2010, at 1:36 PM, Philippe wrote:

> sure. I took a guess at ppn and nodes for the case where 2 processes
> are on the same node... I dont claim these are the right values ;-)
> 
> 
> 
> c0301b10e1 ~/mpi> env|grep OMPI
> OMPI_MCA_orte_nodes=c0301b10e1
> OMPI_MCA_orte_rank=0
> OMPI_MCA_orte_ppn=2
> OMPI_MCA_orte_num_procs=2
> OMPI_MCA_oob_tcp_static_ports_v6=1-11000
> OMPI_MCA_ess=generic
> OMPI_MCA_orte_jobid=
> OMPI_MCA_oob_tcp_static_ports=1-11000
> c0301b10e1 ~/hpa/benchmark/mpi> ./ben1 1 1 1
> [c0301b10e1:22827] [[0,],0] assigned port 10001
> [c0301b10e1:22827] [[0,],0] accepting connections via event library
> minsize=1 maxsize=1 delay=1.00
> 
> 
> 
> 
> c0301b10e1 ~/mpi> env|grep OMPI
> OMPI_MCA_orte_nodes=c0301b10e1
> OMPI_MCA_orte_rank=1
> OMPI_MCA_orte_ppn=2
> OMPI_MCA_orte_num_procs=2
> OMPI_MCA_oob_tcp_static_ports_v6=1-11000
> OMPI_MCA_ess=generic
> OMPI_MCA_orte_jobid=
> OMPI_MCA_oob_tcp_static_ports=1-11000
> c0301b10e1 ~/hpa/benchmark/mpi> ./ben1 1 1 1
> [c0301b10e1:22830] [[0,],1] assigned port 10002
> [c0301b10e1:22830] [[0,],1] accepting connections via event library
> [c0301b10e1:22830] [[0,],1]-[[0,0],0] mca_oob_tcp_send_nb: tag 15 size 189
> [c0301b10e1:22830] [[0,],1]-[[0,0],0]
> mca_oob_tcp_peer_try_connect: connecting port 10002 to:
> 10.4.72.110:1
> [c0301b10e1:22830] [[0,],1]-[[0,0],0]
> mca_oob_tcp_peer_complete_connect: connection failed: Connection
> refused (111) - retrying
> [c0301b10e1:22830] [[0,],1]-[[0,0],0]
> mca_oob_tcp_peer_try_connect: connecting port 10002 to:
> 10.4.72.110:1
> [c0301b10e1:22830] [[0,],1]-[[0,0],0]
> mca_oob_tcp_peer_complete_connect: connection failed: Connection
> refused (111) - retrying
> [c0301b10e1:22830] [[0,],1]-[[0,0],0]
> mca_oob_tcp_peer_try_connect: connecting port 10002 to:
> 10.4.72.110:1
> [c0301b10e1:22830] [[0,],1]-[[0,0],0]
> mca_oob_tcp_peer_complete_connect: connection failed: Connection
> refused (111) - retrying
> 
> 
> 
> 
> Thanks!
> p.
> 
> 
> On Mon, Aug 23, 2010 at 3:24 PM, Ralph Castain  wrote:
>> Can you send me the values you are using for the relevant envars? That way I 
>> can try to replicate here
>> 
>> 
>> On Aug 23, 2010, at 1:15 PM, Philippe wrote:
>> 
>>> I took a look at the code but I'm afraid I dont see anything wrong.
>>> 
>>> p.
>>> 
>>> On Thu, Aug 19, 2010 at 2:32 PM, Ralph Castain  wrote:
 Yes, that is correct - we reserve the first port in the range for a daemon,
 should one exist.
 The problem is clearly that get_node_rank is returning the wrong value for
 the second process (your rank=1). If you want to dig deeper, look at the
 orte/mca/ess/generic code where it generates the nidmap and pidmap. There 
 is
 a bug down there somewhere that gives the wrong answer when ppn > 1.
 
 
 On Thu, Aug 19, 2010 at 12:12 PM, Philippe  wrote:
> 
> Ralph,
> 
> somewhere in ./orte/mca/oob/tcp/oob_tcp.c, there is this comment:
> 
>orte_node_rank_t nrank;
>/* do I know my node_local_rank yet? */
>if (ORTE_NODE_RANK_INVALID != (nrank =
> orte_ess.get_node_rank(ORTE_PROC_MY_NAME)) &&
>(nrank+1) <
> opal_argv_count(mca_oob_tcp_component.tcp4_static_ports)) {
>/* any daemon takes the first entry, so we start
> with the second */
> 
> which seems constant with process #0 listening on 10001. the question
> would be why process #1 attempt to connect to port 1 then? or
> maybe totally unrelated :-)
> 
> btw, if I trick process #1 to open the connection to 10001 by shifting
> the range, I now get this error and the process terminate immediately:
> 
> [c0301b10e1:03919] [[0,],1]-[[0,0],0]
> mca_oob_tcp_peer_recv_connect_ack: received unexpected process
> identifier [[0,],0]
> 
> good luck with the surgery and wishing you a prompt recovery!
> 
> p.
> 
> On Thu, Aug 19, 2010 at 2:02 PM, Ralph Castain  wrote:
>> Something doesn't look right - here is what the algo attempts to do:
>> given a port range of 1-12000, the lowest rank'd process on the node
>> should open port 1. The next lowest rank on the node will open
>> 10001,
>> etc.
>> So it looks to me like there is some confusion in the local rank algo.
>> I'll
>> have to look at the generic module - must be a bug in it somewhere.
>> This might take a couple of days as I have surgery tomorrow morning, so
>> please forgive the delay.
>> 
>> On Thu, Aug 19, 2010 at 11:13 AM, Philippe 
>> wrote:
>>> 
>>> Ralph,
>>> 
>>> I'm able to use the generic module when the processes are on different

[OMPI users] OpenMPI with BLCR runtime problem

2010-08-24 Thread 陈文浩
Dear OMPI users,

 

I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 �C
blade10, nfs)

BLCR configure script: ./configure �Cprefix=/opt/blcr �Cenable-static

After the installation, I can see the ‘blcr’ module loaded correctly
(lsmod | grep blcr). And I can also run ‘cr_run’, ‘cr_checkpoint’,
‘cr_restart’ to C/R the examples correctly under /blcr/examples/.

Then, OMPI configure script is: ./configure �Cprefix=/opt/ompi �Cwith-ft=cr
�Cwith-blcr=/opt/blcr �Cenable-ft-thread �Cenable-mpi-threads �C
enable-static

The installation is okay too.

 

Then here comes the problem.

On one node:

 mpirun -np 2 ./hello_c.c

 mpirun -np 2 �Cam ft-enable-cr ./hello_c.c

 are both okay.

On two nodes(blade01, blade02):

 mpirun �Cnp 2 �Cmachinefile mf ./hello_c.c  OK.

mpirun �Cnp 2 �Cmachinefile mf �Cam ft-enable-cr ./hello_c.c ERROR. Listed
below:

 

*** An error occurred in MPI_Init 
*** before MPI was initialized 
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) 
[blade02:28896] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed! 
-- 
It looks like opal_init failed for some reason; your parallel process is 
likely to abort. There are many reasons that a parallel process can 
fail during opal_init; some of which are due to configuration or 
environment problems. This failure appears to be an internal failure; 
here's some additional information (which may only be relevant to an 
Open MPI developer): 

  opal_cr_init() failed failed 
  --> Returned value -1 instead of OPAL_SUCCESS 
-- 
[blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 77 
-- 
It looks like MPI_INIT failed for some reason; your parallel process is 
likely to abort. There are many reasons that a parallel process can 
fail during MPI_INIT; some of which are due to configuration or environment 
problems. This failure appears to be an internal failure; here's some 
additional information (which may only be relevant to an Open MPI 
developer): 

  ompi_mpi_init: orte_init failed 
  --> Returned "Error" (-1) instead of "Success" (0) 
-- 

 

I have no idea about the error. Our blades use nfs, does it matter? Can
anyone help me solve the problem? I really appreciate it. Thank you.

 

btw, similar error like: 

“Oops, cr_init() failed (the initialization call to the BLCR checkpointing
system). Abort in despair.

The crmpi SSI subsystem failed to initialized modules successfully during
MPI_INIT. This is a fatal error; I must abort.” occurs when I use LAM/MPI +
BLCR.

 

Regards

 

whchen

 



Re: [OMPI users] is there a way to bring to light _all_ configure options in a ready installation?

2010-08-24 Thread Eugene Loh




Terry Dontje wrote:

  
Jeff Squyres wrote:
  
You should be able to run "./configure --help" and see a lengthy help message that includes all the command line options to configure.

Is that what you're looking for? 
  
No, he wants to know what configure options were used with some
binaries.

Apparently even what configure options could have been used even if
they weren't actually used.

  
On Aug 24, 2010, at 7:40 AM, Paul Kapinos wrote:


  Hello OpenMPI developers,

I am searching for a way to discover _all_ configure options of an OpenMPI installation.

Background: in a existing installation, the ompi_info program helps to find out a lot of informations about the installation. So, "ompi_info -c" shows *some* configuration options like CFLAGS, FFLAGS et cetera. Compilation directories often does not survive for long time (or are not shipped at all, e.g. with SunMPI)

But what about --enable-mpi-threads or --enable-contrib-no-build=vt for example (and all other possible) flags of "configure", how can I see would these flags set or would not?

In other words: is it possible to get _all_ flags of configure from an "ready" installation in without having the compilation dirs (with configure logs) any more?

Many thanks

Paul
  

  





Re: [OMPI users] is there a way to bring to light _all_ configure options in a ready installation?

2010-08-24 Thread Paul Kapinos





You should be able to run "./configure --help" and see a lengthy help message 
that includes all the command line options to configure.
Is that what you're looking for?

No, he wants to know what configure options were used with some binaries.



Yes Terry - I want to know what configure options were for a given 
installation! "./configure --help" helps but to guess which all of the 
options are used in a release, is a hard job..





--td

On Aug 24, 2010, at 7:40 AM, Paul Kapinos wrote:

  

Hello OpenMPI developers,

I am searching for a way to discover _all_ configure options of an OpenMPI 
installation.

Background: in a existing installation, the ompi_info program helps to find out a lot of 
informations about the installation. So, "ompi_info -c" shows *some* 
configuration options like CFLAGS, FFLAGS et cetera. Compilation directories often does 
not survive for long time (or are not shipped at all, e.g. with SunMPI)

But what about --enable-mpi-threads or --enable-contrib-no-build=vt for example (and all 
other possible) flags of "configure", how can I see would these flags set or 
would not?

In other words: is it possible to get _all_ flags of configure from an "ready" 
installation in without having the compilation dirs (with configure logs) any more?

Many thanks

Paul


--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




  



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [OMPI users] is there a way to bring to light _all_ configure options in a ready installation?

2010-08-24 Thread Terry Dontje

Jeff Squyres wrote:

You should be able to run "./configure --help" and see a lengthy help message 
that includes all the command line options to configure.

Is that what you're looking for?

  

No, he wants to know what configure options were used with some binaries.

--td

On Aug 24, 2010, at 7:40 AM, Paul Kapinos wrote:

  

Hello OpenMPI developers,

I am searching for a way to discover _all_ configure options of an OpenMPI 
installation.

Background: in a existing installation, the ompi_info program helps to find out a lot of 
informations about the installation. So, "ompi_info -c" shows *some* 
configuration options like CFLAGS, FFLAGS et cetera. Compilation directories often does 
not survive for long time (or are not shipped at all, e.g. with SunMPI)

But what about --enable-mpi-threads or --enable-contrib-no-build=vt for example (and all 
other possible) flags of "configure", how can I see would these flags set or 
would not?

In other words: is it possible to get _all_ flags of configure from an "ready" 
installation in without having the compilation dirs (with configure logs) any more?

Many thanks

Paul


--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




  



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle * - Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com 



Re: [OMPI users] is there a way to bring to light _all_ configure options in a ready installation?

2010-08-24 Thread Jeff Squyres
You should be able to run "./configure --help" and see a lengthy help message 
that includes all the command line options to configure.

Is that what you're looking for?


On Aug 24, 2010, at 7:40 AM, Paul Kapinos wrote:

> Hello OpenMPI developers,
> 
> I am searching for a way to discover _all_ configure options of an OpenMPI 
> installation.
> 
> Background: in a existing installation, the ompi_info program helps to find 
> out a lot of informations about the installation. So, "ompi_info -c" shows 
> *some* configuration options like CFLAGS, FFLAGS et cetera. Compilation 
> directories often does not survive for long time (or are not shipped at all, 
> e.g. with SunMPI)
> 
> But what about --enable-mpi-threads or --enable-contrib-no-build=vt for 
> example (and all other possible) flags of "configure", how can I see would 
> these flags set or would not?
> 
> In other words: is it possible to get _all_ flags of configure from an 
> "ready" installation in without having the compilation dirs (with configure 
> logs) any more?
> 
> Many thanks
> 
> Paul
> 
> 
> -- 
> Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
> RWTH Aachen University, Center for Computing and Communication
> Seffenter Weg 23,  D 52074  Aachen (Germany)
> Tel: +49 241/80-24915
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




[OMPI users] is there a way to bring to light _all_ configure options in a ready installation?

2010-08-24 Thread Paul Kapinos

Hello OpenMPI developers,

I am searching for a way to discover _all_ configure options of an 
OpenMPI installation.


Background: in a existing installation, the ompi_info program helps to 
find out a lot of informations about the installation. So, "ompi_info 
-c" shows *some* configuration options like CFLAGS, FFLAGS et cetera. 
Compilation directories often does not survive for long time (or are not 
shipped at all, e.g. with SunMPI)


But what about --enable-mpi-threads or --enable-contrib-no-build=vt for 
example (and all other possible) flags of "configure", how can I see 
would these flags set or would not?


In other words: is it possible to get _all_ flags of configure from an 
"ready" installation in without having the compilation dirs (with 
configure logs) any more?


Many thanks

Paul


--
Dipl.-Inform. Paul Kapinos   -   High Performance Computing,
RWTH Aachen University, Center for Computing and Communication
Seffenter Weg 23,  D 52074  Aachen (Germany)
Tel: +49 241/80-24915


smime.p7s
Description: S/MIME Cryptographic Signature