Re: [OMPI users] Program deadlocks, on simple send/recv loop

2009-12-22 Thread Eugene Loh

vasilis gkanis wrote:

I had a similar problem with the portland Fortran compiler. I new that this 
was not caused by a network problem ( I run the code on a single node with 4 
CPUs). After I tested pretty much anything, I decided to change the compiler.
I used the Intel Fortran compiler and everything is running fine. 
It could be a PGI compiler voodoo :)
 

There were some thoughts on this e-mail thread that the problem could be 
related to trac ticket 2043.  Note that there has been progress on this 
ticket.  See https://svn.open-mpi.org/trac/ompi/ticket/2043#comment:18 
.  The shared-memory (on-node) communications were subject to race 
conditions that could be exposed by optimizing compilers.  Some signals 
could have gotten lost in inter-process communications, quite possibly 
leading to hangs.


If you think you got bitten by this bug, please try the revisions 
mentioned in the trac ticket and report your success (or, alas, failure) 
via the trac ticket or as appropriate.


[OMPI users] VampirTrace: time not increasing

2009-12-22 Thread Roman Cheplyaka
I got the following problem while trying to run vt-enabled HPL benchmark on a
single 8-core Linux node.

OTF ERROR in function OTF_WBuffer_setTimeAndProcess, file: OTF_WBuffer.c, line: 
308:
 time not increasing. (t= 2995392288, p= 2)
vtunify: Error: Could not read events of OTF stream [namestub ./a__ufy.tmp id 1]
vtunify: An error occurred during unifying events - Terminating ...

Sometimes instead of above message I get this:

vtunify: vt_unify_events_hdlr.cc:37: int Handle_Enter(OTF_WStream*, uint64_t, 
uint32_t, uint32_t, uint32_t): Assertion `global_statetoken != 0' failed.

A program is automatically instrumented (all I did was changing mpicc to
mpicc-vt), compiled and run with latest svn of Open MPI, command:

mpirun -np 8 --mca btl self,sm hpl

The same problem was with the latest release version.

When I run it with less number of processes, it works fine.

Any ideas?

-- 
Roman I. Cheplyaka


Re: [OMPI users] MTT -trivial :All tests are not getting passed

2009-12-22 Thread Ethan Mallove
Hi Vishal,

This is an MTT question for mtt-us...@open-mpi.org (see comments
below).

On Tue, Dec/22/2009 03:54:08PM, vishal shorghar wrote:
>Hi All,
> 
>I have one issue with MTT trivial tests.All tests are not getting
>passed,Please read below for detailed description.
> 
>Today I ran mtt trivial tests with latest ofed package
>OFED-1.5-20091217-0600 (ompi-1.4), between two machines,I was  able to run
>the MTT trivial tests manually but not through MTT framework. I think we
>are missing some configuration steps since it is unable to find the test
>executables in the test run phase of the MTT.
> 
>-> When we ran it through MTT it gave us the error and exits.
>I ran the test as  "cat developer.ini trivial.ini | ../client/mtt
>--verbose - "
> 
>-> When we analyzed error from
>/root/mtt-svn/samples/Test_Run-trivial-my_installation-1.4.txt file we
>found it is not getting the executable files of the different test to
>execute.
> 
>-> Then we found that those executables were being generated only on one
>of the machine out of two machines. So, we manually copied the tests  from
>/root/mtt-svn/samples/installs/nRpF/tests/trivial/test_get__trivial/c_ring
>to another machine.
> 
>-> And we ran it manually as shown below and it worked fine:
>mpirun --host 102.77.77.64,102.77.77.68 -np 2 --mca btl openib,sm,self
>--prefix
>
> /usr/mpi/gcc/openmpi-1.4/root/mtt-svn/samples/installs/nRpF/tests/trivial/test_get__trivial/c_ring
> 
>-> I am attaching file trivial.ini,developer.ini and
>/root/mtt-svn/samples/Test_Run-trivial-my_installation-1.4.txt.
> 
>Let us know if I am  missing some configuration steps.
> 

You need to set your scratch directory (via the --scratch option) to
an NFS share that is accessible to all nodes in your hostlist.  MTT
won't copy local files onto each node for you.

Regards,
Ethan


>NOTE:
>
>It gave me following output at the end of execution of test command and
>the same is saved in /root/mtt-svn/samples/All_phase-summary.txt
> 
>hostname: nizam
>uname: Linux nizam 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009
>x86_64 x86_64 x86_64 GNU/Linux
>who am i:
> 
>
> +-+-+-+--+--+--+--+--+--+
>| Phase   | Section | MPI Version | Duration | Pass | Fail |
>Time out | Skip | Detailed report  |
>
> +-+-+-+--+--+--+--+--+--+
>| MPI Install | my installation | 1.4 | 00:00| 1|
>|  |  | MPI_Install-my_installation-my_installation-1.4.html |
>| Test Build  | trivial | 1.4 | 00:01| 1|
>|  |  | Test_Build-trivial-my_installation-1.4.html  |
>| Test Run| trivial | 1.4 | 00:10|  | 8
>|  |  | Test_Run-trivial-my_installation-1.4.html|
>
> +-+-+-+--+--+--+--+--+--+
> 
>Total Tests:10
>Total Failures: 8
>Total Passed:   2
>Total Duration: 11 secs. (00:11)
> 
>Thanks  & Regards,
> 
>Vishal shorghar
>MTS
>Chelsio Communication

> #
> # Copyright (c) 2007 Sun Microystems, Inc.  All rights reserved.
> #
> 
> # Template MTT configuration file for Open MPI developers.  The intent
> # for this template file is to establish at least some loose
> # guidelines for what Open MPI core developers should be running
> # before committing changes to the OMPI repository. This file is not
> # intended to be an exhaustive sample of all possible fields and
> # values that MTT offers. Each developer will undoubtedly have to
> # edit this template for their own needs (e.g., pick compilers to use,
> # etc.), but this file provides a baseline set of configurations that
> # we intend for you to run.
> #
> # Sample usage:
> #   cat developer.ini intel.ini   | client/mtt - 
> alreadyinstalled_dir=/your/install
> #   cat developer.ini trivial.ini | client/mtt - 
> alreadyinstalled_dir=/your/install
> #
> 
> [MTT]
> # No overrides to defaults
> 
> # Fill this field in
> 
> #hostlist = 102.77.77.63 102.77.77.54 102.77.77.64 102.77.77.68 
> #hostlist = 102.77.77.66 102.77.77.68 102.77.77.63 102.77.77.64 102.77.77.53 
> 102.77.77.54 102.77.77.243 102.77.77.65
> hostlist = 102.77.77.64 102.77.77.68 
> hostlist_max_np = 2 
> max_np = 2
> force = 1
> #prefix = /usr/mpi/gcc/openmpi-1.3.4/bin
> 
> #--
> 
> [MPI Details: Open MPI]
> 
> exec = mpirun @hosts@ -np &test_np() @mca@ --prefix &test_prefix() 
> &test_executable() &test_

Re: [OMPI users] Torque 2.4.3 fails with OpenMPI 1.3.4; no startup at all

2009-12-22 Thread Johann Knechtel
Hi Ralph,

Somehow I did not receive your last answer as mail, so I reply to myself...
Thanks for the explanation. I thought that the prefix issue would be
handled by the OMPI configure parameter
"enable-mpirun-prefix-by-default". But now I see your point. Anyway, I
did not find any further information regarding that issue on the torque
FAQ, and since the rsh launcher works I will stick to that and dont
spend more time in experiments with torque... Thanks again for your help!

Greetings
Johann


Johann Knechtel schrieb:
> Ralph, thank you very much for your input! The parameter "mca plm rsh"
> did it. I am just curious about the reasons for that behavior?
> You can find the complete output of the different commands embedded in
> your mail below. The first line states the successful load of the OMPI
> environment, we use the modules package on our cluster.
> 
> Greetings
> Johann
> 
> 
> Ralph Castain schrieb:
>> Sorry - hit "send" and then saw the version sitting right there in the 
>> subject! Doh...
>>
>> First, let's try verifying what components are actually getting used. Run 
>> this:
>>
>> mpirun -n 1 -mca ras_base_verbose 10 -mca plm_base_verbose 10 which orted
>>   
>  OpenMPI with PPU-GCC was loaded
> [node1:00706] mca: base: components_open: Looking for plm components
> [node1:00706] mca: base: components_open: opening plm components
> [node1:00706] mca: base: components_open: found loaded component rsh
> [node1:00706] mca: base: components_open: component rsh has no register
> function
> [node1:00706] mca: base: components_open: component rsh open function
> successful
> [node1:00706] mca: base: components_open: found loaded component slurm
> [node1:00706] mca: base: components_open: component slurm has no
> register function
> [node1:00706] mca: base: components_open: component slurm open function
> successful
> [node1:00706] mca: base: components_open: found loaded component tm
> [node1:00706] mca: base: components_open: component tm has no register
> function
> [node1:00706] mca: base: components_open: component tm open function
> successful
> [node1:00706] mca:base:select: Auto-selecting plm components
> [node1:00706] mca:base:select:(  plm) Querying component [rsh]
> [node1:00706] mca:base:select:(  plm) Query of component [rsh] set
> priority to 10
> [node1:00706] mca:base:select:(  plm) Querying component [slurm]
> [node1:00706] mca:base:select:(  plm) Skipping component [slurm]. Query
> failed to return a module
> [node1:00706] mca:base:select:(  plm) Querying component [tm]
> [node1:00706] mca:base:select:(  plm) Query of component [tm] set
> priority to 75
> [node1:00706] mca:base:select:(  plm) Selected component [tm]
> [node1:00706] mca: base: close: component rsh closed
> [node1:00706] mca: base: close: unloading component rsh
> [node1:00706] mca: base: close: component slurm closed
> [node1:00706] mca: base: close: unloading component slurm
> [node1:00706] mca: base: components_open: Looking for ras components
> [node1:00706] mca: base: components_open: opening ras components
> [node1:00706] mca: base: components_open: found loaded component slurm
> [node1:00706] mca: base: components_open: component slurm has no
> register function
> [node1:00706] mca: base: components_open: component slurm open function
> successful
> [node1:00706] mca: base: components_open: found loaded component tm
> [node1:00706] mca: base: components_open: component tm has no register
> function
> [node1:00706] mca: base: components_open: component tm open function
> successful
> [node1:00706] mca:base:select: Auto-selecting ras components
> [node1:00706] mca:base:select:(  ras) Querying component [slurm]
> [node1:00706] mca:base:select:(  ras) Skipping component [slurm]. Query
> failed to return a module
> [node1:00706] mca:base:select:(  ras) Querying component [tm]
> [node1:00706] mca:base:select:(  ras) Query of component [tm] set
> priority to 100
> [node1:00706] mca:base:select:(  ras) Selected component [tm]
> [node1:00706] mca: base: close: unloading component slurm
> /opt/openmpi_1.3.4_gcc_ppc/bin/orted
> [node1:00706] mca: base: close: unloading component tm
> [node1:00706] mca: base: close: component tm closed
> [node1:00706] mca: base: close: unloading component tm
> 
>> Then get an allocation and run
>>
>> mpirun -pernode which orted
>>   
>  OpenMPI with PPU-GCC was loaded
> --
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> ---

Re: [OMPI users] Is OpenMPI's orted = MPICH2's smpd?

2009-12-22 Thread Ashley Pittman
On Tue, 2009-12-22 at 09:59 +0530, Sangamesh B wrote:
> Hi,
> 
> MPICh2 has different process managers: MPD, SMPD, GFORKER etc.

It also has Hydra.

>  Is the Open MPI's startup daemon orted similar to MPICH2's smpd? Or
> something else?

My understand is that SMPD is for launching on Windows which isn't
something I know about.

orte is similar to MPD although without the requirement that you start
the ring before-hand.

A quick summary of orte: Orte takes a list of nodes and a process count,
given these it will start a job of the given size on the given nodes.
No prior configuration or starting of deamons is required.  No effort is
made to prevent multiple jobs from starting on the same nodes and no
effort is made to maintain a "queue" of jobs waiting for nodes to become
free.  Each job is independent, and runs where you tell it to
immediately.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Is OpenMPI's orted = MPICH2's smpd?

2009-12-22 Thread Ralph Castain
Afraid I don't know enough about MPICH2's different process managers (and why 
they need more than one) to answer that question.

On Dec 21, 2009, at 9:29 PM, Sangamesh B wrote:

> Hi,
> 
> MPICh2 has different process managers: MPD, SMPD, GFORKER etc. Is the 
> Open MPI's startup daemon orted similar to MPICH2's smpd? Or something else?
> 
> Thanks,
> Sangamesh
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Memory corruption?

2009-12-22 Thread Roy Dragseth
Hi.

We have started to scale up one of our codes and sometimes we get messages 
like this:

[c9-13.local:31125] Memory 0x2aaab7b64000:217088 cannot be freed from 
the registration cache. Possible memory corruption.

It seems like the application runs normally and it does not crash becaus of 
this.  Should we be worried?  We have tested the code with up to 1700 cores 
and the message becomes more frequent as we scale up.

System details:

Rocks 5.2 (aka CentOS 5.3) x86_64
INTEL Compiler 11.1
OFED 1.4.1
OpenMPI 1.3.3

Best regards and Merry Christmas to all,
r.

-- 

  The Computer Center, University of Tromsø, N-9037 TROMSØ Norway.
  phone:+47 77 64 41 07, fax:+47 77 64 41 00
Roy Dragseth, Team Leader, High Performance Computing
 Direct call: +47 77 64 62 56. email: roy.drags...@uit.no



[OMPI users] MTT -trivial :All tests are not getting passed

2009-12-22 Thread vishal shorghar

*Hi All,*

I have one issue with MTT trivial tests.All tests are not getting 
passed,Please read below for detailed description.



Today I ran mtt trivial tests with latest ofed package 
OFED-1.5-20091217-0600 (ompi-1.4), between two machines,I was  able to 
run the MTT trivial tests manually but not through MTT framework. I 
think we are missing some configuration steps since it is unable to find 
the test executables in the test run phase of the MTT.


-> When we ran it through MTT it gave us the error and exits.
I ran the test as  "cat developer.ini trivial.ini | ../client/mtt 
--verbose - "


-> When we analyzed error from 
/root/mtt-svn/samples/Test_Run-trivial-my_installation-1.4.txt file we 
found it is not getting the executable files of the different test to 
execute.


-> Then we found that those executables were being generated only on one 
of the machine out of two machines. So, we manually copied the tests  from
/root/mtt-svn/samples/installs/nRpF/tests/trivial/test_get__trivial/c_ring 
to another machine.


-> And we ran it manually as shown below and it worked fine:
mpirun --host 102.77.77.64,102.77.77.68 -np 2 --mca btl openib,sm,self 
--prefix 
/usr/mpi/gcc/openmpi-1.4/root/mtt-svn/samples/installs/nRpF/tests/trivial/test_get__trivial/c_ring


-> I am attaching file trivial.ini,developer.ini and 
/root/mtt-svn/samples/Test_Run-trivial-my_installation-1.4.txt.


Let us know if I am  missing some configuration steps.

NOTE:

It gave me following output at the end of execution of test command and 
the same is saved in /root/mtt-svn/samples/All_phase-summary.txt


hostname: nizam
uname: Linux nizam 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 
x86_64 x86_64 x86_64 GNU/Linux

who am i:

+-+-+-+--+--+--+--+--+--+
| Phase   | Section | MPI Version | Duration | Pass | Fail | 
Time out | Skip | Detailed report  |

+-+-+-+--+--+--+--+--+--+
| MPI Install | my installation | 1.4 | 00:00| 1|  
|  |  | MPI_Install-my_installation-my_installation-1.4.html |
| Test Build  | trivial | 1.4 | 00:01| 1|  
|  |  | Test_Build-trivial-my_installation-1.4.html  |
| Test Run| trivial | 1.4 | 00:10|  | 8
|  |  | Test_Run-trivial-my_installation-1.4.html|

+-+-+-+--+--+--+--+--+--+


   Total Tests:10
   Total Failures: 8
   Total Passed:   2
   Total Duration: 11 secs. (00:11)

Thanks  & Regards,

Vishal shorghar
MTS
Chelsio Communication

#
# Copyright (c) 2007 Sun Microystems, Inc.  All rights reserved.
#

# Template MTT configuration file for Open MPI developers.  The intent
# for this template file is to establish at least some loose
# guidelines for what Open MPI core developers should be running
# before committing changes to the OMPI repository. This file is not
# intended to be an exhaustive sample of all possible fields and
# values that MTT offers. Each developer will undoubtedly have to
# edit this template for their own needs (e.g., pick compilers to use,
# etc.), but this file provides a baseline set of configurations that
# we intend for you to run.
#
# Sample usage:
#   cat developer.ini intel.ini   | client/mtt - 
alreadyinstalled_dir=/your/install
#   cat developer.ini trivial.ini | client/mtt - 
alreadyinstalled_dir=/your/install
#

[MTT]
# No overrides to defaults

# Fill this field in

#hostlist = 102.77.77.63 102.77.77.54 102.77.77.64 102.77.77.68 
#hostlist = 102.77.77.66 102.77.77.68 102.77.77.63 102.77.77.64 102.77.77.53 
102.77.77.54 102.77.77.243 102.77.77.65
hostlist = 102.77.77.64 102.77.77.68 
hostlist_max_np = 2 
max_np = 2
force = 1
#prefix = /usr/mpi/gcc/openmpi-1.3.4/bin

#--

[MPI Details: Open MPI]

exec = mpirun @hosts@ -np &test_np() @mca@ --prefix &test_prefix() 
&test_executable() &test_argv()

mca = --mca btl openib,sm,self

hosts = <

+--+---+
| Field| Value  
   |
+--+---+
| description  |
   |
| environment  |
   |
| exit_signal  | -1