Re: [OMPI users] shared memory (sm) module not working properly?

2010-01-15 Thread Eugene Loh




Dunno.  Do lower np values succeed?  If so, at what value of np does
the job no longer start?

Perhaps it's having a hard time creating the shared-memory backing file
in /tmp.  I think this is a 64-Mbyte file.  If this is the case, try
reducing the size of the shared area per this FAQ item: 
http://www.open-mpi.org/faq/?category=sm#decrease-sm  Most notably,
reduce mpool_sm_min_size below 67108864.

Also note trac ticket 2043, which describes problems with the sm BTL
exposed by GCC 4.4.x compilers.  You need to get a sufficiently recent
build to solve this.  But, those problems don't occur until you start
passing messages, and here you're not even starting up.

Nicolas Bock wrote:
Sorry, I forgot to give more details on what versions I am
using:
  
OpenMPI 1.4
Ubuntu 9.10, kernel 2.6.31-16-generic #53-Ubuntu
gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1
  
  On Fri, Jan 15, 2010 at 15:47, Nicolas Bock 
wrote:
  Hello
list,

I am running a job on a 4 quadcore AMD Opteron. This machine has 16
cores, which I can verify by looking at /proc/cpuinfo. However, when I
run a job with

mpirun -np 16 -mca btl self,sm job

I get this error:

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[56972,2],0]) is on host: rust
  Process 2 ([[56972,1],0]) is on host: rust
  BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
--

By adding the tcp btl I can run the job. I don't understand why openmpi
claims that a pair of processes can not reach each other, all processor
cores should have access to all memory after all. Do I need to set some
other btl limit?
  
  





Re: [OMPI users] shared memory (sm) module not working properly?

2010-01-15 Thread Nicolas Bock
Sorry, I forgot to give more details on what versions I am using:

OpenMPI 1.4
Ubuntu 9.10, kernel 2.6.31-16-generic #53-Ubuntu
gcc (Ubuntu 4.4.1-4ubuntu8) 4.4.1



On Fri, Jan 15, 2010 at 15:47, Nicolas Bock  wrote:

> Hello list,
>
> I am running a job on a 4 quadcore AMD Opteron. This machine has 16 cores,
> which I can verify by looking at /proc/cpuinfo. However, when I run a job
> with
>
> mpirun -np 16 -mca btl self,sm job
>
> I get this error:
>
> --
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[56972,2],0]) is on host: rust
>   Process 2 ([[56972,1],0]) is on host: rust
>   BTLs attempted: self sm
>
> Your MPI job is now going to abort; sorry.
> --
>
> By adding the tcp btl I can run the job. I don't understand why openmpi
> claims that a pair of processes can not reach each other, all processor
> cores should have access to all memory after all. Do I need to set some
> other btl limit?
>
> nick
>
>


[OMPI users] shared memory (sm) module not working properly?

2010-01-15 Thread Nicolas Bock
Hello list,

I am running a job on a 4 quadcore AMD Opteron. This machine has 16 cores,
which I can verify by looking at /proc/cpuinfo. However, when I run a job
with

mpirun -np 16 -mca btl self,sm job

I get this error:

--
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[56972,2],0]) is on host: rust
  Process 2 ([[56972,1],0]) is on host: rust
  BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
--

By adding the tcp btl I can run the job. I don't understand why openmpi
claims that a pair of processes can not reach each other, all processor
cores should have access to all memory after all. Do I need to set some
other btl limit?

nick


Re: [OMPI users] dynamic rules

2010-01-15 Thread Daniel Spångberg

I tried this and it still crashes with openmpi-1.4. Is it supposed to
work with openmpi-1.4
or do I need to compile openmpi-1.4.1 ?



Terribly sorry, I should checked my own notes thoroughly before giving  
others advice. One needs to give the dynamic rules file location on the  
command line:


mpirun -mca coll_tuned_use_dynamic_rules 1 -mca  
coll_tuned_dynamic_rules_filename /home/.openmpi/dynamic_rules_file


That works for me with openmpi 1.4. I have not tried 1.4.1 yet.

Daniel


Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
It's almost midnight here, so I left home, but I will try it tomorrow.
There were some directories left after "make uninstall". I will give more
details tomorrow.

Thanks Jeff,
Andreea

On Fri, Jan 15, 2010 at 11:30 PM, Jeff Squyres  wrote:

> On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote:
>
> > - I wanted to update to version 1.4.1 and I uninstalled previous version
> like this: make uninstall, and than manually deleted all the left over
> files. the directory where I installed was /usr/local
>
> I'll let Josh answer your CR questions, but I did want to ask about this
> point.  AFAIK, "make uninstall" removes *all* Open MPI files.  For example:
>
> -
> [7:25] $ cd /path/to/my/OMPI/tree
> [7:25] $ make install > /dev/null
> [7:26] $ find /tmp/bogus/ -type f | wc
>646 646   28082
> [7:26] $ make uninstall > /dev/null
> [7:27] $ find /tmp/bogus/ -type f | wc
>  0   0   0
> [7:27] $
> -
>
> I realize that some *directories* are left in $prefix, but there should be
> no *files* left.  Are you seeing something different?
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Jeff Squyres
On Jan 15, 2010, at 8:07 AM, Andreea Costea wrote:

> - I wanted to update to version 1.4.1 and I uninstalled previous version like 
> this: make uninstall, and than manually deleted all the left over files. the 
> directory where I installed was /usr/local

I'll let Josh answer your CR questions, but I did want to ask about this point. 
 AFAIK, "make uninstall" removes *all* Open MPI files.  For example:

-
[7:25] $ cd /path/to/my/OMPI/tree
[7:25] $ make install > /dev/null
[7:26] $ find /tmp/bogus/ -type f | wc
646 646   28082
[7:26] $ make uninstall > /dev/null
[7:27] $ find /tmp/bogus/ -type f | wc
  0   0   0
[7:27] $ 
-

I realize that some *directories* are left in $prefix, but there should be no 
*files* left.  Are you seeing something different?

-- 
Jeff Squyres
jsquy...@cisco.com




Re: [OMPI users] dynamic rules

2010-01-15 Thread Roman Martonak
>I have done this according to suggestion on this list, until a fix comes
>that makes it possible to change via command line:
>
>To choose bruck for all message sizes / mpi sizes with openmpi-1.4
>
>File $HOME/.openmpi/mca-params.conf (replace /homeX) so it points to
>the correct file:
>coll_tuned_use_dynamic_rules=1
>coll_tuned_dynamic_rules_filename="/home/.openmpi/dynamic_rules_file"
> ...

I tried this and it still crashes with openmpi-1.4. Is it supposed to
work with openmpi-1.4
or do I need to compile openmpi-1.4.1 ?

Best regards

Roman


Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
I don't know what else should I try... because it worked on 1.3.3 doing
exactly the same steps. I tried to install it both with an active eth
interface and an inactive one. I am running on a virtual machine that has
CentOS as OS.

Any suggestions?

Thanks,
Andreea

On Fri, Jan 15, 2010 at 9:07 PM, Andreea Costea wrote:

> I tried the new version, that was uploaded today. I still have that error,
> just that now is at line 405 instead of 399.
>
> Maybe if I give more details:
> - I first had OpenMPI version 1.3.3 with BLCR installed: mpirun,
> ompi-checkpoint and ompi-restart worked with that version.
> - I wanted to update to version 1.4.1 and I uninstalled previous version
> like this: make uninstall, and than manually deleted all the left over
> files. the directory where I installed was /usr/local
> - I installed 1.4.1 in the same directory: /usr/locale. paths set
> correctly  to usr/local/bin and /usr/local/lib
> - mpirun works, ompi-checkpoint gives the following error:
> [[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line
> 405
> HNP with PID 7899 Not found!
>
> I would appreciate any help,
> Andreea
>
>
>
> On Fri, Jan 15, 2010 at 1:15 PM, Andreea Costea wrote:
>
>> Hi...
>> still not working. Though I uninstalled OpenMPI with make uninstall and I
>> manually deleted all other files, I still have the same error when
>> checkpointing.
>>
>> Any idea?
>>
>> Thanks,
>> Andreea
>>
>>
>>
>> On Thu, Jan 14, 2010 at 10:38 PM, Joshua Hursey wrote:
>>
>>> On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote:
>>>
>>> > Hi,
>>> >
>>> > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have
>>> downloaded today. When I want to checkpoint I am having the following error
>>> message:
>>> > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at
>>> line 399
>>> > HNP with PID 2337 Not found!
>>>
>>> This looks like an error coming from the 1.3.3 install. In 1.4.1 there is
>>> no error at line 399, in 1.3.3 there is. Check your installation of Open
>>> MPI, I bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected
>>> problems.
>>>
>>> Try a clean installation of 1.4.1 and double check that 1.3.3 is not in
>>> your path/lib_path any longer.
>>>
>>> -- Josh
>>>
>>> >
>>> > I tried the same thing with version 1.3.3 and it works perfectly.
>>> >
>>> > Any idea why?
>>> >
>>> > thanks,
>>> > Andreea
>>> > ___
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>>
>


Re: [OMPI users] dynamic rules

2010-01-15 Thread Daniel Spångberg
I have done this according to suggestion on this list, until a fix comes  
that makes it possible to change via command line:


To choose bruck for all message sizes / mpi sizes with openmpi-1.4

File $HOME/.openmpi/mca-params.conf (replace /homeX) so it points to  
the correct file:

coll_tuned_use_dynamic_rules=1
coll_tuned_dynamic_rules_filename="/home/.openmpi/dynamic_rules_file"

file $HOME/.openmpi/dynamic_rules_file:
1 # num of collectives
3 # ID = 3 Alltoall collective (ID in coll_tuned.h)
1 # number of com sizes
0 # comm size
1 # number of msg sizes
0 3 0 0 # for message size 0, bruck, topo 0, 0 segmentation
# end of collective rule

Change the number 3 to something else for other algoritms (can be found  
with ompi_info -a for example):


   MCA coll: information "coll_tuned_alltoall_algorithm_count" (value: "4")
  Number of alltoall algorithms available
MCA coll: parameter "coll_tuned_alltoall_algorithm"  
(current value: "0")
  Which alltoall algorithm is used. Can be locked  
down to choice of: 0 ignore, 1 basic linear, 2 pairwise, 3: modified  
bruck, 4: two proc only.


HTH
Daniel Spångberg



Den 2010-01-15 13:54:33 skrev Roman Martonak :

On my machine I need to use dynamic rules to enforce the bruck or  
pairwise

algorithm for alltoall, since unfortunately the default basic linear
algorithm performs quite poorly on my
Infiniband network. Few months ago I noticed that in case of VASP,
however, the use of dynamic
rules via --mca coll_tuned_use_dynamic_rules 1 -mca
coll_tuned_dynamic_rules_filename dyn_rules
has no effect at all. Later it was identified that there was a bug
causing the dynamic rules to
apply only to the MPI_COMM_WORLD but not to other communicators. As
far as I understand, the bug
was fixed in openmpi-1.3.4. I tried now the openmpi-1.4 version and
expected that tuning of alltoall via dynamic
rules would work, but there is still no effect at all. Even worse, now
it is not even possible to use static rules
(which worked previously) such as -mca coll_tuned_alltoall_algorithm
3, because the code would crash (as discussed in the list recently).
When running with --mca coll_base_verbose 1000, I get messages like

[compute-0-0.local:08011] coll:sm:comm_query (0/MPI_COMM_WORLD):
intercomm, comm is too small, or not all peers local; disqualifying
myself
[compute-0-0.local:08011] coll:base:comm_select: component not  
available: sm

[compute-0-0.local:08011] coll:base:comm_select: component available:
sync, priority: 50
[compute-0-3.local:26116] coll:base:comm_select: component available:
self, priority: 75
[compute-0-3.local:26116] coll:sm:comm_query (1/MPI_COMM_SELF):
intercomm, comm is too small, or not all peers local; disqualifying
myself
[compute-0-3.local:26116] coll:base:comm_select: component not  
available: sm

[compute-0-3.local:26116] coll:base:comm_select: component available:
sync, priority: 50
[compute-0-3.local:26116] coll:base:comm_select: component not  
available: tuned

[compute-0-0.local:08011] coll:base:comm_select: component available:
tuned, priority: 30

Is there now a way to use other alltoall algorithms instead of the
basic linear algorithm in openmpi-1.4.x ?

Thanks in advance for any suggestion.

Best regards

Roman Martonak
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Daniel Spångberg
Materialkemi
Uppsala Universitet


Re: [OMPI users] Rapid I/O support

2010-01-15 Thread Scott Atchley

On Jan 14, 2010, at 3:08 PM, Jeff Squyres wrote:


On Jan 14, 2010, at 1:59 PM, TONY BASIL wrote:

I am doing a project with an HPC set up on multicore Power  
PC..Nodes will be connected
using Rapid I/O instead for Gigabit Ethernet...I would like to know  
if OpenMPI supports

Rapid I/O...


I'm afraid not.  Before your post, I had never heard of Rapid IO.


Likewise. Does it support Ethernet encapsulation over it? If so, try  
Open-MX.


Scott


Re: [OMPI users] More NetBSD fixes

2010-01-15 Thread Jed Brown
On Thu, 14 Jan 2010 21:55:06 -0500, Jeff Squyres  wrote:
> That being said, you could sign up on it and then set your membership to 
> receive no mail...?

This is especially dangerous because the Open MPI lists munge the
Reply-To header, which is a bad thing

  http://www.unicom.com/pw/reply-to-harmful.html

But lots of mailers have poor default handling of mailing lists, so it's
complicated.

With munging, a mailer's "reply-to-sender" function will send mail
*only* to the list and "reply-to-all" will send it to the list and any
other recipients, but *not* the sender (unless the mailer does special
detection of munged reply-to headers).  This makes it rather difficult
to participate in a discussion without receiving mail from the list, or
even to reliably filter list traffic (you have to write filter rules
that walk the References tree to find if it is something that would be
interesting to you, and then you get false positives from people who
reply to an existing thread when they wanted to make a new thread).

Jed


Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
I tried the new version, that was uploaded today. I still have that error,
just that now is at line 405 instead of 399.

Maybe if I give more details:
- I first had OpenMPI version 1.3.3 with BLCR installed: mpirun,
ompi-checkpoint and ompi-restart worked with that version.
- I wanted to update to version 1.4.1 and I uninstalled previous version
like this: make uninstall, and than manually deleted all the left over
files. the directory where I installed was /usr/local
- I installed 1.4.1 in the same directory: /usr/locale. paths set correctly
to usr/local/bin and /usr/local/lib
- mpirun works, ompi-checkpoint gives the following error:
[[35906,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line
405
HNP with PID 7899 Not found!

I would appreciate any help,
Andreea


On Fri, Jan 15, 2010 at 1:15 PM, Andreea Costea wrote:

> Hi...
> still not working. Though I uninstalled OpenMPI with make uninstall and I
> manually deleted all other files, I still have the same error when
> checkpointing.
>
> Any idea?
>
> Thanks,
> Andreea
>
>
>
> On Thu, Jan 14, 2010 at 10:38 PM, Joshua Hursey wrote:
>
>> On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote:
>>
>> > Hi,
>> >
>> > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have
>> downloaded today. When I want to checkpoint I am having the following error
>> message:
>> > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at
>> line 399
>> > HNP with PID 2337 Not found!
>>
>> This looks like an error coming from the 1.3.3 install. In 1.4.1 there is
>> no error at line 399, in 1.3.3 there is. Check your installation of Open
>> MPI, I bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected
>> problems.
>>
>> Try a clean installation of 1.4.1 and double check that 1.3.3 is not in
>> your path/lib_path any longer.
>>
>> -- Josh
>>
>> >
>> > I tried the same thing with version 1.3.3 and it works perfectly.
>> >
>> > Any idea why?
>> >
>> > thanks,
>> > Andreea
>> > ___
>> > users mailing list
>> > us...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>


[OMPI users] dynamic rules

2010-01-15 Thread Roman Martonak
On my machine I need to use dynamic rules to enforce the bruck or pairwise
algorithm for alltoall, since unfortunately the default basic linear
algorithm performs quite poorly on my
Infiniband network. Few months ago I noticed that in case of VASP,
however, the use of dynamic
rules via --mca coll_tuned_use_dynamic_rules 1 -mca
coll_tuned_dynamic_rules_filename dyn_rules
has no effect at all. Later it was identified that there was a bug
causing the dynamic rules to
apply only to the MPI_COMM_WORLD but not to other communicators. As
far as I understand, the bug
was fixed in openmpi-1.3.4. I tried now the openmpi-1.4 version and
expected that tuning of alltoall via dynamic
rules would work, but there is still no effect at all. Even worse, now
it is not even possible to use static rules
(which worked previously) such as -mca coll_tuned_alltoall_algorithm
3, because the code would crash (as discussed in the list recently).
When running with --mca coll_base_verbose 1000, I get messages like

[compute-0-0.local:08011] coll:sm:comm_query (0/MPI_COMM_WORLD):
intercomm, comm is too small, or not all peers local; disqualifying
myself
[compute-0-0.local:08011] coll:base:comm_select: component not available: sm
[compute-0-0.local:08011] coll:base:comm_select: component available:
sync, priority: 50
[compute-0-3.local:26116] coll:base:comm_select: component available:
self, priority: 75
[compute-0-3.local:26116] coll:sm:comm_query (1/MPI_COMM_SELF):
intercomm, comm is too small, or not all peers local; disqualifying
myself
[compute-0-3.local:26116] coll:base:comm_select: component not available: sm
[compute-0-3.local:26116] coll:base:comm_select: component available:
sync, priority: 50
[compute-0-3.local:26116] coll:base:comm_select: component not available: tuned
[compute-0-0.local:08011] coll:base:comm_select: component available:
tuned, priority: 30

Is there now a way to use other alltoall algorithms instead of the
basic linear algorithm in openmpi-1.4.x ?

Thanks in advance for any suggestion.

Best regards

Roman Martonak


Re: [OMPI users] Windows CMake build problems ... (cont.)

2010-01-15 Thread Shiqing Fan


Hi Charlie,

Glad to hear that you compiled it successfully.

The error you got with 1.3.4 is a bug that the CMake script didn't set 
the SVN information correctly, and it has been fixed in 1.4 and later.



Thanks,
Shiqing


cjohn...@valverdecomputing.com wrote:

Yes that was it.

A much improved result now from CMake 2.6.4, no errors from compiling 
openmpi-1.4:


1>libopen-pal - 0 error(s), 9 warning(s)
2>libopen-rte - 0 error(s), 7 warning(s)
3>opal-restart - 0 error(s), 0 warning(s)
4>opal-wrapper - 0 error(s), 0 warning(s)
5>libmpi - 0 error(s), 42 warning(s)
6>orte-checkpoint - 0 error(s), 0 warning(s)
7>orte-ps - 0 error(s), 0 warning(s)
8>orted - 0 error(s), 0 warning(s)
9>orte-clean - 0 error(s), 0 warning(s)
10>orterun - 0 error(s), 3 warning(s)
11>ompi_info - 0 error(s), 0 warning(s)
12>ompi-server - 0 error(s), 0 warning(s)
13>libmpi_cxx - 0 error(s), 61 warning(s)
== Build: 13 succeeded, 0 failed, 1 up-to-date, 0 skipped 
==


And only one failure from compiling openmpi-1.3.4 (the ompi_info project):

> 1>libopen-pal - 0 error(s), 9 warning(s)
> 2>libopen-rte - 0 error(s), 7 warning(s)
> 3>opal-restart - 0 error(s), 0 warning(s)
> 4>opal-wrapper - 0 error(s), 0 warning(s)
> 5>orte-checkpoint - 0 error(s), 0 warning(s)
> 6>libmpi - 0 error(s), 42 warning(s)
> 7>orte-ps - 0 error(s), 0 warning(s)
> 8>orted - 0 error(s), 0 warning(s)
> 9>orte-clean - 0 error(s), 0 warning(s)
> 10>orterun - 0 error(s), 3 warning(s)
> 11>ompi_info - 3 error(s), 0 warning(s)
> 12>ompi-server - 0 error(s), 0 warning(s)
> 13>libmpi_cxx - 0 error(s), 61 warning(s)
> == Rebuild All: 13 succeeded, 1 failed, 0 skipped ==

Here's the listing from the non-linking project:

11>-- Rebuild All started: Project: ompi_info, Configuration: 
Debug Win32 --
11>Deleting intermediate and output files for project 'ompi_info', 
configuration 'Debug|Win32'

11>Compiling...
11>version.cc
11>..\..\..\..\openmpi-1.3.4\ompi\tools\ompi_info\version.cc(136) : 
error C2059: syntax error : ','
11>..\..\..\..\openmpi-1.3.4\ompi\tools\ompi_info\version.cc(147) : 
error C2059: syntax error : ','
11>..\..\..\..\openmpi-1.3.4\ompi\tools\ompi_info\version.cc(158) : 
error C2059: syntax error : ','

11>param.cc
11>output.cc
11>ompi_info.cc
11>components.cc
11>Generating Code...
11>Build log was saved at 
"file://c:\prog\mon\ompi\tools\ompi_info\ompi_info.dir\Debug\BuildLog.htm"

11>ompi_info - 3 error(s), 0 warning(s)

Thank you Shiqing !

Charlie ...

 Original Message 
Subject: Re: [OMPI users] Windows CMake build problems ... (cont.)
From: Shiqing Fan 
Date: Thu, January 14, 2010 11:20 am
To: Open MPI Users ,
cjohn...@valverdecomputing.com


Hi Charlie,

The problem turns out to be the different behavior of one CMake
macro in
different version of CMake. And it's fixed in Open MPI trunk with
r22405. I also created a ticket to move the fix over to 1.4
branch, see
#2169: https://svn.open-mpi.org/trac/ompi/ticket/2169 .

So you could either switch to use OMPI trunk or use CMake 2.6 to
solve
the problem. Thanks a lot.


Best Regards,
Shiqing


cjohn...@valverdecomputing.com wrote:
> The OpenMPI build problem I'm having occurs in both OpenMPI 1.4
and 1.3.4.
>
> I am on a Windows 7 (US) Enterprise (x86) OS on an HP system with
> Intel core 2 extreme x9000 (4GB RAM), using the 2005 Visual
Studio for
> S/W Architects (release 8.0.50727.867).
>
> [That release has everything the platform SDK would have.]
>
> I'm using CMake 2.8 to generate code, I used it correctly,
pointing at
> the root directory where the makelists are located for the source
side
> and to an empty directory for the build side: did configure, _*I did
> not click debug this time as suggested by Shiqing*_, configure
again,
> generate and opened the OpenMPI.sln file created by CMake. Then I
> right-clicked on the "ALL_BUILD" project and selected "build". Then
> did one "rebuild", just in case build order might get one more
success
> (which it seemed to, but I could not find).
>
> 2 projects built, 12 did not. I have the build listing. [I'm
afraid of
> what the mailing list server would do if I attached it to this
email.]
>
> All the compiles were successful (warnings at most.) All the errors
> were were from linking the VC projects:
>
> *1>libopen-pal - 0 error(s), 9 warning(s)*
> 3>opal-restart - 32 error(s), 0 warning(s)
> 4>opal-wrapper - 21 error(s), 0 warning(s)
> 2>libopen-rte - 749 error(s), 7 warning(s)
> 5>orte-checkpoint - 32 error(s), 0 warning(s)
> 7>orte-ps - 28 error(s), 0 warning(s)
> 8>orted - 2 error(s), 0 warning(s)
> 9>orte-clean - 13 error(s), 0 warning(s)
> 10>orterun - 100 error(s), 3 warning(s)
> 6>libmpi - 2133 error(s), 42 warning(s)
> 12>ompi-server - 27 error(s), 0 war

Re: [OMPI users] MPI debugger

2010-01-15 Thread Ashley Pittman

On 11 Jan 2010, at 06:20, Jed Brown wrote:

> On Sun, 10 Jan 2010 19:29:18 +, Ashley Pittman  
> wrote:
>> It'll show you parallel stack traces but won't let you single step for
>> example.
> 
> Two lightweight options if you want stepping, breakpoints, watchpoints,
> etc.
> 
> * Use serial debuggers on some interesting processes, for example with
> 
>mpiexec -n 1 xterm -e gdb --args ./trouble args : -n 2 ./trouble args : -n 
> 1 xterm -e gdb --args ./trouble args
> 
>  to put an xterm on rank 0 and 3 of a four process job (there are lots
>  of other ways to get here).

You can also achieve something similar with padb by starting the job normally 
and then using padb to launch xterms in a similar manner although it's been 
pointed out to me that this only works with one process per node right now.

> * MPICH2 has a poor-man's parallel debugger, mpiexec.mpd -gdb allows you
>  to send the same gdb commands to each process and collate the output.

True, I'd forgotten about that, the MPICH2 people are moving away from mpd 
though so I don't know how much longer that will be an option.

Ashley,


[OMPI users] Open MPI v1.4.1 released

2010-01-15 Thread Ralph Castain
The Open MPI Team, representing a consortium of research, academic,
and industry partners, is pleased to announce the release of Open MPI
version 1.4.1. This release is strictly a bug fix release over the v1.4
release.

Version 1.4.1 can be downloaded from the main Open MPI web site or
any of its mirrors (mirrors will be updating shortly).

Here is a list of changes in v1.4.1 as compared to v1.4

- Update to PLPA v1.3.2, addressing a licensing issue identified by
  the Fedora project.  See
  https://svn.open-mpi.org/trac/plpa/changeset/262 for details.
- Add check for malformed checkpoint metadata files (Ticket #2141).
- Fix error path in ompi-checkpoint when not able to checkpoint
  (Ticket #2138).
- Cleanup component release logic when selecting checkpoint/restart
  enabled components (Ticket #2135).
- Fixed VT node name detection for Cray XT platforms, and fixed some
  broken VT documentation files.
- Fix a possible race condition in tearing down RDMA CM-based
  connections.
- Relax error checking on MPI_GRAPH_CREATE.  Thanks to David Singleton
  for pointing out the issue.
- Fix a shared memory "hang" problem that occurred on x86/x86_64
  platforms when used with the GNU >=4.4.x compiler series.
- Add fix for Libtool 2.2.6b's problems with the PGI 10.x compiler
  suite.  Inspired directly from the upstream Libtool patches that fix
  the issue (but we need something working before the next Libtool
  release).



Re: [OMPI users] Windows CMake build problems ... (cont.)

2010-01-15 Thread cjohnson
Yes that was it.A much improved result now from CMake 2.6.4, no errors from compiling openmpi-1.4:1>libopen-pal - 0 error(s), 9 warning(s)2>libopen-rte - 0 error(s), 7 warning(s)3>opal-restart - 0 error(s), 0 warning(s)4>opal-wrapper - 0 error(s), 0 warning(s)5>libmpi - 0 error(s), 42 warning(s)6>orte-checkpoint - 0 error(s), 0 warning(s)7>orte-ps - 0 error(s), 0 warning(s)8>orted - 0 error(s), 0 warning(s)9>orte-clean - 0 error(s), 0 warning(s)10>orterun - 0 error(s), 3 warning(s)11>ompi_info - 0 error(s), 0 warning(s)12>ompi-server - 0 error(s), 0 warning(s)13>libmpi_cxx - 0 error(s), 61 warning(s)== Build: 13 succeeded, 0 failed, 1 up-to-date, 0 skipped ==And only one failure from compiling openmpi-1.3.4 (the ompi_info project):> 1>libopen-pal - 0 error(s), 9 warning(s)> 2>libopen-rte - 0 error(s), 7 warning(s)> 3>opal-restart - 0 error(s), 0 warning(s)> 4>opal-wrapper - 0 error(s), 0 warning(s)> 5>orte-checkpoint - 0 error(s), 0 warning(s)> 6>libmpi - 0 error(s), 42 warning(s)> 7>orte-ps - 0 error(s), 0 warning(s)> 8>orted - 0 error(s), 0 warning(s)> 9>orte-clean - 0 error(s), 0 warning(s)> 10>orterun - 0 error(s), 3 warning(s)> 11>ompi_info - 3 error(s), 0 warning(s)> 12>ompi-server - 0 error(s), 0 warning(s)> 13>libmpi_cxx - 0 error(s), 61 warning(s)> == Rebuild All: 13 succeeded, 1 failed, 0 skipped ==Here's the listing from the non-linking project:11>-- Rebuild All started: Project: ompi_info, Configuration: Debug Win32 --11>Deleting intermediate and output files for project 'ompi_info', configuration 'Debug|Win32'11>Compiling...11>version.cc11>..\..\..\..\openmpi-1.3.4\ompi\tools\ompi_info\version.cc(136) : error C2059: syntax error : ','11>..\..\..\..\openmpi-1.3.4\ompi\tools\ompi_info\version.cc(147) : error C2059: syntax error : ','11>..\..\..\..\openmpi-1.3.4\ompi\tools\ompi_info\version.cc(158) : error C2059: syntax error : ','11>param.cc11>output.cc11>ompi_info.cc11>components.cc11>Generating Code...11>Build log was saved at "file://c:\prog\mon\ompi\tools\ompi_info\ompi_info.dir\Debug\BuildLog.htm"11>ompi_info - 3 error(s), 0 warning(s)Thank you Shiqing !Charlie ...


 Original Message 
Subject: Re: [OMPI users] Windows CMake build problems ... (cont.)
From: Shiqing Fan 
List-Post: users@lists.open-mpi.org
Date: Thu, January 14, 2010 11:20 am
To: Open MPI Users , cjohn...@valverdecomputing.com


Hi Charlie,

The problem turns out to be the different behavior of one CMake macro in 
different version of CMake. And it's fixed in Open MPI trunk with 
r22405. I also created a ticket to move the fix over to 1.4 branch, see 
#2169: https://svn.open-mpi.org/trac/ompi/ticket/2169 .

So you could either switch to use OMPI trunk or use CMake 2.6 to solve 
the problem. Thanks a lot.


Best Regards,
Shiqing


cjohn...@valverdecomputing.com wrote:
> The OpenMPI build problem I'm having occurs in both OpenMPI 1.4 and 1.3.4.
>
> I am on a Windows 7 (US) Enterprise (x86) OS on an HP system with 
> Intel core 2 extreme x9000 (4GB RAM), using the 2005 Visual Studio for 
> S/W Architects (release 8.0.50727.867).
>
> [That release has everything the platform SDK would have.]
>
> I'm using CMake 2.8 to generate code, I used it correctly, pointing at 
> the root directory where the makelists are located for the source side 
> and to an empty directory for the build side: did configure, _*I did 
> not click debug this time as suggested by Shiqing*_, configure again, 
> generate and opened the OpenMPI.sln file created by CMake. Then I 
> right-clicked on the "ALL_BUILD" project and selected "build". Then 
> did one "rebuild", just in case build order might get one more success 
> (which it seemed to, but I could not find).
>
> 2 projects built, 12 did not. I have the build listing. [I'm afraid of 
> what the mailing list server would do if I attached it to this email.]
>
> All the compiles were successful (warnings at most.) All the errors 
> were were from linking the VC projects:
>
> *1>libopen-pal - 0 error(s), 9 warning(s)*
> 3>opal-restart - 32 error(s), 0 warning(s)
> 4>opal-wrapper - 21 error(s), 0 warning(s)
> 2>libopen-rte - 749 error(s), 7 warning(s)
> 5>orte-checkpoint - 32 error(s), 0 warning(s)
> 7>orte-ps - 28 error(s), 0 warning(s)
> 8>orted - 2 error(s), 0 warning(s)
> 9>orte-clean - 13 error(s), 0 warning(s)
> 10>orterun - 100 error(s), 3 warning(s)
> 6>libmpi - 2133 error(s), 42 warning(s)
> 12>ompi-server - 27 error(s), 0 warning(s)
> 11>ompi_info - 146 error(s), 0 warning(s)
> 13>libmpi_cxx - 456 error(s), 61 warning(s)
> == Rebuild All: 2 succeeded, 12 failed, 0 skipped ==
>
> It said that 2 succeeded, I could not find the second build success in 
> the listing.
>
> *However, everything did compile, and thank you Shiqing !*
>
> Here is the listing for the first failed link, on "opal-restart":
>
> 3>-- Rebuild All started: Project: opal-restart, Configuration: 
> Debug Win32 --
> 3>Deleting intermediate and output fil

Re: [OMPI users] Checkpoint/Restart error

2010-01-15 Thread Andreea Costea
Hi...
still not working. Though I uninstalled OpenMPI with make uninstall and I
manually deleted all other files, I still have the same error when
checkpointing.

Any idea?

Thanks,
Andreea


On Thu, Jan 14, 2010 at 10:38 PM, Joshua Hursey wrote:

> On Jan 14, 2010, at 8:20 AM, Andreea Costea wrote:
>
> > Hi,
> >
> > I wanted to try the C/R feature in OpenMPI version 1.4.1 that I have
> downloaded today. When I want to checkpoint I am having the following error
> message:
> > [[65192,0],0] ORTE_ERROR_LOG: Not found in file orte-checkpoint.c at line
> 399
> > HNP with PID 2337 Not found!
>
> This looks like an error coming from the 1.3.3 install. In 1.4.1 there is
> no error at line 399, in 1.3.3 there is. Check your installation of Open
> MPI, I bet you are mixing 1.4.1 and 1.3.3, which can cause unexpected
> problems.
>
> Try a clean installation of 1.4.1 and double check that 1.3.3 is not in
> your path/lib_path any longer.
>
> -- Josh
>
> >
> > I tried the same thing with version 1.3.3 and it works perfectly.
> >
> > Any idea why?
> >
> > thanks,
> > Andreea
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>