Re: [OMPI users] How do I compile OpenMPI in Xcode 3.1

2009-05-05 Thread Jeff Squyres

I agree; that is a bummer.  :-(

Warner -- do you have any advice here, perchance?


On May 4, 2009, at 7:26 PM, Vicente Puig wrote:


But it doesn't work well.

For example, I am trying to debug a program, "floyd" in this case,  
and when I make a breakpoint:


No line 26 in file "../../../gcc-4.2-20060805/libgfortran/fmain.c".

I am getting disappointed and frustrated that I can not work well  
with openmpi in my Mac. There should be a was to make it run in  
Xcode, uff...


2009/5/4 Jeff Squyres 
I get those as well.  I believe that they are (annoying but)  
harmless -- an artifact of how the freeware gcc/gofrtran that I use  
was built.




On May 4, 2009, at 1:47 PM, Vicente Puig wrote:

Maybe I had to open a new thread, but if you have any idea why I  
receive it when I use gdb for debugging an openmpi program:


warning: Could not find object file "/Users/admin/build/i386-apple- 
darwin9.0.0/libgcc/_umoddi3_s.o" - no debug information available  
for "../../../gcc-4.3-20071026/libgcc/../gcc/libgcc2.c".



warning: Could not find object file "/Users/admin/build/i386-apple- 
darwin9.0.0/libgcc/_udiv_w_sdiv_s.o" - no debug information  
available for "../../../gcc-4.3-20071026/libgcc/../gcc/libgcc2.c".



warning: Could not find object file "/Users/admin/build/i386-apple- 
darwin9.0.0/libgcc/_udivmoddi4_s.o" - no debug information available  
for "../../../gcc-4.3-20071026/libgcc/../gcc/libgcc2.c".



warning: Could not find object file "/Users/admin/build/i386-apple- 
darwin9.0.0/libgcc/unwind-dw2_s.o" - no debug information available  
for "../../../gcc-4.3-20071026/libgcc/../gcc/unwind-dw2.c".



warning: Could not find object file "/Users/admin/build/i386-apple- 
darwin9.0.0/libgcc/unwind-dw2-fde-darwin_s.o" - no debug information  
available for "../../../gcc-4.3-20071026/libgcc/../gcc/unwind-dw2- 
fde-darwin.c".



warning: Could not find object file "/Users/admin/build/i386-apple- 
darwin9.0.0/libgcc/unwind-c_s.o" - no debug information available  
for "../../../gcc-4.3-20071026/libgcc/../gcc/unwind-c.c".

...



There is no 'admin' so I don't know why it happen. It works well  
with a C program.


Any idea??.

Thanks.


Vincent





2009/5/4 Vicente Puig 
I can run openmpi perfectly with command line, but I wanted a  
graphic interface for debugging because I was having problems.


Thanks anyway.

Vincent

2009/5/4 Warner Yuen 

Admittedly, I don't use Xcode to build Open MPI either.

You can just compile Open MPI from the command line and install  
everything in /usr/local/. Make sure that gfortran is set in your  
path and you should just be able to do a './configure --prefix=/usr/ 
local'


After the installation, just make sure that your path is set  
correctly when you go to use the newly installed Open MPI. If you  
don't set your path, it will always default to using the version of  
OpenMPI that ships with Leopard.



Warner Yuen
Scientific Computing
Consulting Engineer
Apple, Inc.
email: wy...@apple.com
Tel: 408.718.2859




On May 4, 2009, at 9:13 AM, users-requ...@open-mpi.org wrote:

Send users mailing list submissions to
  us...@open-mpi.org

To subscribe or unsubscribe via the World Wide Web, visit
  http://www.open-mpi.org/mailman/listinfo.cgi/users
or, via email, send a message with subject or body 'help' to
  users-requ...@open-mpi.org

You can reach the person managing the list at
  users-ow...@open-mpi.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of users digest..."


Today's Topics:

 1. Re: How do I compile OpenMPI in Xcode 3.1 (Vicente Puig)


--

Message: 1
Date: Mon, 4 May 2009 18:13:45 +0200
From: Vicente Puig 
Subject: Re: [OMPI users] How do I compile OpenMPI in Xcode 3.1
To: Open MPI Users 
Message-ID:
  <3e9a21680905040913u3f36d3c9rdcd3413bfdcd...@mail.gmail.com>
Content-Type: text/plain; charset="iso-8859-1"

If I can not make it work with Xcode,  which one could I use?, which  
one do

you use to compile and debug OpenMPI?.
Thanks

Vincent


2009/5/4 Jeff Squyres 

Open MPI comes pre-installed in Leopard; as Warner noted, since  
Leopard
doesn't ship with a Fortran compiler, the Open MPI that Apple ships  
has

non-functional mpif77 and mpif90 wrapper compilers.

So the Open MPI that you installed manually will use your Fortran
compilers, and therefore will have functional mpif77 and mpif90  
wrapper
compilers.  Hence, you probably need to be sure to use the "right"  
wrapper
compilers.  It looks like you specified the full path specified to  
ExecPath,

so I'm not sure why Xcode wouldn't work with that (like I mentioned, I
unfortunately don't use Xcode myself, so I don't know why that  
wouldn't

work).




On May 4, 2009, at 11:53 AM, Vicente wrote:

Yes, I already have gfortran compiler on /usr/local/bin, the 

Re: [OMPI users] Slightly off topic: Ethernet and InfiniBand speedevolution

2009-05-05 Thread Scott Atchley

On May 5, 2009, at 2:47 PM, Jeff Squyres wrote:


On May 5, 2009, at 1:59 PM, Robert Kubrick wrote:


I am preparing a presentation where I will discuss commodity
interconnects and the evolution of Ethernet and Infiniband NICs. The
idea is to show the advance in network interfaces speed over time on
a chart. So far I have collected the following *approximative* data
for Ethernet:

1990 --> 100Mbits/s
2000 --> 1Gbits/s
2010 --> 10Gbits/s
2020 --> 100Gbits/s




FWIW, your ethernet timeline might be a little too long.  I could  
swear I read an internet trade rag recently that was anticipating  
pre-standard 40Gbps ethernet equipment by the end of 2010 (similar  
to how you can buy pre-standard 802.11n equipment from a variety of  
vendors now).  My *guess* is that 100Gbps will follow much less than  
10 years later because there are already carriers today who need/ 
want/would buy it.


Initially, 100 Gb/s Ethernet will be for switch-to-switch links and  
will be long before 2020.


The CPU roadmap to handle 100 Gb/s (via PCI-Express, QPI, HT, etc.)  
may dictate when NICs will support that rate.


Scott


Re: [OMPI users] Slightly off topic: Ethernet and InfiniBand speedevolution

2009-05-05 Thread Robert Kubrick
It is indeed surprisingly hard to draw a simple timeline for ethernet  
speed evolution. Network throughput is relative to the medium (fiber  
optic, coaxial, twisted pair...), distance (LAN, switch, WAN), number  
of channels (half duplex, full duplex), level of commercialization  
(research, production, commodity). All these factors contributed  
to different ethernet NIC speeds during the same decade or even year.


My goal was to compare network performances evolution vs CPU  
frequency advance in the past and for the next 10 years. It is a lot  
easier to show x86 CPU frequency changes on a time chart.


On May 5, 2009, at 2:47 PM, Jeff Squyres wrote:


On May 5, 2009, at 1:59 PM, Robert Kubrick wrote:


I am preparing a presentation where I will discuss commodity
interconnects and the evolution of Ethernet and Infiniband NICs. The
idea is to show the advance in network interfaces speed over time on
a chart. So far I have collected the following *approximative* data
for Ethernet:

1990 --> 100Mbits/s
2000 --> 1Gbits/s
2010 --> 10Gbits/s
2020 --> 100Gbits/s




FWIW, your ethernet timeline might be a little too long.  I could  
swear I read an internet trade rag recently that was anticipating  
pre-standard 40Gbps ethernet equipment by the end of 2010 (similar  
to how you can buy pre-standard 802.11n equipment from a variety of  
vendors now).  My *guess* is that 100Gbps will follow much less  
than 10 years later because there are already carriers today who  
need/want/would buy it.


(note: that's totally a guess -- don't read anything into into  
based on my email address; I'm in the unified computing/server  
group at Cisco -- nowhere near related to the 40/100Gbps groups)


--
Jeff Squyres
Cisco Systems

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Slightly off topic: Ethernet and InfiniBand speedevolution

2009-05-05 Thread Jeff Squyres

On May 5, 2009, at 1:59 PM, Robert Kubrick wrote:


I am preparing a presentation where I will discuss commodity
interconnects and the evolution of Ethernet and Infiniband NICs. The
idea is to show the advance in network interfaces speed over time on
a chart. So far I have collected the following *approximative* data
for Ethernet:

1990 --> 100Mbits/s
2000 --> 1Gbits/s
2010 --> 10Gbits/s
2020 --> 100Gbits/s




FWIW, your ethernet timeline might be a little too long.  I could  
swear I read an internet trade rag recently that was anticipating pre- 
standard 40Gbps ethernet equipment by the end of 2010 (similar to how  
you can buy pre-standard 802.11n equipment from a variety of vendors  
now).  My *guess* is that 100Gbps will follow much less than 10 years  
later because there are already carriers today who need/want/would buy  
it.


(note: that's totally a guess -- don't read anything into into based  
on my email address; I'm in the unified computing/server group at  
Cisco -- nowhere near related to the 40/100Gbps groups)


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Slightly off topic: Ethernet and InfiniBand speed evolution

2009-05-05 Thread Pavel Shamis (Pasha)


I can't find a similar data set for Infiniband. I would appreciate any 
comment/links.

Here is IB roadmap http://www.infinibandta.org/itinfo/IB_roadmap
...But I do not see there SDR

Pasha



[OMPI users] Slightly off topic: Ethernet and InfiniBand speed evolution

2009-05-05 Thread Robert Kubrick

Greetins,

I am preparing a presentation where I will discuss commodity  
interconnects and the evolution of Ethernet and Infiniband NICs. The  
idea is to show the advance in network interfaces speed over time on  
a chart. So far I have collected the following *approximative* data  
for Ethernet:


1990 --> 100Mbits/s
2000 --> 1Gbits/s
2010 --> 10Gbits/s
2020 --> 100Gbits/s

I can't find a similar data set for Infiniband. I would appreciate  
any comment/links.



Thank you! --Rob.


Re: [OMPI users] LSF launch with OpenMPI

2009-05-05 Thread Matthieu Brucher
2009/5/5 Jeff Squyres :
> On May 5, 2009, at 6:10 AM, Matthieu Brucher wrote:
>
>> The first is what the support of LSF by OpenMPI means. When mpirun is
>> executed, it is an LSF job that is actually ran? Or what does it
>> imply? I've tried to search on the openmpi website as well as on the
>> internet, but I couldn't find a clear answer/use case.
>>
>
> What Terry said is correct.  It means that "mpirun" will use, under the
> covers, the "native" launching mechanism of LSF to launch jobs (vs., say,
> rsh or ssh).  It'll also discover the hosts to use for this job without the
> use of a hostfile -- it'll query LSF directly to see what hosts it should
> use.

OK, so I have to do something like:
bsub -n ${CPUS} mpirun myapplication

Is it what I think?

>> My second question is about the LSF detection. lsf.h is detected, but
>> when lsb_launch is searched for ion libbat.so, it fails because
>> parse_time and parse_time_ex are not found. Is there a way to add
>> additional lsf libraries so that the search can be done?
>>
>
>
> Can you send all the data shown here:
>
>    http://www.open-mpi.org/community/help/

I've enclosed the configure output as well as the config.log. The
problem is that my LSF (I didn't install it) 7.0.3 need libbat to be
linked against llsbstream (I modified the configure script to add
-llsbstream, and it compiled).

I can't use the official way of launching a batch job, LSF doesn't
pickup the correct LSF script wrapper (due to a bogus installation).

Thank you for all the answers! (I will have others, as I'm trying to
use the InfiniPath support as well)

Matthieu
-- 
Information System Engineer, Ph.D.
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn: http://www.linkedin.com/in/matthieubrucher


output.tar.bz
Description: Binary data


Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-05 Thread Eugene Loh

Eugene Loh wrote:

Put more strongly, the "correct" (subjective term) way for an MPI 
implementation to bind processes is upon process creation and waiting 
until MPI_Init is "wrong".  This point of view has nothing to do with 
asking the MPI implementation to support binding of non-MPI processes.


I wanted to clarify my comment.  That notion of correct/wrong here is, 
as I indicated, quite subjective.  It reflects a particular point of 
view (the challenge of getting local memory on a NUMA node).  There is 
no "standard" to tell us what is right or wrong here.  An equally valid 
point of view is that users should not expect any MPI support (including 
for something as nonstandard as process binding) until MPI_Init has been 
called.  I was just trying to help Geoffroy make a case here:  why we 
might want to bind processes even if they don't call -- er, haven't yet 
called :^) -- MPI_Init.


I've used another hack when using an MPI implementation whose binding 
support either doesn't exist or I don't know how to use it or I don't 
trust it.  Instead of launching the executables, I launch a process that 
looks like this (I'm typing this from memory, probably full of typos and 
not guaranteed to work):


#!/bin/csh
set CPULIST = ( 47 23 19 8 43 12 )
@ me = $OMPI_COMM_WORLD_RANK + 1
pbind -b $CPULIST[$me] $$
./a.out

I hope you get the idea.  Anyhow, it's a wrapper script that binds the 
process before launching the MPI executable.


Mainly, just wanted to clarify that I wasn't saying unequivocally what 
was right/wrong here.  Only expressing one point of view to help 
represent Geoffroy's case.


Re: [OMPI users] LSF launch with OpenMPI

2009-05-05 Thread Jeff Squyres

On May 5, 2009, at 9:25 AM, Jeroen Kleijer wrote:

If you wish to submit to lsf using its native commands (bsub) you  
can do the following:


bsub -q ${QUEUE} -a openmpi -n ${CPUS} "mpirun.lsf  -x PATH -x  
LD_LIBRARY_PATH -x MPI_BUFFER_SIZE ${COMMAND} ${OPTIONS}"


It should be noted that in this case you don't call OpenMPI's mpirun  
directly but use the mpirun.lsf, a wrapper script provided by LSF.  
This wrapper script takes care of setting the necessary environment  
variables and eventually calls the correct mpirun. (the option "-a  
openmpi" tells LSF that we're using OpenMPI so don't try to  
autodetect)


I had forgotten about this.

I should ask my LSF contacts if this method still works with Open MPI  
v1.3 (which natively supports LSF), or whether strange / interesting  
failures occur because of the integration that mpirun.lsf does ends up  
effectively conflicting with what OMPI's mpirun does internally...


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-05 Thread Eugene Loh

Ralph Castain wrote:


On May 5, 2009, at 3:37 AM, Geoffroy Pignot wrote:


 The result is : everything works fine with MPI executables : logical !!!


What I was trying to do , was to run non MPI exes thanks to mpirun. 
There , openmpi is not able to bind these processes to a particular CPU.

My conclusion is that the process affinity is set in MPI_Init, right ?


Yes - sorry, I should have caught that in your cmd line. Not enough 
sleep lately... :-)


Could it be possible to have the paffinity features working without 
any MPI_Init call, using taskset for example. I agree , it's not your 
job to support the execution of any kind of exes but it would be nice !!


Actually, it is worth the question. As things stand, processes don't 
bind until they call MPI_Init. This has caused some problems for 
people that rely on the procs to be restricted to specific processor 
sets, but who don't (for various reasons) call MPI_Init at the 
beginning of their program.


I'll raise the question inside the devel community and see what people 
think.


Another case is where you have a NUMA node and the MPI process allocates 
and touches a bunch of memory before MPI_Init is called.  In this case, 
we wouldn't want to wait until MPI_Init for process binding.  Rather, we 
would want the process bound as soon as it is started.


Put more strongly, the "correct" (subjective term) way for an MPI 
implementation to bind processes is upon process creation and waiting 
until MPI_Init is "wrong".  This point of view has nothing to do with 
asking the MPI implementation to support binding of non-MPI processes.


Re: [OMPI users] LSF launch with OpenMPI

2009-05-05 Thread Jeroen Kleijer
If you wish to submit to lsf using its native commands (bsub) you can do the
following:

bsub -q ${QUEUE} -a openmpi -n ${CPUS} "mpirun.lsf  -x PATH -x
LD_LIBRARY_PATH -x MPI_BUFFER_SIZE ${COMMAND} ${OPTIONS}"

It should be noted that in this case you don't call OpenMPI's mpirun
directly but use the mpirun.lsf, a wrapper script provided by LSF. This
wrapper script takes care of setting the necessary environment variables and
eventually calls the correct mpirun. (the option "-a openmpi" tells LSF that
we're using OpenMPI so don't try to autodetect)

Regards,

Jeroen Kleijer

On Tue, May 5, 2009 at 2:23 PM, Jeff Squyres  wrote:

> On May 5, 2009, at 6:10 AM, Matthieu Brucher wrote:
>
> The first is what the support of LSF by OpenMPI means. When mpirun is
>> executed, it is an LSF job that is actually ran? Or what does it
>> imply? I've tried to search on the openmpi website as well as on the
>> internet, but I couldn't find a clear answer/use case.
>>
>>
> What Terry said is correct.  It means that "mpirun" will use, under the
> covers, the "native" launching mechanism of LSF to launch jobs (vs., say,
> rsh or ssh).  It'll also discover the hosts to use for this job without the
> use of a hostfile -- it'll query LSF directly to see what hosts it should
> use.
>
> My second question is about the LSF detection. lsf.h is detected, but
>> when lsb_launch is searched for ion libbat.so, it fails because
>> parse_time and parse_time_ex are not found. Is there a way to add
>> additional lsf libraries so that the search can be done?
>>
>>
>
> Can you send all the data shown here:
>
>http://www.open-mpi.org/community/help/
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-05 Thread Ralph Castain
Actually, my memory was correct!

I believe you are looking at the old code in the 1.3 branch, and not the new
code in the trunk (and soon to come to the 1.3 branch). The new code does
not have this check any more as it is not required.

Sorry for confusion...



On Tue, May 5, 2009 at 7:08 AM, Ralph Castain  wrote:

> Ah - thx for catching that, I'll remove that check. It no longer is
> required.
>
> Thx!
>
>
> On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky <
> lenny.verkhov...@gmail.com> wrote:
>
>> According to the code it does cares.
>>
>> $vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572
>>
>> ival = orte_rmaps_rank_file_value.ival;
>>   if ( ival > (np-1) ) {
>>   orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true, ival,
>> rankfile);
>>   rc = ORTE_ERR_BAD_PARAM;
>>   goto unlock;
>>   }
>>
>> If I remember correctly, I used an array to map ranks, and since the
>> length of array is NP, maximum index must be less than np, so if you have
>> the number of rank > NP, you have no place to put it inside array.
>>
>> "Likewise, if you have more procs than the rankfile specifies, we map the
>> additional procs either byslot (default) or bynode (if you specify that
>> option). So the rankfile doesn't need to contain an entry for every proc."
>>  - Correct point.
>>
>> Lenny.
>>
>> On 5/5/09, Ralph Castain  wrote:
>>>
>>> Sorry Lenny, but that isn't correct. The rankfile mapper doesn't care if
>>> the rankfile contains additional info - it only maps up to the number of
>>> processes, and ignores anything beyond that number. So there is no need to
>>> remove the additional info.
>>>
>>> Likewise, if you have more procs than the rankfile specifies, we map the
>>> additional procs either byslot (default) or bynode (if you specify that
>>> option). So the rankfile doesn't need to contain an entry for every proc.
>>>
>>> Just don't want to confuse folks.
>>> Ralph
>>>
>>>
>>>
>>> On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky <
>>> lenny.verkhov...@gmail.com> wrote:
>>>
 Hi,
 maximum rank number must be less then np.
 if np=1 then there is only rank 0 in the system, so rank 1 is invalid.
 please remove "rank 1=node2 slot=*" from the rankfile
 Best regards,
 Lenny.

 On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot 
 wrote:

> Hi ,
>
> I got the 
> openmpi-1.4a1r21095.tar.gztarball,
>  but unfortunately my command doesn't work
>
> cat rankf:
> rank 0=node1 slot=*
> rank 1=node2 slot=*
>
> cat hostf:
> node1 slots=2
> node2 slots=2
>
> mpirun  --rankfile rankf --hostfile hostf  --host node1 -n 1 hostname :
> --host node2 -n 1 hostname
>
> Error, invalid rank (1) in the rankfile (rankf)
>
>
> --
> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
> rmaps_rank_file.c at line 403
> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
> base/rmaps_base_map_job.c at line 86
> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
> base/plm_base_launch_support.c at line 86
> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
> plm_rsh_module.c at line 1016
>
>
> Ralph, could you tell me if my command syntax is correct or not ? if
> not, give me the expected one ?
>
> Regards
>
> Geoffroy
>
>
>
>
> 2009/4/30 Geoffroy Pignot 
>
>> Immediately Sir !!! :)
>>
>> Thanks again Ralph
>>
>> Geoffroy
>>
>>
>>
>>>
>>>
>>> --
>>>
>>> Message: 2
>>> Date: Thu, 30 Apr 2009 06:45:39 -0600
>>> From: Ralph Castain 
>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> To: Open MPI Users 
>>> Message-ID:
>>><71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com>
>>> Content-Type: text/plain; charset="iso-8859-1"
>>>
>>> I believe this is fixed now in our development trunk - you can
>>> download any
>>> tarball starting from last night and give it a try, if you like. Any
>>> feedback would be appreciated.
>>>
>>> Ralph
>>>
>>>
>>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
>>>
>>> Ah now, I didn't say it -worked-, did I? :-)
>>>
>>> Clearly a bug exists in the program. I'll try to take a look at it
>>> (if Lenny
>>> doesn't get to it first), but it won't be until later in the week.
>>>
>>> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>>>
>>> I agree with you Ralph , and that 's what I expect from openmpi but
>>> my
>>> second example shows that it's not working

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-05 Thread Ralph Castain
Ah - thx for catching that, I'll remove that check. It no longer is
required.

Thx!

On Tue, May 5, 2009 at 7:04 AM, Lenny Verkhovsky  wrote:

> According to the code it does cares.
>
> $vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572
>
> ival = orte_rmaps_rank_file_value.ival;
>   if ( ival > (np-1) ) {
>   orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true, ival,
> rankfile);
>   rc = ORTE_ERR_BAD_PARAM;
>   goto unlock;
>   }
>
> If I remember correctly, I used an array to map ranks, and since the length
> of array is NP, maximum index must be less than np, so if you have the
> number of rank > NP, you have no place to put it inside array.
>
> "Likewise, if you have more procs than the rankfile specifies, we map the
> additional procs either byslot (default) or bynode (if you specify that
> option). So the rankfile doesn't need to contain an entry for every proc."
>  - Correct point.
>
> Lenny.
>
> On 5/5/09, Ralph Castain  wrote:
>>
>> Sorry Lenny, but that isn't correct. The rankfile mapper doesn't care if
>> the rankfile contains additional info - it only maps up to the number of
>> processes, and ignores anything beyond that number. So there is no need to
>> remove the additional info.
>>
>> Likewise, if you have more procs than the rankfile specifies, we map the
>> additional procs either byslot (default) or bynode (if you specify that
>> option). So the rankfile doesn't need to contain an entry for every proc.
>>
>> Just don't want to confuse folks.
>> Ralph
>>
>>
>>
>> On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky <
>> lenny.verkhov...@gmail.com> wrote:
>>
>>> Hi,
>>> maximum rank number must be less then np.
>>> if np=1 then there is only rank 0 in the system, so rank 1 is invalid.
>>> please remove "rank 1=node2 slot=*" from the rankfile
>>> Best regards,
>>> Lenny.
>>>
>>> On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot wrote:
>>>
 Hi ,

 I got the 
 openmpi-1.4a1r21095.tar.gztarball,
  but unfortunately my command doesn't work

 cat rankf:
 rank 0=node1 slot=*
 rank 1=node2 slot=*

 cat hostf:
 node1 slots=2
 node2 slots=2

 mpirun  --rankfile rankf --hostfile hostf  --host node1 -n 1 hostname :
 --host node2 -n 1 hostname

 Error, invalid rank (1) in the rankfile (rankf)


 --
 [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
 rmaps_rank_file.c at line 403
 [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
 base/rmaps_base_map_job.c at line 86
 [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
 base/plm_base_launch_support.c at line 86
 [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
 plm_rsh_module.c at line 1016


 Ralph, could you tell me if my command syntax is correct or not ? if
 not, give me the expected one ?

 Regards

 Geoffroy




 2009/4/30 Geoffroy Pignot 

> Immediately Sir !!! :)
>
> Thanks again Ralph
>
> Geoffroy
>
>
>
>>
>>
>> --
>>
>> Message: 2
>> Date: Thu, 30 Apr 2009 06:45:39 -0600
>> From: Ralph Castain 
>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> To: Open MPI Users 
>> Message-ID:
>><71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> I believe this is fixed now in our development trunk - you can
>> download any
>> tarball starting from last night and give it a try, if you like. Any
>> feedback would be appreciated.
>>
>> Ralph
>>
>>
>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
>>
>> Ah now, I didn't say it -worked-, did I? :-)
>>
>> Clearly a bug exists in the program. I'll try to take a look at it (if
>> Lenny
>> doesn't get to it first), but it won't be until later in the week.
>>
>> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>>
>> I agree with you Ralph , and that 's what I expect from openmpi but my
>> second example shows that it's not working
>>
>> cat hostfile.0
>>   r011n002 slots=4
>>   r011n003 slots=4
>>
>>  cat rankfile.0
>>rank 0=r011n002 slot=0
>>rank 1=r011n003 slot=1
>>
>> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
>> hostname
>> ### CRASHED
>>
>> > > Error, invalid rank (1) in the rankfile (rankfile.0)
>> > >
>> >
>> --
>> > > [r011n002:25129] 

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-05 Thread Lenny Verkhovsky
According to the code it does cares.

$vi orte/mca/rmaps/rank_file/rmaps_rank_file.c +572

ival = orte_rmaps_rank_file_value.ival;
  if ( ival > (np-1) ) {
  orte_show_help("help-rmaps_rank_file.txt", "bad-rankfile", true, ival,
rankfile);
  rc = ORTE_ERR_BAD_PARAM;
  goto unlock;
  }

If I remember correctly, I used an array to map ranks, and since the length
of array is NP, maximum index must be less than np, so if you have the
number of rank > NP, you have no place to put it inside array.

"Likewise, if you have more procs than the rankfile specifies, we map the
additional procs either byslot (default) or bynode (if you specify that
option). So the rankfile doesn't need to contain an entry for every proc."
 - Correct point.

Lenny.

On 5/5/09, Ralph Castain  wrote:
>
> Sorry Lenny, but that isn't correct. The rankfile mapper doesn't care if
> the rankfile contains additional info - it only maps up to the number of
> processes, and ignores anything beyond that number. So there is no need to
> remove the additional info.
>
> Likewise, if you have more procs than the rankfile specifies, we map the
> additional procs either byslot (default) or bynode (if you specify that
> option). So the rankfile doesn't need to contain an entry for every proc.
>
> Just don't want to confuse folks.
> Ralph
>
>
>
> On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky <
> lenny.verkhov...@gmail.com> wrote:
>
>> Hi,
>> maximum rank number must be less then np.
>> if np=1 then there is only rank 0 in the system, so rank 1 is invalid.
>> please remove "rank 1=node2 slot=*" from the rankfile
>> Best regards,
>> Lenny.
>>
>> On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot wrote:
>>
>>> Hi ,
>>>
>>> I got the 
>>> openmpi-1.4a1r21095.tar.gztarball,
>>>  but unfortunately my command doesn't work
>>>
>>> cat rankf:
>>> rank 0=node1 slot=*
>>> rank 1=node2 slot=*
>>>
>>> cat hostf:
>>> node1 slots=2
>>> node2 slots=2
>>>
>>> mpirun  --rankfile rankf --hostfile hostf  --host node1 -n 1 hostname :
>>> --host node2 -n 1 hostname
>>>
>>> Error, invalid rank (1) in the rankfile (rankf)
>>>
>>>
>>> --
>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> rmaps_rank_file.c at line 403
>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> base/rmaps_base_map_job.c at line 86
>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> base/plm_base_launch_support.c at line 86
>>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> plm_rsh_module.c at line 1016
>>>
>>>
>>> Ralph, could you tell me if my command syntax is correct or not ? if not,
>>> give me the expected one ?
>>>
>>> Regards
>>>
>>> Geoffroy
>>>
>>>
>>>
>>>
>>> 2009/4/30 Geoffroy Pignot 
>>>
 Immediately Sir !!! :)

 Thanks again Ralph

 Geoffroy



>
>
> --
>
> Message: 2
> Date: Thu, 30 Apr 2009 06:45:39 -0600
> From: Ralph Castain 
> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
> To: Open MPI Users 
> Message-ID:
><71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com>
> Content-Type: text/plain; charset="iso-8859-1"
>
> I believe this is fixed now in our development trunk - you can download
> any
> tarball starting from last night and give it a try, if you like. Any
> feedback would be appreciated.
>
> Ralph
>
>
> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
>
> Ah now, I didn't say it -worked-, did I? :-)
>
> Clearly a bug exists in the program. I'll try to take a look at it (if
> Lenny
> doesn't get to it first), but it won't be until later in the week.
>
> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>
> I agree with you Ralph , and that 's what I expect from openmpi but my
> second example shows that it's not working
>
> cat hostfile.0
>   r011n002 slots=4
>   r011n003 slots=4
>
>  cat rankfile.0
>rank 0=r011n002 slot=0
>rank 1=r011n003 slot=1
>
> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
> hostname
> ### CRASHED
>
> > > Error, invalid rank (1) in the rankfile (rankfile.0)
> > >
> >
> --
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> file
> > > rmaps_rank_file.c at line 404
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> file
> > > base/rmaps_base_map_job.c at line 87
> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in
> file
> > > 

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-05 Thread Ralph Castain
Sorry Lenny, but that isn't correct. The rankfile mapper doesn't care if the
rankfile contains additional info - it only maps up to the number of
processes, and ignores anything beyond that number. So there is no need to
remove the additional info.

Likewise, if you have more procs than the rankfile specifies, we map the
additional procs either byslot (default) or bynode (if you specify that
option). So the rankfile doesn't need to contain an entry for every proc.

Just don't want to confuse folks.
Ralph



On Tue, May 5, 2009 at 5:59 AM, Lenny Verkhovsky  wrote:

> Hi,
> maximum rank number must be less then np.
> if np=1 then there is only rank 0 in the system, so rank 1 is invalid.
> please remove "rank 1=node2 slot=*" from the rankfile
> Best regards,
> Lenny.
>
> On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot wrote:
>
>> Hi ,
>>
>> I got the 
>> openmpi-1.4a1r21095.tar.gztarball,
>>  but unfortunately my command doesn't work
>>
>> cat rankf:
>> rank 0=node1 slot=*
>> rank 1=node2 slot=*
>>
>> cat hostf:
>> node1 slots=2
>> node2 slots=2
>>
>> mpirun  --rankfile rankf --hostfile hostf  --host node1 -n 1 hostname :
>> --host node2 -n 1 hostname
>>
>> Error, invalid rank (1) in the rankfile (rankf)
>>
>> --
>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> rmaps_rank_file.c at line 403
>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> base/rmaps_base_map_job.c at line 86
>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> base/plm_base_launch_support.c at line 86
>> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
>> plm_rsh_module.c at line 1016
>>
>>
>> Ralph, could you tell me if my command syntax is correct or not ? if not,
>> give me the expected one ?
>>
>> Regards
>>
>> Geoffroy
>>
>>
>>
>>
>> 2009/4/30 Geoffroy Pignot 
>>
>>> Immediately Sir !!! :)
>>>
>>> Thanks again Ralph
>>>
>>> Geoffroy
>>>
>>>
>>>


 --

 Message: 2
 Date: Thu, 30 Apr 2009 06:45:39 -0600
 From: Ralph Castain 
 Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
 To: Open MPI Users 
 Message-ID:
<71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com>
 Content-Type: text/plain; charset="iso-8859-1"

 I believe this is fixed now in our development trunk - you can download
 any
 tarball starting from last night and give it a try, if you like. Any
 feedback would be appreciated.

 Ralph


 On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:

 Ah now, I didn't say it -worked-, did I? :-)

 Clearly a bug exists in the program. I'll try to take a look at it (if
 Lenny
 doesn't get to it first), but it won't be until later in the week.

 On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:

 I agree with you Ralph , and that 's what I expect from openmpi but my
 second example shows that it's not working

 cat hostfile.0
   r011n002 slots=4
   r011n003 slots=4

  cat rankfile.0
rank 0=r011n002 slot=0
rank 1=r011n003 slot=1

 mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1
 hostname
 ### CRASHED

 > > Error, invalid rank (1) in the rankfile (rankfile.0)
 > >
 >
 --
 > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
 > > rmaps_rank_file.c at line 404
 > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
 > > base/rmaps_base_map_job.c at line 87
 > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
 > > base/plm_base_launch_support.c at line 77
 > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
 > > plm_rsh_module.c at line 985
 > >
 >
 --
 > > A daemon (pid unknown) died unexpectedly on signal 1  while
 > attempting to
 > > launch so we are aborting.
 > >
 > > There may be more information reported by the environment (see
 > above).
 > >
 > > This may be because the daemon was unable to find all the needed
 > shared
 > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
 > have the
 > > location of the shared libraries on the remote nodes and this will
 > > automatically be forwarded to the remote nodes.
 > >
 >
 --
 > >
 >
 --
 

Re: [OMPI users] LSF launch with OpenMPI

2009-05-05 Thread Jeff Squyres

On May 5, 2009, at 6:10 AM, Matthieu Brucher wrote:


The first is what the support of LSF by OpenMPI means. When mpirun is
executed, it is an LSF job that is actually ran? Or what does it
imply? I've tried to search on the openmpi website as well as on the
internet, but I couldn't find a clear answer/use case.



What Terry said is correct.  It means that "mpirun" will use, under  
the covers, the "native" launching mechanism of LSF to launch jobs  
(vs., say, rsh or ssh).  It'll also discover the hosts to use for this  
job without the use of a hostfile -- it'll query LSF directly to see  
what hosts it should use.



My second question is about the LSF detection. lsf.h is detected, but
when lsb_launch is searched for ion libbat.so, it fails because
parse_time and parse_time_ex are not found. Is there a way to add
additional lsf libraries so that the search can be done?




Can you send all the data shown here:

http://www.open-mpi.org/community/help/

--
Jeff Squyres
Cisco Systems



Re: [OMPI users] LSF launch with OpenMPI

2009-05-05 Thread Terry Frankcombe
On Tue, 2009-05-05 at 12:10 +0200, Matthieu Brucher wrote:
> Hello,
> 
> I have two questions, in fact.
> 
> The first is what the support of LSF by OpenMPI means. When mpirun is
> executed, it is an LSF job that is actually ran? Or what does it
> imply? I've tried to search on the openmpi website as well as on the
> internet, but I couldn't find a clear answer/use case.

Hi Matthieu

I think it's fair to say that if "batch system XYZ" is supported, then
in a job script submitted to that batch system you can issue an mpirun
command without manually specifying numbers of processes, hostnames,
launch protocols, etc.  They're all picked up using the mechanisms of
the batch system.

If LSF has any peculiarities, someone will point them out, I'm sure.

Configuring for LSF I can't help you with.

Ciao




Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-05 Thread Lenny Verkhovsky
Hi,
maximum rank number must be less then np.
if np=1 then there is only rank 0 in the system, so rank 1 is invalid.
please remove "rank 1=node2 slot=*" from the rankfile
Best regards,
Lenny.

On Mon, May 4, 2009 at 11:14 AM, Geoffroy Pignot wrote:

> Hi ,
>
> I got the 
> openmpi-1.4a1r21095.tar.gztarball,
>  but unfortunately my command doesn't work
>
> cat rankf:
> rank 0=node1 slot=*
> rank 1=node2 slot=*
>
> cat hostf:
> node1 slots=2
> node2 slots=2
>
> mpirun  --rankfile rankf --hostfile hostf  --host node1 -n 1 hostname :
> --host node2 -n 1 hostname
>
> Error, invalid rank (1) in the rankfile (rankf)
>
> --
> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
> rmaps_rank_file.c at line 403
> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
> base/rmaps_base_map_job.c at line 86
> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
> base/plm_base_launch_support.c at line 86
> [r011n006:28986] [[45541,0],0] ORTE_ERROR_LOG: Bad parameter in file
> plm_rsh_module.c at line 1016
>
>
> Ralph, could you tell me if my command syntax is correct or not ? if not,
> give me the expected one ?
>
> Regards
>
> Geoffroy
>
>
>
>
> 2009/4/30 Geoffroy Pignot 
>
>> Immediately Sir !!! :)
>>
>> Thanks again Ralph
>>
>> Geoffroy
>>
>>
>>
>>>
>>>
>>> --
>>>
>>> Message: 2
>>> Date: Thu, 30 Apr 2009 06:45:39 -0600
>>> From: Ralph Castain 
>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> To: Open MPI Users 
>>> Message-ID:
>>><71d2d8cc0904300545v61a42fe1k50086d2704d0f...@mail.gmail.com>
>>> Content-Type: text/plain; charset="iso-8859-1"
>>>
>>> I believe this is fixed now in our development trunk - you can download
>>> any
>>> tarball starting from last night and give it a try, if you like. Any
>>> feedback would be appreciated.
>>>
>>> Ralph
>>>
>>>
>>> On Apr 14, 2009, at 7:57 AM, Ralph Castain wrote:
>>>
>>> Ah now, I didn't say it -worked-, did I? :-)
>>>
>>> Clearly a bug exists in the program. I'll try to take a look at it (if
>>> Lenny
>>> doesn't get to it first), but it won't be until later in the week.
>>>
>>> On Apr 14, 2009, at 7:18 AM, Geoffroy Pignot wrote:
>>>
>>> I agree with you Ralph , and that 's what I expect from openmpi but my
>>> second example shows that it's not working
>>>
>>> cat hostfile.0
>>>   r011n002 slots=4
>>>   r011n003 slots=4
>>>
>>>  cat rankfile.0
>>>rank 0=r011n002 slot=0
>>>rank 1=r011n003 slot=1
>>>
>>> mpirun --hostfile hostfile.0 -rf rankfile.0 -n 1 hostname : -n 1 hostname
>>> ### CRASHED
>>>
>>> > > Error, invalid rank (1) in the rankfile (rankfile.0)
>>> > >
>>> >
>>> --
>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> > > rmaps_rank_file.c at line 404
>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> > > base/rmaps_base_map_job.c at line 87
>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> > > base/plm_base_launch_support.c at line 77
>>> > > [r011n002:25129] [[63976,0],0] ORTE_ERROR_LOG: Bad parameter in file
>>> > > plm_rsh_module.c at line 985
>>> > >
>>> >
>>> --
>>> > > A daemon (pid unknown) died unexpectedly on signal 1  while
>>> > attempting to
>>> > > launch so we are aborting.
>>> > >
>>> > > There may be more information reported by the environment (see
>>> > above).
>>> > >
>>> > > This may be because the daemon was unable to find all the needed
>>> > shared
>>> > > libraries on the remote node. You may set your LD_LIBRARY_PATH to
>>> > have the
>>> > > location of the shared libraries on the remote nodes and this will
>>> > > automatically be forwarded to the remote nodes.
>>> > >
>>> >
>>> --
>>> > >
>>> >
>>> --
>>> > > orterun noticed that the job aborted, but has no info as to the
>>> > process
>>> > > that caused that situation.
>>> > >
>>> >
>>> --
>>> > > orterun: clean termination accomplished
>>>
>>>
>>>
>>> Message: 4
>>> Date: Tue, 14 Apr 2009 06:55:58 -0600
>>> From: Ralph Castain 
>>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>>> To: Open MPI Users 
>>> Message-ID: 
>>> Content-Type: text/plain; charset="us-ascii"; Format="flowed";
>>>   DelSp="yes"
>>>
>>> The rankfile cuts across the entire job - it isn't applied on an
>>> app_context basis. So the ranks in your 

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-05 Thread Ralph Castain


On May 5, 2009, at 3:37 AM, Geoffroy Pignot wrote:


Hi

 The result is : everything works fine with MPI executables :  
logical !!!



What I was trying to do , was to run non MPI exes thanks to mpirun.  
There , openmpi is not able to bind these processes to a particular  
CPU.

My conclusion is that the process affinity is set in MPI_Init, right ?


Yes - sorry, I should have caught that in your cmd line. Not enough  
sleep lately... :-)





Could it be possible to have the paffinity features working without  
any MPI_Init call, using taskset for example. I agree , it's not  
your job to support the execution of any kind of exes but it would  
be nice !!


Actually, it is worth the question. As things stand, processes don't  
bind until they call MPI_Init. This has caused some problems for  
people that rely on the procs to be restricted to specific processor  
sets, but who don't (for various reasons) call MPI_Init at the  
beginning of their program.


I'll raise the question inside the devel community and see what people  
think.




Thanks again for all your efforts, I really appreciate


No problem! Thanks for your patience while debugging this...





I am looking forward to downloading, trying and deploying the next  
official release


Regards

Geoffroy



2009/5/4 Geoffroy Pignot 
Hi Ralph

Thanks for your extra tests.  Before leaving , I just pointed out a  
problem coming from running plpa across different rh distribs (<=>  
different Linux kernels). Indeed, I configure and compile openmpi on  
rhel4 , then I run on rhel5. I think my problem comes from this  
approximation. I'll do few more tests tomorrow morning (France) and  
keep you inform.


Regards

Geoffroy



Message: 2
Date: Mon, 4 May 2009 13:34:40 -0600

From: Ralph Castain 
Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
To: Open MPI Users 
Message-ID:
   <71d2d8cc0905041234m76eb5a9dx57a773997779d...@mail.gmail.com>

Content-Type: text/plain; charset="iso-8859-1"

Hmmm...I'm afraid I can't replicate the problem. All seems to be  
working
just fine on the RHEL systems available to me. The procs indeed bind  
to the

specified processors in every case.

rhc@odin ~/trunk]$ cat rankfile
rank 0=odin001 slot=0
rank 1=odin002 slot=1

[rhc@odin mpi]$ mpirun -rf ../../../rankfile -n 2 --leave-session- 
attached

-mca paffinity_base_verbose 5 ./mpi_spin
[odin001.cs.indiana.edu:09297 ]
paffinity slot assignment: slot_list == 0
[odin001.cs.indiana.edu:09297 ]
paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[odin002.cs.indiana.edu:13566] paffinity slot assignment: slot_list  
== 1
[odin002.cs.indiana.edu:13566] paffinity slot assignment: rank 1  
runs on cpu

#1 (#1)

Suspended
[rhc@odin mpi]$ ssh odin001
[rhc@odin001 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
Srhc0  9296  0.0 orted
RLl  rhc0  9297  100 mpi_spin

[rhc@odin mpi]$ ssh odin002
[rhc@odin002 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
Srhc0 13562  0.0 orted
RLl  rhc1 13566  102 mpi_spin


Not sure where to go from here...perhaps someone else can spot the  
problem?

Ralph


On Mon, May 4, 2009 at 8:28 AM, Ralph Castain   
wrote:


> Unfortunately, I didn't write any of that code - I was just fixing  
the
> mapper so it would properly map the procs. From what I can tell,  
the proper

> things are happening there.
>
> I'll have to dig into the code that specifically deals with  
parsing the
> results to bind the processes. Afraid that will take awhile longer  
- pretty

> dark in that hole.
>
>
>
> On Mon, May 4, 2009 at 8:04 AM, Geoffroy Pignot  
wrote:


>
>> Hi,
>>
>> So, there are no more crashes with my "crazy" mpirun command. But  
the
>> paffinity feature seems to be broken. Indeed I am not able to pin  
my

>> processes.
>>
>> Simple test with a program using your plpa library :
>>
>> r011n006% cat hostf
>> r011n006 slots=4
>>
>> r011n006% cat rankf
>> rank 0=r011n006 slot=0   > bind to CPU 0 , exact ?
>>
>> r011n006% /tmp/HALMPI/openmpi-1.4a/bin/mpirun --hostfile hostf -- 
rankfile

>> rankf --wdir /tmp -n 1 a.out
>>  >>> PLPA Number of processors online: 4
>>  >>> PLPA Number of processor sockets: 2
>>  >>> PLPA Socket 0 (ID 0): 2 cores
>>  >>> PLPA Socket 1 (ID 3): 2 cores
>>
>> Ctrl+Z
>> r011n006%bg
>>
>> r011n006% ps axo stat,user,psr,pid,pcpu,comm | grep gpignot
>> R+   gpignot3  9271 97.8 a.out
>>
>> In fact whatever the slot number I put in my rankfile , a.out  
always runs
>> on the CPU 3. I was looking for it on CPU 0 accordind to my  
cpuinfo file

>> (see below)
>> The result is the same if I try another syntax (rank 0=r011n006  
slot=0:0

>> bind to socket 0 - core 0  , exact ? )
>>
>> Thanks in advance
>>
>> Geoffroy
>>
>> PS: I run on rhel5
>>
>> r011n006% uname -a
>> Linux r011n006 

[OMPI users] LSF launch with OpenMPI

2009-05-05 Thread Matthieu Brucher
Hello,

I have two questions, in fact.

The first is what the support of LSF by OpenMPI means. When mpirun is
executed, it is an LSF job that is actually ran? Or what does it
imply? I've tried to search on the openmpi website as well as on the
internet, but I couldn't find a clear answer/use case.

My second question is about the LSF detection. lsf.h is detected, but
when lsb_launch is searched for ion libbat.so, it fails because
parse_time and parse_time_ex are not found. Is there a way to add
additional lsf libraries so that the search can be done?

Matthieu Brucher
-- 
Information System Engineer, Ph.D.
Website: http://matthieu-brucher.developpez.com/
Blogs: http://matt.eifelle.com and http://blog.developpez.com/?blog=92
LinkedIn: http://www.linkedin.com/in/matthieubrucher


Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-05-05 Thread Geoffroy Pignot
Hi

 The result is : everything works fine with MPI executables : logical !!!

What I was trying to do , was to run non MPI exes thanks to mpirun. There ,
openmpi is not able to bind these processes to a particular CPU.
My conclusion is that the process affinity is set in MPI_Init, right ?

Could it be possible to have the paffinity features working without any
MPI_Init call, using taskset for example. I agree , it's not your job to
support the execution of any kind of exes but it would be nice !!
Thanks again for all your efforts, I really appreciate

I am looking forward to downloading, trying and deploying the next official
release

Regards

Geoffroy



2009/5/4 Geoffroy Pignot 

> Hi Ralph
>
> Thanks for your extra tests.  Before leaving , I just pointed out a problem
> coming from running plpa across different rh distribs (<=> different Linux
> kernels). Indeed, I configure and compile openmpi on rhel4 , then I run on
> rhel5. I think my problem comes from this approximation. I'll do few more
> tests tomorrow morning (France) and keep you inform.
>
> Regards
>
> Geoffroy
>
>
>>
>> Message: 2
>> Date: Mon, 4 May 2009 13:34:40 -0600
>> From: Ralph Castain 
>> Subject: Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??
>> To: Open MPI Users 
>> Message-ID:
>><71d2d8cc0905041234m76eb5a9dx57a773997779d...@mail.gmail.com>
>> Content-Type: text/plain; charset="iso-8859-1"
>>
>> Hmmm...I'm afraid I can't replicate the problem. All seems to be working
>> just fine on the RHEL systems available to me. The procs indeed bind to
>> the
>> specified processors in every case.
>>
>> rhc@odin ~/trunk]$ cat rankfile
>> rank 0=odin001 slot=0
>> rank 1=odin002 slot=1
>>
>> [rhc@odin mpi]$ mpirun -rf ../../../rankfile -n 2
>> --leave-session-attached
>> -mca paffinity_base_verbose 5 ./mpi_spin
>> [odin001.cs.indiana.edu:09297 ]
>> paffinity slot assignment: slot_list == 0
>> [odin001.cs.indiana.edu:09297 ]
>> paffinity slot assignment: rank 0 runs on cpu #0 (#0)
>> [odin002.cs.indiana.edu:13566] paffinity slot assignment: slot_list == 1
>> [odin002.cs.indiana.edu:13566] paffinity slot assignment: rank 1 runs on
>> cpu
>> #1 (#1)
>>
>> Suspended
>> [rhc@odin mpi]$ ssh odin001
>> [rhc@odin001 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
>> Srhc0  9296  0.0 orted
>> RLl  rhc0  9297  100 mpi_spin
>>
>> [rhc@odin mpi]$ ssh odin002
>> [rhc@odin002 ~]$ ps axo stat,user,psr,pid,pcpu,comm | grep rhc
>> Srhc0 13562  0.0 orted
>> RLl  rhc1 13566  102 mpi_spin
>>
>>
>> Not sure where to go from here...perhaps someone else can spot the
>> problem?
>> Ralph
>>
>>
>> On Mon, May 4, 2009 at 8:28 AM, Ralph Castain  wrote:
>>
>> > Unfortunately, I didn't write any of that code - I was just fixing the
>> > mapper so it would properly map the procs. From what I can tell, the
>> proper
>> > things are happening there.
>> >
>> > I'll have to dig into the code that specifically deals with parsing the
>> > results to bind the processes. Afraid that will take awhile longer -
>> pretty
>> > dark in that hole.
>> >
>> >
>> >
>> > On Mon, May 4, 2009 at 8:04 AM, Geoffroy Pignot > >wrote:
>>
>> >
>> >> Hi,
>> >>
>> >> So, there are no more crashes with my "crazy" mpirun command. But the
>> >> paffinity feature seems to be broken. Indeed I am not able to pin my
>> >> processes.
>> >>
>> >> Simple test with a program using your plpa library :
>> >>
>> >> r011n006% cat hostf
>> >> r011n006 slots=4
>> >>
>> >> r011n006% cat rankf
>> >> rank 0=r011n006 slot=0   > bind to CPU 0 , exact ?
>> >>
>> >> r011n006% /tmp/HALMPI/openmpi-1.4a/bin/mpirun --hostfile hostf
>> --rankfile
>> >> rankf --wdir /tmp -n 1 a.out
>> >>  >>> PLPA Number of processors online: 4
>> >>  >>> PLPA Number of processor sockets: 2
>> >>  >>> PLPA Socket 0 (ID 0): 2 cores
>> >>  >>> PLPA Socket 1 (ID 3): 2 cores
>> >>
>> >> Ctrl+Z
>> >> r011n006%bg
>> >>
>> >> r011n006% ps axo stat,user,psr,pid,pcpu,comm | grep gpignot
>> >> R+   gpignot3  9271 97.8 a.out
>> >>
>> >> In fact whatever the slot number I put in my rankfile , a.out always
>> runs
>> >> on the CPU 3. I was looking for it on CPU 0 accordind to my cpuinfo
>> file
>> >> (see below)
>> >> The result is the same if I try another syntax (rank 0=r011n006
>> slot=0:0
>> >> bind to socket 0 - core 0  , exact ? )
>> >>
>> >> Thanks in advance
>> >>
>> >> Geoffroy
>> >>
>> >> PS: I run on rhel5
>> >>
>> >> r011n006% uname -a
>> >> Linux r011n006 2.6.18-92.1.1NOMAP32.el5 #1 SMP Sat Mar 15 01:46:39 CDT
>> >> 2008 x86_64 x86_64 x86_64 GNU/Linux
>> >>
>> >> My configure is :
>> >>  ./configure --prefix=/tmp/openmpi-1.4a --libdir='${exec_prefix}/lib64'
>> >> --disable-dlopen --disable-mpi-cxx --enable-heterogeneous
>> >>
>> >>
>> >> r011n006% cat /proc/cpuinfo
>> >> processor   : 0
>> >> vendor_id   : 

Re: [OMPI users] users Digest, Vol 1217, Issue 2, Message3

2009-05-05 Thread Pavel Shamis (Pasha)

Jan,
I guess that you have OFED driver installed on you machines. You may do 
basic network verification with ibdiagnet utility 
(http://linux.die.net/man/1/ibdiagnet) that is part of OFED installation. 


Regards,
Pasha


Jeff Squyres wrote:

On May 4, 2009, at 9:50 AM, jan wrote:


Thank you Jeff. I have passed the mail to the IB vendor Dell company(the
blade was ordered from Dell Taiwan), but he todl me that he didn't
understand  "layer 0 diagnostics". Coluld you help us to get more
information of "layer 0 diagnostics". Thanks again.



Layer 0 = your physical network layer.  Specifically: ensure that your 
IB network is actually functioning properly at both the physical and 
driver layer.  Cisco was an IB vendor for several years; I can tell 
you from experience that it is *not* enough to just plug everything in 
and run a few trivial tests to ensure that network traffic seems to be 
passed properly.  You need to have your vendor run a full set of layer 
0 diagnostics to ensure that all the cables are good, all the HCAs are 
good, all the drivers are functioning properly, etc.  This involves 
running diagnostic network testing patterns, checking various error 
counters on the HCAs and IB switches, etc.


This is something that Dell should know how to do.

I say all this because the problem that you are seeing *seems* to be a 
network-related problem, not an OMPI-related problem.  One can never 
know for sure, but it is fairly clear that the very first step in your 
case is to verify that the network is functioning 100% properly.  
FWIW: this was standard operating procedure when Cisco was selling IB 
hardware.