Re: [OMPI users] MPI_File_Write

2011-11-29 Thread Rob Latham
On Wed, Nov 16, 2011 at 03:52:05PM +, Kharche, Sanjay wrote:
> 
> Dear All
> 
> I am sure this issue has been discussed before on the forum, but I will 
> appreciate your comments.
> 
> I have  a package that tries to do parallel file output using MPI_File_Write:
> 
> /*  Write to file. */
> mpi_errno = MPI_File_write(file, New, 1, sourceType, MPI_STATUS_IGNORE);
> 
> With an increasing number of processors, I see that this causes the file 
> output to take longer. Can someone suggest a solution?

Think a bit about what adding more processors will do.  Each MPI
process will write 1 sourceType to the file.   More processors will
write more data.

I don't know how your program creates sourceType, nor do i know the
file view (if any) it has placed on the output, so maybe you will need
to show more code.

I hope you are setting a file view here, or each processor will end up
writing the same data to the same location in the file.  If you
duplicate the work identically to N processors then yeah, you will
take N times longer.

==rob

-- 
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


Re: [OMPI users] problem with fortran, MPI_REDUCE and MPI_IN_PLACE

2011-11-29 Thread Jeff Squyres
Ask and you shall receive!

I got a tip from the MPICH2 guys about how they handle this stuff; it seems 
that the magic gfortran compiler flag is -Wl,-commons,use_dylibs.  Thanks Dave 
Goodell!

I will commit this to the OMPI SVN trunk tonight (because it's an 
autotools-level change, which we try not to do during the workday), and will 
file tickets to get this change over to v1.4 and v1.5.

While you're waiting for a release with this fix, you can either manually add 
-Wl,-commons,use_dylibs to your mpif77/mpif90 command lines, or edit your 
$prefix/share/ompi/mpif77-wrapper-data.txt file (and mpif90-wrapper-data.txt 
file) to set the "compiler_flags" line to include -Wl,-commons,use_dylibs.  For 
example:

compiler_flags=-Wl,-commons,use_dylibs

Woo hoo!



On Nov 28, 2011, at 8:11 PM, Jeff Squyres wrote:

> Unfortunately, this is a known issue.  :-\
> 
> I have not found a reliable way to deduce that MPI_IN_PLACE has been passed 
> as the parameter to MPI_REDUCE (and friends) on OS X.  There's something very 
> strange going on with regards to the Fortran compiler and common block 
> variables (which is where we have MPI_IN_PLACE and other sentinel-value MPI 
> constants defined).
> 
> We have a very old ticket open on this issue:
> 
>https://svn.open-mpi.org/trac/ompi/ticket/1982
> 
> Any suggestions would be welcome.  :-\
> 
> 
> On Nov 23, 2011, at 1:20 PM, Arjen van Elteren wrote:
> 
>> Dear All,
>> 
>> I'm running a complex program with a number of MPI_REDUCE calls, every call 
>> uses MPI_IN_PLACE as the first parameter (the send buffer).
>> 
>> I'm currently testing this program on Mac 10.6 with macports installed.
>> 
>> Unfortunately all MPI_REDUCE calls with MPI_IN_PLACE, seem to fail. 
>> 
>> I've pinpointed the problem to the MPI_IN_PLACE parameter location, it seems 
>> to matter if it is the first or the second parameter to the MPI_REDUCE call.
>> 
>> This is specific for fortran, in C the sequence does not matter!
>> 
>> A simple program to test this:
>> 
>> PROGRAM MAIN
>>  implicit none
>>  include 'mpif.h'
>>  integer :: x(10)
>>  integer :: provided,ioerror
>>  call MPI_INIT(ioerror)
>>  x = 1
>> 
>>  print *, x
>>  call MPI_REDUCE(x, MPI_IN_PLACE,10, MPI_INTEGER, MPI_SUM, 0,MPI_COMM_WORLD, 
>>  ioerror)
>>  print *, x
>>  call MPI_REDUCE(MPI_IN_PLACE, x,10, MPI_INTEGER, MPI_SUM, 0,MPI_COMM_WORLD, 
>>  ioerror)
>>  print *, x
>> 
>>  call MPI_FINALIZE(ioerror)
>> END PROGRAM
>> 
>> I run this on one process (mpiexec ./a.out)
>> 
>> I'm running with openmpi version 1.5.4 (macports)
>> 
>> The openmpi is compiled with gfortran 4.4.6
>> 
>> Is this a bug in openmpi or is my understanding of MPI_REDUCE wrong?
>> 
>> Best regards,
>> 
>> Arjen
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Brice Goglin

> Hwloc optional build support status (more details can be found above):
>
> Probe / display PCI devices: yes
> Graphical output (Cairo):yes
> XML output:  full

"XML output" should be "XML input/output" or "XML support".

> Memory support:  binding, set policy, migrate pages

Looks ok otherwise.

Brice



Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Jeff Squyres
On Nov 29, 2011, at 12:01 PM, Brice Goglin wrote:

> Yes, always installed. There are some configure checks for verbs, but
> it's only used for enabling verbs-related helper testing.

Ok, how's this for output at the end of configure? 

Linux:

-
Hwloc optional build support status (more details can be found above):

Probe / display PCI devices: yes
Graphical output (Cairo):yes
XML output:  full
Memory support:  binding, set policy, migrate pages
-

OS X:

-
Hwloc optional build support status (more details can be found above):

Probe / display PCI devices: no
Graphical output (Cairo):yes
XML output:  full
Memory support:  none
-

XML support will show "basic" if libxml2 is not found.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Jeff Squyres
On Nov 29, 2011, at 11:53 AM, Brice Goglin wrote:

>> What about MX, verbs, Cuda, ...?
> 
> MX and verbs are not used internally, we just have public helpers to
> interoperate with them (and tests).

I forget -- are the helpers installed/available even if the MX 
headers/libraries are not found at configure time?  (ditto for verbs, cuda, 
etc.)

> Same for cuda in trunk (until Samuel's cuda branch gets merged).
> 
> Brice
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Program hangs in mpi_bcast

2011-11-29 Thread Jeff Squyres
That's quite weird/surprising that you would need to set it down to *5* -- 
that's really low.

Can you share a simple reproducer code, perchance?


On Nov 15, 2011, at 11:49 AM, Tom Rosmond wrote:

> Ralph,
> 
> Thanks for the advice.  I have to set 'coll_sync_barrier_before=5' to do
> the job.  This is a big change from the default value (1000), so our
> application seems to be a pretty extreme case.
> 
> T. Rosmond
> 
> 
> On Mon, 2011-11-14 at 16:17 -0700, Ralph Castain wrote:
>> Yes, this is well documented - may be on the FAQ, but certainly has been in 
>> the user list multiple times.
>> 
>> The problem is that one process falls behind, which causes it to begin 
>> accumulating "unexpected messages" in its queue. This causes the matching 
>> logic to run a little slower, thus making the process fall further and 
>> further behind. Eventually, things hang because everyone is sitting in bcast 
>> waiting for the slow proc to catch up, but it's queue is saturated and it 
>> can't.
>> 
>> The solution is to do exactly what you describe - add some barriers to force 
>> the slow process to catch up. This happened enough that we even added 
>> support for it in OMPI itself so you don't have to modify your code. Look at 
>> the following from "ompi_info --param coll sync"
>> 
>>MCA coll: parameter "coll_base_verbose" (current value: <0>, 
>> data source: default value)
>>  Verbosity level for the coll framework (0 = no 
>> verbosity)
>>MCA coll: parameter "coll_sync_priority" (current value: 
>> <50>, data source: default value)
>>  Priority of the sync coll component; only relevant 
>> if barrier_before or barrier_after is > 0
>>   MCA coll: parameter "coll_sync_barrier_before" (current value: 
>> <1000>, data source: default value)
>>  Do a synchronization before each Nth collective
>>MCA coll: parameter "coll_sync_barrier_after" (current value: 
>> <0>, data source: default value)
>>  Do a synchronization after each Nth collective
>> 
>> Take your pick - inserting a barrier before or after doesn't seem to make a 
>> lot of difference, but most people use "before". Try different values until 
>> you get something that works for you.
>> 
>> 
>> On Nov 14, 2011, at 3:10 PM, Tom Rosmond wrote:
>> 
>>> Hello:
>>> 
>>> A colleague and I have been running a large F90 application that does an
>>> enormous number of mpi_bcast calls during execution.  I deny any
>>> responsibility for the design of the code and why it needs these calls,
>>> but it is what we have inherited and have to work with.
>>> 
>>> Recently we ported the code to an 8 node, 6 processor/node NUMA system
>>> (lstopo output attached) running Debian linux 6.0.3 with Open_MPI 1.5.3,
>>> and began having trouble with mysterious 'hangs' in the program inside
>>> the mpi_bcast calls.  The hangs were always in the same calls, but not
>>> necessarily at the same time during integration.  We originally didn't
>>> have NUMA support, so reinstalled with libnuma support added, but the
>>> problem persisted.  Finally, just as a wild guess, we inserted
>>> 'mpi_barrier' calls just before the 'mpi_bcast' calls, and the program
>>> now runs without problems.
>>> 
>>> I believe conventional wisdom is that properly formulated MPI programs
>>> should run correctly without barriers, so do you have any thoughts on
>>> why we found it necessary to add them?  The code has run correctly on
>>> other architectures, i.g. Crayxe6, so I don't think there is a bug
>>> anywhere.  My only explanation is that some internal resource gets
>>> exhausted because of the large number of 'mpi_bcast' calls in rapid
>>> succession, and the barrier calls force synchronization which allows the
>>> resource to be restored.  Does this make sense?  I'd appreciate any
>>> comments and advice you can provide.
>>> 
>>> 
>>> I have attached compressed copies of config.log and ompi_info for the
>>> system.  The program is built with ifort 12.0 and typically runs with 
>>> 
>>> mpirun -np 36 -bycore -bind-to-core program.exe
>>> 
>>> We have run both interactively and with PBS, but that doesn't seem to
>>> make any difference in program behavior.
>>> 
>>> T. Rosmond
>>> 
>>> 
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] Qlogic & openmpi

2011-11-29 Thread Jeff Squyres
On Nov 28, 2011, at 11:53 PM, arnaud Heritier wrote:

> I do have a contract and i tried to open a case, but their support is ..

What happens if you put a delay between the two jobs?  E.g., if you just delay 
a few seconds before the 2nd job starts?  Perhaps the ipath device just needs a 
little time before it will be available...?  (that's a total guess)

I suggest this because the PSM device will definitely give you better overall 
performance than the QLogic verbs support.  Their verbs support basically 
barely works -- PSM is their primary device and the one that we always 
recommend.

> Anyway. I'm stii working on the strange error message from mpirun saying it 
> can't allocate memory when at the same time it also reports that the memory 
> is unlimited ...
> 
> 
> Arnaud
> 
> On Tue, Nov 29, 2011 at 4:23 AM, Jeff Squyres  wrote:
> I'm afraid we don't have any contacts left at QLogic to ask them any more... 
> do you have a support contract, perchance?
> 
> On Nov 27, 2011, at 3:11 PM, Arnaud Heritier wrote:
> 
> > Hello,
> >
> > I run into a stange problem with qlogic OFED and openmpi. When i submit 
> > (through SGE) 2 jobs on the same node, the second job ends up with:
> >
> > (ipath/PSM)[10292]: can't open /dev/ipath, network down (err=26)
> >
> > I'm pretty sure the infiniband is working well as the other job runs fine.
> >
> > Here is details about the configuration:
> >
> > Qlogic HCA: InfiniPath_QMH7342 (2 ports but only one connected to a switch)
> > qlogic_ofed-1.5.3-7.0.0.0.35 (rocks cluster roll)
> > openmpi 1.5.4 (./configure --with-psm --with-openib --with-sge)
> >
> > -
> >
> > In order to fix this problem i recompiled openmpi without psm support, but 
> > i faced an other problem:
> >
> > The OpenFabrics (openib) BTL failed to initialize while trying to
> > allocate some locked memory.  This typically can indicate that the
> > memlock limits are set too low.  For most HPC installations, the
> > memlock limits should be set to "unlimited".  The failure occured
> > here:
> >
> >   Local host:compute-0-6.local
> >   OMPI source:   btl_openib.c:329
> >   Function:  ibv_create_srq()
> >   Device:qib0
> >   Memlock limit: unlimited
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Stefan Eilemann
Hi Jeff,

On 29. Nov 2011, at 15:28, Jeff Squyres wrote:

>> I think messages of found/not found optional modules could be more prominent 
>> at the end of the configure process.
> 
> FWIW, I've traditionally been against such things for two reasons:

Your call, really. The information is there and not too hard to find, but I 
missed it on the first run. Most software I know provides this in a very 
concise list at the end (Supported: A B C\n Unsupported: D E F).


Cheers,

Stefan.
-- 
http://www.eyescale.ch
http://www.equalizergraphics.com
http://www.linkedin.com/in/eilemann






Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed

2011-11-29 Thread MM
fantastic, thank you very much,

-Original Message-
From: Shiqing Fan [mailto:f...@hlrs.de] 
Sent: 29 November 2011 14:10
To: MM
Cc: 'Open MPI Users'
Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed

Hi MM,

That doesn't really help.

Do you need a debug version on your 32 bit Windows XP? Maybe I can build 
one for you. Will send it to you off the mailing list.

Regards,
Shiqing

On 2011-11-29 2:59 PM, MM wrote:
> debugging mpirund arrives to
> openmpi-1.5.4\opal\mca\base\mca_base_components_select.c:
> function mca_base_select
>
> components_available list appears to be empty, ie
> orte_debugger_base_components_available appears to be empty (opal list
> length=0)
>
> Is this an indication of something meaningful?
>
> Note, I built opnempi static libs (with dll c/c++ runtime)
> OMPI_IMPORTS is __not__ defined, that's how I got it to compile
>
> MM
> -Original Message-
> From: Shiqing Fan [mailto:f...@hlrs.de]
> Sent: 25 November 2011 22:19
> To: MM
> Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name
failed
>
> Hi MM,
>
> Do you really want to build Open MPI by yourself? If you only need the
> libraries, probably you may stick to 1.5.4 binaries, which you said
> works for you.
>
> Anyway, if you want to debug mpirun, you can step into orterun project,
> which generates mpirun executable.
>
> Which version of Open MPI are you building? I'm not sure whether I'll
> have time this days to look closely to this problem, but if you can
> reproduce this problem with a small test program, and send it to me, I
> would like also help debug it.
>
>
> Best Regards,
> Shiqing
>
>
> On 2011-11-25 11:06 PM, MM wrote:
>> Shiqing,
>>
>> As I built the mpi libs in debug as well, can I break point somehow when
I
>> run
>>
>> mpirun -np 1   : -np 1
>>
>> and I get those 2 errors.
>>
>> Can I breakpoint somehow inside vs2010? maybe to investigate what's going
>> on?
>>
>> How do I launch "mpirun" in debug from the openmpi solution. Which
project
>> generates the mpirun binary?
>>
>> I am a bit stuck and would appreciate help to progress,
>>
>> rds,
>>
>> MM
>>
>> -Original Message-
>> From: Shiqing Fan [mailto:f...@hlrs.de]
>> Sent: 24 November 2011 16:44
>> To: MM
>> Cc: 'Open MPI Users'
>> Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name
> failed
>> Hi MM,
>>
>> Sorry for the delayed reply, I was busy in a meeting these days.
>>
>> The log files seem not very helpful to solve the problem. May be your
>> CMakeCache.txt file would help.
>>
>> Currently we don't provided binaries built from trunk. Have you also
>> tried the 1.5.x binaries?
>>
>> Best Regards,
>> Shiqing
>>
>> On 2011-11-23 10:08 PM, MM wrote:
>>> Hi Shiqing,
>>>
>>> Is the info provided useful to understand what's going on?
>>> Alternatively, is there a way to get the provided binaries for win but
> off
>>> trunk rather than off 1.5.4 as on the website, because I don't have this
>>> problem when I link against those libs,
>>>
>>> Thanks
>>>
>>> MM
>>>
>>> -Original Message-
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>>> Behalf Of MM
>>> Sent: 21 November 2011 21:08
>>> To: f...@hlrs.de
>>> Cc: 'Open MPI Users'
>>> Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name
>> failed
>>> Hi,
>>>
>>> I have placed the source in \Program Files\openmpi-1.5.4 the build dir
in
>>> \Program Files\openmpi.build and the install dir in \Program
> Files\openmpi
>>> I could not find config.log in any of the 3 directories nor in the
>> directory
>>> from which I run mpirun.
>>>
>>> The build log attached is a zip of all the .log under \Program
>>> Files\openmpi.build
>>>
>>> First, I installed the provided binaries on xp32bit, and successfully
ran
>>> the program in Release mode.
>>> in debug mode, there was that error of some function missing in kernel,
>> that
>>> you fixed in svn.
>>>
>>> Second, I then downloaded the source and built the static libraries w
>> cmake
>>> according to README.windows, and against these home built libs, the same
>>> program run neithers in debug nor in release, because of the error
below.
>>>
>>> How can I generate the config.log?
>>>
>>> About Debug/Release, thinking about it at this time, I don't really need
>> the
>>> debug libs of openmpi.
>>> but to be able to link against vs2010 Release libs of openmpi, I need
> them
>>> to be linked against the Release c runtime, so I might as well link
>> against
>>> the debug version of the openmpi libs.
>>>
>>> Your help is very appreciated,
>>> MM
>>>
>>> -Original Message-
>>> From: Shiqing Fan [mailto:f...@hlrs.de]
>>> Sent: 21 November 2011 12:48
>>> To: Open MPI Users
>>> Cc: MM
>>> Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name
>> failed
>>> Hi,
>>>
>>> Could you please send your config and build log to me? Have you tried
> with
>> a
>>> simpler program? Does this error always happen?
>>>
>>> Regards,
>>> Shiqing
>>>
>>>
>>> 

Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Jeff Squyres
On Nov 29, 2011, at 7:25 AM, Stefan Eilemann wrote:

>> You are probably missing the libpci-devel package.
> 
> Thanks, that either doesn't exist or wasn't installed on Redhat. It works now.
> 
> I think messages of found/not found optional modules could be more prominent 
> at the end of the configure process.

FWIW, I've traditionally been against such things for two reasons:

1. The information *was* displayed above (i.e., that pci-devel wasn't 
found/wasn't usable/whatever).  I realize that most people don't read the 
stdout of configure at all, but all the information you need is already there.

2. A list of what will/will not be built at the end tends to grow lengthy such 
that it dilutes the value of repeating the information at the end.

That being said, I can *somewhat* see the value of displaying a user-friendly 
"PCI device support will not be built" vs. the output of a configure test, 
which might be somewhat obscure.  However, in hwloc's case, the configure test 
output is pretty self-evident.  Examples:

checking for PCI... no
checking pci/pci.h usability... no
checking pci/pci.h presence... no
checking for pci/pci.h... no
checking for LIBXML2... yes
checking for xmlNewDoc... yes
checking for final LIBXML2 support... yes

A simple string search for "pci" and "xml" will find these lines in the 
configure output.  Assumedly, if you're building from source, you've likely got 
at least *some* experience and it shouldn't be unreasonable to ask you to go 
look in the output of configure.

Don't get me wrong -- I'm not dead-set against a listing at the bottom.  I just 
find it redundant and somewhat of a maintenance hassle.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed

2011-11-29 Thread Shiqing Fan

Hi MM,

That doesn't really help.

Do you need a debug version on your 32 bit Windows XP? Maybe I can build 
one for you. Will send it to you off the mailing list.


Regards,
Shiqing

On 2011-11-29 2:59 PM, MM wrote:

debugging mpirund arrives to
openmpi-1.5.4\opal\mca\base\mca_base_components_select.c:
function mca_base_select

components_available list appears to be empty, ie
orte_debugger_base_components_available appears to be empty (opal list
length=0)

Is this an indication of something meaningful?

Note, I built opnempi static libs (with dll c/c++ runtime)
OMPI_IMPORTS is __not__ defined, that's how I got it to compile

MM
-Original Message-
From: Shiqing Fan [mailto:f...@hlrs.de]
Sent: 25 November 2011 22:19
To: MM
Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed

Hi MM,

Do you really want to build Open MPI by yourself? If you only need the
libraries, probably you may stick to 1.5.4 binaries, which you said
works for you.

Anyway, if you want to debug mpirun, you can step into orterun project,
which generates mpirun executable.

Which version of Open MPI are you building? I'm not sure whether I'll
have time this days to look closely to this problem, but if you can
reproduce this problem with a small test program, and send it to me, I
would like also help debug it.


Best Regards,
Shiqing


On 2011-11-25 11:06 PM, MM wrote:

Shiqing,

As I built the mpi libs in debug as well, can I break point somehow when I
run

mpirun -np 1   : -np 1

and I get those 2 errors.

Can I breakpoint somehow inside vs2010? maybe to investigate what's going
on?

How do I launch "mpirun" in debug from the openmpi solution. Which project
generates the mpirun binary?

I am a bit stuck and would appreciate help to progress,

rds,

MM

-Original Message-
From: Shiqing Fan [mailto:f...@hlrs.de]
Sent: 24 November 2011 16:44
To: MM
Cc: 'Open MPI Users'
Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name

failed

Hi MM,

Sorry for the delayed reply, I was busy in a meeting these days.

The log files seem not very helpful to solve the problem. May be your
CMakeCache.txt file would help.

Currently we don't provided binaries built from trunk. Have you also
tried the 1.5.x binaries?

Best Regards,
Shiqing

On 2011-11-23 10:08 PM, MM wrote:

Hi Shiqing,

Is the info provided useful to understand what's going on?
Alternatively, is there a way to get the provided binaries for win but

off

trunk rather than off 1.5.4 as on the website, because I don't have this
problem when I link against those libs,

Thanks

MM

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of MM
Sent: 21 November 2011 21:08
To: f...@hlrs.de
Cc: 'Open MPI Users'
Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name

failed

Hi,

I have placed the source in \Program Files\openmpi-1.5.4 the build dir in
\Program Files\openmpi.build and the install dir in \Program

Files\openmpi

I could not find config.log in any of the 3 directories nor in the

directory

from which I run mpirun.

The build log attached is a zip of all the .log under \Program
Files\openmpi.build

First, I installed the provided binaries on xp32bit, and successfully ran
the program in Release mode.
in debug mode, there was that error of some function missing in kernel,

that

you fixed in svn.

Second, I then downloaded the source and built the static libraries w

cmake

according to README.windows, and against these home built libs, the same
program run neithers in debug nor in release, because of the error below.

How can I generate the config.log?

About Debug/Release, thinking about it at this time, I don't really need

the

debug libs of openmpi.
but to be able to link against vs2010 Release libs of openmpi, I need

them

to be linked against the Release c runtime, so I might as well link

against

the debug version of the openmpi libs.

Your help is very appreciated,
MM

-Original Message-
From: Shiqing Fan [mailto:f...@hlrs.de]
Sent: 21 November 2011 12:48
To: Open MPI Users
Cc: MM
Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name

failed

Hi,

Could you please send your config and build log to me? Have you tried

with

a

simpler program? Does this error always happen?

Regards,
Shiqing


On 2011-11-19 4:24 PM, MM wrote:

Trying to run my program linked against debug 1.5.4 on vs2010 fails:


mpirun -np 1 .\nhui\Debug\nhui.exe : -np 1
.\nhcomp\Debug\nhcomp.exe

[PCNAME:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in file
C:\Program Files\openmpi-1.5.4\orte\mca\ess\hnp\ess_hnp_module.c at
line 536
--
 It looks like orte_init failed for some reason; your parallel
process is likely to abort.  There are many reasons that a parallel
process can fail during orte_init; some of which are due to
configuration or environment problems.  This failure appears to be an

Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed

2011-11-29 Thread MM
debugging mpirund arrives to
openmpi-1.5.4\opal\mca\base\mca_base_components_select.c:
function mca_base_select

components_available list appears to be empty, ie 
orte_debugger_base_components_available appears to be empty (opal list
length=0)

Is this an indication of something meaningful?

Note, I built opnempi static libs (with dll c/c++ runtime)
OMPI_IMPORTS is __not__ defined, that's how I got it to compile

MM
-Original Message-
From: Shiqing Fan [mailto:f...@hlrs.de] 
Sent: 25 November 2011 22:19
To: MM
Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed

Hi MM,

Do you really want to build Open MPI by yourself? If you only need the 
libraries, probably you may stick to 1.5.4 binaries, which you said 
works for you.

Anyway, if you want to debug mpirun, you can step into orterun project, 
which generates mpirun executable.

Which version of Open MPI are you building? I'm not sure whether I'll 
have time this days to look closely to this problem, but if you can 
reproduce this problem with a small test program, and send it to me, I 
would like also help debug it.


Best Regards,
Shiqing


On 2011-11-25 11:06 PM, MM wrote:
> Shiqing,
>
> As I built the mpi libs in debug as well, can I break point somehow when I
> run
>
> mpirun -np 1  : -np 1
>
> and I get those 2 errors.
>
> Can I breakpoint somehow inside vs2010? maybe to investigate what's going
> on?
>
> How do I launch "mpirun" in debug from the openmpi solution. Which project
> generates the mpirun binary?
>
> I am a bit stuck and would appreciate help to progress,
>
> rds,
>
> MM
>
> -Original Message-
> From: Shiqing Fan [mailto:f...@hlrs.de]
> Sent: 24 November 2011 16:44
> To: MM
> Cc: 'Open MPI Users'
> Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name
failed
>
> Hi MM,
>
> Sorry for the delayed reply, I was busy in a meeting these days.
>
> The log files seem not very helpful to solve the problem. May be your
> CMakeCache.txt file would help.
>
> Currently we don't provided binaries built from trunk. Have you also
> tried the 1.5.x binaries?
>
> Best Regards,
> Shiqing
>
> On 2011-11-23 10:08 PM, MM wrote:
>> Hi Shiqing,
>>
>> Is the info provided useful to understand what's going on?
>> Alternatively, is there a way to get the provided binaries for win but
off
>> trunk rather than off 1.5.4 as on the website, because I don't have this
>> problem when I link against those libs,
>>
>> Thanks
>>
>> MM
>>
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>> Behalf Of MM
>> Sent: 21 November 2011 21:08
>> To: f...@hlrs.de
>> Cc: 'Open MPI Users'
>> Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name
> failed
>> Hi,
>>
>> I have placed the source in \Program Files\openmpi-1.5.4 the build dir in
>> \Program Files\openmpi.build and the install dir in \Program
Files\openmpi
>>
>> I could not find config.log in any of the 3 directories nor in the
> directory
>> from which I run mpirun.
>>
>> The build log attached is a zip of all the .log under \Program
>> Files\openmpi.build
>>
>> First, I installed the provided binaries on xp32bit, and successfully ran
>> the program in Release mode.
>> in debug mode, there was that error of some function missing in kernel,
> that
>> you fixed in svn.
>>
>> Second, I then downloaded the source and built the static libraries w
> cmake
>> according to README.windows, and against these home built libs, the same
>> program run neithers in debug nor in release, because of the error below.
>>
>> How can I generate the config.log?
>>
>> About Debug/Release, thinking about it at this time, I don't really need
> the
>> debug libs of openmpi.
>> but to be able to link against vs2010 Release libs of openmpi, I need
them
>> to be linked against the Release c runtime, so I might as well link
> against
>> the debug version of the openmpi libs.
>>
>> Your help is very appreciated,
>> MM
>>
>> -Original Message-
>> From: Shiqing Fan [mailto:f...@hlrs.de]
>> Sent: 21 November 2011 12:48
>> To: Open MPI Users
>> Cc: MM
>> Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name
> failed
>> Hi,
>>
>> Could you please send your config and build log to me? Have you tried
with
> a
>> simpler program? Does this error always happen?
>>
>> Regards,
>> Shiqing
>>
>>
>> On 2011-11-19 4:24 PM, MM wrote:
>>> Trying to run my program linked against debug 1.5.4 on vs2010 fails:
>>>
>> mpirun -np 1 .\nhui\Debug\nhui.exe : -np 1
>> .\nhcomp\Debug\nhcomp.exe
>>> [PCNAME:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in file
>>> C:\Program Files\openmpi-1.5.4\orte\mca\ess\hnp\ess_hnp_module.c at
>>> line 536
>>> --
>>>  It looks like orte_init failed for some reason; your parallel
>>> process is likely to abort.  There are many reasons that a parallel
>>> process can fail during orte_init; some of which are due to
>>> 

Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed

2011-11-29 Thread MM
I have to admit this is driving me a bit crazy, 

Trying to debug orterun from vs2010 says "Cannot attach to process", even if
I do "Start debugging" from the UI.

I'll keep digging,

PS: if anyone has time and can join on a openmpi IRC channel :-) that would
be great,

-Original Message-
From: Shiqing Fan [mailto:f...@hlrs.de] 
Sent: 25 November 2011 22:19
To: MM
Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name failed

Hi MM,

Do you really want to build Open MPI by yourself? If you only need the 
libraries, probably you may stick to 1.5.4 binaries, which you said 
works for you.

Anyway, if you want to debug mpirun, you can step into orterun project, 
which generates mpirun executable.

Which version of Open MPI are you building? I'm not sure whether I'll 
have time this days to look closely to this problem, but if you can 
reproduce this problem with a small test program, and send it to me, I 
would like also help debug it.


Best Regards,
Shiqing


On 2011-11-25 11:06 PM, MM wrote:
> Shiqing,
>
> As I built the mpi libs in debug as well, can I break point somehow when I
> run
>
> mpirun -np 1  : -np 1
>
> and I get those 2 errors.
>
> Can I breakpoint somehow inside vs2010? maybe to investigate what's going
> on?
>
> How do I launch "mpirun" in debug from the openmpi solution. Which project
> generates the mpirun binary?
>
> I am a bit stuck and would appreciate help to progress,
>
> rds,
>
> MM
>
> -Original Message-
> From: Shiqing Fan [mailto:f...@hlrs.de]
> Sent: 24 November 2011 16:44
> To: MM
> Cc: 'Open MPI Users'
> Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name
failed
>
> Hi MM,
>
> Sorry for the delayed reply, I was busy in a meeting these days.
>
> The log files seem not very helpful to solve the problem. May be your
> CMakeCache.txt file would help.
>
> Currently we don't provided binaries built from trunk. Have you also
> tried the 1.5.x binaries?
>
> Best Regards,
> Shiqing
>
> On 2011-11-23 10:08 PM, MM wrote:
>> Hi Shiqing,
>>
>> Is the info provided useful to understand what's going on?
>> Alternatively, is there a way to get the provided binaries for win but
off
>> trunk rather than off 1.5.4 as on the website, because I don't have this
>> problem when I link against those libs,
>>
>> Thanks
>>
>> MM
>>
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
>> Behalf Of MM
>> Sent: 21 November 2011 21:08
>> To: f...@hlrs.de
>> Cc: 'Open MPI Users'
>> Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name
> failed
>> Hi,
>>
>> I have placed the source in \Program Files\openmpi-1.5.4 the build dir in
>> \Program Files\openmpi.build and the install dir in \Program
Files\openmpi
>>
>> I could not find config.log in any of the 3 directories nor in the
> directory
>> from which I run mpirun.
>>
>> The build log attached is a zip of all the .log under \Program
>> Files\openmpi.build
>>
>> First, I installed the provided binaries on xp32bit, and successfully ran
>> the program in Release mode.
>> in debug mode, there was that error of some function missing in kernel,
> that
>> you fixed in svn.
>>
>> Second, I then downloaded the source and built the static libraries w
> cmake
>> according to README.windows, and against these home built libs, the same
>> program run neithers in debug nor in release, because of the error below.
>>
>> How can I generate the config.log?
>>
>> About Debug/Release, thinking about it at this time, I don't really need
> the
>> debug libs of openmpi.
>> but to be able to link against vs2010 Release libs of openmpi, I need
them
>> to be linked against the Release c runtime, so I might as well link
> against
>> the debug version of the openmpi libs.
>>
>> Your help is very appreciated,
>> MM
>>
>> -Original Message-
>> From: Shiqing Fan [mailto:f...@hlrs.de]
>> Sent: 21 November 2011 12:48
>> To: Open MPI Users
>> Cc: MM
>> Subject: Re: [OMPI users] orte_debugger_select and orte_ess_set_name
> failed
>> Hi,
>>
>> Could you please send your config and build log to me? Have you tried
with
> a
>> simpler program? Does this error always happen?
>>
>> Regards,
>> Shiqing
>>
>>
>> On 2011-11-19 4:24 PM, MM wrote:
>>> Trying to run my program linked against debug 1.5.4 on vs2010 fails:
>>>
>> mpirun -np 1 .\nhui\Debug\nhui.exe : -np 1
>> .\nhcomp\Debug\nhcomp.exe
>>> [PCNAME:04960] [[1282,0],0] ORTE_ERROR_LOG: Not found in file
>>> C:\Program Files\openmpi-1.5.4\orte\mca\ess\hnp\ess_hnp_module.c at
>>> line 536
>>> --
>>>  It looks like orte_init failed for some reason; your parallel
>>> process is likely to abort.  There are many reasons that a parallel
>>> process can fail during orte_init; some of which are due to
>>> configuration or environment problems.  This failure appears to be an
>>> internal failure; here's some additional information (which may only
>>> be 

Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Stefan Eilemann

On 29. Nov 2011, at 11:41, Samuel Thibault wrote:

> You are probably missing the libpci-devel package.

Thanks, that either doesn't exist or wasn't installed on Redhat. It works now.

I think messages of found/not found optional modules could be more prominent at 
the end of the configure process.


Cheers,

Stefan.
-- 
http://www.eyescale.ch
http://www.equalizergraphics.com
http://www.linkedin.com/in/eilemann






Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Samuel Thibault
Stefan Eilemann, le Tue 29 Nov 2011 11:40:18 +0100, a écrit :
> Maybe I'm missing something, but I don't see any PCI-related output with 
> lstopo.

You are probably missing the libpci-devel package.

Samuel


Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Stefan Eilemann
Hi Brice,

On 29. Nov 2011, at 9:45, Brice Goglin wrote:

> hwloc 1.3 already has support for PCI device detection. These new
> objects contain a "class" field that can help you know if it's a NIC/GPU/...
> 
> Just run lstopo
> on your machine to see what I am talking about.

Maybe I'm missing something, but I don't see any PCI-related output with lstopo.

I just compiled 1.3 from scratch, and run lstopo as user and hwloc-info as root:

$ sudo ./local/bin/hwloc-info -v
[sudo] password for eilemann: 
Machine (24GB)
  NUMANode L#0 (P#0 12GB) + Socket L#0 + L3 L#0 (12MB)
L2 L#0 (256KB) + L1 L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (256KB) + L1 L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (256KB) + L1 L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (256KB) + L1 L#3 (32KB) + Core L#3 + PU L#3 (P#3)
L2 L#4 (256KB) + L1 L#4 (32KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (256KB) + L1 L#5 (32KB) + Core L#5 + PU L#5 (P#5)
  NUMANode L#1 (P#1 12GB) + Socket L#1 + L3 L#1 (12MB)
L2 L#6 (256KB) + L1 L#6 (32KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (256KB) + L1 L#7 (32KB) + Core L#7 + PU L#7 (P#7)
L2 L#8 (256KB) + L1 L#8 (32KB) + Core L#8 + PU L#8 (P#8)
L2 L#9 (256KB) + L1 L#9 (32KB) + Core L#9 + PU L#9 (P#9)
L2 L#10 (256KB) + L1 L#10 (32KB) + Core L#10 + PU L#10 (P#10)
L2 L#11 (256KB) + L1 L#11 (32KB) + Core L#11 + PU L#11 (P#11)
[eilemann@node01 ~]$ 

The lstopo graphical output contains the same information.


Cheers,

Stefan.
-- 
http://www.eyescale.ch
http://www.equalizergraphics.com
http://www.linkedin.com/in/eilemann






[OMPI users] FW: Problem launching application on windows

2011-11-29 Thread Martin Santa María

Hello all,


I have the same problem on Vista x64 using 1.5.4. But my configuration is 
different:
I have a Jenkins server that launches the executable in a remote Windows 
machine, so I suppose that something is missing in my environment.
If I manually launch the application in the machine locally, everything works 
well.
I checked using depends and it seems it found the required libraries without 
problems so I don't understand what is causing the problems.
Perhaps some windows permissions?


Regards


Martin



 

Subject: Re: [OMPI users] Problem launching application on windows

From: Alex van 't Veer (avantveer_at_[hidden])

List-Post: users@lists.open-mpi.org
Date: 2011-10-28 07:33:06 



Hi Shiqing,



 



Unfortunately that did not solve the problem.



Can you tell me something more about how the sockets work and how they could


get corrupted? Maybe I can figure out what is going wrong.



 



Thanks



 







From: Shiqing Fan [mailto:fan_at_[hidden]] 


Sent: Friday, October 28, 2011 12:16 PM


To: Open MPI Users


Cc: Alex van 't Veer


Subject: Re: [OMPI users] Problem launching application on windows



 



Hi,



This looks not normal, because this error might happen mainly by improper


sockets. I don't have any clue at moment, as I can't reproduce it.



Could you try to reinstall Open MPI? And make sure there is no other


installation on your system. If this is still not working, try using Open MPI


1.5.3. Please let me know whether these will work for you or not.



Regards,


Shiqing



On 2011-10-27 11:35 AM, Alex van 't Veer wrote: 



Hi



 



I've installed the OpenMPI 1.5.4-1 64-bit binaries on windows 7 when I run


mpirun.exe without any options I get the help text and everything seems to


work fine but when I try to actually run a application, I get the following


error:



..\..\..\openmpi-1.5.4\opal\event\event.c: ompi_evesel->dispatch() failed.



I get the error when running any application, to exclude my own application I


tried the hello world example and it returns the same error. (The command I


used is mpirun.exe helloworld.exe)



Searching for the error in the list or looking at event.c didn't get me much


further, can anyone point me in the right direction for solving this problem?



 



Thanks

  

Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Stefan Eilemann
Bonjour Brice,

On 29. Nov 2011, at 9:45, Brice Goglin wrote:

> hwloc 1.3 already has support for PCI device detection. These new
> objects contain a "class" field that can help you know if it's a NIC/GPU/...

Ok, time to upgrade my installation. The cluster has RHEL6.1 which ships with 
an older version.

> How are you using GPUs and NICs in your software? Which libraries or
> ways do you use to access them?

I use them mostly with OpenGL ('XOpenDisplay(":0.")' and RDMA in 
Equalizer/Collage (see links in signature). Is there a straight way to 
associate the GPUs with the corresponding X screen? I guess at least the path 
through the Xorg PCI ID should work, but it would be nice to have that in hwloc.

We also use Cuda/OpenMPI here, but I guess this will be easier to support. I'll 
look into the latest source of lstopo to see how it's done.


BTW, I recently created a library for ZeroConf GPU discovery[1], this might be 
of interest for you.


Cheers,

Stefan.

[1] http://www.equalizergraphics.com/gpu-sd
-- 
http://www.eyescale.ch
http://www.equalizergraphics.com
http://www.linkedin.com/in/eilemann






Re: [hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Brice Goglin
Hello Stefan,

hwloc 1.3 already has support for PCI device detection. These new
objects contain a "class" field that can help you know if it's a NIC/GPU/...

However it's hard to know which PCI device is eth0 or eth1, so we also
try to add some OS device inside PCI device. If you're using Linux, you
will see which network device (eth0, ...), IB device (mlx4_0, ...), or
disk (sda, ...) corresponds to each PCI device (if any). Just run lstopo
on your machine to see what I am talking about. Then you should read the
I/O devices section in the doc.

There's also some work to insert CUDA device information inside those
PCI devices.

Additionally, we have some helpers to retrieve locality of some custom
libraries objects (OFED, CUDA, ...). See the interoperability section in
the doc.

How are you using GPUs and NICs in your software? Which libraries or
ways do you use to access them?

hope this helps.
Brice




Le 29/11/2011 09:32, Stefan Eilemann a écrit :
> All,
>
> We have the need to discover which GPUs and NICs are close to which CPUs[1], 
> independent from CUDA. From the overview page there are hints that there is 
> some kind of support planned, but it's unclear to me of how much of this is 
> implemented.
>
> Is there support in hwloc, and in which version, for this? If yes, can you 
> give me a hint/code snippet on how to do this? If no, what does it take to 
> get this support in hwloc?
>
>
> Cheers,
>
> Stefan.
>
> [1] https://github.com/Eyescale/Equalizer/issues/57
>



[hwloc-users] GPU/NIC/CPU locality

2011-11-29 Thread Stefan Eilemann
All,

We have the need to discover which GPUs and NICs are close to which CPUs[1], 
independent from CUDA. From the overview page there are hints that there is 
some kind of support planned, but it's unclear to me of how much of this is 
implemented.

Is there support in hwloc, and in which version, for this? If yes, can you give 
me a hint/code snippet on how to do this? If no, what does it take to get this 
support in hwloc?


Cheers,

Stefan.

[1] https://github.com/Eyescale/Equalizer/issues/57

-- 
http://www.eyescale.ch
http://www.equalizergraphics.com
http://www.linkedin.com/in/eilemann