Re: [OMPI users] [External] Re: cpu binding of mpirun to follow slurm setting

2021-10-11 Thread Åke Sandgren via users



On 10/10/21 5:38 PM, Chang Liu via users wrote:
> OMPI v4.1.1-85-ga39a051fd8
> 
> % srun bash -c "cat /proc/self/status|grep Cpus_allowed_list"
> Cpus_allowed_list:  58-59
> Cpus_allowed_list:  106-107
> Cpus_allowed_list:  110-111
> Cpus_allowed_list:  114-115
> Cpus_allowed_list:  16-17
> Cpus_allowed_list:  36-37
> Cpus_allowed_list:  54-55
> ...
> 
> % mpirun bash -c "cat /proc/self/status|grep Cpus_allowed_list"
> Cpus_allowed_list:  0-127
> Cpus_allowed_list:  0-127
> Cpus_allowed_list:  0-127
> Cpus_allowed_list:  0-127
> Cpus_allowed_list:  0-127
> Cpus_allowed_list:  0-127
> Cpus_allowed_list:  0-127
> ...

Was that run in the same batch job? If not, the data is useless.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] OpenMPI 4.0.2 with PGI 19.10, will not build with hcoll

2020-01-26 Thread Åke Sandgren via users
Note that when built against SLURM it will pick up pthread from
libslurm.la too.

On 1/26/20 4:37 AM, Gilles Gouaillardet via users wrote:
> Thanks Jeff for the information and sharing the pointer.
> 
> FWIW, this issue typically occurs when libtool pulls the -pthread flag
> from libhcoll.la that was compiled with a GNU compiler.
> The simplest workaround is to remove libhcoll.la (so libtool simply
> links with libhcoll.so and does not pull any compiler flags),
> and the right fix is imho to either have the libtool maintainers
> handle this case or the PGI/NVIDIA folks add support for the -pthread
> flag.
> 
> Cheers,
> 
> Gilles
> 
> On Sun, Jan 26, 2020 at 12:09 PM Jeff Hammond via users
>  wrote:
>>
>> To be more strictly equivalent, you will want to add -D_REENTRANT to add to 
>> the substitution, but this may not affect hcoll.
>>
>> https://stackoverflow.com/questions/2127797/significance-of-pthread-flag-when-compiling/2127819#2127819
>>
>> The proper fix here is a change in OMPI build system, of course, to not set 
>> -pthread when PGI is used.
>>
>> Jeff
>>
>> On Fri, Jan 24, 2020 at 11:31 AM Åke Sandgren via users 
>>  wrote:
>>>
>>> PGI needs this in its, for instance, siterc or localrc:
>>> # replace unknown switch -pthread with -lpthread
>>> switch -pthread is replace(-lpthread) positional(linker);
>>>
>>>
>>> On 1/24/20 8:12 PM, Raymond Muno via users wrote:
>>>> I am having issues building OpenMPI 4.0.2 using the PGI 19.10
>>>> compilers.  OS is CentOS 7.7, MLNX_OFED 4.7.3
>>>>
>>>> It dies at:
>>>>
>>>> PGC/x86-64 Linux 19.10-0: compilation completed with warnings
>>>>   CCLD mca_coll_hcoll.la
>>>> pgcc-Error-Unknown switch: -pthread
>>>> make[2]: *** [mca_coll_hcoll.la] Error 1
>>>> make[2]: Leaving directory
>>>> `/project/muno/OpenMPI/PGI/openmpi-4.0.2/ompi/mca/coll/hcoll'
>>>> make[1]: *** [all-recursive] Error 1
>>>> make[1]: Leaving directory `/project/muno/OpenMPI/PGI/openmpi-4.0.2/ompi'
>>>> make: *** [all-recursive] Error 1
>>>>
>>>> I tried with PGI 19.9 and had the same issue.
>>>>
>>>> If I do not include hcoll, it builds.  I have successfully built OpenMPI
>>>> 4.0.2 with GCC, Intel and AOCC compilers, all using the same options.
>>>>
>>>> hcoll is provided by MLNX_OFED 4.7.3 and configure is run with
>>>>
>>>> --with-hcoll=/opt/mellanox/hcoll
>>>>
>>>>
>>>
>>> --
>>> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
>>> Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
>>> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
>>
>> --
>> Jeff Hammond
>> jeff.scie...@gmail.com
>> http://jeffhammond.github.io/

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] OpenMPI 4.0.2 with PGI 19.10, will not build with hcoll

2020-01-24 Thread Åke Sandgren via users
PGI needs this in its, for instance, siterc or localrc:
# replace unknown switch -pthread with -lpthread
switch -pthread is replace(-lpthread) positional(linker);


On 1/24/20 8:12 PM, Raymond Muno via users wrote:
> I am having issues building OpenMPI 4.0.2 using the PGI 19.10
> compilers.  OS is CentOS 7.7, MLNX_OFED 4.7.3
> 
> It dies at:
> 
> PGC/x86-64 Linux 19.10-0: compilation completed with warnings
>   CCLD mca_coll_hcoll.la
> pgcc-Error-Unknown switch: -pthread
> make[2]: *** [mca_coll_hcoll.la] Error 1
> make[2]: Leaving directory
> `/project/muno/OpenMPI/PGI/openmpi-4.0.2/ompi/mca/coll/hcoll'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/project/muno/OpenMPI/PGI/openmpi-4.0.2/ompi'
> make: *** [all-recursive] Error 1
> 
> I tried with PGI 19.9 and had the same issue.
> 
> If I do not include hcoll, it builds.  I have successfully built OpenMPI
> 4.0.2 with GCC, Intel and AOCC compilers, all using the same options.
> 
> hcoll is provided by MLNX_OFED 4.7.3 and configure is run with
> 
> --with-hcoll=/opt/mellanox/hcoll
> 
> 

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] Seg fault in opal_progress

2018-07-12 Thread Åke Sandgren
Are you running with ulimit -s unlimited?
If not that looks like a out-of-stack crash, which VASP frequently causes.

If you are running with unlimited stack, I could perhaps run that input
case on our VASP build. (Which have a bunch of fixes for bad stack usage
among other things)

On 07/11/2018 11:13 PM, Noam Bernstein wrote:
>> On Jul 11, 2018, at 11:29 AM, Jeff Squyres (jsquyres) via users
>> mailto:users@lists.open-mpi.org>> wrote:
>>
>> Ok, that would be great -- thanks.
>>
>> Recompiling Open MPI with --enable-debug will turn on several
>> debugging/sanity checks inside Open MPI, and it will also enable
>> debugging symbols.  Hence, If you can get a failure when a debug Open
>> MPI build, it might give you a core file that can be used to get a
>> more detailed stack trace, poke around and see if there's a NULL
>> pointer somewhere, …etc.
> 
> I haven’t tried to get a core file yes, but it’s not producing any more
> info from the runtime stack trace, despite configure with —enable-debug:
> 
> Image              PC                Routine            Line      
>  Source
> vasp.gamma_para.i  02DCE8C1  Unknown               Unknown
>  Unknown
> vasp.gamma_para.i  02DCC9FB  Unknown               Unknown
>  Unknown
> vasp.gamma_para.i  02D409E4  Unknown               Unknown
>  Unknown
> vasp.gamma_para.i  02D407F6  Unknown               Unknown
>  Unknown
> vasp.gamma_para.i  02CDCED9  Unknown               Unknown
>  Unknown
> vasp.gamma_para.i  02CE3DB6  Unknown               Unknown
>  Unknown
> libpthread-2.12.s  003F8E60F7E0  Unknown               Unknown
>  Unknown
> mca_btl_vader.so   2B1AFA5FAC30  Unknown               Unknown
>  Unknown
> mca_btl_vader.so   2B1AFA5FD00D  Unknown               Unknown
>  Unknown
> libopen-pal.so.40  2B1AE884327C  opal_progress         Unknown
>  Unknown
> mca_pml_ob1.so     2B1AFB855DCE  Unknown               Unknown
>  Unknown
> mca_pml_ob1.so     2B1AFB858305  mca_pml_ob1_send      Unknown
>  Unknown
> libmpi.so.40.10.1  2B1AE823A5DA  ompi_coll_base_al     Unknown
>  Unknown
> mca_coll_tuned.so  2B1AFC6F0842  ompi_coll_tuned_a     Unknown
>  Unknown
> libmpi.so.40.10.1  2B1AE81B66F5  PMPI_Allreduce        Unknown
>  Unknown
> libmpi_mpifh.so.4  2B1AE7F2259B  mpi_allreduce_        Unknown
>  Unknown
> vasp.gamma_para.i  0042D1ED  m_sum_d_                 1300
>  mpi.F
> vasp.gamma_para.i  0089947D  nonl_mp_vnlacc_.R        1754
>  nonl.F
> vasp.gamma_para.i  00972C51  hamil_mp_hamiltmu         825
>  hamil.F
> vasp.gamma_para.i  01BD2608  david_mp_eddav_.R         419
>  davidson.F
> vasp.gamma_para.i  01D2179E  elmin_.R                  424
>  electron.F
> vasp.gamma_para.i  02B92452  vamp_IP_electroni        4783
>  main.F
> vasp.gamma_para.i  02B6E173  MAIN__                   2800
>  main.F
> vasp.gamma_para.i  0041325E  Unknown               Unknown
>  Unknown
> libc-2.12.so       003F8E21ED1D  __libc_start_main     Unknown
>  Unknown
> vasp.gamma_para.i  00413169  Unknown               Unknown
>  Unknown
> 
> 
> This is the configure line that was supposedly used to create the library:
>  ./configure
> --prefix=/usr/local/openmpi/3.1.1_debug/x86_64/ib/intel/11.1.080
> --with-tm=/usr/local/torque --enable-mpirun-prefix-by-default
> --with-verbs=/usr --with-verbs-libdir=/usr/lib64 --enable-debug
> 
> Is there any way I can confirm that the version of the openmpi library I
> think I’m using really was compiled with debugging?
> 
> Noam
> 
> 
> |
> |
> |
> *U.S. NAVAL*
> |
> |
> _*RESEARCH*_
> |
> LABORATORY
> 
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil 
> 
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
> 

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [hwloc-users] call for testing on KNL

2018-02-09 Thread Åke Sandgren
Any specific configure flags you'd want me to use?

And does node config matter, i.e., hemi/snc2 etc?

On 02/09/2018 05:56 PM, Brice Goglin wrote:
> Hello
> 
> As you may know, hwloc only discovers KNL MCDRAM Cache details if
> hwloc-dump-hwdata ran as root earlier. There's an issue with that tool
> in 2.0, which was supposed to be a feature: we fixed the matching of
> SMBIOS strings, and now it appears some vendors don't match anymore
> because they didn't use the standardized strings :(
> 
> If you have root access on a KNL, could you build this tarball and run
> utils/hwloc/hwloc-dump-hwdata -o foo (or $prefix/sbin/hwloc-dump-hwdata
> -o foo).
> 
> https://ci.inria.fr/hwloc/job/zbgoglin-0-tarball/lastSuccessfulBuild/artifact/hwloc-hdh-20180209.1705.git6cd1efb.tar.gz
> 
> 
> If things work, you should see something like:
> 
> Dumping KNL SMBIOS Memory-Side Cache information:
>   File = /sys/firmware/dmi/entries/14-0/raw, size = 4096
> Looking for "Group: Knights Landing Information" in group string "Group: 
> Knights Landing Information"
>Found phi group
>   Found KNL type = 160
>   [... lots of other lines ...]
> 
> 
> If it doesn't work:
> 
> |Dumping KNL SMBIOS Memory-Side Cache information: File =
> /sys/firmware/dmi/entries/14-0/raw, size = 4096 Looking for "Group:
> Knights Landing Information" in group string "foobar" Looking for
> "Group: Knights Mill Information" in group string "foobar" Looking for
> "Knights Landing Association" in group string "foobar" Failed to find
> phi group SMBIOS table does not contain KNL entries|
> 
> 
> In both cases, please send the model of the machine [1] as well as the
> full output of the tool.
> 
> This issue might lead to a 2.0.1 release soon. In the meantime, you can
> use hwloc-dump-hwdata from 1.11.9, its output is compatible with hwloc 2.0.
> 
> Thanks
> 
> Brice
> 
> 
> [1] if you don't know, look at strings in "cat
> /sys/devices/virtual/dmi/id/*"
> 
> 
> 
> ___
> hwloc-users mailing list
> hwloc-users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/hwloc-users
> 

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
___
hwloc-users mailing list
hwloc-users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/hwloc-users


Re: [OMPI users] Problems building OpenMPI 2.1.1 on Intel KNL

2017-11-20 Thread Åke Sandgren
Done, issue 4519

On 11/20/2017 07:02 PM, Howard Pritchard wrote:
> Hello Ake,
> 
> Would you mind opening an issue on Github so we can track this?
> 
> https://github.com/open-mpi/ompi/issues
> 
> There's a template to show what info we need to fix this.
> 
> Thanks very much for reporting this,
> 
> Howard
> 
> 
> 2017-11-20 3:26 GMT-07:00 Åke Sandgren <ake.sandg...@hpc2n.umu.se
> <mailto:ake.sandg...@hpc2n.umu.se>>:
> 
> Hi!
> 
> When the xppsl-libmemkind-dev package version 1.5.3 is installed
> building OpenMPI fails.
> 
> opal/mca/mpool/memkind uses the macro MEMKIND_NUM_BASE_KIND which has
> been moved to memkind/internal/memkind_private.h


-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Problems building OpenMPI 2.1.1 on Intel KNL

2017-11-20 Thread Åke Sandgren
Hi!

When the xppsl-libmemkind-dev package version 1.5.3 is installed
building OpenMPI fails.

opal/mca/mpool/memkind uses the macro MEMKIND_NUM_BASE_KIND which has
been moved to memkind/internal/memkind_private.h

Current master is also using that so I think that will also fail.

Are there anyone working on this?

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] [Open MPI Announce] Open MPI v2.1.2 released

2017-09-20 Thread Åke Sandgren
Hi!

The OB1 PML problem, how long has it been around and, apart from the
hang, how can i check if it is likely that i get hit by it?
And are there any specific situations when it does appear?

Will try 2.1.2 (and 3.0.0) out on our problem case soon but it takes a
couple of days for the hang we're seeing to appear so knowing for sure
that it's gone or not is hard.

On 09/21/2017 12:04 AM, Pritchard Jr., Howard wrote:
> 
> 2.1.2 -- September, 2017
> 
> 
> Bug fixes/minor improvements:
> - Fix a problem in the OB1 PML that led to hangs with OSU collective tests.



-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-03 Thread Åke Sandgren
We build slurm with GCC, drop the -pthread arg in the .la files, and
have never seen any problems related to that. And we do build quite a
lot of code. And lots of versions of OpenMPI with multiple different
compilers (and versions).

On 04/03/2017 04:51 PM, Prentice Bisbal wrote:
> This is the second suggestion to rebuild Slurm
> 
> The  other from Åke Sandgren, who recommended this:
> 
>> This usually comes from slurm, so we always do
>>
>> perl -pi -e 's/-pthread//' /lap/slurm/${version}/lib/libpmi.la
>> /lap/slurm/${version}/lib/libslurm.la
>>
>> when installing a new slurm version. Thus no need for a fakepg wrapper.
> 
> I don't really have the luxury to rebuild Slurm at the moment. How would
> I rebuild Slurm to change this behavior? Is rebuilding Slurm with PGI
> the only option to fix this in slurm, or use Åke's suggestion above?
> 
> If I did use Åke's suggestion above, how would that affect the operation
> of Slurm, or future builds of OpenMPI and any other software that might
> rely on Slurm, particulary with regards to building those apps with
> non-PGI compilers?
> 
> Prentice
> 
> On 04/03/2017 10:31 AM, Gilles Gouaillardet wrote:
>> Hi,
>>
>> The -pthread flag is likely pulled by libtool from the slurm libmpi.la
>> <http://libmpi.la> and/or libslurm.la <http://libslurm.la>
>> Workarounds are
>> - rebuild slurm with PGI
>> - remove the .la files (*.so and/or *.a are enough)
>> - wrap the PGI compiler to ignore the -pthread option
>>
>> Hope this helps
>>
>> Gilles
>>
>> On Monday, April 3, 2017, Prentice Bisbal <pbis...@pppl.gov
>> <mailto:pbis...@pppl.gov>> wrote:
>>
>> Greeting Open MPI users! After being off this list for several
>> years, I'm back! And I need help:
>>
>> I'm trying to compile OpenMPI 1.10.3 with the PGI compilers,
>> version 17.3. I'm using the following configure options:
>>
>> ./configure \
>>   --prefix=/usr/pppl/pgi/17.3-pkgs/openmpi-1.10.3 \
>>   --disable-silent-rules \
>>   --enable-shared \
>>   --enable-static \
>>   --enable-mpi-thread-multiple \
>>   --with-pmi=/usr/pppl/slurm/15.08.8 \
>>   --with-hwloc \
>>   --with-verbs \
>>   --with-slurm \
>>   --with-psm \
>>   CC=pgcc \
>>   CFLAGS="-tp x64 -fast" \
>>   CXX=pgc++ \
>>   CXXFLAGS="-tp x64 -fast" \
>>   FC=pgfortran \
>>   FCFLAGS="-tp x64 -fast" \
>>   2>&1 | tee configure.log
>>
>> Which leads to this error  from libtool during make:
>>
>> pgcc-Error-Unknown switch: -pthread
>>
>> I've searched the archives, which ultimately lead to this work
>> around from 2009:
>>
>> https://www.open-mpi.org/community/lists/users/2009/04/8724.php
>> <https://www.open-mpi.org/community/lists/users/2009/04/8724.php>
>>
>> Interestingly, I participated in the discussion that lead to that
>> workaround, stating that I had no problem compiling Open MPI with
>> PGI v9. I'm assuming the problem now is that I'm specifying
>> --enable-mpi-thread-multiple, which I'm doing because a user
>> requested that feature.
>>
>> It's been exactly 8 years and 2 days since that workaround was
>> posted to the list. Please tell me a better way of dealing with
>> this issue than writing a 'fakepgf90' script. Any suggestions?
>>
>>
>> -- 
>> Prentice
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>> <https://rfd.newmexicoconsortium.org/mailman/listinfo/users>
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 
> 
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Compiler error with PGI: pgcc-Error-Unknown switch: -pthread

2017-04-03 Thread Åke Sandgren
This usually comes from slurm, so we always do

perl -pi -e 's/-pthread//' /lap/slurm/${version}/lib/libpmi.la
/lap/slurm/${version}/lib/libslurm.la

when installing a new slurm version. Thus no need for a fakepg wrapper.

On 04/03/2017 04:20 PM, Prentice Bisbal wrote:
> Greeting Open MPI users! After being off this list for several years,
> I'm back! And I need help:
> 
> I'm trying to compile OpenMPI 1.10.3 with the PGI compilers, version
> 17.3. I'm using the following configure options:
> 
> ./configure \
>   --prefix=/usr/pppl/pgi/17.3-pkgs/openmpi-1.10.3 \
>   --disable-silent-rules \
>   --enable-shared \
>   --enable-static \
>   --enable-mpi-thread-multiple \
>   --with-pmi=/usr/pppl/slurm/15.08.8 \
>   --with-hwloc \
>   --with-verbs \
>   --with-slurm \
>   --with-psm \
>   CC=pgcc \
>   CFLAGS="-tp x64 -fast" \
>   CXX=pgc++ \
>   CXXFLAGS="-tp x64 -fast" \
>   FC=pgfortran \
>   FCFLAGS="-tp x64 -fast" \
>   2>&1 | tee configure.log
> 
> Which leads to this error  from libtool during make:
> 
> pgcc-Error-Unknown switch: -pthread
> 
> I've searched the archives, which ultimately lead to this work around
> from 2009:
> 
> https://www.open-mpi.org/community/lists/users/2009/04/8724.php
> 
> Interestingly, I participated in the discussion that lead to that
> workaround, stating that I had no problem compiling Open MPI with PGI
> v9. I'm assuming the problem now is that I'm specifying
> --enable-mpi-thread-multiple, which I'm doing because a user requested
> that feature.
> 
> It's been exactly 8 years and 2 days since that workaround was posted to
> the list. Please tell me a better way of dealing with this issue than
> writing a 'fakepgf90' script. Any suggestions?
> 
> 

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-23 Thread Åke Sandgren
Ok, we have E5-2690v4's and Connect-IB.

On 03/23/2017 10:11 AM, Götz Waschk wrote:
> On Thu, Mar 23, 2017 at 9:59 AM, Åke Sandgren <ake.sandg...@hpc2n.umu.se> 
> wrote:
>> E5-2697A which version? v4?
> Hi, yes, that one:
> Intel(R) Xeon(R) CPU E5-2697A v4 @ 2.60GHz
> 
> Regards, Götz
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> 

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-23 Thread Åke Sandgren
E5-2697A which version? v4?

On 03/23/2017 09:53 AM, Götz Waschk wrote:
> Hi Åke,
> 
> I have E5-2697A CPUs and Mellanox ConnectX-3 FDR Infiniband. I'm using
> EL7.3 as the operating system.


-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-23 Thread Åke Sandgren
Since i'm seeing similar Bus errors from both openmpi and other places
on our system I'm wondering, what hardware do you have?

CPU:s, interconnect etc.

On 03/23/2017 08:45 AM, Götz Waschk wrote:
> Hi Howard,
> 
> I have attached my config.log file for version 2.1.0. I have based it
> on the OpenHPC package. Unfortunately, it still crashes with disabling
> the vader btl with this command line:
> mpirun --mca btl "^vader" IMB-MPI1
> 
> 
> [pax11-10:44753] *** Process received signal ***
> [pax11-10:44753] Signal: Bus error (7)
> [pax11-10:44753] Signal code: Non-existant physical address (2)
> [pax11-10:44753] Failing at address: 0x2b3989e27a00

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] Lustre support uses deprecated include.

2017-03-13 Thread Åke Sandgren
Hi!

The lustre support in ompi/mca/fs/lustre/fs_lustre.h is using a
deprecated include.

#include 

is deprecated in newer lustre versions (at least from 2.8) and

#include 

should be used instead.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] openmpi 1.10.2 and PGI 15.9

2016-07-14 Thread Åke Sandgren
No, you have to manually edit those two .la files by hand after
installation. It's basically a libtool problem. It generates the .la
file with an option that PGI dsoesn't understand.

On 07/14/2016 04:06 PM, Michael Di Domenico wrote:
> On Mon, Jul 11, 2016 at 9:52 AM, Åke Sandgren <ake.sandg...@hpc2n.umu.se> 
> wrote:
>> Looks like you are compiling with slurm support.
>>
>> If so, you need to remove the "-pthread" from libslurm.la and libpmi.la
> 
> i don't see a configure option in slurm to disable pthreads, so i'm
> not sure this is possible.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] openmpi 1.10.2 and PGI 15.9

2016-07-11 Thread Åke Sandgren
Looks like you are compiling with slurm support.

If so, you need to remove the "-pthread" from libslurm.la and libpmi.la

On 07/11/2016 02:54 PM, Michael Di Domenico wrote:
> I'm trying to get openmpi compiled using the PGI compiler.
> 
> the configure goes through and the code starts to compile, but then
> gets hung up with
> 
> entering: openmpi-1.10.2/opal/mca/common/pmi
> CC common_pmi.lo
> CCLD libmca_common_pmi.la
> pgcc-Error-Unknown switch: - pthread
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/07/29635.php
> 

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] my_sense in ompi_osc_sm_module_t not always protected by OPAL_HAVE_POSIX_THREADS

2015-12-07 Thread Åke Sandgren
The #if OPAL_HAVE_POSIX_THREADS is still there around my_sense in 
osc_sm.h in 1.10.1


On 06/29/2015 05:42 PM, Åke Sandgren wrote:

Yeah, i thought so. Well code reductions are good when correct :-)

On 06/29/2015 05:39 PM, Nathan Hjelm wrote:


Open MPI has required posix threads for some time. The check for
OPAL_HAVE_POSIX_THREADS in ompi/mca/osc/sm/osc_sm.h is stale and should
be removed. I will clean that out in master, 1.8, and 1.10.



--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] openmpi 1.8.8: Problems with MPI_Send and mmap:ed buffer

2015-11-18 Thread Åke Sandgren

Did anyone take notice of this?

I haven't seen any respons.

On 10/08/2015 11:14 AM, Åke Sandgren wrote:

Hi!

The attached code shows a problem when using mmap:ed buffer with
MPI_Send and vader btl.

With OMPI_MCA_btl='^vader' it works in all cases i have tested.


Intel MPI also have problems with this, failing to receive the complete
data, getting a NULL at position 6116 when the receiver is on another node.

(Haven't had time to build 1.10 yet...)



___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/10/27842.php



--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


[OMPI users] openmpi 1.8.8: Problems with MPI_Send and mmap:ed buffer

2015-10-08 Thread Åke Sandgren

Hi!

The attached code shows a problem when using mmap:ed buffer with 
MPI_Send and vader btl.


With OMPI_MCA_btl='^vader' it works in all cases i have tested.


Intel MPI also have problems with this, failing to receive the complete 
data, getting a NULL at position 6116 when the receiver is on another node.


(Haven't had time to build 1.10 yet...)

--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define SIZE 1
#define DATA 2
#define ACK 3

#define MMAP 1
#define MALLOC 2

void master(int nodes, int me, int argc, char **argv);
void slave(int nodes, int me, int argc, char **argv);

void terminate(void)
{
MPI_Abort(MPI_COMM_WORLD,1);
}

main(int argc, char **argv)
{

int me, nodes;
char hn[1024];

MPI_Init(, );
MPI_Comm_rank(MPI_COMM_WORLD, );
MPI_Comm_size(MPI_COMM_WORLD, );

gethostname(hn, 1023);
fprintf(stderr, "Me %d: @ %s\n", me, hn);

if (me == 0) {
	master(nodes, me, argc, argv);
} else {
	slave(nodes, me, argc, argv);
}

MPI_Finalize();
}

void master(int nodes, int me, int argc, char **argv)
{
int fd, ret, n, ack;
struct stat st;
char *buff;
size_t l, sz, mmaplength = 1024*1024, malloclength = 1024*1024;
MPI_Status status;
int mode;

mode = MMAP;
if (argc > 1) {
	if (strcmp(argv[1], "mmap") == 0) {
	mode = MMAP;
	}
	if (strcmp(argv[1], "malloc") == 0) {
	mode = MALLOC;
	}
}
if (mode == MMAP) {
	fd = open("somelargefile", O_RDONLY);
	if (fd < 0) {
	perror("Could nod open input file");
	terminate();
	}
	ret = fstat(fd, );
	if (ret != 0) {
	perror("Could not stat input file");
	close(fd);
	terminate();
	}
	sz = st.st_size;
	buff = mmap(NULL, mmaplength, PROT_READ, MAP_PRIVATE, fd, 0);
	if (buff == MAP_FAILED) {
	perror("Failed to mmap input file");
	terminate();
	}
}
if (mode == MALLOC) {
	buff = malloc(malloclength);
	memset(buff, 'N', malloclength);
}

for (n = 1; n < nodes; n++) {
	l = 34986;
	ack = 0;
	while (!ack) {
	fprintf(stderr, "Start MPI_send of size (%ld) to %d\n", l, n);
	ret = MPI_Send(, 1, MPI_LONG, n, SIZE, MPI_COMM_WORLD);
	if (ret != MPI_SUCCESS) {
		fprintf(stderr, "MPI_Send of size (%ld) to %d failed with ret (%d)\n",
		l, n, ret);
	}
	fprintf(stderr, "Start MPI_send of data (sz %ld) to %d\n", l, n);
	ret = MPI_Send(buff, (int)l, MPI_BYTE, n, DATA, MPI_COMM_WORLD);
	if (ret != MPI_SUCCESS) {
		fprintf(stderr, "MPI_Send of buff (sz %ld) to %d failed with ret (%d)\n",
		l, n, ret);
	}
	fprintf(stderr, "Start MPI_recv of ack (sz %ld) to %d\n", l, n);
	ret = MPI_Recv(, 1, MPI_INT, n, ACK, MPI_COMM_WORLD, );
	if (ret != MPI_SUCCESS) {
		fprintf(stderr, "MPI_Recv of ack from %d failed with ret (%d), status.error (%d)\n",
		n, ret, status.MPI_ERROR);
	}
	if (!ack) {
		fprintf(stderr, "Master: slave %d got NULL in data, reducing size\n", n);
		l -= 32;
	}
	}
}

if (mode == MMAP) {
	munmap(buff, mmaplength);
	close(fd);
}
if (mode == MALLOC) {
	free(buff);
}

fprintf(stderr, "Master returning\n");
return;
}

void slave(int nodes, int me, int argc, char **argv)
{
int ret, ack, i;
size_t newdatasz, datasz = 0;
char *data = NULL;
MPI_Status status;

ack = 0;
while (!ack) {
	newdatasz = 0;
	fprintf(stderr, "Me %d: Start MPI_recv of size\n", me);
	ret = MPI_Recv(, 1, MPI_LONG, 0, SIZE, MPI_COMM_WORLD, );
	if (ret != MPI_SUCCESS) {
	fprintf(stderr, "Me %d: Recv of newdatasz failed with ret (%d), status.error (%d)\n",
		me, ret, status.MPI_ERROR);
	}
	fprintf(stderr, "Me %d: got newdatasz (%ld)\n", me, newdatasz);
	if (newdatasz > datasz) {
	data = realloc(data, newdatasz);
	}
	if (newdatasz == 0) {
	fprintf(stderr, "Me %d: ERROR got newdatasz == 0!!\n", me);
	terminate();
	}
	fprintf(stderr, "Me %d: Start MPI_recv of data (sz %ld)\n", me, newdatasz);
	ret = MPI_Recv(data, (int)newdatasz, MPI_CHAR, 0, DATA, MPI_COMM_WORLD, );
	if (ret != MPI_SUCCESS) {
	fprintf(stderr, "Me %d: Recv of data (sz %ld) failed with ret (%d), status.error (%d)\n",
		me, newdatasz, ret, status.MPI_ERROR);
	}
	ack = 1;
	for (i = 0; i < newdatasz; i++) {
	if (data[i] == '\0') {
		fprintf(stderr, "Me %d: Got NULL in data at pos %d\n", me, i);
		ack = 0;
		break;
	}
	}
	fprintf(stderr, "Me %d: Start MPI_send of ack (%d) (sz %ld)\n", me, ack, newdatasz);
	ret = MPI_Send(, 1, MPI_INT, 0, ACK, MPI_COMM_WORLD);
	if (ret != MPI_SUCCESS) {
	fprintf(stderr, "Me %d: mpi_send ack returned error, %d\n",
		me, ret);
	}
	datasz = newdatasz;
}

fprintf(stderr, "Me %d: returning\n", me);
return;
}


Re: [OMPI users] Bug in ompi/errhandler/errcode.h (1.8.6)?

2015-08-14 Thread Åke Sandgren

This problem still exists in 1.8.8

On 06/29/2015 05:37 PM, Jeff Squyres (jsquyres) wrote:

Good catch; fixed.

Thanks!



On Jun 29, 2015, at 7:28 AM, Åke Sandgren <ake.sandg...@hpc2n.umu.se> wrote:

Hi!

static inline int ompi_mpi_errnum_is_class ( int errnum )
{
ompi_mpi_errcode_t *err;

if (errno < 0) {
return false;
}

I assume it should be errnum < 0.


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-11 Thread Åke Sandgren



On 08/11/2015 10:22 AM, Gilles Gouaillardet wrote:

i do not know the context, so i should not jump to any conclusion ...
if xxx.h is in $HCOLL_HOME/include/hcoll in hcoll version Y, but in
$HCOLL_HOME/include/hcoll/api in hcoll version Z, then the relative path
to $HCOLL_HOME/include cannot be hard coded.


It can be done, by using version detection of hcoll and #if, #else 
around the includes. But the risk of files moving in or out of a "api" 
include dir (relative to another include dir in the same package) should 
be fairly small i think, regardless of it being hcoll or some other package.


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] Open MPI 1.8.8 and hcoll in system space

2015-08-11 Thread Åke Sandgren

Please fix the hcoll test (and code) to be correct.

Any configure test that adds /usr/lib and/or /usr/include to any compile 
flags is broken.


And if hcoll include files are under $HCOLL_HOME/include/hcoll (and 
hcoll/api) then the include directives in the source should be

#include 
and
#include 
respectively.

I.e. one should never add -I$HCOLL_HOME/include/hcoll to CPPFLAGS, only 
-I$HCOLL_HOME/include.


Doing otherwise is bad design and a big cause for problems with include 
files from different packages having the same name...


My opinion at least...

On 08/11/2015 01:57 AM, Gilles Gouaillardet wrote:

David,

the configure help is misleading about hcoll ...

  --with-hcoll(=DIR)  Build hcoll (Mellanox Hierarchical Collectives)
   support, searching for libraries in DIR

the =DIR is not really optional ...
configure will not complain if you configure with --with-hcoll or
--with-hcoll=yes
but hcoll will be disabled in this case

FWIW, here is a snippet of my config.status generate with
--with-hcoll=$HCOLL_HOME
/* i manually 'unexpaned' $HCOLL_HOME */
S["coll_hcoll_LIBS"]="-lhcoll "
S["coll_hcoll_LDFLAGS"]=" -L$HCOLL_HOME/lib"
S["coll_hcoll_CPPFLAGS"]=" -I$HCOLL_HOME/include"
S["coll_hcoll_CFLAGS"]=""
S["coll_hcoll_HOME"]="$HCOLL_HOME"
S["coll_hcoll_extra_CPPFLAGS"]="-I$HCOLL_HOME/include/hcoll
-I$HCOLL_HOME/include/hcoll/api"

bottom line, if you configure with --with-hcoll=/usr it will add some
useless flags such as -L/usr/lib (or -L/usr/lib64, i am not sure about
that) and -I/usr/include
but it will also add the required -I/usr/include/hcoll
-I/usr/include/hcoll/api flags

if you believe this is an issue, i can revamp the hcoll detection (e.g.
configure --with-hcoll) but you might
need to manually set CPPFLAGS='-I/usr/include/hcoll
-I/usr/include/hcoll/api'
if not, i guess i will simply update the configure help message ...

Cheers,

Gilles

On 8/11/2015 7:39 AM, David Shrader wrote:

Hello All,

I'm having some trouble getting Open MPI 1.8.8 to configure correctly
when hcoll is installed in system space. That is, hcoll is installed
to /usr/lib64 and /usr/include/hcoll. I get an error during configure:

$> Konsole output ./configure --with-hcoll
...output snipped...
Konsole output configure:219976: checking for MCA component coll:hcoll
compile mode
configure:219982: result: static
configure:220039: checking --with-hcoll value
configure:220042: result: simple ok (unspecified)
configure:220840: error: HCOLL support requested but not found. Aborting

I have also tried using "--with-hcoll=yes" and gotten the same
behavior. Has anyone else gotten the hcoll component to build when
hcoll itself is in system space? I am using hcoll-3.2.748.

I did take a look at configure, and it looks like there is a test on
"with_hcoll" to see if it is not empty and not yes on line 220072. In
my case, this test fails, so the else clause gets invoked. The else
clause is several hundred lines below on line 220822 and simply sets
Konsole output ompi_check_hcoll_happy="no". Configure doesn't try to
do anything to figure out if hcoll is usable, but it does quit soon
after with the above error because ompi_check_hcoll_happy isn't "yes."

In case it helps, here is the output from config.log for that area:

...output snipped...
configure:219976: checking for MCA component coll:hcoll compile mode
configure:219982: result: dso
configure:220039: checking --with-hcoll value
configure:220042: result: simple ok (unspecified)
configure:220840: error: HCOLL support requested but not found. Aborting

##  ##
## Cache variables. ##
##  ##
...output snipped...

Have I missed something in specifying --with-hcoll? I would prefer not
to use "--with-hcoll=/usr" as I am pretty sure that spurious linker
flags to that area will work their way in when they shouldn't.

Thanks,
David
--
David Shrader
HPC-3 High Performance Computer Systems
Los Alamos National Lab
Email: dshrader  lanl.gov


___
users mailing list
us...@open-mpi.org
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this 
post:http://www.open-mpi.org/community/lists/users/2015/08/27418.php




___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/08/27419.php



--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [hwloc-users] ***UNCHECKED*** Re: [WARNING: A/V UNSCANNABLE] hwloc 1.11.0 seems to have problem with 3.13 kernel on AMD bulldozer

2015-07-24 Thread Åke Sandgren

No i haven't yet. Went on summer vacation before i had time.

On 07/24/2015 12:38 AM, Bill Broadley wrote:

I have the same problem with ubuntu 14.04.2 (fully patched) using the 3.13.0-58
and hwloc-1.11.0:

* hwloc 1.11.0 has encountered what looks like an error from the operating 
system.
*
* L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset 0x003f)
without inclusion!
* Error occurred in topology.c line 983

Here's the output of ./hwloc-gather-topology attached.

Anyone know if Ake Sandgren submitted a bug report with ubuntu?  URL?  If not I
can try.  I don't have a 3.2 system to collect the output from, but I can use
the one he attached.


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


[hwloc-users] hwloc 1.11.0 seems to have problem with 3.13 kernel on AMD bulldozer

2015-07-09 Thread Åke Sandgren

Hi!

On a 48 core AMD bulldozer node with Ubuntu kernel 3.13.0-57-generic i 
get this with hwloc 1.11.0


* hwloc 1.11.0 has encountered what looks like an error from the 
operating system.

*
* L3 (cpuset 0x03f0) intersects with NUMANode (P#0 cpuset 
0x003f) without inclusion!

* Error occurred in topology.c line 983
...

An identical node with kernel 3.2.0-87-generic and hwloc 1.11.0 shows no 
problem.


(The hwloc version in openmpi 1.8.6 also shows the same type of problem 
but with a slightly shorter message)


Attached tar file from hwloc-gather-topology

--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


hwloc-gat-top.tar.bz2
Description: application/bzip


[OMPI users] Missing init of rc in modex (orte/mca/grpcomm/pmi/grpcomm_pmi_module.c), 1.8.6

2015-07-08 Thread Åke Sandgren

Hi!

rc in modex in orte/mca/grpcomm/pmi/grpcomm_pmi_module.c is not properly 
initialized and is causing problems at least with the intel compiler.


diff -ru site/orte/mca/grpcomm/pmi/grpcomm_pmi_module.c 
intel/orte/mca/grpcomm/pmi/grpcomm_pmi_module.c
--- site/orte/mca/grpcomm/pmi/grpcomm_pmi_module.c  2015-06-13 
22:34:43.2 +0200
+++ intel/orte/mca/grpcomm/pmi/grpcomm_pmi_module.c 2015-07-08 
22:23:57.2 +0200

@@ -149,7 +149,7 @@
 orte_process_name_t name;
 orte_vpid_t v;
 bool local;
-int rc, i;
+int rc = ORTE_SUCCESS, i;

 OPAL_OUTPUT_VERBOSE((1, orte_grpcomm_base_framework.framework_output,
  "%s grpcomm:pmi: modex entered",


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] my_sense in ompi_osc_sm_module_t not always protected by OPAL_HAVE_POSIX_THREADS

2015-06-29 Thread Åke Sandgren

Yeah, i thought so. Well code reductions are good when correct :-)

On 06/29/2015 05:39 PM, Nathan Hjelm wrote:


Open MPI has required posix threads for some time. The check for
OPAL_HAVE_POSIX_THREADS in ompi/mca/osc/sm/osc_sm.h is stale and should
be removed. I will clean that out in master, 1.8, and 1.10.

-Nathan

On Mon, Jun 29, 2015 at 05:26:30PM +0200, Åke Sandgren wrote:

Hi!

The my_sense entity in struct ompi_osc_sm_module_t is protected by
OPAL_HAVE_POSIX_THREADS in the definition (ompi/mca/osc/sm/osc_sm.h)

But in ./ompi/mca/osc/sm/osc_sm_active_target.c it is not.

(Tripped on this due to a compiler problem which caused it to only partially
detect threads support, found for C++, missing for C/Fortran)

Not sure if it is something that need to be dealt with but reporting anyway.


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] Bug in ompi/errhandler/errcode.h (1.8.6)?

2015-06-29 Thread Åke Sandgren
The interesting thing is that gcc/intel/portland all failed to detect 
this. Pathscale found it, and clang probably would.


On 06/29/2015 05:37 PM, Jeff Squyres (jsquyres) wrote:

Good catch; fixed.

Thanks!



On Jun 29, 2015, at 7:28 AM, Åke Sandgren <ake.sandg...@hpc2n.umu.se> wrote:

Hi!

static inline int ompi_mpi_errnum_is_class ( int errnum )
{
ompi_mpi_errcode_t *err;

if (errno < 0) {
return false;
}

I assume it should be errnum < 0.



--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] Bug in ompi/errhandler/errcode.h (1.8.6)?

2015-06-29 Thread Åke Sandgren

That's what i said. The code in openmpi checks errno and not errnum.

On 06/29/2015 05:27 PM, Nathan Hjelm wrote:


I see a typo. You are checking errno instead of errnum.

-Nathan

On Mon, Jun 29, 2015 at 01:28:11PM +0200, Åke Sandgren wrote:

Hi!

static inline int ompi_mpi_errnum_is_class ( int errnum )
{
 ompi_mpi_errcode_t *err;

 if (errno < 0) {
 return false;
 }

I assume it should be errnum < 0.

--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27208.php


___
users mailing list
us...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2015/06/27212.php


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


[OMPI users] my_sense in ompi_osc_sm_module_t not always protected by OPAL_HAVE_POSIX_THREADS

2015-06-29 Thread Åke Sandgren

Hi!

The my_sense entity in struct ompi_osc_sm_module_t is protected by 
OPAL_HAVE_POSIX_THREADS in the definition (ompi/mca/osc/sm/osc_sm.h)


But in ./ompi/mca/osc/sm/osc_sm_active_target.c it is not.

(Tripped on this due to a compiler problem which caused it to only 
partially detect threads support, found for C++, missing for C/Fortran)


Not sure if it is something that need to be dealt with but reporting anyway.

--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


[OMPI users] Bug in ompi/errhandler/errcode.h (1.8.6)?

2015-06-29 Thread Åke Sandgren

Hi!

static inline int ompi_mpi_errnum_is_class ( int errnum )
{
ompi_mpi_errcode_t *err;

if (errno < 0) {
return false;
}

I assume it should be errnum < 0.

--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] OpenMPI 1.8 and PGI compilers

2014-04-30 Thread Åke Sandgren

On 04/29/2014 09:33 AM, Åke Sandgren wrote:

On 04/29/2014 07:55 AM, Åke Sandgren wrote:

On 04/29/2014 12:15 AM, Jeff Squyres (jsquyres) wrote:

Brian: Can you report this bug to PGI and see what they say?


PGC-S-0094-Illegal type conversion required (btl_scif_component.c: 215)
PGC/x86-64 Linux 14.3-0: compilation completed with severe errors
make[2]: *** [btl_scif_component.lo] Error 1


Has anyone successfully built OpenMPI 1.8 with PGI?


I have no problem building openmpi 1.8.1 with pgi 14.3

Did you specify anything special during configure?


Sorry didn't notice until after my reply that you had scif enabled.
And yes it fails for me too with that.

Reducing and reporting to PGI...


And in case you're interested it got assigned as "TPR 20405".

--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] OpenMPI 1.8 and PGI compilers

2014-04-29 Thread Åke Sandgren

On 04/29/2014 07:55 AM, Åke Sandgren wrote:

On 04/29/2014 12:15 AM, Jeff Squyres (jsquyres) wrote:

Brian: Can you report this bug to PGI and see what they say?


PGC-S-0094-Illegal type conversion required (btl_scif_component.c: 215)
PGC/x86-64 Linux 14.3-0: compilation completed with severe errors
make[2]: *** [btl_scif_component.lo] Error 1


Has anyone successfully built OpenMPI 1.8 with PGI?


I have no problem building openmpi 1.8.1 with pgi 14.3

Did you specify anything special during configure?


Sorry didn't notice until after my reply that you had scif enabled.
And yes it fails for me too with that.

Reducing and reporting to PGI...

--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] OpenMPI 1.8 and PGI compilers

2014-04-29 Thread Åke Sandgren

On 04/29/2014 12:15 AM, Jeff Squyres (jsquyres) wrote:

Brian: Can you report this bug to PGI and see what they say?


PGC-S-0094-Illegal type conversion required (btl_scif_component.c: 215)
PGC/x86-64 Linux 14.3-0: compilation completed with severe errors
make[2]: *** [btl_scif_component.lo] Error 1


Has anyone successfully built OpenMPI 1.8 with PGI?


I have no problem building openmpi 1.8.1 with pgi 14.3

Did you specify anything special during configure?

--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] probable bug in 1.9a1r31409

2014-04-16 Thread Åke Sandgren

On 04/16/2014 02:25 PM, Åke Sandgren wrote:

Hi!

Found this problem when building r31409 with Pathscale 5.0

pshmem_barrier.c:81:6: error: redeclaration of 'pshmem_barrier_all' must
have the 'overloadable' attribute
void shmem_barrier_all(void)
  ^
../../../../oshmem/shmem/c/profile/defines.h:193:37: note: expanded from
macro 'shmem_barrier_all'
#define shmem_barrier_all   pshmem_barrier_all
 ^
pshmem_barrier.c:78:14: note: previous overload of function is here
#pragma weak shmem_barrier_all = pshmem_barrier_all
  ^
../../../../oshmem/shmem/c/profile/defines.h:193:37: note: expanded from
macro 'shmem_barrier_all'
#define shmem_barrier_all   pshmem_barrier_all
 ^
pragma weak and define clashing...



Suggested patch attached (actually there where two similar cases...)


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
diff -ru site/oshmem/shmem/c/shmem_barrier.c 
amd64/oshmem/shmem/c/shmem_barrier.c
--- site/oshmem/shmem/c/shmem_barrier.c 2014-04-16 03:05:05.2 +0200
+++ amd64/oshmem/shmem/c/shmem_barrier.c2014-04-16 15:33:35.2 
+0200
@@ -24,6 +24,7 @@
 #if OSHMEM_PROFILING
 #include "oshmem/include/pshmem.h"
 #pragma weak shmem_barrier = pshmem_barrier
+#pragma weak shmem_barrier_all = pshmem_barrier_all
 #include "oshmem/shmem/c/profile/defines.h"
 #endif

@@ -74,10 +75,6 @@
 RUNTIME_CHECK_RC(rc);
 }

-#if OSHMEM_PROFILING
-#pragma weak shmem_barrier_all = pshmem_barrier_all
-#endif
-
 void shmem_barrier_all(void)
 {
 int rc = OSHMEM_SUCCESS;
diff -ru site/oshmem/shmem/c/shmem_get.c amd64/oshmem/shmem/c/shmem_get.c
--- site/oshmem/shmem/c/shmem_get.c 2014-04-16 03:05:05.2 +0200
+++ amd64/oshmem/shmem/c/shmem_get.c2014-04-16 15:09:48.2 +0200
@@ -52,7 +52,6 @@
 #pragma weak shmem_float_get = pshmem_float_get
 #pragma weak shmem_double_get = pshmem_double_get
 #pragma weak shmem_longdouble_get = pshmem_longdouble_get
-#include "oshmem/shmem/c/profile/defines.h"
 #endif

 SHMEM_TYPE_GET(_char, char)
@@ -90,6 +89,7 @@
 #pragma weak shmem_get32 = pshmem_get32
 #pragma weak shmem_get64 = pshmem_get64
 #pragma weak shmem_get128 = pshmem_get128
+#include "oshmem/shmem/c/profile/defines.h"
 #endif

 SHMEM_TYPE_GETMEM(_getmem, 1)


[OMPI users] probable bug in 1.9a1r31409

2014-04-16 Thread Åke Sandgren

Hi!

Found this problem when building r31409 with Pathscale 5.0

pshmem_barrier.c:81:6: error: redeclaration of 'pshmem_barrier_all' must 
have the 'overloadable' attribute

void shmem_barrier_all(void)
 ^
../../../../oshmem/shmem/c/profile/defines.h:193:37: note: expanded from 
macro 'shmem_barrier_all'

#define shmem_barrier_all   pshmem_barrier_all
^
pshmem_barrier.c:78:14: note: previous overload of function is here
#pragma weak shmem_barrier_all = pshmem_barrier_all
 ^
../../../../oshmem/shmem/c/profile/defines.h:193:37: note: expanded from 
macro 'shmem_barrier_all'

#define shmem_barrier_all   pshmem_barrier_all
^
pragma weak and define clashing...

--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] openmpi 1.7.4rc1 and f08 interface

2014-02-03 Thread Åke Sandgren

On 02/01/2014 03:12 PM, Jeff Squyres (jsquyres) wrote:

I think that ompi_funloc_variant1 needs to do IMPORT to have access to the 
callback_variant1 definition before using it to define "FN"
I.e.
 !
  function ompi_funloc_variant1(fn)
use, intrinsic :: iso_c_binding, only: c_funptr
import
procedure(callback_variant1) :: fn


At work reading the specs it is clear that it needs the IMPORT clause.
Could probably do IMPORT :: callback_variant1 if you want to import as 
little as possible.


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] openmpi 1.7.4rc1 and f08 interface

2014-01-31 Thread Åke Sandgren

On 01/28/2014 08:26 PM, Jeff Squyres (jsquyres) wrote:

Ok, will do.

Yesterday, I put in a temporary behavioral test in configure that will exclude 
ekopath 5.0 in 1.7.4.  We'll remove this behavioral test once OMPI fixes the 
bug correctly (for 1.7.5).


I'm not 100% sure yet (my F2k3 spec is at work and I'm not) but the 
ompi_funloc.tar.gz code in 
https://svn.open-mpi.org/trac/ompi/ticket/4157 seems to be non comformant.


   abstract interface
 !! This is the prototype for ONE of the MPI callback routines
 !
 function callback_variant1(val)
   integer :: val, callback_variant1
 end function
   end interface

   interface
 !! This is the OMPI conversion routine for ONE of the MPI callback 
routines

 !
  function ompi_funloc_variant1(fn)
use, intrinsic :: iso_c_binding, only: c_funptr
procedure(callback_variant1) :: fn
type(c_funptr) :: ompi_funloc_variant1
  end function ompi_funloc_variant1
   end interface

I think that ompi_funloc_variant1 needs to do IMPORT to have access to 
the callback_variant1 definition before using it to define "FN"

I.e.
 !
  function ompi_funloc_variant1(fn)
use, intrinsic :: iso_c_binding, only: c_funptr
import
procedure(callback_variant1) :: fn

--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] openmpi 1.7.4rc1 and f08 interface

2014-01-27 Thread Åke Sandgren

On 01/27/2014 04:44 PM, Åke Sandgren wrote:

On 01/27/2014 04:31 PM, Jeff Squyres (jsquyres) wrote:

We *do* still have a problem in the mpi_f08 module that we probably
won't fix before 1.7.4 is released.  Here's the ticket:

 https://svn.open-mpi.org/trac/ompi/ticket/4157

Craig has a suggested patch, but a) I haven't had time to investigate
it yet, and b) we believe that, at least so far, this issue only
affects the as-yet unreleased gfortran 4.9 compiler.

All that being said, my limited Fortran knowledge is probably showing
here: is this the same issue that you're reporting with the ekopath
compiler?

If so, what version of the ekopath compiler are you using?


Yeah, that's exactly the problem I'm seeing.

I'm using Ekopath 5 (5.0.0) which was released a little while ago.
"PathScale EKOPath(tm) Compiler Suite: Version 5.0.0"

(which also gets rid of the 32char limit in bind(c, name="...") that you
talked about with Christopher B. in case he haven't had time to get back
to you on that)


I.e., if you need testing of patches for this, let me know.

--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] openmpi 1.7.4rc1 and f08 interface

2014-01-27 Thread Åke Sandgren

On 01/27/2014 04:31 PM, Jeff Squyres (jsquyres) wrote:

We *do* still have a problem in the mpi_f08 module that we probably won't fix 
before 1.7.4 is released.  Here's the ticket:

 https://svn.open-mpi.org/trac/ompi/ticket/4157

Craig has a suggested patch, but a) I haven't had time to investigate it yet, 
and b) we believe that, at least so far, this issue only affects the as-yet 
unreleased gfortran 4.9 compiler.

All that being said, my limited Fortran knowledge is probably showing here: is 
this the same issue that you're reporting with the ekopath compiler?

If so, what version of the ekopath compiler are you using?


Yeah, that's exactly the problem I'm seeing.

I'm using Ekopath 5 (5.0.0) which was released a little while ago.
"PathScale EKOPath(tm) Compiler Suite: Version 5.0.0"

(which also gets rid of the 32char limit in bind(c, name="...") that you 
talked about with Christopher B. in case he haven't had time to get back 
to you on that)


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] openmpi 1.7.4rc1 and f08 interface

2014-01-27 Thread Åke Sandgren

On 01/27/2014 03:28 PM, Åke Sandgren wrote:

Hi!

I just started trying to build 1.7.4rc1 with the new Pathscale EkoPath5
compiler and stumbled onto this.

When building without --enable-mpi-f08-subarray-prototype i get into
problems with ompi/mpi/fortran/use-mpi-f08/mpi-f-interfaces-bind.h

It defines
subroutine
ompi_comm_create_keyval_f(comm_copy_attr_fn,comm_delete_attr_fn, &
  comm_keyval,extra_state,ierror) &
BIND(C, name="ompi_comm_create_keyval_f")
use :: mpi_f08_types, only : MPI_ADDRESS_KIND
use :: mpi_f08_interfaces_callbacks, only : MPI_Comm_copy_attr_function
use :: mpi_f08_interfaces_callbacks, only :
MPI_Comm_delete_attr_function
implicit none
OMPI_PROCEDURE(MPI_Comm_copy_attr_function) :: comm_copy_attr_fn
OMPI_PROCEDURE(MPI_Comm_delete_attr_function) :: comm_delete_attr_fn
INTEGER, INTENT(OUT) :: comm_keyval
INTEGER(MPI_ADDRESS_KIND), INTENT(IN) :: extra_state
INTEGER, INTENT(OUT) :: ierror
end subroutine ompi_comm_create_keyval_f

But at least the F2k3 specs says that
"Each dummy argument of an interoperable procedure or interface must be
an interoperable variable or an interoperable procedure."

The code above violates that since comm_copy_attr_fn is not
interoperable as far as i can see.
If I'm reading this wrong then please let me know...

The only definition of OMPI_PROCEDURE i can find is this one in
ompi/mpi/fortran/configure-fortran-output-bottom.h

#if OMPI_FORTRAN_HAVE_PROCEDURE
#define OMPI_PROCEDURE(name) PROCEDURE(name)
#else
#define OMPI_PROCEDURE(name) EXTERNAL
#endif

I currently don't have any F2k8 specs to check so if this is changed
there I'll try to get this sorted in the compiler.



According to people with the specs around F2k8 has the same restriction:
"C1255 (R1229) If proc-language-binding-spec is specified for a 
procedure, each of the procedure’s dummy arguments shall be a 
nonoptional interoperable variable (15.3.5, 15.3.6) or a nonoptional 
interoperable procedure (15.3.7)."


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


[OMPI users] openmpi 1.7.4rc1 and f08 interface

2014-01-27 Thread Åke Sandgren

Hi!

I just started trying to build 1.7.4rc1 with the new Pathscale EkoPath5 
compiler and stumbled onto this.


When building without --enable-mpi-f08-subarray-prototype i get into 
problems with ompi/mpi/fortran/use-mpi-f08/mpi-f-interfaces-bind.h


It defines
subroutine 
ompi_comm_create_keyval_f(comm_copy_attr_fn,comm_delete_attr_fn, &

 comm_keyval,extra_state,ierror) &
   BIND(C, name="ompi_comm_create_keyval_f")
   use :: mpi_f08_types, only : MPI_ADDRESS_KIND
   use :: mpi_f08_interfaces_callbacks, only : MPI_Comm_copy_attr_function
   use :: mpi_f08_interfaces_callbacks, only : 
MPI_Comm_delete_attr_function

   implicit none
   OMPI_PROCEDURE(MPI_Comm_copy_attr_function) :: comm_copy_attr_fn
   OMPI_PROCEDURE(MPI_Comm_delete_attr_function) :: comm_delete_attr_fn
   INTEGER, INTENT(OUT) :: comm_keyval
   INTEGER(MPI_ADDRESS_KIND), INTENT(IN) :: extra_state
   INTEGER, INTENT(OUT) :: ierror
end subroutine ompi_comm_create_keyval_f

But at least the F2k3 specs says that
"Each dummy argument of an interoperable procedure or interface must be 
an interoperable variable or an interoperable procedure."


The code above violates that since comm_copy_attr_fn is not 
interoperable as far as i can see.

If I'm reading this wrong then please let me know...

The only definition of OMPI_PROCEDURE i can find is this one in 
ompi/mpi/fortran/configure-fortran-output-bottom.h


#if OMPI_FORTRAN_HAVE_PROCEDURE
#define OMPI_PROCEDURE(name) PROCEDURE(name)
#else
#define OMPI_PROCEDURE(name) EXTERNAL
#endif

I currently don't have any F2k8 specs to check so if this is changed 
there I'll try to get this sorted in the compiler.


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] forrtl: severe (174): SIGSEGV, segmentation fault occurred

2014-01-02 Thread Åke Sandgren

On 01/02/2014 11:08 AM, Hongyi Zhao wrote:

Hi all,

I compiled openmpi-1.6.5 with ifort-14.0.0, then I use the mpif90
wrapper of  openmpi to compile the siesta package - a DFT package,
obtain from here:http://departments.icmab.es/leem/siesta/  .

After I successfully compile the   siesta package, then I use it to do
some compuations like this:

$ mpirun -np 2 transiesta < INPUT.fdf > OUTPUT.fdf

In this phase, I meet the followig error:


forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image  PCRoutineLine
Source
transiesta 019A8A59  Unknown   Unknown  Unknown
transiesta 019A73D0  Unknown   Unknown  Unknown

I cann't figure out the reason for this issue,  any hints will be
highly appreciated.


Can you give me the INPUT.fdf and i can try running it with our build 
(which contains a bunch of bugfixes for siesta/transiesta)


--
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se


Re: [OMPI users] MPI_Allreduce on local machine

2010-07-28 Thread Åke Sandgren
On Wed, 2010-07-28 at 11:48 -0400, Gus Correa wrote:
> Hi Hugo, Jeff, list
> 
> Hugo: I think David Zhang's suggestion was to use
> MPI_REAL8 not MPI_REAL, instead of MPI_DOUBLE_PRECISION in your
> MPI_Allreduce call.
> 
> Still, to me it looks like OpenMPI is making double precision 4-byte 
> long, which shorter than I expected it be (8 bytes),
> at least when looking at the output of ompi_info --all.
> 
> I always get get a size 4 for dbl prec in my x86_64 machine,
> from ompi_info --all.
> I confirmed this in six builds of OpenMPI 1.4.2:  gcc+gfortran,
> gcc+pgf90, gcc+ifort, icc+ifort, pgcc+pgf90, and opencc+openf95.
> Although the output of ompi_info never says this is actually the size
> of MPI_DOUBLE_PRECISION, just of "dbl prec", which is a bit ambiguous.
> 
> FWIW, I include the output below.  Note that alignment for gcc+ifort
> is 1, all others are 4.
> 
> Jeff:  Is this correct?

This is wrong, it should be 8 and alignement should be 8 even for intel.
And i also see exactly the same thing.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] Problems building Open MPI 1.4.1 with Pathscale

2010-02-09 Thread Åke Sandgren
On Tue, 2010-02-09 at 13:42 -0500, Jeff Squyres wrote:
> Perhaps someone with a pathscale compiler support contract can investigate 
> this with them.
> 
> Have them contact us if they want/need help understanding our atomics; we're 
> happy to explain, etc. (the atomics are fairly localized to a small part of 
> OMPI).

I will surely do that.
It will take a few days though due to lots of other work.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] Problems building Open MPI 1.4.1 with Pathscale

2010-01-25 Thread Åke Sandgren
1 - Do you have problems with openmpi 1.4 too? (I don't, haven't built
1.4.1 yet)
2 - There is a bug in the pathscale compiler with -fPIC and -g that
generates incorrect dwarf2 data so debuggers get really confused and
will have BIG problems debugging the code. I'm chasing them to get a
fix...
3 - Do you have an example code that have problems?

On Mon, 2010-01-25 at 15:01 -0500, Jeff Squyres wrote:
> I'm afraid I don't have any clues offhand.  We *have* had problems with the 
> Pathscale compiler in the past that were never resolved by their support 
> crew.  However, they were of the "variables weren't initialized and the 
> process generally aborts" kind of failure, not a "persistent hang" kind of 
> failure.
> 
> Can you tell where in MPI_Init the process is hanging?  E.g., can you build 
> Open MPI with debugging enabled (such as by passing CFLAGS=-g to OMPI's 
> configure line) and then attach a debugger to a hung process and see what 
> it's stuck on?
> 
> 
> On Jan 25, 2010, at 7:52 AM, Rafael Arco Arredondo wrote:
> 
> > Hello:
> > 
> > I'm having some issues with Open MPI 1.4.1 and Pathscale compiler
> > (version 3.2). Open MPI builds successfully with the following configure
> > arguments:
> > 
> > ./configure --with-openib=/usr --with-openib-libdir=/usr/lib64
> > --with-sge --enable-static CC=pathcc CXX=pathCC F77=pathf90 F90=pathf90
> > FC=pathf90
> > 
> > (we have OpenFabrics 1.2 Infiniband drivers, by the way)
> > 
> > However, applications hang on MPI_Init (or maybe MPI_Comm_rank or
> > MPI_Comm_size, a basic hello-world anyway doesn't print 'Hello World
> > from node...'). I tried running them with and without SGE. Same result.
> > 
> > This hello-world works flawlessly when I build Open MPI with gcc:
> > 
> > ./configure --with-openib=/usr --with-openib-libdir=/usr/lib64
> > --with-sge --enable-static
> > 
> > This successful execution runs in one machine only, so it shouldn't use
> > Infiniband, and it also works when several nodes are used.
> > 
> > I was able to build previous versions of Open MPI with Pathscale (1.2.6
> > and 1.3.2, particularly). I tried building version 1.4.1 both with
> > Pathscale 3.2 and Pathscale 3.1. No difference.
> > 
> > Any ideas?
> > 
> > Thank you in advance,
> > 
> > Rafa
> > 
> > --
> > Rafael Arco Arredondo
> > Centro de Servicios de Informática y Redes de Comunicaciones
> > Universidad de Granada
> > 
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> > 
> 
> 



Re: [OMPI users] ScaLAPACK and OpenMPI > 1.3.1

2010-01-21 Thread Åke Sandgren
On Thu, 2010-01-21 at 15:40 -0600, Champagne, Nathan J. (JSC-EV)[Jacobs
Technology] wrote:
> >What is a correct result then?
> 
> The correct results are output by v1.3.1. The filename in the archive is 
> "sol_1.3.1_96.txt".
> 
> >How often do you get junk or NaNs compared to correct result.
> We haven't been able to quantify it. It almost seems random; similar to using 
> a variable that's unintialized, expecting its initial value to be zero when 
> it may not be.

In that case i wonder what version of scalapack/blacs you are using?

I have run a bunch of tests with openmpi 1.3.3 and 1.4 all yield the
correct result.
Using Intel 10.1 with lapack 3.1.1 built my me + gotoblas, and also
tried mkl.

Also tried with Pathscale 3.2 with lapack 3.1.1/gotoblas still ok

I tried running with 128 cores too but still the same result (except one
small round-off difference)

I know that scalapack versions prior to 1.8.0 had a couple of bugs with
uninitalized vars.



Re: [OMPI users] ScaLAPACK and OpenMPI > 1.3.1

2010-01-21 Thread Åke Sandgren
On Thu, 2010-01-21 at 14:48 -0600, Champagne, Nathan J. (JSC-EV)[Jacobs
Technology] wrote:
> We started having a problem with OpenMPI beginning with version 1.3.2
> where the program output can be correct, junk, or NaNs (result is not
> predictable). The output is the solution of a matrix equation solved
> by ScaLAPACK. We are using the Intel Fortran compiler (version 11.1)
> and the GNU compiler (version 4.1.2) on Gentoo Linux. So far, the
> problem manifests itself for a matrix (N X N) with N ~ 10,000 or more
> with a processor count ~ 64 or more. Note that the problem still
> occurs using OpenMPI 1.4.1.
> 
>  
> 
> We build the ScaLAPACK and BLACS libraries locally and use the LAPACK
> and BLAS libraries supplied by Intel.
> 
>  
> 
> We wrote a test program to demonstrate the problem. The matrix is
> built on each processor (no communication). Then, the matrix is
> factored and solved. The solution vector is collected from the
> processors and printed to a file by the master processor. The program
> and associated OpenMPI information (ompi_info --all) are available at:
> 
>  
> 
> http://www.em-stuff.com/files/files.tar.gz
> 
>  
> 
> The file "compile" in the "test" directory is used to create the
> executable. Edit it to reflect libraries on your local machine. Data
> created using OpenMPI 1.3.1 and 1.4.1 are in the "output" directory
> for reference.

What is a correct result then?
Hard to test without knowing.

How often do you get junk or NaNs compared to correct result.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] memalign usage in OpenMPI and it's consequencesforTotalVIew

2009-10-01 Thread Åke Sandgren
On Thu, 2009-10-01 at 15:19 -0400, Jeff Squyres wrote:
> On Oct 1, 2009, at 2:19 PM, Åke Sandgren wrote:
> 
> > No it didn't. And memalign is obsolete according to the manpage.
> > posix_memalign is the one to use.
> >
> 
> 
> This particular call is testing the memalign intercept in the ptmalloc  
> component during startup; we can't replace it with posix_memalign.
> 
> Hence, the values that are passed are fairly meaningless.  It's just  
> testing that the intercept works.

Yes, but perhaps you need to verify that posix_memalign is also
intercepted?

I commented on memalign being obsolete since there are a couple of uses
of it in the rest of the openmpi code apart from that particular case.
They should probably be changed.



Re: [OMPI users] memalign usage in OpenMPI and it's consequencesfor TotalVIew

2009-10-01 Thread Åke Sandgren
On Thu, 2009-10-01 at 19:56 +0100, Ashley Pittman wrote:
> Simple malloc() returns pointers that are at least eight byte aligned
> anyway, I'm not sure what the reason for calling memalign() with a value
> of four would be be anyway.

That is not necessarily true on all systems.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] memalign usage in OpenMPI and it's consequencesfor TotalVIew

2009-10-01 Thread Åke Sandgren
On Thu, 2009-10-01 at 13:58 -0400, Jeff Squyres wrote:
> Did that make it over to the v1.3 branch?

No it didn't. And memalign is obsolete according to the manpage.
posix_memalign is the one to use.

> >
> > I think Jeff has already addressed this problem.
> >
> > https://svn.open-mpi.org/trac/ompi/changeset/21744




Re: [OMPI users] Valgrind writev() errors with 1.3.2.

2009-06-09 Thread Åke Sandgren
On Tue, 2009-06-09 at 12:01 -0600, Ralph Castain wrote:
> I can't speak to all of the OMPI code, but I can certainly create a
> new configure option --valgrind-friendly that would initialize the OOB
> comm buffers and other RTE-related memory to eliminate such warnings.
> 
> I would prefer to configure it out rather than adding a bunch of
> "if-then" checks for envars to avoid having the performance hit when
> not needed.
> Would that help?

Yes please!!!
I gave up the last attempt at finding memory problems in a user code
simply because the amount of complaints from the openmpi parts.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



[OMPI users] oob-tcp problem, unreachable in orted_comm

2009-06-06 Thread Åke Sandgren
Just got this in a user job.
Any idea why it complains like this.
The original error was the infamous "RETRY EXCEEDED ERROR" but instead
of killing the job it showed this and never died.
I have never seen this happen before.

openmpi 1.3.2, built with intel 10.1
This binary is used ALOT (+50% of the system walltime) and has never
shown this specific problem and rarely the "Retry exceeded error"
either.

[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0]-[[34820,0],1] oob-tcp:
Communication
 retries exceeded.  Can not communicate with peer
[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0] ORTE_ERROR_LOG: Unreachable
in file 
orted/orted_comm.c at line 130
[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0] ORTE_ERROR_LOG: Unreachable
in file 
orted/orted_comm.c at line 130
[p-bc2503.hpc2n.umu.se:11892] [[34820,0],0]-[[34820,0],1] oob-tcp:
Communication
 retries exceeded.  Can not communicate with peer


-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] Receiving MPI messages of unknown size

2009-06-04 Thread Åke Sandgren
On Thu, 2009-06-04 at 14:54 +1000, Lars Andersson wrote:
> Hi Gus,
> 
> Thanks for the suggestion. I've been thinking along those lines, but
> it seems to have drawbacks. Consider the following MPI conversation:
> 
> TimeNODE 1  NODE 2
> 0local work   local work
> 1post n-b recv  local work
> 2local work   post n-b send
> 3complete recv in 1 local work

Its been awhile since i did mpi programming but...
why not just post a n-b recv for the header too?
just tag it correctly.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] OpenMPI 1.3.2 with PathScale 3.2

2009-05-14 Thread Åke Sandgren
On Thu, 2009-05-14 at 13:35 -0700, Joshua Bernstein wrote:
> Greetings All,
> 
>   I'm trying to build OpenMPI 1.3.2 with the Pathscale compiler, version 
> 3.2. A 
> bit of the way through the build the compiler dies with what it things is a 
> bad 
> optimization. Has anybody else seen this, or know a work around for it? I'm 
> going to take it up with Pathscale of course, but I thought I'd throw it out 
> here:
> 
> ---SNIP---
> /opt/pathscale/bin/pathCC -DHAVE_CONFIG_H -I. -I../.. 
> -I../../extlib/otf/otflib 
> -I../../extlib/otf/otflib -I../../vtlib/ -I../../vtlib  -D_GNU_SOURCE -mp 
> -DVT_OMP -O3 -DNDEBUG -finline-functions -pthread -MT 
> vtfilter-vt_tracefilter.o 
> -MD -MP -MF .deps/vtfilter-vt_tracefilter.Tpo -c -o vtfilter-vt_tracefilter.o 
> `test -f 'vt_tracefilter.cc' || echo './'`vt_tracefilter.cc
> Signal: Segmentation fault in Global Optimization -- Dead Store Elimination 
> phase.
> Error: Signal Segmentation fault in phase Global Optimization -- Dead Store 
> Elimination -- processing aborted
> *** Internal stack backtrace:
> pathCC INTERNAL ERROR: /opt/pathscale/lib/3.2/be died due to signal 4

Haven't seen it. But I'm only using -O2 when building openmpi.
Report it quickly, if we're lucky they might get a fix into the 3.3
release that is due out very soon. (I just got the beta yesterday)

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



[OMPI users] Problems with "error polling LP CQ with status RNR"

2009-05-13 Thread Åke Sandgren
Hi!

I'm having problem with getting the "error polling LP CQ with status
RNR..." on an otherwise completely empty system.
There are no errors visible in the error counters in any of the HCAs or
switches or anywhere else.

I'm running OMPI 1.3.2 built with pathscale 3.2

If i add -mca btl 'ofud,self,sm' the same code works ok.

It usually only shows up on runs with nodes=16:ppn=8 or higher, i.e. 8x8
works ok.

This might very well be a pathscale problem since when running with the
debug version of ompi 1.3.2 the problem goes away.

Complete error is:
error polling LP CQ with status RECEIVER NOT READY RETRY EXCEEDED ERROR
status number 13 for wr_id 465284992 opcode -1  vendor error 135 qp_idx
0

Any ideas to where in the ompi code i should start reducing optimization
levels to pinpoint this?

I'll try some more tests tomorrow with a hopefully fresh mind...

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] PGI Fortran pthread support

2009-04-14 Thread Åke Sandgren
On Mon, 2009-04-13 at 16:48 -0600, Orion Poplawski wrote:
> Seeing the following building openmpi 1.3.1 on CentOS 5.3 with PGI pgf90 
> 8.0-5 fortran compiler:
> checking for PTHREAD_MUTEX_ERRORCHECK_NP... yes
> checking for PTHREAD_MUTEX_ERRORCHECK... yes
> checking for working POSIX threads package... no

> Is there any way to get the PGI Fortran compiler to support threads for 
> openmpi?

I recommend adding the attached pthread.h into pgi's internal include
dir.
The pthread.h in newer distros is VERY VERY GCC-centric and when using
any other compiler it very often fails to do the "right" thing.

This pthread.h sets needed GCC-isms before parsing the real pthread.h.

At least we haven't had any problems with getting openmpi and pgi to
work correctly together since.
(I found this problem when building openmpi 1.2.something)

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se

#if ! defined(__builtin_expect)
# define __builtin_expect(expr, expected) (expr)
#endif

#if ! defined(__USE_GNU)
#define __USE_GNU
#define __PGI_USE_GNU
#endif

#if ! defined(__GNUC__)
#define __GNUC__ 2
#define __PGI_GNUC
#endif

#include_next

#if defined(__PGI_USE_GNU)
#undef __USE_GNU
#endif

#if defined(__PGI_GNUC)
#undef __GNUC__
#endif


[OMPI users] Possible regression from 1.2 to 1.3 when BLACS is involved

2009-03-24 Thread Åke Sandgren
Hi!

We're having problems with code that uses BLACS and openmpi 1.3.x.
When compiled with memory-manager turned on (base only), code using
BLACS either start leaking memory or gets into some kind of deadlock.
The first code-case can be circumvented by using
mpi_leave_pinned_pipeline 0, but the second one could only be solved by
compiling openmpi without memory manager.

Building 1.3.1 with ptmalloc2-internal makes the second code break in
different ways.

Anyone else seen similar problems when using BLACS?

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] valgrind complaint in openmpi 1.3 (mca_mpool_sm_alloc)

2009-03-10 Thread Åke Sandgren
On Tue, 2009-03-10 at 09:23 -0800, Eugene Loh wrote:
> Åke Sandgren wrote:
> 
> >Hi!
> >
> >Valgrind seems to think that there is an use of uninitialized value in
> >mca_mpool_sm_alloc, i.e. the if(mpool_sm->mem_node >= 0) {
> >Backtracking that i found that mem_node is not set during initializing
> >in mca_mpool_sm_init.
> >The resources parameter is never used and the mpool_module->mem_node is
> >never initalized.
> >
> >Bug or not?
> >  
> >
> Apparently George fixed this in the trunk in r19257
> https://svn.open-mpi.org/source/history/ompi-trunk/ompi/mca/mpool/sm/mpool_sm_module.c
>  
> .  So, the resources parameter is never used, but you call 
> mca_mpool_sm_module_init(), which has the decency to set mem_node to 
> -1.  Not a helpful value, but a legal one.

So why not set it in the calling function which have access to the
precomputed resources value?

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] openib RETRY EXCEEDED ERROR

2009-02-27 Thread Åke Sandgren
On Fri, 2009-02-27 at 09:54 -0700, Matt Hughes wrote:
> 2009/2/26 Brett Pemberton :
> > [[1176,1],0][btl_openib_component.c:2905:handle_wc] from tango092.vpac.org
> > to: tango090 error polling LP CQ with status RETRY EXCEEDED ERROR status
> > number 12 for wr_id 38996224 opcode 0 qp_idx 0
> 
> What OS are you using?  I've seen this error and many other Infiniband
> related errors on RedHat enterprise linux 4 update 4, with ConnectX
> cards and various versions of OFED, up to version 1.3.  Depending on
> the MCA parameters, I also see hangs often enough to make native
> Infiniband unusable on this OS.
> 
> However, the openib btl works just fine on the same hardware and the
> same OFED/OpenMPI stack when used with Centos 4.6.  I suspect there
> may be something about the kernel that is contributing to these
> problems, but I haven't had a chance to test the kernel from 4.6 on
> 4.4.

We see these errors fairly frequently on our CentOS 5.2 system with
Mellanox InfiniHost III cards. The OFED stack is whatever the CentOS5.2
uses. Has anyone tested that with the 1.4 OFED stack?

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] undefined symbol: tm_init

2009-02-12 Thread Åke Sandgren
On Wed, 2009-02-11 at 17:14 -0700, Ralph Castain wrote:
> Actually, this was also the subject of another email thread on the  
> user list earlier today. The user noted that we had lost an important  
> line in our Makefile.am for the tm plm module, and that this was the  
> root cause of the problems you and others have been seeing. We don't  
> see it here because we always configure as shown below.
> 
> This has been fixed in the upcoming 1.3.1 release.
> 
> In the meantime, if you configure with --enable-static  --enable- 
> shared, the required library will be linked into OMPI and will be  
> available.
> 
> Sorry for the problem.
> Ralph

Or, do as i did.
In orte/mca/plm/tm/Makefile.in there is a line
mca_plm_tm_la_LIBADD = 
change it to
mca_plm_tm_la_LIBADD = $(plm_tm_LIBS)

and rebuild.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



[OMPI users] Bug in openmpi 1.3 orte/mca/plm/tm/Makefile.am

2009-02-11 Thread Åke Sandgren
Hi!

orte/mca/plm/tm/Makefile.am is missing a
mca_plm_tm_la_LIBADD = $(plm_tm_LIBS)
like the corresponding line in orte/mca/ras/tm/Makefile.am
mca_ras_tm_la_LIBADD...

I think this is the cause for the "undefined symbol: tm_init" mail from
2009-02-09 20:41:45 by Brett Pemberton

I have the same problem and when closely checking the resulting
Makefiles and build output i saw that mca_ras_tm.so gets the -ltorque
added but mca_plm_tm.so doesn't.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] Problem with btl_openib_endpoint_post_rr

2008-08-26 Thread Åke Sandgren
On Tue, 2008-08-26 at 15:02 +0300, Pavel Shamis (Pasha) wrote:
> Hi,
> Can you please provide more information about your setup:
> - OpenMPI version
> - Runtime tuning
> - Platform
> - IB vendor and driver version

openmpi: 1.2.6
runtime: mpirun -mca mpi_yield_when_idle 1 (PBS -l nodes=32:ppn=8)
platform: intel dual quadcore, centos5
ib: Mellanox Technologies MT25208 InfiniHost III Ex

libraries from centos5:
libibverbs-1.1.1-6.el5_1.1
libibumad-1.0.5-6.el5_1.1
libibcommon-1.0.3-6.el5_1.1

I haven't had time to run this enough to know if it happens everytime or
only intermittently.
It takes a while before it happens...

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



[OMPI users] Problem with btl_openib_endpoint_post_rr

2008-08-26 Thread Åke Sandgren
Hi!

We have a code that (at least sometimes) gets the following error
message:
[p-bc2909][0,1,98][btl_openib_endpoint.h:201:btl_openib_endpoint_post_rr] error
posting receive errno says Numerical result out of range


Any ideas as to where i should start searching for the problem?

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] OpenMPI scaling > 512 cores

2008-06-04 Thread Åke Sandgren
On Wed, 2008-06-04 at 11:43 -0700, Scott Shaw wrote:
> Hi, I was wondering if anyone had any comments with regarding to my
> posting of questions.  Am I off base with my questions or is this the
> wrong forum for these types of questions?   
> 
> > 
> > Hi, I hope this is the right forum for my questions.  I am running
> into a
> > problem when scaling >512 cores on a infiniband cluster which has
> 14,336
> > cores. I am new to openmpi and trying to figure out the right -mca
> options

I don't have any real answerr to you question except that i have had no
problems running HPL on our 672 node dual quad core = 5376 cores with
infiniband.
We use verbs.
I wouldn't touch the oob parameters since it uses tcp over ethernet to
setup the environment.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] Problems using Intel MKL with OpenMPI and Pathscale

2008-04-14 Thread Åke Sandgren
On Sun, 2008-04-13 at 08:00 -0400, Jeff Squyres wrote:
> Do you get the same error if you disable the memory handling in Open  
> MPI?  You can configure OMPI with:
> 
>  --disable-memory-manager

Doesn't help, it still compiles ptmalloc2 and trying to turn off
ptmaloc2 during runtime doesn't help either.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] Problems using Intel MKL with OpenMPI and Pathscale

2008-04-13 Thread Åke Sandgren
On Sun, 2008-04-13 at 08:00 -0400, Jeff Squyres wrote:
> Do you get the same error if you disable the memory handling in Open  
> MPI?  You can configure OMPI with:
> 
>  --disable-memory-manager

Ah, I have apparently missed that config flag, will try on monday.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



[OMPI users] Problems using Intel MKL with OpenMPI and Pathscale

2008-04-09 Thread Åke Sandgren
Hi!

I have an annoying problem that i hope someone here has some info on.

I'm trying to build a code with OpenMPI+Intel MKL+Pathscale.
When using the sequential (non-threaded) MKL everything is ok, but when
using the threaded MKL i get a segfault.

This doesn't happen when using MVAPICH so i suspect the memory handling
inside OpenMPI.

Version used are:
OpenMPI 1.2.6
Pathscale 3.2beta
MKL 10.0.2.018

Has anyone seen anything like this?

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] Help: Trouble building OpenMPI v1.2.4 with PGI v7.0-6

2008-02-01 Thread Åke Sandgren
On Thu, 2008-01-31 at 16:01 -0800, Adam Moody wrote:
> Here is some more info.  The build works if I do either of:
> 
> (1)  Build with PGI v7.1-3 instead of PGI v7.0-3
> (2)  Or, drop the "-g" option in CXXFLAGS, i.e.,
> change:
> CXXFLAGS="-Msignextend -g -O2"
> to just:
> CXXFLAGS="-Msignextend -O2"
> 
> I'd still like to know if there is a better fix (I need a 7.0-3 build 
> and would prefer to have -g set).  Anyone know a better fix?
> Thanks again,

I haven't seen that problem, but if you want to build with pgi and -g
you need to patch the configure script like this.

It's where it tests for building f90 modules,
without it it build the module with -g, the main program with -g and
fails the link due to missing symbols.

diff -ru site/configure amd64_ubuntu606-pgi/configure
--- site/configure  2007-10-20 03:06:03.0 +0200
+++ amd64_ubuntu606-pgi/configure   2007-11-29 13:57:46.0
+0100
@@ -37286,7 +37286,7 @@
 # 2 is actions to do if success
 # 3 is actions to do if fail
 echo "configure:37288: $FC $FCFLAGS $FCFLAGS_f90 conftest.f90
${flag}subdir $LDFLAGS $LIBS" >&5
-$FC $FCFLAGS $FCFLAGS_f90 conftest.f90 ${flag}subdir $LDFLAGS $LIBS
1>&5 2>&1
+$FC $FCFLAGS $FCFLAGS_f90 conftest.f90 ${flag}subdir
subdir/conftest-module.o $LDFLAGS $LIBS 1>&5 2>&1
 ompi_status=$?

 # 1 is the message


-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] SCALAPACK: Segmentation Fault (11) and Signal code:Address not mapped (1)

2008-01-31 Thread Åke Sandgren
On Wed, 2008-01-30 at 10:01 -0600, Backlund, Daniel wrote:
> Jeff, thank your for your suggestion, I am sure that the correct mpif.h is 
> being included. One 
> thing that I did not do in my original message was submit the job to SGE. I 
> did that and the 
> program still failed with the same seg fault messages.

> Hello all, I am using OMPI 1.2.4 on a Linux cluster (Rocks 4.2). OMPI was 
> configured to use the 
> Pathscale Compiler Suite installed in the (NFS mounted on nodes) 
> /home/PROGRAMS/pathscale. I am 
> trying to compile and run the example1.f that comes with the ACML package 
> from AMD, and I am 
> unable to get it to run. All nodes have the same Opteron processors and 2GB 
> ram per core. OMPI 
> was configured as below.

- upgrade to latest 3.1 version of Pathscale (3.1-231.14750 or later)
- upgrade to 1.2.5 of openmpi
- make sure you are using blacs 1.1.patch3
- make sure you are using scalapack 1.8.0

This combo is the first that i have seen pass the blacs and scalapack
selftests. Any earlier version combo of pathscale/openmpi used to fail
the blacstests horribly.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] Segmentation fault: intel 10.1.008 compilers w/openmpi-1.2.4

2007-12-04 Thread Åke Sandgren
On Tue, 2007-12-04 at 15:28 -0500, de Almeida, Valmor F. wrote:
> > -Original Message-
> > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> > 
> > On Tue, 2007-12-04 at 09:33 +0100, Åke Sandgren wrote:
> > > On Sun, 2007-12-02 at 21:27 -0500, de Almeida, Valmor F. wrote:
> > >
> > > Run an nm on opal/mca/memory/ptmalloc2/.libs/malloc.o and check if
> > > malloc is defined in there.
> > >
> > > This seems to be the problem i have when compiling with pathscale.
> > > It removes the malloc (public_mALLOc) function from the objectfile but
> > > leaves the free (public_fREe) in there, resulting in malloc/free
> > > mismatch.
> > 
> > For pathscale the solution for me was to add -fno-builtin.
> > Now ompi_info doesn't segfault anymore.
> > 
> > Check if the intel 10 has something similar.
> 
> Below is the nm output. The no builtin compiler option you mentioned above 
> seems to belong to gcc. I have compiled openmpi-1.2.4 with the gcc-4.1.2 
> suite without problems.

Ok, it was a long short anyway.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] Segmentation fault: intel 10.1.008 compilers w/ openmpi-1.2.4

2007-12-04 Thread Åke Sandgren
On Tue, 2007-12-04 at 09:33 +0100, Åke Sandgren wrote:
> On Sun, 2007-12-02 at 21:27 -0500, de Almeida, Valmor F. wrote:
> > Hello,
> > 
> > After compiling ompi-1.2.4 with the intel compiler suite 10.1.008, I get
> > 
> > ->mpicxx --showme
> > Segmentation fault
> > 
> > ->ompi_info
> > Segmentation fault
> > 
> > The 10.1.008 is the only one I know that officially supports the linux
> > kernel 2.6 and glibc-2.6 that I have on my system.
> > 
> > config.log file attached.
> > 
> > Any help appreciated.
> 
> Run an nm on opal/mca/memory/ptmalloc2/.libs/malloc.o and check if
> malloc is defined in there.
> 
> This seems to be the problem i have when compiling with pathscale.
> It removes the malloc (public_mALLOc) function from the objectfile but
> leaves the free (public_fREe) in there, resulting in malloc/free
> mismatch.

For pathscale the solution for me was to add -fno-builtin.
Now ompi_info doesn't segfault anymore.

Check if the intel 10 has something similar.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] Segmentation fault: intel 10.1.008 compilers w/ openmpi-1.2.4

2007-12-04 Thread Åke Sandgren
On Sun, 2007-12-02 at 21:27 -0500, de Almeida, Valmor F. wrote:
> Hello,
> 
> After compiling ompi-1.2.4 with the intel compiler suite 10.1.008, I get
> 
> ->mpicxx --showme
> Segmentation fault
> 
> ->ompi_info
> Segmentation fault
> 
> The 10.1.008 is the only one I know that officially supports the linux
> kernel 2.6 and glibc-2.6 that I have on my system.
> 
> config.log file attached.
> 
> Any help appreciated.

Run an nm on opal/mca/memory/ptmalloc2/.libs/malloc.o and check if
malloc is defined in there.

This seems to be the problem i have when compiling with pathscale.
It removes the malloc (public_mALLOc) function from the objectfile but
leaves the free (public_fREe) in there, resulting in malloc/free
mismatch.



Re: [OMPI users] mpicc Segmentation Fault with Intel Compiler

2007-11-07 Thread Åke Sandgren
On Tue, 2007-11-06 at 20:49 -0500, Jeff Squyres wrote:
> On Nov 6, 2007, at 4:42 AM, Åke Sandgren wrote:
> 
> > I had the same problem with pathscale.
> 
> There is a known outstanding problem with the pathscale problem.  I am  
> still waiting for a solution from their engineers (we don't know yet  
> whether it's an OMPI issue or a Pathscale issue, but my [biased] money  
> is on a Pathscale issue :-) -- it doesn't happen with any other  
> compiler).
> 
> > Try this, i think it is the solution i found.
> >
> > diff -ru site/opal/runtime/opal_init.c
> > amd64_ubuntu606-psc/opal/runtime/opal_init.c
> > --- site/opal/runtime/opal_init.c   2007-10-20 03:00:35.0
> > +0200
> > +++ amd64_ubuntu606-psc/opal/runtime/opal_init.c2007-10-23
> > 16:12:15.0 +0200
> > @@ -169,7 +169,7 @@
> > }
> >
> > /* register params for opal */
> > -if (OPAL_SUCCESS !=  opal_register_params()) {
> > +if (OPAL_SUCCESS !=  (ret = opal_register_params())) {
> > error = "opal_register_params";
> > goto return_error;
> > }
> 
> I don't see why this change would make any difference in terms of a  
> segv...?
> 
> I see that ret is an uninitialized variable in the error case (which  
> I'll fix -- thanks for pointing it out :-) ) -- but I don't see how  
> that would fix a segv.  Am I missing something?

The problem is that i don't really remember what fixed my problem (or if
it got interrupted before i managed to fix it in the first place).
I have been busy building other software for a couple of weeks.
The above was simply the only patch i hade made that i didn't know
exactly what it was doing.

But judging from trying to run that version of ompi_info i still have
problems.

I've been working with this for a while and can hopefully continue
pursuing it next week or so.



Re: [OMPI users] mpicc Segmentation Fault with Intel Compiler

2007-11-06 Thread Åke Sandgren
On Tue, 2007-11-06 at 10:28 +0100, Michael Schulz wrote:
> Hi,
> 
> I've the same problem described by some other users, that I can't
> compile anything if I'm using the open-mpi compiled with the Intel- 
> Compiler.
> 
>  > ompi_info --all
> Segmentation fault
> 
> OpenSUSE 10.3
> Kernel: 2.6.22.9-0.4-default
> Intel P4
> 
> Configure-Flags: CC=icc, CXX=icpc, F77=ifort, F90=ifort
> 
> Intel-Compiler: both, C and Fortran 10.0.025
> 
> Is there any known solution?

I had the same problem with pathscale.
Try this, i think it is the solution i found.

diff -ru site/opal/runtime/opal_init.c
amd64_ubuntu606-psc/opal/runtime/opal_init.c
--- site/opal/runtime/opal_init.c   2007-10-20 03:00:35.0
+0200
+++ amd64_ubuntu606-psc/opal/runtime/opal_init.c2007-10-23
16:12:15.0 +0200
@@ -169,7 +169,7 @@
 }

 /* register params for opal */
-if (OPAL_SUCCESS !=  opal_register_params()) {
+if (OPAL_SUCCESS !=  (ret = opal_register_params())) {
 error = "opal_register_params";
 goto return_error;
 }

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] Bug in common_mx.c (1.2.5a0r16522)

2007-10-24 Thread Åke Sandgren
On Wed, 2007-10-24 at 09:00 +0200, Åke Sandgren wrote:
> Hi!
> 
> In common_mx.c the following looks wrong.
> ompi_common_mx_finalize(void)
> {
> mx_return_t mx_return;
> ompi_common_mx_initialize_ref_cnt--;
> if(ompi_common_mx_initialize == 0) {
> 
> That should be
> if(ompi_common_mx_initialize_ref_cnt == 0)
> right?
> 

And there was a missing return too.
Complete ompi_common_mx_finalize should be
int
ompi_common_mx_finalize(void)
{
mx_return_t mx_return;
ompi_common_mx_initialize_ref_cnt--;
if(ompi_common_mx_initialize_ref_cnt == 0) {
mx_return = mx_finalize();
if(mx_return != MX_SUCCESS){
opal_output(0, "Error in mx_finalize (error %s)\n",
mx_strerror(mx_return));
return OMPI_ERROR;
}
}
return OMPI_SUCCESS;
}




[OMPI users] Bug in common_mx.c (1.2.5a0r16522)

2007-10-24 Thread Åke Sandgren
Hi!

In common_mx.c the following looks wrong.
ompi_common_mx_finalize(void)
{
mx_return_t mx_return;
ompi_common_mx_initialize_ref_cnt--;
if(ompi_common_mx_initialize == 0) {

That should be
if(ompi_common_mx_initialize_ref_cnt == 0)
right?

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] incorrect configure code (1.2.4 and earlier)

2007-09-27 Thread Åke Sandgren
On Thu, 2007-09-27 at 14:18 -0400, Tim Prins wrote:
> Åke Sandgren wrote:
> > On Thu, 2007-09-27 at 09:09 -0400, Tim Prins wrote:
> >> Hi Ake,
> >>
> >> Looking at the svn logs it looks like you reported the problems with 
> >> these checks quite a while ago and we fixed them (in r13773 
> >> https://svn.open-mpi.org/trac/ompi/changeset/13773), but we never moved 
> >> them to the 1.2 branch.
> > 
> > Yes, it's the same. Since i never saw it in the source i tried once more
> > with some more explanations just in case :-)
> > 
> >> I will ask for this to be moved to the 1.2 branch.
> > 
> > Good.
> > 
> >> However, the changes made for ompi_config_pthreads.m4 are different than 
> >> you are suggesting now. Is this changeset good enough, or are there 
> >> other changes you think should be made?
> > 
> > The ones i sent today are slightly more correct. There where 2 missing
> > LIBS="$orig_LIBS" in the failure cases.
> But do we really need these? It looks like configure aborts in these 
> cases (I am not a autoconf wizard, so I could be completely wrong here).

I don't know. I just put them in since it was the right thing to do. And
there where other variables that was reset in those places.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] incorrect configure code (1.2.4 and earlier)

2007-09-27 Thread Åke Sandgren
On Thu, 2007-09-27 at 09:09 -0400, Tim Prins wrote:
> Hi Ake,
> 
> Looking at the svn logs it looks like you reported the problems with 
> these checks quite a while ago and we fixed them (in r13773 
> https://svn.open-mpi.org/trac/ompi/changeset/13773), but we never moved 
> them to the 1.2 branch.

Yes, it's the same. Since i never saw it in the source i tried once more
with some more explanations just in case :-)

> I will ask for this to be moved to the 1.2 branch.

Good.

> However, the changes made for ompi_config_pthreads.m4 are different than 
> you are suggesting now. Is this changeset good enough, or are there 
> other changes you think should be made?

The ones i sent today are slightly more correct. There where 2 missing
LIBS="$orig_LIBS" in the failure cases.

If you compare the resulting file after patching you will see the
difference. They are in the "Can not find working threads configuration"
portions.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



[OMPI users] incorrect configure code (1.2.4 and earlier)

2007-09-27 Thread Åke Sandgren
Hi!

There are a couple of bugs in the configure scripts regarding threads
checking.

In ompi_check_pthread_pids.m4 the actual code for testing is wrong and
is also missing a CFLAG save/add-THREAD_CFLAGS/restore resulting in the
linking always failing for the -pthread test with gcc.
config.log looks like this.
=
configure:50353: checking if threads have different pids (pthreads on
linux)
configure:50409: gcc -o conftest -DNDEBUG -march=k8 -O3 -msse -msse2
-maccumulate-outgoing-args -finline-functions -fno-strict-aliasing
-fexceptions  conftest.c -lnsl -lutil  -lm  >&5
conftest.c: In function 'checkpid':
conftest.c:327: warning: cast to pointer from integer of different size
/tmp/ccqUaAns.o: In function `main':conftest.c:(.text+0x1f): undefined
reference to `pthread_create'
:conftest.c:(.text+0x2e): undefined reference to `pthread_join'
collect2: ld returned 1 exit status
configure:50412: $? = 1
configure: program exited with status 1
=

Adding the CFLAGS save/add/restore make the code return the right answer
both on systems with the old pthreads implementation and NPTL based
systems. BUT, the code as it stands is technically incorrect.
The patch have a corrected version.

There is also two bugs in ompi_config_pthreads.m4.
In OMPI_INTL_POSIX_THREADS_LIBS_CXX it is incorrectly setting
PTHREAD_LIBS to $pl, in the then-part of the second if-statement, which
at the time isn't set yet and forgetting to reset LIBS on failure in the
bottom most if-else case in the for pl loop.

In OMPI_INTL_POSIX_THREADS_LIBS_FC it is resetting LIBS whether
succesfull or not resulting in -lpthread missing when checking for
PTHREAD_MUTEX_ERRORCHECK_NP at least for some versions of pgi, (6.1 and
older fails, 7.0 seems to always add -lpthread with pgf77 as linker)

The output from configure in such a case looks like this:
checking if C compiler and POSIX threads work with -lpthread... yes
checking if C++ compiler and POSIX threads work with -lpthread... yes
checking if F77 compiler and POSIX threads work with -lpthread... yes
checking for PTHREAD_MUTEX_ERRORCHECK_NP... no
checking for PTHREAD_MUTEX_ERRORCHECK... no
(OS: Ubuntu Dapper, Compiler: pgi 6.1)

There is also a problem in the F90 modules include flag search.
The test currently does:
$FC -c conftest-module.f90
$FC conftest.f90

This doesn't work if one has set FCFLAGS=-g in the environment.
At least not with pgf90 since it needs the debug symbols from
conftest-module.o to be able to link.
You have to either add conftest-module.o to the compile line of conftest
or make it a three-stager, $FC -c conftest-module.f90; $FC -c
conftest.f90; $FC conftest.o conftest-module.o

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
diff -rU 10 site/config/ompi_config_pthreads.m4 p3/config/ompi_config_pthreads.m4
--- site/config/ompi_config_pthreads.m4	2006-08-15 22:14:05.0 +0200
+++ p3/config/ompi_config_pthreads.m4	2007-09-27 09:10:21.0 +0200
@@ -473,24 +473,24 @@
   CXXCPPFLAGS="$CXXCPPFLAGS $PTHREAD_CXXCPPFLAGS"
 fi
   ;;
 esac
 LIBS="$orig_LIBS $PTHREAD_LIBS"
 AC_LANG_PUSH(C++)
 OMPI_INTL_PTHREAD_TRY_LINK(ompi_pthread_cxx_success=1, 
   ompi_pthread_cxx_success=0)
 AC_LANG_POP(C++)
 if test "$ompi_pthread_cxx_success" = "1"; then
-  PTHREAD_LIBS="$pl"
   AC_MSG_RESULT([yes])
 else
   CXXCPPFLAGS="$orig_CXXCPPFLAGS"
+  LIBS="$orig_LIBS"
   AC_MSG_RESULT([no])
   AC_MSG_ERROR([Can not find working threads configuration.  aborting])
 fi
   else 
 for pl in $plibs; do
   AC_MSG_CHECKING([if C++ compiler and POSIX threads work with $pl])
   case "${host_cpu}-${host-_os}" in
 *-aix* | *-freebsd*)
   if test "`echo $CXXCPPFLAGS | grep 'D_THREAD_SAFE'`" = ""; then
 PTRHEAD_CXXCPPFLAGS="-D_THREAD_SAFE"
@@ -508,61 +508,62 @@
   AC_LANG_PUSH(C++)
   OMPI_INTL_PTHREAD_TRY_LINK(ompi_pthread_cxx_success=1, 
 ompi_pthread_cxx_success=0)
   AC_LANG_POP(C++)
   if test "$ompi_pthread_cxx_success" = "1"; then
 	PTHREAD_LIBS="$pl"
 AC_MSG_RESULT([yes])
   else
 PTHREAD_CXXCPPFLAGS=
 CXXCPPFLAGS="$orig_CXXCPPFLAGS"
+	LIBS="$orig_LIBS"
 AC_MSG_RESULT([no])
   fi
 done
   fi
 fi
 ])dnl


 AC_DEFUN([OMPI_INTL_POSIX_THREADS_LIBS_FC],[
 #
 # Fortran compiler
 #
 if test "$ompi_pthread_f77_success" = "0" -a "$OMPI_WANT_F77_BINDINGS" = "1"; then
   if test ! "$ompi_pthread_c_success" = "0" -a ! "$PTHREAD_LIBS" = "" ; then
 AC_MSG_CHECKING([if F77 compiler and POSIX threads work with $PTHREAD_LIBS])
 LIBS="$orig_LIBS $PTHREAD_LIBS"
 AC_LANG_PUSH(C)
 OMPI_INTL_PTHREAD_TRY_LINK_F77(ompi_pthread_f77_success=1, 
   ompi_pthread_f77_success=0)
 AC_LANG_POP(C)
-  

Re: [OMPI users] [Fwd: MPI question/problem] including code attachments

2007-06-22 Thread Åke Sandgren
On Thu, 2007-06-21 at 14:14 -0500, Anthony Chan wrote:
> What test you are refering to ?
> 
> config.log contains the test results of the features that configure is
> looking for.  Failure of some thread test does not mean OpenMPI can't
> support threads.  In fact, I was able to run a simple/correct
> MPI_THREAD_MULTIPLE program uses pthread with openmpi-1.2.3 and the
> program finishes normally on this multicore Ubuntu box.

Yes but what i was referring to is the test the checks for NPTL vs Linux
threads. That test is/was broken and i don't remember off hand if they
by accident (using gcc) got it right or not on a NPTL system.
I now for certain that using non GCC compilers the test fails horribly
and gets it completely wrong. (Unless they have fixed it since i tried
on 1.0 and early 1.1 and 1.2)
I did send a patch but i have yet to see it being included.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] [Fwd: MPI question/problem] including code attachments

2007-06-21 Thread Åke Sandgren
On Thu, 2007-06-21 at 13:27 -0500, Anthony Chan wrote:
> It seems the hang only occurs when OpenMPI is built with
> --enable-mpi-threads --enable-progress-threads.  [My OpenMPI builds use
> gcc (GCC) 4.1.2 (Ubuntu 4.1.2-0ubuntu4)].  Probably
> --enable-mpi-threads is the relevant option to cause the hang.

Have you looked at the config.log file for the thread test. When i built
ompi on Ubuntu that test was broken.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] Fortran90 interfaces--problem?

2007-03-06 Thread Åke Sandgren
On Tue, 2007-03-06 at 09:51 -0500, Jeff Squyres wrote:
> On Mar 5, 2007, at 9:50 AM, Michael wrote:
> 
> > I have discovered a problem with the Fortran90 interfaces for all
> > types of communication when one uses derived datatypes (I'm currently
> > using openmpi-1.3a1r13918 [for testing] and openmpi-1.1.2 [for
> > compatibility with an HPC system]), for example
> >
> > call MPI_RECV(tsk,1,MPI_TASKSTATE,src,
> > 1,MPI_COMM_WORLD,MPI_STATUS_IGNORE,ier)
> >
> > where tsk is a Fortran 90 structure and MPI_TASKSTATE has been
> > created by MPI_TYPE_CREATE_STRUCT.
> >
> > At the moment I can't imagine a way to modify the OpenMPI interface
> > generation to work around this besides switching to --with-mpi-f90-
> > size=small.
> 
> This is unfortunately a known problem -- not just with Open MPI, but  
> with the F90 bindings specification in MPI.  :-(  Since there's no  
> F90 equivalent of C's (void*), there's no way to pass a variable of  
> arbitrary type through the MPI F90 bindings.  Hence, all we can do is  
> define bindings for all the known types (i.e., various dimension  
> sizes of the MPI types).
> 

What about the "Fortran 2003 ISO_C_BINDING" couldn't a C_LOC be used
here?
(I probably don't know what i'm talking about but i just saw a reference
to it.)

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



[OMPI users] Problems running Intel Mpi Benchmark (formerly PMB) with ompi 1.1.2 and 1.2b1

2006-11-16 Thread Åke Sandgren
Hi!

I'm having problems running the Allgather test of the IMB 3.0.

System: Ubuntu Dapper, dual Amd Opteron, Myricom MX 1.1.5
OMPI version: 1.1.2 and 1.2b1
buildflags -O0 -g

started with mpirun -mca  mpi_yield_when_idle 1 -mca
mpi_keep_peer_hostnames 0

(The problem also exists when mpi_yield_when_idle is 0)

When running with 88 nodes (one task per node) the test runs ok, but
when run with 89 nodes or more it never returns any data. It prints the
header, up to

# List of Benchmarks to run

# Allgather

and then nothing.

If i trap into one the task 0 process with gdb it shows
#0  0x2b7185f9 in sched_yield () from /lib/libc.so.6
#1  0x2af48d06 in opal_progress () at runtime/opal_progress.c:301
#2  0x2e5948f4 in opal_condition_wait (c=0x2e69a890,
m=0x2e69a840) at condition.h:81
#3  0x2e59471d in __ompi_free_list_wait (fl=0x2e69a790,
item=0x7fae7948) at ompi_free_list.h:180
#4  0x2e594c86 in mca_btl_mx_prepare_src (btl=0x557200,
endpoint=0x7bd690, registration=0x0, convertor=0x5712e0, reserve=32,
size=0x7fae79e8) at btl_mx.c:263
#5  0x2e157507 in mca_bml_base_prepare_src (bml_btl=0x7bded0, reg=0x0,
conv=0x5712e0, reserve=32, size=0x7fae79e8, des=0x7fae7a00)
at bml.h:315
#6  0x2e157b57 in mca_pml_ob1_send_request_start_rndv (
sendreq=0x571200, bml_btl=0x7bded0, size=16292, flags=8)
at pml_ob1_sendreq.c:803
#7  0x2e14ba66 in mca_pml_ob1_send_request_start_btl (
sendreq=0x571200, bml_btl=0x7bded0) at pml_ob1_sendreq.h:332
#8  0x2e14b6f4 in mca_pml_ob1_send_request_start (sendreq=0x571200)
at pml_ob1_sendreq.h:374
#9  0x2e14bf6d in mca_pml_ob1_send (buf=0x2aaab0832010, count=65536,
datatype=0x50c180, dst=1, tag=-17, sendmode=MCA_PML_BASE_SEND_STANDARD,
comm=0x7ed4b0) at pml_ob1_isend.c:103
#10 0x2ecd7a0d in ompi_coll_tuned_bcast_intra_chain (
buff=0x2aaab0832010, count=373293056, datatype=0x50c180, root=0,
comm=0x7ed4b0, segsize=65536, chains=1) at coll_tuned_bcast.c:109
#11 0x2ecd7e90 in ompi_coll_tuned_bcast_intra_pipeline (
buffer=0x2aaab0832010, count=373293056, datatype=0x50c180, root=0,
comm=0x7ed4b0, segsize=65536) at coll_tuned_bcast.c:208
#12 0x2ecd2d79 in ompi_coll_tuned_bcast_intra_dec_fixed (
buff=0x2aaab0832010, count=373293056, datatype=0x50c180, root=0,
comm=0x7ed4b0) at coll_tuned_decision_fixed.c:205
#13 0x2e9bce6f in mca_coll_basic_allgather_intra (sbuf=0x2aaab0431010,
scount=4194304, sdtype=0x50c180, rbuf=0x2aaab0832010, rcount=4194304,
rdtype=0x50c180, comm=0x7ed4b0) at coll_basic_allgather.c:77
#14 0x2ac2efb2 in PMPI_Allgather (sendbuf=0x2aaab0431010,
sendcount=4194304, sendtype=0x50c180, recvbuf=0x2aaab0832010,
recvcount=4194304, recvtype=0x50c180, comm=0x7ed4b0) at pallgather.c:75
#15 0x004088e8 in IMB_allgather ()
#16 0x004065a2 in IMB_warm_up ()
#17 0x0040347a in main ()

The last task shows
#0  0x2b7185f9 in sched_yield () from /lib/libc.so.6
#1  0x2af48d06 in opal_progress () at runtime/opal_progress.c:301
#2  0x2e14aace in opal_condition_wait (c=0x2adb2880,
m=0x2adb2900) at condition.h:81
#3  0x2e14a9ad in mca_pml_ob1_recv (addr=0x2aaab002f010, count=65536,
datatype=0x50c180, src=87, tag=-17, comm=0x7fc170, status=0x0)
at pml_ob1_irecv.c:107
#4  0x2ecd7d07 in ompi_coll_tuned_bcast_intra_chain (
buff=0x2aaab002f010, count=373293056, datatype=0x50c180, root=0,
comm=0x7fc170, segsize=65536, chains=1) at coll_tuned_bcast.c:179
#5  0x2ecd7e90 in ompi_coll_tuned_bcast_intra_pipeline (
buffer=0x2aaab002f010, count=373293056, datatype=0x50c180, root=0,
comm=0x7fc170, segsize=65536) at coll_tuned_bcast.c:208
#6  0x2ecd2d79 in ompi_coll_tuned_bcast_intra_dec_fixed (
buff=0x2aaab002f010, count=373293056, datatype=0x50c180, root=0,
comm=0x7fc170) at coll_tuned_decision_fixed.c:205
#7  0x2e9bce6f in mca_coll_basic_allgather_intra (sbuf=0x2fc2e010,
scount=4194304, sdtype=0x50c180, rbuf=0x2aaab002f010, rcount=4194304,
rdtype=0x50c180, comm=0x7fc170) at coll_basic_allgather.c:77
#8  0x2ac2efb2 in PMPI_Allgather (sendbuf=0x2fc2e010,
sendcount=4194304, sendtype=0x50c180, recvbuf=0x2aaab002f010,
recvcount=4194304, recvtype=0x50c180, comm=0x7fc170) at pallgather.c:75
#9  0x004088e8 in IMB_allgather ()
#10 0x004065a2 in IMB_warm_up ()
#11 0x0040347a in main ()


Any ideas??

I have no problem running the the Reduce_scatter or Allreduce test of IMB.



Re: [OMPI users] BLACS vs. OpenMPI 1.1.1 & 1.3

2006-10-16 Thread Åke Sandgren
On Mon, 2006-10-16 at 10:13 +0200, Åke Sandgren wrote:
> On Fri, 2006-10-06 at 00:04 -0400, Jeff Squyres wrote:
> > On 10/5/06 2:42 PM, "Michael Kluskens" <mk...@ieee.org> wrote:
> > 
> > > System: BLACS 1.1p3 on Debian Linux 3.1r3 on dual-opteron, gcc 3.3.5,
> > > Intel ifort 9.0.32 all tests with 4 processors (comments below)
> > > 
> > > OpenMPi 1.1.1 patched and OpenMPI 1.1.2 patched:
> > >C & F tests: no errors with default data set.  F test slowed down
> > > in the middle of the tests.
> > 
> > Good.  Can you expand on what you mean by "slowed down"?
> 
> Lets add some more data to this...
> BLACS 1.1p3
> Ubuntu Dapper 6.06
> dual opteron
> gcc 4.0
> gfortran 4.0 (for both f77 and f90)
> standard tests with 4 tasks on one node (i.e. 2 tasks per cpu)
> 
> OpenMPI 1.1.2rc3
> The tests comes to a complete standstill at the integer bsbr tests
> It consumes cpu all the time but nothing happens.

Actually if i'm not too inpatient i will progress but VERY slowly.
A complete run of the blacstest takes +30min cpu-time...
>From the bsbr tests and onwards everything takes "forever".



Re: [OMPI users] Bugs in config tests for threads (1.1.2rc3 at least)

2006-10-11 Thread Åke Sandgren
On Fri, 2006-10-06 at 10:18 -0400, Brian W. Barrett wrote:
> Is there a platform on which this breaks?  It seems to have worked well 
> for years...  I'll take a closer look early next week.

Have you had a chance to look at this yet?
I could use a new "release" to build from since something currently
breaks down if i run autogen.sh locally.

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] Bugs in config tests for threads (1.1.2rc3 at least)

2006-10-06 Thread Åke Sandgren
On Fri, 2006-10-06 at 10:18 -0400, Brian W. Barrett wrote:
> Is there a platform on which this breaks?  It seems to have worked well 
> for years...  I'll take a closer look early next week.

It should be a general problem as far as i know. It might have "worked
well for years" but it has never done the "right thing".

The changes to ompi_config_pthreads.m4 is just to make sure LIBS have
the correct value when the checks for OMPI_INTL_POSIX_THREADS_LIBS are
finished and the PTHREAD_MUTEX_ERRORCHECK_NP test is started.
Take a look at the code before applying patch and after and you'll see
what i'm aiming for.

The ompi_check_pthread_pids.m4 fixes two problems.
Making sure that CFLAGS (and not only CPPFLAGS) contains -pthread when
building with gcc (which it didn't and hence the test always failed on
all gcc builds.)
And fixing the test-code itself which was incorrect.
It worked somewhat with old Linux-threads but the new posix correct
thread implementation is more strict on what can be returned in
pthread_exit.

So the patches fixes things that was always broken...



Re: [OMPI users] BLACS & OpenMPI

2006-10-03 Thread Åke Sandgren
On Mon, 2006-10-02 at 18:39 -0400, Michael Kluskens wrote:
> Having trouble getting BLACS to pass tests.
> 
> OpenMPI, BLACS, and blacstester built just fine.  Tester reports  
> errors for integer and real cases #1 and #51 and more for the other  
> types..
> 
>  is an open ticket  
> related to this.

Finally someone else with the same problem!!!

I tried the suggested fix from ticket 356 but it didn't help.
I still get lots of errors in the blacstest.

I'm running on a dual-cpu opteron with Ubuntu dapper and gcc-4.0.
The tests also failed on our i386 Ubuntu breezy system with gcc-3.4

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se



Re: [OMPI users] BLACS Tester installation errors

2006-09-21 Thread Åke Sandgren
On Thu, 2006-09-21 at 09:26 -0400, Benjamin Gaudio wrote:
> I have installed OpenMPI 1.1.1 for the first time yesterday and am
> now having trouble getting the BLACS Tester to install properly. 
> OpenMPI seemed to build without error, and BLACS also built without
> any apparent errors.  When I tried to install the Blacs tester, one
> of the first lines of output was:

To compile blacstest with g77 you need
-fugly-complex -fno-globals -Wno-globals
Add this to TESTING/Makefile
where it compiles blacstest.

blacstest.o : blacstest.f
$(F77) -fugly-complex -fno-globals -Wno-globals
$(F77NO_OPTFLAGS) -c $*.f

-- 
Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90 7866126
Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se