from:"Mike Dubman"

Re: [OMPI users] (no subject)

2016-02-22 Thread Mike Dubman

Hi,
it seems that your ompi was compiled with ofed ver X but running on ofed
ver Y.
X and Y are incompatible.


On Mon, Feb 22, 2016 at 8:18 PM, Mark Potter  wrote:

> I am usually able to find the answer to my problems by searching the
> archive but I've run up against one that I can't suss out.
>
> bison-opt: relocation error:
> /home/pbme002/opt/gcc-4.8.2-tpls/openmpi-1.8.4/lib/libmpi.so.1: symbol
> rdma_get_src_port, version RDMACM_1.0 not defined in file librdmacm.so.1
> with link time reference
>
> There is the error I am getting, the problem is that it's not consistent.
> This happens to a random few jobs in a series of the same job on different
> data sets. The ones that fail and produce the error run fine when a second
> attempt is made. I am the admin for this cluster and the user is using
> their own compiled OpenMPI and not the system OpenMPI so I can't say for
> certain that it was compiled correctly but it strikes me as odd that jobs
> would fail with the above error but run perfectly fine when a second
> attempt is made.
>
> I'm looking for any help sussing out what could be causing this issue.
>
> Regards,
>
> Mark L. Potter
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28565.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] hcoll dependency on mxm configure error

2015-10-21 Thread Mike Dubman

could you please check if you have file /etc/ld.so.conf.d/mxm.conf on your
system?
it will help us understand why hcoll did not detect libmxm.so at the 1st
attempt.

Thanks

On Wed, Oct 21, 2015 at 7:19 PM, David Shrader <dshra...@lanl.gov> wrote:

> We're using TOSS which is based on Red Hat. The current version we're
> running is based on Red Hat 6.6. I'm actually not sure what mofed version
> we're using right now based on what I can find on the system and the admins
> over that are out. I'll get back to you on that as soon as I know.
>
> Using LD_LIBRARY_PATH before configure got it to work, which I didn't
> expect. Thanks for the tip! I didn't realize that loading in a shared
> library of a library that is being linked in on the active compile line
> fell under the runtime portion of linking, and could be affected by using
> LD_LIBRARY_PATH.
>
> Thanks!
> David
>
>
> On 10/21/2015 09:59 AM, Mike Dubman wrote:
>
> Hi David,
> what linux distro do you use? (and mofed version)?
> Do you have /etc/ld.conf.d/mxm.conf file?
> Can you please try add LD_LIBRARY_PATH=/opt/mellanox/mxm/lib ./configure
> ?
>
>
> Thanks
>
> On Wed, Oct 21, 2015 at 6:40 PM, David Shrader <dshra...@lanl.gov> wrote:
>
>> I should probably point out that libhcoll.so does not know where
>> libmxm.so is:
>>
>> [dshrader@zo-fe1 ~]$ ldd /opt/mellanox/hcoll/lib/libhcoll.so
>> linux-vdso.so.1 =>  (0x7fffb2f1f000)
>> libibnetdisc.so.5 => /usr/lib64/libibnetdisc.so.5
>> (0x7fe31bd0b000)
>> libmxm.so.2 => not found
>> libz.so.1 => /lib64/libz.so.1 (0x7fe31baf4000)
>> libdl.so.2 => /lib64/libdl.so.2 (0x7fe31b8f)
>> libosmcomp.so.3 => /usr/lib64/libosmcomp.so.3 (0x7fe31b6e2000)
>> libocoms.so.0 => /opt/mellanox/hcoll/lib/libocoms.so.0
>> (0x7fe31b499000)
>> libm.so.6 => /lib64/libm.so.6 (0x7fe31b215000)
>> libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x7fe31b009000)
>> libalog.so.0 => /opt/mellanox/hcoll/lib/libalog.so.0
>> (0x7fe31adfe000)
>> librt.so.1 => /lib64/librt.so.1 (0x7fe31abf6000)
>> libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x7fe31a9ee000)
>> librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x7fe31a7d9000)
>> libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x7fe31a5c7000)
>> libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe31a3a9000)
>> libc.so.6 => /lib64/libc.so.6 (0x7fe31a015000)
>> libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x7fe319cfe000)
>> libibmad.so.5 => /usr/lib64/libibmad.so.5 (0x7fe319ae3000)
>> /lib64/ld-linux-x86-64.so.2 (0x7fe31c2d3000)
>> libwrap.so.0 => /lib64/libwrap.so.0 (0x7fe3198d8000)
>> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fe3196c2000)
>> libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe3194a8000)
>> libutil.so.1 => /lib64/libutil.so.1 (0x7fe3192a5000)
>> libnl.so.1 => /lib64/libnl.so.1 (0x7fe319052000)
>>
>> Both hcoll and mxm where installed using the rpms provided by Mellanox.
>>
>> Thanks again,
>> David
>>
>>
>> On 10/21/2015 09:34 AM, David Shrader wrote:
>>
>>> Hello All,
>>>
>>> I'm currently trying to install 1.10.0 with hcoll and mxm, and am
>>> getting an error during configure:
>>>
>>> --- MCA component coll:hcoll (m4 configuration macro)
>>> checking for MCA component coll:hcoll compile mode... static
>>> checking hcoll/api/hcoll_api.h usability... yes
>>> checking hcoll/api/hcoll_api.h presence... yes
>>> checking for hcoll/api/hcoll_api.h... yes
>>> looking for library in lib
>>> checking for library containing hcoll_get_version... no
>>> looking for library in lib64
>>> checking for library containing hcoll_get_version... no
>>> configure: error: HCOLL support requested but not found.  Aborting
>>>
>>> The configure line I used:
>>>
>>> ./configure --with-mxm=/opt/mellanox/mxm
>>> --with-hcoll=/opt/mellanox/hcoll
>>> --with-platform=contrib/platform/lanl/toss/optimized-panasas
>>>
>>> Here are the corresponding lines from config.log:
>>>
>>> configure:217014: gcc -std=gnu99 -o conftest -O3 -DNDEBUG
>>> -I/opt/panfs/include -finline-functions -fno-strict-aliasing -pthread
>>> -I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/hwloc/hwloc191/hwloc/inclu

Re: [OMPI users] hcoll dependency on mxm configure error

2015-10-21 Thread Mike Dubman

Hi David,
what linux distro do you use? (and mofed version)?
Do you have /etc/ld.conf.d/mxm.conf file?
Can you please try add LD_LIBRARY_PATH=/opt/mellanox/mxm/lib ./configure
?


Thanks

On Wed, Oct 21, 2015 at 6:40 PM, David Shrader  wrote:

> I should probably point out that libhcoll.so does not know where libmxm.so
> is:
>
> [dshrader@zo-fe1 ~]$ ldd /opt/mellanox/hcoll/lib/libhcoll.so
> linux-vdso.so.1 =>  (0x7fffb2f1f000)
> libibnetdisc.so.5 => /usr/lib64/libibnetdisc.so.5
> (0x7fe31bd0b000)
> libmxm.so.2 => not found
> libz.so.1 => /lib64/libz.so.1 (0x7fe31baf4000)
> libdl.so.2 => /lib64/libdl.so.2 (0x7fe31b8f)
> libosmcomp.so.3 => /usr/lib64/libosmcomp.so.3 (0x7fe31b6e2000)
> libocoms.so.0 => /opt/mellanox/hcoll/lib/libocoms.so.0
> (0x7fe31b499000)
> libm.so.6 => /lib64/libm.so.6 (0x7fe31b215000)
> libnuma.so.1 => /usr/lib64/libnuma.so.1 (0x7fe31b009000)
> libalog.so.0 => /opt/mellanox/hcoll/lib/libalog.so.0
> (0x7fe31adfe000)
> librt.so.1 => /lib64/librt.so.1 (0x7fe31abf6000)
> libibumad.so.3 => /usr/lib64/libibumad.so.3 (0x7fe31a9ee000)
> librdmacm.so.1 => /usr/lib64/librdmacm.so.1 (0x7fe31a7d9000)
> libibverbs.so.1 => /usr/lib64/libibverbs.so.1 (0x7fe31a5c7000)
> libpthread.so.0 => /lib64/libpthread.so.0 (0x7fe31a3a9000)
> libc.so.6 => /lib64/libc.so.6 (0x7fe31a015000)
> libglib-2.0.so.0 => /lib64/libglib-2.0.so.0 (0x7fe319cfe000)
> libibmad.so.5 => /usr/lib64/libibmad.so.5 (0x7fe319ae3000)
> /lib64/ld-linux-x86-64.so.2 (0x7fe31c2d3000)
> libwrap.so.0 => /lib64/libwrap.so.0 (0x7fe3198d8000)
> libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x7fe3196c2000)
> libnsl.so.1 => /lib64/libnsl.so.1 (0x7fe3194a8000)
> libutil.so.1 => /lib64/libutil.so.1 (0x7fe3192a5000)
> libnl.so.1 => /lib64/libnl.so.1 (0x7fe319052000)
>
> Both hcoll and mxm where installed using the rpms provided by Mellanox.
>
> Thanks again,
> David
>
>
> On 10/21/2015 09:34 AM, David Shrader wrote:
>
>> Hello All,
>>
>> I'm currently trying to install 1.10.0 with hcoll and mxm, and am getting
>> an error during configure:
>>
>> --- MCA component coll:hcoll (m4 configuration macro)
>> checking for MCA component coll:hcoll compile mode... static
>> checking hcoll/api/hcoll_api.h usability... yes
>> checking hcoll/api/hcoll_api.h presence... yes
>> checking for hcoll/api/hcoll_api.h... yes
>> looking for library in lib
>> checking for library containing hcoll_get_version... no
>> looking for library in lib64
>> checking for library containing hcoll_get_version... no
>> configure: error: HCOLL support requested but not found.  Aborting
>>
>> The configure line I used:
>>
>> ./configure --with-mxm=/opt/mellanox/mxm --with-hcoll=/opt/mellanox/hcoll
>> --with-platform=contrib/platform/lanl/toss/optimized-panasas
>>
>> Here are the corresponding lines from config.log:
>>
>> configure:217014: gcc -std=gnu99 -o conftest -O3 -DNDEBUG
>> -I/opt/panfs/include -finline-functions -fno-strict-aliasing -pthread
>> -I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/hwloc/hwloc191/hwloc/include
>> -I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/event/libevent2021/libevent
>> -I/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.10.0/openmpi-1.10.0/opal/mca/event/libevent2021/libevent/include
>> -I/opt/mellanox/hcoll/include   -L/opt/mellanox/hcoll/lib conftest.c
>> -lhcoll  -lrt -lm -lutil   >&5
>> /usr/bin/ld: warning: libmxm.so.2, needed by
>> /opt/mellanox/hcoll/lib/libhcoll.so, not found (try using -rpath or
>> -rpath-link)
>> /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_req_recv'
>> /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to
>> `mxm_ep_create'
>> /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to
>> `mxm_config_free_context_opts'
>> /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to
>> `mxm_ep_destroy'
>> /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to
>> `mxm_config_free_ep_opts'
>> /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_progress'
>> /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to
>> `mxm_config_read_opts'
>> /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to
>> `mxm_ep_disconnect'
>> /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to
>> `mxm_mq_destroy'
>> /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to
>> `mxm_mq_create'
>> /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_cleanup'
>> /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_req_send'
>> /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to
>> `mxm_ep_connect'
>> /opt/mellanox/hcoll/lib/libhcoll.so: undefined reference to `mxm_init'
>>

Re: [OMPI users] Using OpenMPI (1.8, 1.10) with Mellanox MXM, ulimits ?

2015-10-01 Thread Mike Dubman

right, it is not attribute of mxm, but general effect.
and you are right again - performance engineering will always be needed for
best performance in some cases.

OMPI, mxm trying to address out of the box performance for any workload,
but OS tuning, hw tuning, OMPI or mxm tuning may be needed as well. (there
is a reason that any MPI have hundreds of knobs)


On Thu, Oct 1, 2015 at 1:50 PM, Dave Love <d.l...@liverpool.ac.uk> wrote:

> Mike Dubman <mi...@dev.mellanox.co.il> writes:
>
> > we did not get to the bottom for "why".
> > Tried different mpi packages (mvapich,intel mpi) and the observation hold
> > true.
>
> Does that mean it's a general effect, unrelated to mxm, or that it is
> related?
>
> > it could be many factors affected by huge heap size (cpu cache misses?
> > swapness?).
>
> I'm sure we're grateful for any information, and I don't mean to be
> rude, but this could be frustrating to people told they should do
> performance engineering and trying to understand what might be going on.
> [Was "heap" a typo?]
>
> > On Wed, Sep 30, 2015 at 1:12 PM, Dave Love <d.l...@liverpool.ac.uk>
> wrote:
> >
> >> Mike Dubman <mi...@dev.mellanox.co.il> writes:
> >>
> >> > Hello Grigory,
> >> >
> >> > We observed ~10% performance degradation with heap size set to
> unlimited
> >> > for CFD applications.
> >>
> >> OK, but why?  It would help to understand what the mechanism is, and why
> >> MXM specifically tells you to set the stack to the default, which may
> >> well be wrong for the application.
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/10/27759.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] Using OpenMPI (1.8, 1.10) with Mellanox MXM, ulimits ?

2015-10-01 Thread Mike Dubman

thanks Nathan, you are right, we will fix it.

On Wed, Sep 30, 2015 at 7:02 PM, Nathan Hjelm <hje...@lanl.gov> wrote:

>
> Mike, I see a typo in the mxm warning:
>
> mxm.c:185  MXM  WARN  The
> 'ulimit -s' on the system is set to 'unlimited'. This may have negative
> performance implications. Please set the heap size to the default value
> (10240)
>
> Should say stack not heap.
>
> -Nathan
>
> On Wed, Sep 30, 2015 at 06:52:46PM +0300, Mike Dubman wrote:
> >mxm comes with mxm_dump_config utility which provides and explains all
> >tunables.
> >Please check HPCX/README file for details.
> >On Wed, Sep 30, 2015 at 1:21 PM, Dave Love <d.l...@liverpool.ac.uk>
> wrote:
> >
> >  Mike Dubman <mi...@dev.mellanox.co.il> writes:
> >
> >  > unfortunately, there is no one size fits all here.
> >  >
> >  > mxm provides best performance for IB.
> >  >
> >  > different application may require different OMPI, mxm, OS
> tunables and
> >  > requires some performance engineering.
> >
> >  Fair enough, but is there any guidance on the MXM stuff, in
> particular?
> >  There's essentially no useful information in the distribution I got.
> >  ___
> >  users mailing list
> >  us...@open-mpi.org
> >  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >  Link to this post:
> >  http://www.open-mpi.org/community/lists/users/2015/09/27725.php
> >
> >--
> >Kind Regards,
> >M.
>
> > ___
> > users mailing list
> > us...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> > Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27746.php
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27747.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] Using OpenMPI (1.8, 1.10) with Mellanox MXM, ulimits ?

2015-09-30 Thread Mike Dubman

mxm comes with mxm_dump_config utility which provides and explains all
tunables.
Please check HPCX/README file for details.

On Wed, Sep 30, 2015 at 1:21 PM, Dave Love <d.l...@liverpool.ac.uk> wrote:

> Mike Dubman <mi...@dev.mellanox.co.il> writes:
>
> > unfortunately, there is no one size fits all here.
> >
> > mxm provides best performance for IB.
> >
> > different application may require different OMPI, mxm, OS tunables and
> > requires some performance engineering.
>
> Fair enough, but is there any guidance on the MXM stuff, in particular?
> There's essentially no useful information in the distribution I got.
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27725.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] Using OpenMPI (1.8, 1.10) with Mellanox MXM, ulimits ?

2015-09-30 Thread Mike Dubman

we did not get to the bottom for "why".
Tried different mpi packages (mvapich,intel mpi) and the observation hold
true.

it could be many factors affected by huge heap size (cpu cache misses?
swapness?).

On Wed, Sep 30, 2015 at 1:12 PM, Dave Love <d.l...@liverpool.ac.uk> wrote:

> Mike Dubman <mi...@dev.mellanox.co.il> writes:
>
> > Hello Grigory,
> >
> > We observed ~10% performance degradation with heap size set to unlimited
> > for CFD applications.
>
> OK, but why?  It would help to understand what the mechanism is, and why
> MXM specifically tells you to set the stack to the default, which may
> well be wrong for the application.
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27723.php
>

-- 

Kind Regards,

M.

Re: [OMPI users] worse latency in 1.8 c.f. 1.6

2015-09-29 Thread Mike Dubman

what is your command line and setup? (ofed version, distro)

This is what was just measured w/ fdr on haswell with v1.8.8 and mxm and UD

+ mpirun -np 2 -bind-to core -display-map -mca rmaps_base_mapping_policy
dist:span -x MXM_RDMA_PORTS=mlx5_3:1 -mca rmaps_dist_device mlx5_3:1  -x
MXM_TLS=self,shm,ud osu_latency
 Data for JOB [65499,1] offset 0

    JOB MAP   

 Data for node: clx-orion-001   Num slots: 28   Max slots: 0Num procs: 1
Process OMPI jobid: [65499,1] App: 0 Process rank: 0

 Data for node: clx-orion-002   Num slots: 28   Max slots: 0Num procs: 1
Process OMPI jobid: [65499,1] App: 0 Process rank: 1

 =
# OSU MPI Latency Test v4.4.1
# Size  Latency (us)
0   1.18
1   1.16
2   1.19
4   1.20
8   1.19
16  1.19
32  1.21
64  1.27


and w/ ob1, openib btl:

mpirun -np 2 -bind-to core -display-map -mca rmaps_base_mapping_policy
dist:span  -mca rmaps_dist_device mlx5_3:1  -mca btl_if_include mlx5_3:1
-mca pml ob1 -mca btl openib,self osu_latency

# OSU MPI Latency Test v4.4.1
# Size  Latency (us)
0   1.13
1   1.17
2   1.17
4   1.17
8   1.22
16  1.23
32  1.25
64  1.28


On Tue, Sep 29, 2015 at 6:49 PM, Dave Love  wrote:

> I've just compared IB p2p latency between version 1.6.5 and 1.8.8.  I'm
> surprised to find that 1.8 is rather worse, as below.  Assuming that's
> not expected, are there any suggestions for debugging it?
>
> This is with FDR Mellanox, between two Sandybridge nodes on the same
> blade chassis switch.  The results are similar for IMB pingpong and
> osu_latency, and reproducible.  I'm running both cases the same way as
> far as I can tell (e.g. core binding with 1.6 and not turning it off
> with 1.8) just rebuilding the test against between OMPI versions.
>
> The initial osu_latency figures for 1.6 are:
>
>   # OSU MPI Latency Test v5.0
>   # Size  Latency (us)
>   0   1.16
>   1   1.24
>   2   1.23
>   4   1.23
>   8   1.26
>   16  1.27
>   32  1.30
>   64  1.36
>
> and for 1.8:
>
>   # OSU MPI Latency Test v5.0
>   # Size  Latency (us)
>   0   1.48
>   1   1.46
>   2   1.42
>   4   1.43
>   8   1.46
>   16  1.47
>   32  1.48
>   64  1.54
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27712.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] Using OpenMPI (1.8, 1.10) with Mellanox MXM, ulimits ?

2015-09-29 Thread Mike Dubman

unfortunately, there is no one size fits all here.

mxm provides best performance for IB.

different application may require different OMPI, mxm, OS tunables and
requires some performance engineering.

On Mon, Sep 28, 2015 at 9:49 PM, Grigory Shamov <grigory.sha...@umanitoba.ca
> wrote:

> Hi Nathan,
> Hi Mike,
>
> Thanks for the quick replies!
>
> My problem is I don't know what are my applications. I mean, I know them,
> but we are a general purpose cluster, running in production for quite a
> while, and there are everybody, from quantum chemists to machine learners
> to bioinformatists. SO a system-wide change might harm some of them; and
> doing per-app benchmarking/tuning  looks a bit daunting.
>
> The default behaviour our users are used to was to have unlimited values
> for all memory limits. We have set it so a few years ago, as a response
> for some user complaints that applications won't start (we set the ulimits
> in Torque).
>
> Is it known (I know every application is different ) how much costs,
> performance-wise, to have MXM with good ulimits vs unlimited ulimits, vs
> not using MXM at all?
>
> --
> Grigory Shamov
>
> Westgrid/ComputeCanada Site Lead
> University of Manitoba
> E2-588 EITC Building,
> (204) 474-9625
>
>
>
>
>
>
> On 15-09-28 12:58 PM, "users on behalf of Nathan Hjelm"
> <users-boun...@open-mpi.org on behalf of hje...@lanl.gov> wrote:
>
> >
> >I would like to add that you may want to play with the value and see
> >what works for your applications. Most applications should be using
> >malloc or similar functions to allocate large memory regions in the heap
> >and not on the stack.
> >
> >-Nathan
> >
> >On Mon, Sep 28, 2015 at 08:01:09PM +0300, Mike Dubman wrote:
> >>Hello Grigory,
> >>We observed ~10% performance degradation with heap size set to
> >>unlimited
> >>for CFD applications.
> >>You can measure your application performance with default and
> >>unlimited
> >>"limits" and select the best setting.
> >>Kind Regards.
> >>M
> >>On Mon, Sep 28, 2015 at 7:36 PM, Grigory Shamov
> >><grigory.sha...@umanitoba.ca> wrote:
> >>
> >>  Hi All,
> >>
> >>  We have built OpenMPI (1.8.8., 1.10.0) against Mellanox OFED 2.4
> >>and
> >>  corresponding MXM. When it runs now, it gives the following
> >>warning, per
> >>  process:
> >>
> >>  [1443457390.911053] [myhist:5891 :0] mxm.c:185  MXM  WARN
> >>The
> >>  'ulimit -s' on the system is set to 'unlimited'. This may have
> >>negative
> >>  performance implications. Please set the heap size to the default
> >>value
> >>  (10240)
> >>
> >>  We have ulimits for heap (as well as most of the other limits) set
> >>  unlimited because of applications that might possibly need a lot
> >>of RAM.
> >>
> >>  The question is if we should do as MXM wants, or ignore it? Has
> >>anyone
> >>  an
> >>  experience running recent OpenMPI with MXM enabled, and what kind
> >>of
> >>  ulimits do you have? Any suggestions/comments appreciated, thanks!
> >>
> >>  --
> >>  Grigory Shamov
> >>
> >>  Westgrid/ComputeCanada Site Lead
> >>  University of Manitoba
> >>  E2-588 EITC Building,
> >>  (204) 474-9625
> >>
> >>  ___
> >>  users mailing list
> >>  us...@open-mpi.org
> >>  Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >>  Link to this post:
> >>  http://www.open-mpi.org/community/lists/users/2015/09/27697.php
> >>
> >>--
> >>Kind Regards,
> >>M.
> >
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> >> Link to this post:
> >>http://www.open-mpi.org/community/lists/users/2015/09/27698.php
> >
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27701.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] Using OpenMPI (1.8, 1.10) with Mellanox MXM, ulimits ?

2015-09-28 Thread Mike Dubman

Hello Grigory,

We observed ~10% performance degradation with heap size set to unlimited
for CFD applications.

You can measure your application performance with default and unlimited
"limits" and select the best setting.

Kind Regards.
M

On Mon, Sep 28, 2015 at 7:36 PM, Grigory Shamov  wrote:

> Hi All,
>
>
> We have built OpenMPI (1.8.8., 1.10.0) against Mellanox OFED 2.4 and
> corresponding MXM. When it runs now, it gives the following warning, per
> process:
>
> [1443457390.911053] [myhist:5891 :0] mxm.c:185  MXM  WARN  The
> 'ulimit -s' on the system is set to 'unlimited'. This may have negative
> performance implications. Please set the heap size to the default value
> (10240)
>
> We have ulimits for heap (as well as most of the other limits) set
> unlimited because of applications that might possibly need a lot of RAM.
>
> The question is if we should do as MXM wants, or ignore it? Has anyone an
> experience running recent OpenMPI with MXM enabled, and what kind of
> ulimits do you have? Any suggestions/comments appreciated, thanks!
>
>
> --
> Grigory Shamov
>
> Westgrid/ComputeCanada Site Lead
> University of Manitoba
> E2-588 EITC Building,
> (204) 474-9625
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27697.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] No suitable active ports warning and -mca btl_openib_if_include option

2015-06-17 Thread Mike Dubman

Hi,
the message in question belongs to MXM and it is warning (silenced in
latter releases of MXM).

To select specific device in MXM, please pass:

mpirun -x MXM_IB_PORTS=mlx4_0:2 ...

M

On Wed, Jun 17, 2015 at 9:38 PM, Na Zhang  wrote:

> Hi all,
>
> I am trying to launch MPI jobs (with version openmpi-1.6.5) on a node with
> multiple InfiniBand HCA cards (pls. see ibstat info below). I just want to
> use the only active port: mlx4_0 port 2. Thus I issued
>
> mpirun *-mca btl_openib_if_include "mlx4_0:2"* -np...
>
> I thought this command would allow only to use mlx4_0:2. But I still got
> below warning:
>
> ib_dev.c:241  MXM  WARN  No suitable active ports were found on IB device
> 'mlx4_1'.
>
> I wonder whether this support still works.  Or are there any other
> supports to enable HCA card selection? And how can I change my default
> settings for MPI use?
>
> Thanks!
>
> Best,
> Na Zhang
>
> $ ibstat
> CA 'mlx4_0'
> CA type: MT4099
> Number of ports: 2
> Firmware version: 2.31.5050
> Hardware version: 1
> Node GUID: 0x0002c90300fee510
> System image GUID: 0x0002c90300fee513
> Port 1:
> State: Down
> Physical state: Disabled
> Rate: 10
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x0001
> Port GUID: 0x0202c9fffefee510
> Link layer: Ethernet
> Port 2:
> State: Active
> Physical state: LinkUp
> Rate: 56
> Base lid: 3
> LMC: 0
> SM lid: 8
> Capability mask: 0x02514868
> Port GUID: 0x0002c90300fee512
> Link layer: InfiniBand
> CA 'mlx4_1'
> CA type: MT4099
> Number of ports: 2
> Firmware version: 2.30.3200
> Hardware version: 1
> Node GUID: 0x24be05b584f0
> System image GUID: 0x24be05b584f3
> Port 1:
> State: Down
> Physical state: Disabled
> Rate: 10
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x0001
> Port GUID: 0x26be05fffeb584f1
> Link layer: Ethernet
> Port 2:
> State: Down
> Physical state: Disabled
> Rate: 10
> Base lid: 0
> LMC: 0
> SM lid: 0
> Capability mask: 0x0001
> Port GUID: 0x24be050001b584f2
> Link layer: Ethernet
>
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27149.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] MXM problem

2015-05-28 Thread Mike Dubman

it is fine to recompile OMPI from HPCx to apply site default (choice of job
scheduler for example, OMPI from HPCX compiled with ssh support only, etc.).

If ssh launcher is working on your system - than OMPI from HPCX should work
as well.

could you please send to Alina (in cc) the command line and its output from
hpcx/ompi failure?

Thanks


On Thu, May 28, 2015 at 7:33 PM, Timur Ismagilov <tismagi...@mail.ru> wrote:

> Is it normal to rebuild openmpi from hpcx?
> Why binaries don't work?
>
>
>
>
> Четверг, 28 мая 2015, 14:01 +03:00 от Alina Sklarevich <
> ali...@dev.mellanox.co.il>:
>
>   Thank you for this info.
>
> If 'yalla' now works for you, is there anything that is still wrong?
>
> Thanks,
> Alina.
>
> On Thu, May 28, 2015 at 10:21 AM, Timur Ismagilov <tismagi...@mail.ru
> > wrote:
>
> I'm sorry for the delay.
>
> Here it is:
> (*I used 5 min **time limit*)
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-redhat6.2-x86_64/ompi-mellanox-v1.8/bin/mpirun
> -x
> LD_PRELOAD=/gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.330-icc-OFED-1.5.4.1-
> redhat6.2-x86_64/mxm/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data -x
> MXM_IB_PORTS=mlx4_0:1 -x MXM_SHM_KCOPY_MODE=off --mca pml yalla --hostfile
> hostlist ./hello 1> hello_debugMXM_n-2_ppn-2.out
> 2>hello_debugMXM_n-2_ppn-2.err
>
> P.S.
> yalla warks fine with rebuilded ompi: --with-mxm=$HPCX_MXM_DIR
>
>
>
>
>
>
> Вторник, 26 мая 2015, 16:22 +03:00 от Alina Sklarevich <
> ali...@dev.mellanox.co.il
> >:
>
>   Hi Timur,
>
> HPCX has a debug version of MXM. Can you please add the following to your
> command line with pml yalla in order to use it and attach the output?
> "-x LD_PRELOAD=$HPCX_MXM_DIR/debug/lib/libmxm.so -x MXM_LOG_LEVEL=data"
>
> Also, could you please attach the entire output of
> "$HPCX_MPI_DIR/bin/ompi_info -a"
>
> Thank you,
> Alina.
>
> On Tue, May 26, 2015 at 3:39 PM, Mike Dubman <mi...@dev.mellanox.co.il
> <https://e.mail.ru/compose/?mailto=mailto%3ami...@dev.mellanox.co.il>>
> wrote:
>
> Alina - could you please take a look?
> Thx
>
>
> -- Forwarded message --
> From: *Timur Ismagilov* <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>>
> Date: Tue, May 26, 2015 at 12:40 PM
> Subject: Re[12]: [OMPI users] MXM problem
> To: Open MPI Users <us...@open-mpi.org
> <https://e.mail.ru/compose/?mailto=mailto%3ausers@open%2dmpi.org>>
> Cc: Mike Dubman <mi...@dev.mellanox.co.il
> <https://e.mail.ru/compose/?mailto=mailto%3ami...@dev.mellanox.co.il>>
>
>
> It does not work for single node:
>
> *1)* host: $  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x
> MXM_SHM_KCOPY_MODE=off -host node5 -mca pml yalla -x MXM_TLS=ud,self,shm
> --prefix $HPCX_MPI_DIR -mca plm_base_verbose 5  -mca oob_base_verbose 10
> -mca rml_base_verbose 10 --debug-daemons  -np 1 ./hello &> *yalla.out *
>
>
> *2)* host: $  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x
> MXM_SHM_KCOPY_MODE=off -host node5  --mca pml cm --mca mtl mxm --prefix
> $HPCX_MPI_DIR -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca
> rml_base_verbose 10 --debug-daemons -np 1 ./hello &> *cm_mxm.out*
>
> I've attached the *yalla.out* and *cm_mxm.out* to this email.
>
>
>
> Вторник, 26 мая 2015, 11:54 +03:00 от Mike Dubman <
> mi...@dev.mellanox.co.il
> <https://e.mail.ru/compose/?mailto=mailto%3ami...@dev.mellanox.co.il>>:
>
>   does it work from single node?
> could you please run with opts below and attach output?
>
>  -mca plm_base_verbose 5  -mca oob_base_verbose 10 -mca rml_base_verbose
> 10 --debug-daemons
>
> On Tue, May 26, 2015 at 11:38 AM, Timur Ismagilov <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote:
>
> 1. mxm_perf_test - OK.
> 2. no_tree_spawn  - OK.
> 3. ompi yalla and "--mca pml cm --mca mtl mxm" still does not work (I use
> prebuild ompi-1.8.5 from hpcx-v1.3.330)
> *3.a) host:$  $HPCX_MPI_DIR/bin/mpirun -x MXM_IB_PORTS=mlx4_0:1 -x
> MXM_SHM_KCOPY_MODE=off -host node5,node153  --mca pml cm --mca mtl mxm
> --prefix $HPCX_MPI_DIR ./hello*
> --
>
> A requested component was not found, or was unable to be opened.
> This
> means that this component is either not installed or is unable to
> be
> used on your system (e.g., sometimes this means that shared
> libraries
> that the component requires are unable to be found/loaded).  Note
> that
> Open MPI stopped checking at the first component th

Re: [OMPI users] 1.8.5, mxm, and a spurious '-L' flag

2015-05-27 Thread Mike Dubman

thanks, than makes sense.

submitted PR https://github.com/open-mpi/ompi/pull/606 with a fix and will
port to release branches soon.

Thanks a lot.

On Tue, May 26, 2015 at 10:38 PM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:

> Unless the compiler can find the MXM headers/libraries without the
> --with-mxm value.  E.g.,:
>
> ./configure CPPFLAGS=-I/path/to/mxm/headers LDFLAGS=-L/path/to/mxm/libs
> --with-mxm ...
>
> (or otherwise sets the compiler/linker default search paths, etc.)
>
> It seems like however it is happening, somehow the variable is empty, and
> you just end up appending "-L" instead of "-L/something".  So why not just
> check to ensure that the variable is not empty?
>
>
>
> > On May 26, 2015, at 3:27 PM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
> >
> > in that case, OPAL_CHECK_PACKAGE will disqualify mxm because it will not
> find mxm_api.h header file in _OPAL_CHECK_PACKAGE_HEADER macro.
> >
> > from
> >
> >
> https://github.com/open-mpi/ompi/blob/master/config/ompi_check_mxm.m4#L43
> >
> >
> > from config.log generated after "./configure --with-mxm"
> >
> > configure:263059: checking --with-mxm value
> > configure:263062: result: simple ok (unspecified)
> > configure:263097: checking --with-mxm-libdir value
> > configure:263100: result: simple ok (unspecified)
> > configure:263197: checking mxm/api/mxm_api.h usability
> > configure:263197: gcc -std=gnu99 -c -g -Wall -Wundef -Wno-long-long
> -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic
> -Werror-implicit-function-declaration -finline-functions
> -fno-strict-aliasing -pthread
>  
> -I/labhome/miked/workspace/git/mellanox-hpc/ompi-release/opal/mca/hwloc/hwloc191/hwloc/include
> -I/labhome/miked/workspace/git/mellanox-hpc/ompi-release/opal/mca/event/libevent2022/libevent
> -I/labhome/miked/workspace/git/mellanox-hpc/ompi-release/opal/mca/event/libevent2022/libevent/include
> conftest.c >&5
> > conftest.c:775:29: error: mxm/api/mxm_api.h: No such file or directory
> >
> >
> >
> > On Tue, May 26, 2015 at 10:11 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > Mike --
> >
> > I don't think that's right.  If you just pass "--with-mxm", then
> $with_mxm will equal "yes", and therefore neither of those two blocks of
> code are executed.  Hence, ompi_check_mxm_libdir will be empty.
> >
> > Right?
> >
> >
> > > On May 26, 2015, at 1:28 PM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
> > >
> > > Thanks Jeff!
> > >
> > > but in this line:
> > >
> > >
> https://github.com/open-mpi/ompi/blob/master/config/ompi_check_mxm.m4#L36
> > >
> > > ompi_check_mxm_libdir gets value if with_mxm was passed
> > >
> > >
> > >
> > > On Tue, May 26, 2015 at 6:59 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > > This line:
> > >
> > >
> https://github.com/open-mpi/ompi/blob/master/config/ompi_check_mxm.m4#L41
> > >
> > > doesn't check to see if $ompi_check_mxm_libdir is empty.
> > >
> > >
> > > > On May 26, 2015, at 11:50 AM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
> > > >
> > > > David,
> > > > Could you please send me your config.log file?
> > > >
> > > > Looking into config/ompi_check_mxm.m4 macro I don`t understand how
> it could happen.
> > > >
> > > > Thanks a lot.
> > > >
> > > > On Tue, May 26, 2015 at 6:41 PM, Mike Dubman <
> mi...@dev.mellanox.co.il> wrote:
> > > > Hello David,
> > > > Thanks for info and patch - will fix ompi configure logic with your
> patch.
> > > >
> > > > mxm can be installed in the system and user spaces - both are valid
> and supported logic.
> > > >
> > > > M
> > > >
> > > > On Tue, May 26, 2015 at 5:50 PM, David Shrader <dshra...@lanl.gov>
> wrote:
> > > > Hello Mike,
> > > >
> > > > This particular instance of mxm was installed using rpms that were
> re-rolled by our admins. I'm not 100% sure where they got them (HPCx or
> somewhere else). I myself am not using HPCx. Is there any particular reason
> why mxm shouldn't be in system space? If there is, I'll share it with our
> admins and try to get the install location corrected.
> > > >
> > > > As for what is causing the extra -L, it does look like

Re: [OMPI users] 1.8.5, mxm, and a spurious '-L' flag

2015-05-26 Thread Mike Dubman

in that case, OPAL_CHECK_PACKAGE will disqualify mxm because it will not
find mxm_api.h header file in _OPAL_CHECK_PACKAGE_HEADER macro.

from

https://github.com/open-mpi/ompi/blob/master/config/ompi_check_mxm.m4#L43


from config.log generated after "./configure --with-mxm"

configure:263059: checking --with-mxm value
configure:263062: result: simple ok (unspecified)
configure:263097: checking --with-mxm-libdir value
configure:263100: result: simple ok (unspecified)
configure:263197: checking mxm/api/mxm_api.h usability
configure:263197: gcc -std=gnu99 -c -g -Wall -Wundef -Wno-long-long
-Wsign-compare -Wmissing-prototypes -Wstrict-prototypes -Wcomment -pedantic
-Werror-implicit-function-declaration -finline-functions
-fno-strict-aliasing -pthread
-I/labhome/miked/workspace/git/mellanox-hpc/ompi-release/opal/mca/hwloc/hwloc191/hwloc/include
-I/labhome/miked/workspace/git/mellanox-hpc/ompi-release/opal/mca/event/libevent2022/libevent
-I/labhome/miked/workspace/git/mellanox-hpc/ompi-release/opal/mca/event/libevent2022/libevent/include
conftest.c >&5
conftest.c:775:29: error: mxm/api/mxm_api.h: No such file or directory



On Tue, May 26, 2015 at 10:11 PM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:

> Mike --
>
> I don't think that's right.  If you just pass "--with-mxm", then $with_mxm
> will equal "yes", and therefore neither of those two blocks of code are
> executed.  Hence, ompi_check_mxm_libdir will be empty.
>
> Right?
>
>
> > On May 26, 2015, at 1:28 PM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
> >
> > Thanks Jeff!
> >
> > but in this line:
> >
> >
> https://github.com/open-mpi/ompi/blob/master/config/ompi_check_mxm.m4#L36
> >
> > ompi_check_mxm_libdir gets value if with_mxm was passed
> >
> >
> >
> > On Tue, May 26, 2015 at 6:59 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > This line:
> >
> >
> https://github.com/open-mpi/ompi/blob/master/config/ompi_check_mxm.m4#L41
> >
> > doesn't check to see if $ompi_check_mxm_libdir is empty.
> >
> >
> > > On May 26, 2015, at 11:50 AM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
> > >
> > > David,
> > > Could you please send me your config.log file?
> > >
> > > Looking into config/ompi_check_mxm.m4 macro I don`t understand how it
> could happen.
> > >
> > > Thanks a lot.
> > >
> > > On Tue, May 26, 2015 at 6:41 PM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
> > > Hello David,
> > > Thanks for info and patch - will fix ompi configure logic with your
> patch.
> > >
> > > mxm can be installed in the system and user spaces - both are valid
> and supported logic.
> > >
> > > M
> > >
> > > On Tue, May 26, 2015 at 5:50 PM, David Shrader <dshra...@lanl.gov>
> wrote:
> > > Hello Mike,
> > >
> > > This particular instance of mxm was installed using rpms that were
> re-rolled by our admins. I'm not 100% sure where they got them (HPCx or
> somewhere else). I myself am not using HPCx. Is there any particular reason
> why mxm shouldn't be in system space? If there is, I'll share it with our
> admins and try to get the install location corrected.
> > >
> > > As for what is causing the extra -L, it does look like an empty
> variable is used without checking that it is empty in configure. Line
> 246117 in the configure script provided by the openmpi-1.8.5.tar.bz2
> tarball has this:
> > >
> > > ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir"
> > >
> > > By invoking configure with '/bin/sh -x ./configure ...' and changing
> PS4 to output line numbers, I saw that line 246117 was setting
> ompi_check_mxm_extra_libs to just "-L". It turns out that configure does
> this in three separate locations. I put a check around all three instances
> like this:
> > >
> > > if test ! -z "$ompi_check_mxm_extra_libs"; then
> > >   ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir"
> > > fi
> > >
> > > And the spurious '-L' disappeared from the linking commands and make
> completed fine.
> > >
> > > So, it looks like there are two solutions: move the install location
> of mxm to not be in system-space or modify configure. Which one would be
> the better one for me to pursue?
> > >
> > > Thanks,
> > > David
> > >
> > >
> > > On 05/23/2015 12:05 AM, Mike Dubman wrote:
> > >> Hi,
> > >>
> > >> How mxm was installed? by copyi

Re: [OMPI users] 1.8.5, mxm, and a spurious '-L' flag

2015-05-26 Thread Mike Dubman

if just "./configure" was used - it can detect mxm only if it is installed
in /usr/include/...

by default mxm is installed in /opt/mellanox/mxm/...

I just checked with:

"./configure" and it did not detect mxm which is installed in the system
space

"./configure --with-mxm" and it did not detect mxm

"./configure --with-mxm=/opt/mellanox/mxm" and it did work as expected.


Can you please send me your config.log to understand how it could happen?


On Tue, May 26, 2015 at 8:40 PM, David Shrader <dshra...@lanl.gov> wrote:

>  Hello Mike,
>
> I'm still working on getting you my config.log, but I thought I would
> chime in about that line 36. In my case, that code path is not executed
> because with_mxm is empty (I don't use --with-mxm on the configure line
> since libmxm.so is in system space and configure picks up on it
> automatically). Thus, ompi_check_mxm_libdir never gets assigned which
> results in just "-L" getting used on line 41. The same behavior could be
> found by using '--with-mxm=yes'.
>
> Thanks,
> David
>
>
> On 05/26/2015 11:28 AM, Mike Dubman wrote:
>
> Thanks Jeff!
>
>  but in this line:
>
>  https://github.com/open-mpi/ompi/blob/master/config/ompi_check_mxm.m4#L36
>
>  ompi_check_mxm_libdir gets value if with_mxm was passed
>
>
>
> On Tue, May 26, 2015 at 6:59 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> This line:
>>
>>
>> https://github.com/open-mpi/ompi/blob/master/config/ompi_check_mxm.m4#L41
>>
>> doesn't check to see if $ompi_check_mxm_libdir is empty.
>>
>>
>> > On May 26, 2015, at 11:50 AM, Mike Dubman <mi...@dev.mellanox.co.il>
>> wrote:
>> >
>> > David,
>> > Could you please send me your config.log file?
>> >
>> > Looking into config/ompi_check_mxm.m4 macro I don`t understand how it
>> could happen.
>> >
>> > Thanks a lot.
>> >
>> > On Tue, May 26, 2015 at 6:41 PM, Mike Dubman <mi...@dev.mellanox.co.il>
>> wrote:
>> > Hello David,
>> > Thanks for info and patch - will fix ompi configure logic with your
>> patch.
>> >
>> > mxm can be installed in the system and user spaces - both are valid and
>> supported logic.
>> >
>> > M
>> >
>> > On Tue, May 26, 2015 at 5:50 PM, David Shrader <dshra...@lanl.gov>
>> wrote:
>> > Hello Mike,
>> >
>> > This particular instance of mxm was installed using rpms that were
>> re-rolled by our admins. I'm not 100% sure where they got them (HPCx or
>> somewhere else). I myself am not using HPCx. Is there any particular reason
>> why mxm shouldn't be in system space? If there is, I'll share it with our
>> admins and try to get the install location corrected.
>> >
>> > As for what is causing the extra -L, it does look like an empty
>> variable is used without checking that it is empty in configure. Line
>> 246117 in the configure script provided by the openmpi-1.8.5.tar.bz2
>> tarball has this:
>> >
>> > ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir"
>> >
>> > By invoking configure with '/bin/sh -x ./configure ...' and changing
>> PS4 to output line numbers, I saw that line 246117 was setting
>> ompi_check_mxm_extra_libs to just "-L". It turns out that configure does
>> this in three separate locations. I put a check around all three instances
>> like this:
>> >
>> > if test ! -z "$ompi_check_mxm_extra_libs"; then
>> >   ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir"
>> > fi
>> >
>> > And the spurious '-L' disappeared from the linking commands and make
>> completed fine.
>> >
>> > So, it looks like there are two solutions: move the install location of
>> mxm to not be in system-space or modify configure. Which one would be the
>> better one for me to pursue?
>> >
>> > Thanks,
>> > David
>> >
>> >
>> > On 05/23/2015 12:05 AM, Mike Dubman wrote:
>> >> Hi,
>> >>
>> >> How mxm was installed? by copying?
>> >>
>> >> The rpm based installation places mxm into /opt/mellanox/mxm and not
>> into /usr/lib64/libmxm.so.
>> >>
>> >> Do you use HPCx (pack of OMPI and MXM and FCA)?
>> >> You can download HPCX, extract it anywhere and compile OMPI pointing
>> to mxm location under HPCX.
>> >>
>> >> Also, HPCx contains rpms for mxm and fca.
>> >>
>> >>
&g

Re: [OMPI users] 1.8.5, mxm, and a spurious '-L' flag

2015-05-26 Thread Mike Dubman

David,
Could you please send me your config.log file?

Looking into config/ompi_check_mxm.m4 macro I don`t understand how it could
happen.

Thanks a lot.

On Tue, May 26, 2015 at 6:41 PM, Mike Dubman <mi...@dev.mellanox.co.il>
wrote:

> Hello David,
> Thanks for info and patch - will fix ompi configure logic with your patch.
>
> mxm can be installed in the system and user spaces - both are valid and
> supported logic.
>
> M
>
> On Tue, May 26, 2015 at 5:50 PM, David Shrader <dshra...@lanl.gov> wrote:
>
>>  Hello Mike,
>>
>> This particular instance of mxm was installed using rpms that were
>> re-rolled by our admins. I'm not 100% sure where they got them (HPCx or
>> somewhere else). I myself am not using HPCx. Is there any particular reason
>> why mxm shouldn't be in system space? If there is, I'll share it with our
>> admins and try to get the install location corrected.
>>
>> As for what is causing the extra -L, it does look like an empty variable
>> is used without checking that it is empty in configure. Line 246117 in the
>> configure script provided by the openmpi-1.8.5.tar.bz2 tarball has this:
>>
>> ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir"
>>
>> By invoking configure with '/bin/sh -x ./configure ...' and changing PS4
>> to output line numbers, I saw that line 246117 was setting
>> ompi_check_mxm_extra_libs to just "-L". It turns out that configure does
>> this in three separate locations. I put a check around all three instances
>> like this:
>>
>> if test ! -z "$ompi_check_mxm_extra_libs"; then
>>   ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir"
>> fi
>>
>> And the spurious '-L' disappeared from the linking commands and make
>> completed fine.
>>
>> So, it looks like there are two solutions: move the install location of
>> mxm to not be in system-space or modify configure. Which one would be the
>> better one for me to pursue?
>>
>> Thanks,
>> David
>>
>>
>> On 05/23/2015 12:05 AM, Mike Dubman wrote:
>>
>> Hi,
>>
>>  How mxm was installed? by copying?
>>
>>  The rpm based installation places mxm into /opt/mellanox/mxm and not
>> into /usr/lib64/libmxm.so.
>>
>>  Do you use HPCx (pack of OMPI and MXM and FCA)?
>> You can download HPCX, extract it anywhere and compile OMPI pointing to
>> mxm location under HPCX.
>>
>>  Also, HPCx contains rpms for mxm and fca.
>>
>>
>>  M
>>
>> On Sat, May 23, 2015 at 1:07 AM, David Shrader <dshra...@lanl.gov> wrote:
>>
>>> Hello,
>>>
>>> I'm getting a spurious '-L' flag when I have mxm installed in
>>> system-space (/usr/lib64/libmxm.so) which is causing an error at link time
>>> during make:
>>>
>>> ...output snipped...
>>> /bin/sh ../../../../libtool  --tag=CC   --mode=link gcc -std=gnu99 -O3
>>> -DNDEBUG -I/opt/panfs/include -finline-functions -fno-strict-aliasing
>>> -pthread -module -avoid-version   -o libmca_mtl_mxm.la  mtl_mxm.lo
>>> mtl_mxm_cancel.lo mtl_mxm_component.lo mtl_mxm_endpoint.lo mtl_mxm_probe.lo
>>> mtl_mxm_recv.lo mtl_mxm_send.lo -lmxm -L -lrt -lm -lutil
>>> libtool: link: require no space between `-L' and `-lrt'
>>> make[2]: *** [libmca_mtl_mxm.la] Error 1
>>> make[2]: Leaving directory
>>> `/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi/mca/mtl/mxm'
>>> make[1]: *** [all-recursive] Error 1
>>> make[1]: Leaving directory
>>> `/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi'
>>> make: *** [all-recursive] Error 1
>>>
>>> If I I use --with-mxm=no, then this error doesn't occur (as expected as
>>> the mxm component isn't touched). Has anyone run in to this before?
>>>
>>> Here is my configure line:
>>>
>>> ./configure --disable-silent-rules
>>> --with-platform=contrib/platform/lanl/toss/optimized-panasas --prefix=...
>>>
>>> I wonder if there is an empty variable that should contain the directory
>>> libmxm is in somewhere in configure since no directory is passed to
>>> --with-mxm which is then paired with a "-L". I think I'll go through the
>>> configure script while waiting to see if anyone else has run in to this.
>>>
>>> Thank you for any and all help,
>>> David
>>>
>>> --
>>> David Shrader
>>> HPC-3 High Performance Computer Systems
>>> Los Alamos National Lab
>>&

Re: [OMPI users] 1.8.5, mxm, and a spurious '-L' flag

2015-05-26 Thread Mike Dubman

Hello David,
Thanks for info and patch - will fix ompi configure logic with your patch.

mxm can be installed in the system and user spaces - both are valid and
supported logic.

M

On Tue, May 26, 2015 at 5:50 PM, David Shrader <dshra...@lanl.gov> wrote:

>  Hello Mike,
>
> This particular instance of mxm was installed using rpms that were
> re-rolled by our admins. I'm not 100% sure where they got them (HPCx or
> somewhere else). I myself am not using HPCx. Is there any particular reason
> why mxm shouldn't be in system space? If there is, I'll share it with our
> admins and try to get the install location corrected.
>
> As for what is causing the extra -L, it does look like an empty variable
> is used without checking that it is empty in configure. Line 246117 in the
> configure script provided by the openmpi-1.8.5.tar.bz2 tarball has this:
>
> ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir"
>
> By invoking configure with '/bin/sh -x ./configure ...' and changing PS4
> to output line numbers, I saw that line 246117 was setting
> ompi_check_mxm_extra_libs to just "-L". It turns out that configure does
> this in three separate locations. I put a check around all three instances
> like this:
>
> if test ! -z "$ompi_check_mxm_extra_libs"; then
>   ompi_check_mxm_extra_libs="-L$ompi_check_mxm_libdir"
> fi
>
> And the spurious '-L' disappeared from the linking commands and make
> completed fine.
>
> So, it looks like there are two solutions: move the install location of
> mxm to not be in system-space or modify configure. Which one would be the
> better one for me to pursue?
>
> Thanks,
> David
>
>
> On 05/23/2015 12:05 AM, Mike Dubman wrote:
>
> Hi,
>
>  How mxm was installed? by copying?
>
>  The rpm based installation places mxm into /opt/mellanox/mxm and not
> into /usr/lib64/libmxm.so.
>
>  Do you use HPCx (pack of OMPI and MXM and FCA)?
> You can download HPCX, extract it anywhere and compile OMPI pointing to
> mxm location under HPCX.
>
>  Also, HPCx contains rpms for mxm and fca.
>
>
>  M
>
> On Sat, May 23, 2015 at 1:07 AM, David Shrader <dshra...@lanl.gov> wrote:
>
>> Hello,
>>
>> I'm getting a spurious '-L' flag when I have mxm installed in
>> system-space (/usr/lib64/libmxm.so) which is causing an error at link time
>> during make:
>>
>> ...output snipped...
>> /bin/sh ../../../../libtool  --tag=CC   --mode=link gcc -std=gnu99 -O3
>> -DNDEBUG -I/opt/panfs/include -finline-functions -fno-strict-aliasing
>> -pthread -module -avoid-version   -o libmca_mtl_mxm.la  mtl_mxm.lo
>> mtl_mxm_cancel.lo mtl_mxm_component.lo mtl_mxm_endpoint.lo mtl_mxm_probe.lo
>> mtl_mxm_recv.lo mtl_mxm_send.lo -lmxm -L -lrt -lm -lutil
>> libtool: link: require no space between `-L' and `-lrt'
>> make[2]: *** [libmca_mtl_mxm.la] Error 1
>> make[2]: Leaving directory
>> `/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi/mca/mtl/mxm'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory
>> `/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi'
>> make: *** [all-recursive] Error 1
>>
>> If I I use --with-mxm=no, then this error doesn't occur (as expected as
>> the mxm component isn't touched). Has anyone run in to this before?
>>
>> Here is my configure line:
>>
>> ./configure --disable-silent-rules
>> --with-platform=contrib/platform/lanl/toss/optimized-panasas --prefix=...
>>
>> I wonder if there is an empty variable that should contain the directory
>> libmxm is in somewhere in configure since no directory is passed to
>> --with-mxm which is then paired with a "-L". I think I'll go through the
>> configure script while waiting to see if anyone else has run in to this.
>>
>> Thank you for any and all help,
>> David
>>
>> --
>> David Shrader
>> HPC-3 High Performance Computer Systems
>> Los Alamos National Lab
>> Email: dshrader  lanl.gov
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/05/26904.php
>>
>
>
>
>  --
>
> Kind Regards,
>
>  M.
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2015/05/26905.php
>
>
> --
> David Shrader
> HPC-3 High Performance Computer Systems
> Los Alamos National Lab
> Email: dshrader  lanl.gov
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26936.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] Error: "all nodes which are allocated for this job are already filled"

2015-05-26 Thread Mike Dubman

btw, what is a rationale to run in chroot env? is it dockers-like env?

does "ibv_devinfo -v" works for you from chroot env?



On Tue, May 26, 2015 at 7:08 AM, Rahul Yadav  wrote:

> Yes Ralph, MXM cards are on the node. Command runs fine if I run it out of
> the chroot environment.
>
> Thanks
> Rahul
>
> On Mon, May 25, 2015 at 9:03 PM, Ralph Castain  wrote:
>
>> Well, it isn’t finding any MXM cards on NAE27 - do you have any there?
>>
>> You can’t use yalla without MXM cards on all nodes
>>
>>
>> On May 25, 2015, at 8:51 PM, Rahul Yadav  wrote:
>>
>> We were able to solve ssh problem.
>>
>> But now MPI is not able to use component yalla. We are running following
>> command
>>
>> mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1
>> /root/app2 : -n 1 --hostfile /root/host2 /root/backend
>>
>> command is run in chroot environment on JARVICENAE27 and other node is
>> JARVICENAE125. JARVICENAE125 is able to select yalla since that is a
>> remote node and thus is not trying to run the job in chroot environment.
>> But JARVICENAE27 is throwing few MXM related errors and yalla is not
>> selected.
>>
>> Following are the logs of the command with verbose.
>>
>> Any idea what might be wrong ?
>>
>> [1432283901.548917] sys.c:719  MXM  WARN  Conflicting CPU
>> frequencies detected, using: 2601.00
>> [JARVICENAE125:00909] mca: base: components_register: registering pml
>> components
>> [JARVICENAE125:00909] mca: base: components_register: found loaded
>> component v
>> [JARVICENAE125:00909] mca: base: components_register: component v
>> register function successful
>> [JARVICENAE125:00909] mca: base: components_register: found loaded
>> component bfo
>> [JARVICENAE125:00909] mca: base: components_register: component bfo
>> register function successful
>> [JARVICENAE125:00909] mca: base: components_register: found loaded
>> component cm
>> [JARVICENAE125:00909] mca: base: components_register: component cm
>> register function successful
>> [JARVICENAE125:00909] mca: base: components_register: found loaded
>> component ob1
>> [JARVICENAE125:00909] mca: base: components_register: component ob1
>> register function successful
>> [JARVICENAE125:00909] mca: base: components_register: found loaded
>> component yalla
>> [JARVICENAE125:00909] mca: base: components_register: component yalla
>> register function successful
>> [JARVICENAE125:00909] mca: base: components_open: opening pml components
>> [JARVICENAE125:00909] mca: base: components_open: found loaded component v
>> [JARVICENAE125:00909] mca: base: components_open: component v open
>> function successful
>> [JARVICENAE125:00909] mca: base: components_open: found loaded component
>> bfo
>> [JARVICENAE125:00909] mca: base: components_open: component bfo open
>> function successful
>> [JARVICENAE125:00909] mca: base: components_open: found loaded component
>> cm
>> [JARVICENAE125:00909] mca: base: components_open: component cm open
>> function successful
>> [JARVICENAE125:00909] mca: base: components_open: found loaded component
>> ob1
>> [JARVICENAE125:00909] mca: base: components_open: component ob1 open
>> function successful
>> [JARVICENAE125:00909] mca: base: components_open: found loaded component
>> yalla
>> [JARVICENAE125:00909] mca: base: components_open: component yalla open
>> function successful
>> [JARVICENAE125:00909] select: component v not in the include list
>> [JARVICENAE125:00909] select: component bfo not in the include list
>> [JARVICENAE125:00909] select: initializing pml component cm
>> [JARVICENAE27:06474] mca: base: components_register: registering pml
>> components
>> [JARVICENAE27:06474] mca: base: components_register: found loaded
>> component v
>> [JARVICENAE27:06474] mca: base: components_register: component v register
>> function successful
>> [JARVICENAE27:06474] mca: base: components_register: found loaded
>> component bfo
>> [JARVICENAE27:06474] mca: base: components_register: component bfo
>> register function successful
>> [JARVICENAE27:06474] mca: base: components_register: found loaded
>> component cm
>> [JARVICENAE27:06474] mca: base: components_register: component cm
>> register function successful
>> [JARVICENAE27:06474] mca: base: components_register: found loaded
>> component ob1
>> [JARVICENAE27:06474] mca: base: components_register: component ob1
>> register function successful
>> [JARVICENAE27:06474] mca: base: components_register: found loaded
>> component yalla
>> [JARVICENAE27:06474] mca: base: components_register: component yalla
>> register function successful
>> [JARVICENAE27:06474] mca: base: components_open: opening pml components
>> [JARVICENAE27:06474] mca: base: components_open: found loaded component v
>> [JARVICENAE27:06474] mca: base: components_open: component v open
>> function successful
>> [JARVICENAE27:06474] mca: base: components_open: found loaded component
>> bfo
>> [JARVICENAE27:06474] mca: base: components_open:

Re: [OMPI users] MXM problem

2015-05-25 Thread Mike Dubman

scif is a OFA device from Intel.
can you please select export MXM_IB_PORTS=mlx4_0:1 explicitly and retry

On Mon, May 25, 2015 at 8:26 PM, Timur Ismagilov <tismagi...@mail.ru> wrote:

> Hi, Mike,
> that is what i have:
>
> $ echo $LD_LIBRARY_PATH | tr ":" "\n"
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/fca/lib
>
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/hcoll/lib
>
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/mxm/lib
>
>
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8/lib
>  +intel compiler paths
>
> $ echo
> $OPAL_PREFIX
>
>
> /gpfs/NETHOME/oivt1/nicevt/itf/sources/hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2/ompi-mellanox-v1.8
>
> I don't use LD_PRELOAD.
>
> In the attached file(ompi_info.out) you will find the output of ompi_info
> -l 9  command.
>
> *P.S*.
> node1 $ ./mxm_perftest
> node2 $  ./mxm_perftest node1  -t send_lat
> [1432568685.067067] [node151:87372:0] shm.c:65   MXM  WARN  Could
> not open the KNEM device file $t /dev/knem : No such file or directory.
> Won't use knem. *( I don't have knem)*
> [1432568685.069699] [node151:87372:0]  ib_dev.c:531  MXM  WARN
> skipping device scif0 (vendor_id/par$_id = 0x8086/0x0) - not a Mellanox
> device   *(???)*
> Failed to create endpoint: No such device
>
> $  ibv_devinfo
> hca_id: mlx4_0
> transport:  InfiniBand (0)
> fw_ver: 2.10.600
> node_guid:  0002:c903:00a1:13b0
> sys_image_guid: 0002:c903:00a1:13b3
> vendor_id:  0x02c9
> vendor_part_id: 4099
> hw_ver: 0x0
> board_id:   MT_1090120019
> phys_port_cnt:  2
> port:   1
> state:  PORT_ACTIVE (4)
> max_mtu:4096 (5)
> active_mtu: 4096 (5)
> sm_lid: 1
> port_lid:   83
> port_lmc:   0x00
>
> port:   2
> state:  PORT_DOWN (1)
> max_mtu:4096 (5)
> active_mtu: 4096 (5)
> sm_lid:     0
> port_lid:   0
> port_lmc:   0x00
>
> Best regards,
> Timur.
>
>
> Понедельник, 25 мая 2015, 19:39 +03:00 от Mike Dubman <
> mi...@dev.mellanox.co.il>:
>
>   Hi Timur,
> seems that yalla component was not found in your OMPI tree.
> can it be that your mpirun is not from hpcx? Can you please check
> LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing to the
> right mpirun?
>
> Also, could you please check that yalla is present in the ompi_info -l 9
> output?
>
> Thanks
>
> On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote:
>
> I can password-less ssh to all nodes:
> base$ ssh node1
> node1$ssh node2
> Last login: Mon May 25 18:41:23
> node2$ssh node3
> Last login: Mon May 25 16:25:01
> node3$ssh node4
> Last login: Mon May 25 16:27:04
> node4$
>
> Is this correct?
>
> In ompi-1.9 i do not have no-tree-spawn problem.
>
>
> Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain <r...@open-mpi.org
> <https://e.mail.ru/compose/?mailto=mailto%3arhc@open%2dmpi.org>>:
>
>   I can’t speak to the mxm problem, but the no-tree-spawn issue indicates
> that you don’t have password-less ssh authorized between the compute nodes
>
>
> On May 25, 2015, at 8:55 AM, Timur Ismagilov <tismagi...@mail.ru
> <https://e.mail.ru/compose/?mailto=mailto%3atismagi...@mail.ru>> wrote:
>
> Hello!
>
> I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
> OFED-1.5.4.1;
> CentOS release 6.2;
> infiniband 4x FDR
>
>
>
> I have two problems:
> *1. I can not use mxm*:
> *1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29
> -mca plm_rsh_no_tree_spawn 1 -np 4 ./hello *
> --
>
> A requested component was not found, or was unable to be opened.
> This
> means that this component is

Re: [OMPI users] MXM problem

2015-05-25 Thread Mike Dubman

Hi Timur,
seems that yalla component was not found in your OMPI tree.
can it be that your mpirun is not from hpcx? Can you please check
LD_LIBRARY_PATH,PATH, LD_PRELOAD and OPAL_PREFIX that it is pointing to the
right mpirun?

Also, could you please check that yalla is present in the ompi_info -l 9
output?

Thanks

On Mon, May 25, 2015 at 7:11 PM, Timur Ismagilov  wrote:

> I can password-less ssh to all nodes:
> base$ ssh node1
> node1$ssh node2
> Last login: Mon May 25 18:41:23
> node2$ssh node3
> Last login: Mon May 25 16:25:01
> node3$ssh node4
> Last login: Mon May 25 16:27:04
> node4$
>
> Is this correct?
>
> In ompi-1.9 i do not have no-tree-spawn problem.
>
>
> Понедельник, 25 мая 2015, 9:04 -07:00 от Ralph Castain :
>
>   I can’t speak to the mxm problem, but the no-tree-spawn issue indicates
> that you don’t have password-less ssh authorized between the compute nodes
>
>
> On May 25, 2015, at 8:55 AM, Timur Ismagilov  > wrote:
>
> Hello!
>
> I use ompi-v1.8.4 from hpcx-v1.3.0-327-icc-OFED-1.5.3-redhat6.2;
> OFED-1.5.4.1;
> CentOS release 6.2;
> infiniband 4x FDR
>
>
>
> I have two problems:
> *1. I can not use mxm*:
> *1.a) $mpirun --mca pml cm --mca mtl mxm -host node5,node14,node28,node29
> -mca plm_rsh_no_tree_spawn 1 -np 4 ./hello *
> --
>
> A requested component was not found, or was unable to be opened.
> This
> means that this component is either not installed or is unable to
> be
> used on your system (e.g., sometimes this means that shared
> libraries
> that the component requires are unable to be found/loaded).  Note
> that
> Open MPI stopped checking at the first component that it did not
> find.
>
>
> Host:
> node14
>
> Framework:
> pml
>
> Component:
> yalla
>
> --
>
> *** An error occurred in
> MPI_Init
>
> --
>
> It looks like MPI_INIT failed for some reason; your parallel process
> is
> likely to abort.  There are many reasons that a parallel process
> can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's
> some
> additional information (which may only be relevant to an Open
> MPI
> developer):
>
>
>
>   mca_pml_base_open()
> failed
>
>   --> Returned "Not found" (-13) instead of "Success"
> (0)
> --
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***and potentially your MPI
> job)
> *** An error occurred in
> MPI_Init
>
> [node28:102377] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
>  and not able to guarantee that all other processes were
> killed!
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***and potentially your MPI
> job)
> [node29:105600] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
>  and not able to guarantee that all other processes were
> killed!
> *** An error occurred in
> MPI_Init
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***and potentially your MPI
> job)
> [node5:102409] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
> and not able to guarantee that all other processes were
> killed!
> *** An error occurred in
> MPI_Init
>
> *** on a NULL
> communicator
>
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now
> abort,
> ***and potentially your MPI
> job)
> [node14:85284] Local abort before MPI_INIT completed successfully; not
> able to aggregate error messages,
> and not able to guarantee that all other processes were
> killed!
> ---
>
> Primary job  terminated normally, but 1 process
> returned
> a non-zero exit code.. Per user-direction, the job has been
> aborted.
> ---
>
> --
>
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so
> was:
>
>
>   Process name: [[9372,1],2]
>   Exit code:
> 1
>
> --
>
> [login:08295] 3 more processes have sent help message help-mca-base.txt /
> find-available:not-valid
> [login:08295] Set MCA parameter "orte_base_help_aggregate" to 0 to see all
> help / error messages
> [login:08295] 3 more processes have sent help message

Re: [OMPI users] 1.8.5, mxm, and a spurious '-L' flag

2015-05-23 Thread Mike Dubman

Hi,

How mxm was installed? by copying?

The rpm based installation places mxm into /opt/mellanox/mxm and not into
/usr/lib64/libmxm.so.

Do you use HPCx (pack of OMPI and MXM and FCA)?
You can download HPCX, extract it anywhere and compile OMPI pointing to mxm
location under HPCX.

Also, HPCx contains rpms for mxm and fca.


M

On Sat, May 23, 2015 at 1:07 AM, David Shrader  wrote:

> Hello,
>
> I'm getting a spurious '-L' flag when I have mxm installed in system-space
> (/usr/lib64/libmxm.so) which is causing an error at link time during make:
>
> ...output snipped...
> /bin/sh ../../../../libtool  --tag=CC   --mode=link gcc -std=gnu99 -O3
> -DNDEBUG -I/opt/panfs/include -finline-functions -fno-strict-aliasing
> -pthread -module -avoid-version   -o libmca_mtl_mxm.la  mtl_mxm.lo
> mtl_mxm_cancel.lo mtl_mxm_component.lo mtl_mxm_endpoint.lo mtl_mxm_probe.lo
> mtl_mxm_recv.lo mtl_mxm_send.lo -lmxm -L -lrt -lm -lutil
> libtool: link: require no space between `-L' and `-lrt'
> make[2]: *** [libmca_mtl_mxm.la] Error 1
> make[2]: Leaving directory
> `/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi/mca/mtl/mxm'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory
> `/turquoise/usr/projects/hpctools/dshrader/hpcsoft/openmpi/1.8.5/openmpi-1.8.5/ompi'
> make: *** [all-recursive] Error 1
>
> If I I use --with-mxm=no, then this error doesn't occur (as expected as
> the mxm component isn't touched). Has anyone run in to this before?
>
> Here is my configure line:
>
> ./configure --disable-silent-rules
> --with-platform=contrib/platform/lanl/toss/optimized-panasas --prefix=...
>
> I wonder if there is an empty variable that should contain the directory
> libmxm is in somewhere in configure since no directory is passed to
> --with-mxm which is then paired with a "-L". I think I'll go through the
> configure script while waiting to see if anyone else has run in to this.
>
> Thank you for any and all help,
> David
>
> --
> David Shrader
> HPC-3 High Performance Computer Systems
> Los Alamos National Lab
> Email: dshrader  lanl.gov
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26904.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] MPI_THREAD_MULTIPLE and openib btl

2015-04-28 Thread Mike Dubman

-mca pml_base_verbose 10

you should see:

select: component yalla selected

for mxm debug info, please add:

-x LD_PRELOAD=$MXM_DIR/lib/libmxm-debug.so -x MXM_LOG_LEVEL=debug


On Tue, Apr 28, 2015 at 7:54 AM, Subhra Mazumdar <subhramazumd...@gmail.com>
wrote:

> Is there any way (probe or trace or other) to sanity check that I am
> indeed using #2 ?
>
> Subhra
>
> On Fri, Apr 24, 2015 at 12:55 AM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
>
>> yes
>>
>> #1 - ob1 as pml, openib openib as btl (default: rc)
>> #2 - yalla as pml, mxm as IB library (default: ud, use "-x
>> MXM_TLS=rc,self,shm" for rc)
>> #3 - cm as pml, mxm as mtl and mxm as a transport (default: ud, use
>> params from #2 for rc)
>>
>> On Fri, Apr 24, 2015 at 10:46 AM, Subhra Mazumdar <
>> subhramazumd...@gmail.com> wrote:
>>
>>> I am a little confused now, I ran 3 different ways and got 3 different
>>> performance from best to worse in following order:
>>>
>>> 1) mpirun --allow-run-as-root --mca pml ob1  -n 1 /root/backend
>>>  localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2
>>>
>>> 2) mpirun --allow-run-as-root  -n 1 /root/backend  localhost : -x
>>> LD_PRELOAD=/root/libci.so -n 1 /root/app2
>>>
>>> 3) mpirun --allow-run-as-root --mca pml cm --mca mtl mxm  -n 1
>>> /root/backend  localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2
>>>
>>> Are all of the above using infiniband but in different ways?
>>>
>>> Thanks,
>>> Subhra.
>>>
>>>
>>>
>>> On Thu, Apr 23, 2015 at 11:57 PM, Mike Dubman <mi...@dev.mellanox.co.il>
>>> wrote:
>>>
>>>> HPCX package uses pml "yalla" by default (part of ompi master branch,
>>>> not in v1.8).
>>>> So, "-mca mtl mxm" has no effect, unless "-mca pml cm" specified to
>>>> disable "pml yalla" and let mtl  layer to play.
>>>>
>>>>
>>>>
>>>> On Fri, Apr 24, 2015 at 6:36 AM, Subhra Mazumdar <
>>>> subhramazumd...@gmail.com> wrote:
>>>>
>>>>> I changed my downloaded MOFED version to match the one installed on
>>>>> the node and now the error goes away and it runs fine. But I still have a
>>>>> question, I get the exact same performance on all the below 3 cases:
>>>>>
>>>>> 1) mpirun --allow-run-as-root  --mca mtl mxm -mca mtl_mxm_np 0 -x
>>>>> MXM_TLS=self,shm,rc,ud -n 1 /root/backend  localhost : -x
>>>>> LD_PRELOAD=/root/libci.so -n 1 /root/app2
>>>>>
>>>>> 2) mpirun --allow-run-as-root  --mca mtl mxm -n 1 /root/backend
>>>>>  localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2
>>>>>
>>>>> 3) mpirun --allow-run-as-root  --mca mtl ^mxm -n 1 /root/backend
>>>>>  localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2
>>>>>
>>>>> Seems like it doesn't matter if I use mxm, not use mxm or use it with
>>>>> reliable connection (RC). How can I be sure I am indeed using mxm over
>>>>> infiniband?
>>>>>
>>>>> Thanks,
>>>>> Subhra.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 23, 2015 at 1:06 AM, Mike Dubman <mi...@dev.mellanox.co.il
>>>>> > wrote:
>>>>>
>>>>>> /usr/bin/ofed_info
>>>>>>
>>>>>> So, the OFED on your system is not MellanoxOFED 2.4.x but smth else.
>>>>>>
>>>>>> try #rpm -qi libibverbs
>>>>>>
>>>>>>
>>>>>> On Thu, Apr 23, 2015 at 7:47 AM, Subhra Mazumdar <
>>>>>> subhramazumd...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> where is the command ofed_info located? I searched from / but didn't
>>>>>>> find it.
>>>>>>>
>>>>>>> Subhra.
>>>>>>>
>>>>>>> On Tue, Apr 21, 2015 at 10:43 PM, Mike Dubman <
>>>>>>> mi...@dev.mellanox.co.il> wrote:
>>>>>>>
>>>>>>>> cool, progress!
>>>>>>>>
>>>>>>>> >>1429676565.124664] sys.c:719  MXM  WARN  Conflicting CPU
>>>>>>>> frequencies detected, using

Re: [OMPI users] MPI_Finalize not behaving correctly, orphaned processes

2015-04-26 Thread Mike Dubman

you are right, Jeff.

from the security reasons "child" is not allowed to share memory with
parent.

On Fri, Apr 24, 2015 at 9:20 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> Does the child process end up with valid memory in the buffer in that
> sample?  Back when I paid attention to verbs (which was admittedly a long
> time ago), the sample I pasted would segv...
>
>
> > On Apr 24, 2015, at 9:40 AM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
> >
> > ibv_fork_init() will set special flag for madvise()
> (IBV_DONTFORK/DOFORK) to inherit (and not cow) registered/locked pages on
> fork() and will maintain refcount for cleanup.
> >
> > I think some minimal kernel version required (2.6.x) which supports
> these flags.
> >
> > I can check if internally if you think the behave is different.
> >
> >
> > On Fri, Apr 24, 2015 at 1:41 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > Mike --
> >
> > What happens when you do this?
> >
> > 
> > ibv_fork_init();
> >
> > int *buffer = malloc(...);
> > ibv_reg_mr(buffer, ...);
> >
> > if (fork() != 0) {
> > // in the child
> > *buffer = 3;
> > // ...
> > }
> > 
> >
> >
> >
> > > On Apr 24, 2015, at 2:54 AM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
> > >
> > > btw, ompi master now calls ibv_fork_init() before initializing
> btl/mtl/oob frameworks and all fork fears should be addressed.
> > >
> > >
> > > On Fri, Apr 24, 2015 at 4:37 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > > Disable the memory manager / don't use leave pinned.  Then you can
> fork/exec without fear (because only MPI will have registered memory --
> it'll never leave user buffers registered after MPI communications finish).
> > >
> > >
> > > > On Apr 23, 2015, at 9:25 PM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
> > > >
> > > > Jeff
> > > >
> > > > this is kind of a lanl thing. Jack and I are working offline.  any
> suggestions about openib and fork/exec may be useful however...and don't
> say no to fork/exec not at least if you dream of mpi in the data center.
> > > >
> > > > On Apr 23, 2015 10:49 AM, "Galloway, Jack D" <ja...@lanl.gov> wrote:
> > > > I am using a “homecooked” cluster at LANL, ~500 cores.  There are a
> whole bunch of fortran system calls doing the copying and pasting.  The
> full code is attached here, a bunch of if-then statements for user
> options.  Thanks for the help.
> > > >
> > > >
> > > >
> > > > --Jack Galloway
> > > >
> > > >
> > > >
> > > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Howard
> Pritchard
> > > > Sent: Thursday, April 23, 2015 8:15 AM
> > > > To: Open MPI Users
> > > > Subject: Re: [OMPI users] MPI_Finalize not behaving correctly,
> orphaned processes
> > > >
> > > >
> > > >
> > > > Hi Jack,
> > > >
> > > > Are you using a system at LANL? Maybe I could try to reproduce the
> problem on the system you are using.  The system call stuff adds a certain
> bit of zest to the problem.  does the app make fortran system calls to do
> the copying and pasting?
> > > >
> > > > Howard
> > > >
> > > > On Apr 22, 2015 4:24 PM, "Galloway, Jack D" <ja...@lanl.gov> wrote:
> > > >
> > > > I have an MPI program that is fairly straight forward, essentially
> "initialize, 2 sends from master to slaves, 2 receives on slaves, do a
> bunch of system calls for copying/pasting then running a serial code on
> each mpi task, tidy up and mpi finalize".
> > > >
> > > > This seems straightforward, but I'm not getting mpi_finalize to work
> correctly. Below is a snapshot of the program, without all the system
> copy/paste/call external code which I've rolled up in "do codish stuff"
> type statements.
> > > >
> > > > program mpi_finalize_break
> > > >
> > > > !
> > > >
> > > > call MPI_INIT(ierr)
> > > >
> > > > icomm = MPI_COMM_WORLD
> > > >
> > > > call MPI_COMM_SIZE(icomm,nproc,ierr)
> > > >
> > > > call MPI_COMM_RANK(icomm,rank,ierr)
> > > >
> > > >
> > > >
> > > >

Re: [OMPI users] MPI_Finalize not behaving correctly, orphaned processes

2015-04-24 Thread Mike Dubman

ibv_fork_init() will set special flag for madvise() (IBV_DONTFORK/DOFORK)
to inherit (and not cow) registered/locked pages on fork() and will
maintain refcount for cleanup.

I think some minimal kernel version required (2.6.x) which supports these
flags.

I can check if internally if you think the behave is different.


On Fri, Apr 24, 2015 at 1:41 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> Mike --
>
> What happens when you do this?
>
> 
> ibv_fork_init();
>
> int *buffer = malloc(...);
> ibv_reg_mr(buffer, ...);
>
> if (fork() != 0) {
> // in the child
> *buffer = 3;
> // ...
> }
> ----
>
>
>
> > On Apr 24, 2015, at 2:54 AM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
> >
> > btw, ompi master now calls ibv_fork_init() before initializing
> btl/mtl/oob frameworks and all fork fears should be addressed.
> >
> >
> > On Fri, Apr 24, 2015 at 4:37 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> > Disable the memory manager / don't use leave pinned.  Then you can
> fork/exec without fear (because only MPI will have registered memory --
> it'll never leave user buffers registered after MPI communications finish).
> >
> >
> > > On Apr 23, 2015, at 9:25 PM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
> > >
> > > Jeff
> > >
> > > this is kind of a lanl thing. Jack and I are working offline.  any
> suggestions about openib and fork/exec may be useful however...and don't
> say no to fork/exec not at least if you dream of mpi in the data center.
> > >
> > > On Apr 23, 2015 10:49 AM, "Galloway, Jack D" <ja...@lanl.gov> wrote:
> > > I am using a “homecooked” cluster at LANL, ~500 cores.  There are a
> whole bunch of fortran system calls doing the copying and pasting.  The
> full code is attached here, a bunch of if-then statements for user
> options.  Thanks for the help.
> > >
> > >
> > >
> > > --Jack Galloway
> > >
> > >
> > >
> > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Howard
> Pritchard
> > > Sent: Thursday, April 23, 2015 8:15 AM
> > > To: Open MPI Users
> > > Subject: Re: [OMPI users] MPI_Finalize not behaving correctly,
> orphaned processes
> > >
> > >
> > >
> > > Hi Jack,
> > >
> > > Are you using a system at LANL? Maybe I could try to reproduce the
> problem on the system you are using.  The system call stuff adds a certain
> bit of zest to the problem.  does the app make fortran system calls to do
> the copying and pasting?
> > >
> > > Howard
> > >
> > > On Apr 22, 2015 4:24 PM, "Galloway, Jack D" <ja...@lanl.gov> wrote:
> > >
> > > I have an MPI program that is fairly straight forward, essentially
> "initialize, 2 sends from master to slaves, 2 receives on slaves, do a
> bunch of system calls for copying/pasting then running a serial code on
> each mpi task, tidy up and mpi finalize".
> > >
> > > This seems straightforward, but I'm not getting mpi_finalize to work
> correctly. Below is a snapshot of the program, without all the system
> copy/paste/call external code which I've rolled up in "do codish stuff"
> type statements.
> > >
> > > program mpi_finalize_break
> > >
> > > !
> > >
> > > call MPI_INIT(ierr)
> > >
> > > icomm = MPI_COMM_WORLD
> > >
> > > call MPI_COMM_SIZE(icomm,nproc,ierr)
> > >
> > > call MPI_COMM_RANK(icomm,rank,ierr)
> > >
> > >
> > >
> > > !
> > >
> > > if (rank == 0) then
> > >
> > > ! slaves>
> > >
> > > call MPI_SEND(numat,1,MPI_INTEGER,n,0,icomm,ierr)
> > >
> > > call MPI_SEND(n_to_add,1,MPI_INTEGER,n,0,icomm,ierr)
> > >
> > > else
> > >
> > > call MPI_Recv(begin_mat,1,MPI_INTEGER,0,0,icomm,status,ierr)
> > >
> > > call MPI_Recv(nrepeat,1,MPI_INTEGER,0,0,icomm,status,ierr)
> > >
> > > !
> > >
> > > endif
> > >
> > >
> > >
> > > print*, "got here4", rank
> > >
> > > call MPI_BARRIER(icomm,ierr)
> > >
> > > print*, "got here5", rank, ierr
> > >
> > > call MPI_FINALIZE(ierr)
> > >
> > >
> > >
> > > print*, "got here6"
> > >
> > > end program mpi_finalize_break
> >

Re: [OMPI users] MPI_THREAD_MULTIPLE and openib btl

2015-04-24 Thread Mike Dubman

yes

#1 - ob1 as pml, openib openib as btl (default: rc)
#2 - yalla as pml, mxm as IB library (default: ud, use "-x
MXM_TLS=rc,self,shm" for rc)
#3 - cm as pml, mxm as mtl and mxm as a transport (default: ud, use params
from #2 for rc)

On Fri, Apr 24, 2015 at 10:46 AM, Subhra Mazumdar <subhramazumd...@gmail.com
> wrote:

> I am a little confused now, I ran 3 different ways and got 3 different
> performance from best to worse in following order:
>
> 1) mpirun --allow-run-as-root --mca pml ob1  -n 1 /root/backend  localhost
> : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2
>
> 2) mpirun --allow-run-as-root  -n 1 /root/backend  localhost : -x
> LD_PRELOAD=/root/libci.so -n 1 /root/app2
>
> 3) mpirun --allow-run-as-root --mca pml cm --mca mtl mxm  -n 1
> /root/backend  localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2
>
> Are all of the above using infiniband but in different ways?
>
> Thanks,
> Subhra.
>
>
>
> On Thu, Apr 23, 2015 at 11:57 PM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
>
>> HPCX package uses pml "yalla" by default (part of ompi master branch, not
>> in v1.8).
>> So, "-mca mtl mxm" has no effect, unless "-mca pml cm" specified to
>> disable "pml yalla" and let mtl  layer to play.
>>
>>
>>
>> On Fri, Apr 24, 2015 at 6:36 AM, Subhra Mazumdar <
>> subhramazumd...@gmail.com> wrote:
>>
>>> I changed my downloaded MOFED version to match the one installed on the
>>> node and now the error goes away and it runs fine. But I still have a
>>> question, I get the exact same performance on all the below 3 cases:
>>>
>>> 1) mpirun --allow-run-as-root  --mca mtl mxm -mca mtl_mxm_np 0 -x
>>> MXM_TLS=self,shm,rc,ud -n 1 /root/backend  localhost : -x
>>> LD_PRELOAD=/root/libci.so -n 1 /root/app2
>>>
>>> 2) mpirun --allow-run-as-root  --mca mtl mxm -n 1 /root/backend
>>>  localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2
>>>
>>> 3) mpirun --allow-run-as-root  --mca mtl ^mxm -n 1 /root/backend
>>>  localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2
>>>
>>> Seems like it doesn't matter if I use mxm, not use mxm or use it with
>>> reliable connection (RC). How can I be sure I am indeed using mxm over
>>> infiniband?
>>>
>>> Thanks,
>>> Subhra.
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Apr 23, 2015 at 1:06 AM, Mike Dubman <mi...@dev.mellanox.co.il>
>>> wrote:
>>>
>>>> /usr/bin/ofed_info
>>>>
>>>> So, the OFED on your system is not MellanoxOFED 2.4.x but smth else.
>>>>
>>>> try #rpm -qi libibverbs
>>>>
>>>>
>>>> On Thu, Apr 23, 2015 at 7:47 AM, Subhra Mazumdar <
>>>> subhramazumd...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> where is the command ofed_info located? I searched from / but didn't
>>>>> find it.
>>>>>
>>>>> Subhra.
>>>>>
>>>>> On Tue, Apr 21, 2015 at 10:43 PM, Mike Dubman <
>>>>> mi...@dev.mellanox.co.il> wrote:
>>>>>
>>>>>> cool, progress!
>>>>>>
>>>>>> >>1429676565.124664] sys.c:719  MXM  WARN  Conflicting CPU
>>>>>> frequencies detected, using: 2601.00
>>>>>>
>>>>>> means that cpu governor on your machine is not on "performance" mode
>>>>>>
>>>>>> >> MXM  ERROR ibv_query_device() returned 38: Function not
>>>>>> implemented
>>>>>>
>>>>>> indicates that ofed installed on your nodes is not indeed 2.4.-1.0.0
>>>>>> or there is a mismatch between ofed kernel drivers version and ofed
>>>>>> userspace libraries version.
>>>>>> or you have multiple ofed libraries installed on your node and use
>>>>>> incorrect one.
>>>>>> could you please check that ofed_info -s indeed prints mofed
>>>>>> 2.4-1.0.0?
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 22, 2015 at 7:59 AM, Subhra Mazumdar <
>>>>>> subhramazumd...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I compiled the openmpi that comes inside the mellanox hpcx pa

Re: [OMPI users] MPI_THREAD_MULTIPLE and openib btl

2015-04-24 Thread Mike Dubman

HPCX package uses pml "yalla" by default (part of ompi master branch, not
in v1.8).
So, "-mca mtl mxm" has no effect, unless "-mca pml cm" specified to disable
"pml yalla" and let mtl  layer to play.



On Fri, Apr 24, 2015 at 6:36 AM, Subhra Mazumdar <subhramazumd...@gmail.com>
wrote:

> I changed my downloaded MOFED version to match the one installed on the
> node and now the error goes away and it runs fine. But I still have a
> question, I get the exact same performance on all the below 3 cases:
>
> 1) mpirun --allow-run-as-root  --mca mtl mxm -mca mtl_mxm_np 0 -x
> MXM_TLS=self,shm,rc,ud -n 1 /root/backend  localhost : -x
> LD_PRELOAD=/root/libci.so -n 1 /root/app2
>
> 2) mpirun --allow-run-as-root  --mca mtl mxm -n 1 /root/backend  localhost
> : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2
>
> 3) mpirun --allow-run-as-root  --mca mtl ^mxm -n 1 /root/backend
>  localhost : -x LD_PRELOAD=/root/libci.so -n 1 /root/app2
>
> Seems like it doesn't matter if I use mxm, not use mxm or use it with
> reliable connection (RC). How can I be sure I am indeed using mxm over
> infiniband?
>
> Thanks,
> Subhra.
>
>
>
>
>
> On Thu, Apr 23, 2015 at 1:06 AM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
>
>> /usr/bin/ofed_info
>>
>> So, the OFED on your system is not MellanoxOFED 2.4.x but smth else.
>>
>> try #rpm -qi libibverbs
>>
>>
>> On Thu, Apr 23, 2015 at 7:47 AM, Subhra Mazumdar <
>> subhramazumd...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> where is the command ofed_info located? I searched from / but didn't
>>> find it.
>>>
>>> Subhra.
>>>
>>> On Tue, Apr 21, 2015 at 10:43 PM, Mike Dubman <mi...@dev.mellanox.co.il>
>>> wrote:
>>>
>>>> cool, progress!
>>>>
>>>> >>1429676565.124664] sys.c:719  MXM  WARN  Conflicting CPU
>>>> frequencies detected, using: 2601.00
>>>>
>>>> means that cpu governor on your machine is not on "performance" mode
>>>>
>>>> >> MXM  ERROR ibv_query_device() returned 38: Function not implemented
>>>>
>>>> indicates that ofed installed on your nodes is not indeed 2.4.-1.0.0 or
>>>> there is a mismatch between ofed kernel drivers version and ofed userspace
>>>> libraries version.
>>>> or you have multiple ofed libraries installed on your node and use
>>>> incorrect one.
>>>> could you please check that ofed_info -s indeed prints mofed 2.4-1.0.0?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Apr 22, 2015 at 7:59 AM, Subhra Mazumdar <
>>>> subhramazumd...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I compiled the openmpi that comes inside the mellanox hpcx package
>>>>> with mxm support instead of separately downloaded openmpi. I also used the
>>>>> environment as in the README so that no LD_PRELOAD (except our own library
>>>>> which is unrelated) is needed. Now it runs fine (no segfault) but we get
>>>>> same errors as before (saying initialization of MXM library failed). Is it
>>>>> using MXM successfully?
>>>>>
>>>>> [root@JARVICE
>>>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# mpirun
>>>>> --allow-run-as-root  --mca mtl mxm -n 1 /root/backend  localhost : -x
>>>>> LD_PRELOAD=/root/libci.so -n 1 /root/app2
>>>>>
>>>>> --
>>>>> WARNING: a request was made to bind a process. While the system
>>>>> supports binding the process itself, at least one node does NOT
>>>>> support binding memory to the process location.
>>>>>
>>>>>   Node:  JARVICE
>>>>>
>>>>> This usually is due to not having the required NUMA support installed
>>>>> on the node. In some Linux distributions, the required support is
>>>>> contained in the libnumactl and libnumactl-devel packages.
>>>>> This is a warning only; your job will continue, though performance may
>>>>> be degraded.
>>>>>
>>>>> --
>>>>>  i am backend
>>>>> [1429676565.121218] sys.c:719  MXM  WARN  Conflicting CPU
>>>>> frequencies detecte

Re: [OMPI users] MPI_Finalize not behaving correctly, orphaned processes

2015-04-24 Thread Mike Dubman

btw, ompi master now calls ibv_fork_init() before initializing btl/mtl/oob
frameworks and all fork fears should be addressed.


On Fri, Apr 24, 2015 at 4:37 AM, Jeff Squyres (jsquyres)  wrote:

> Disable the memory manager / don't use leave pinned.  Then you can
> fork/exec without fear (because only MPI will have registered memory --
> it'll never leave user buffers registered after MPI communications finish).
>
>
> > On Apr 23, 2015, at 9:25 PM, Howard Pritchard 
> wrote:
> >
> > Jeff
> >
> > this is kind of a lanl thing. Jack and I are working offline.  any
> suggestions about openib and fork/exec may be useful however...and don't
> say no to fork/exec not at least if you dream of mpi in the data center.
> >
> > On Apr 23, 2015 10:49 AM, "Galloway, Jack D"  wrote:
> > I am using a “homecooked” cluster at LANL, ~500 cores.  There are a
> whole bunch of fortran system calls doing the copying and pasting.  The
> full code is attached here, a bunch of if-then statements for user
> options.  Thanks for the help.
> >
> >
> >
> > --Jack Galloway
> >
> >
> >
> > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Howard
> Pritchard
> > Sent: Thursday, April 23, 2015 8:15 AM
> > To: Open MPI Users
> > Subject: Re: [OMPI users] MPI_Finalize not behaving correctly, orphaned
> processes
> >
> >
> >
> > Hi Jack,
> >
> > Are you using a system at LANL? Maybe I could try to reproduce the
> problem on the system you are using.  The system call stuff adds a certain
> bit of zest to the problem.  does the app make fortran system calls to do
> the copying and pasting?
> >
> > Howard
> >
> > On Apr 22, 2015 4:24 PM, "Galloway, Jack D"  wrote:
> >
> > I have an MPI program that is fairly straight forward, essentially
> "initialize, 2 sends from master to slaves, 2 receives on slaves, do a
> bunch of system calls for copying/pasting then running a serial code on
> each mpi task, tidy up and mpi finalize".
> >
> > This seems straightforward, but I'm not getting mpi_finalize to work
> correctly. Below is a snapshot of the program, without all the system
> copy/paste/call external code which I've rolled up in "do codish stuff"
> type statements.
> >
> > program mpi_finalize_break
> >
> > !
> >
> > call MPI_INIT(ierr)
> >
> > icomm = MPI_COMM_WORLD
> >
> > call MPI_COMM_SIZE(icomm,nproc,ierr)
> >
> > call MPI_COMM_RANK(icomm,rank,ierr)
> >
> >
> >
> > !
> >
> > if (rank == 0) then
> >
> > ! slaves>
> >
> > call MPI_SEND(numat,1,MPI_INTEGER,n,0,icomm,ierr)
> >
> > call MPI_SEND(n_to_add,1,MPI_INTEGER,n,0,icomm,ierr)
> >
> > else
> >
> > call MPI_Recv(begin_mat,1,MPI_INTEGER,0,0,icomm,status,ierr)
> >
> > call MPI_Recv(nrepeat,1,MPI_INTEGER,0,0,icomm,status,ierr)
> >
> > !
> >
> > endif
> >
> >
> >
> > print*, "got here4", rank
> >
> > call MPI_BARRIER(icomm,ierr)
> >
> > print*, "got here5", rank, ierr
> >
> > call MPI_FINALIZE(ierr)
> >
> >
> >
> > print*, "got here6"
> >
> > end program mpi_finalize_break
> >
> > Now the problem I am seeing occurs around the "got here4", "got here5"
> and "got here6" statements. I get the appropriate number of print
> statements with corresponding ranks for "got here4", as well as "got
> here5". Meaning, the master and all the slaves (rank 0, and all other
> ranks) got to the barrier call, through the barrier call, and to
> MPI_FINALIZE, reporting 0 for ierr on all of them. However, when it gets to
> "got here6", after the MPI_FINALIZE I'll get all kinds of weird behavior.
> Sometimes I'll get one less "got here6" than I expect, or sometimes I'll
> get eight less (it varies), however the program hangs forever, never
> closing and leaves an orphaned process on one (or more) of the compute
> nodes.
> >
> > I am running this on an infiniband backbone machine, with the NFS server
> shared over infiniband (nfs-rdma). I'm trying to determine how the
> MPI_BARRIER call works fine, yet MPI_FINALIZE ends up with random orphaned
> runs (not the same node, nor the same number of orphans every time). I'm
> guessing it is related to the various system calls to cp, mv,
> ./run_some_code, cp, mv but wasn't sure if it may be related to the speed
> of infiniband too, as all this happens fairly quickly. I could have wrong
> intuition as well. Anybody have thoughts? I could put the whole code if
> helpful, but this condensed version I believe captures it. I'm running
> openmpi1.8.4 compiled against ifort 15.0.2 , with Mellanox adapters running
> firmware 2.9.1000.  This is the mellanox firmware available through yum
> with centos 6.5, 2.6.32-504.8.1.el6.x86_64.
> >
> > ib0   Link encap:InfiniBand  HWaddr
> 80:00:00:48:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
> >
> >   inet addr:192.168.6.254  Bcast:192.168.6.255
> Mask:255.255.255.0
> >
> >   inet6 addr: fe80::202:c903:57:e7fd/64 Scope:Link
> >
> >   UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
> >
> >   RX

Re: [OMPI users] MPI_THREAD_MULTIPLE and openib btl

2015-04-23 Thread Mike Dubman

/usr/bin/ofed_info

So, the OFED on your system is not MellanoxOFED 2.4.x but smth else.

try #rpm -qi libibverbs


On Thu, Apr 23, 2015 at 7:47 AM, Subhra Mazumdar <subhramazumd...@gmail.com>
wrote:

> Hi,
>
> where is the command ofed_info located? I searched from / but didn't find
> it.
>
> Subhra.
>
> On Tue, Apr 21, 2015 at 10:43 PM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
>
>> cool, progress!
>>
>> >>1429676565.124664] sys.c:719  MXM  WARN  Conflicting CPU
>> frequencies detected, using: 2601.00
>>
>> means that cpu governor on your machine is not on "performance" mode
>>
>> >> MXM  ERROR ibv_query_device() returned 38: Function not implemented
>>
>> indicates that ofed installed on your nodes is not indeed 2.4.-1.0.0 or
>> there is a mismatch between ofed kernel drivers version and ofed userspace
>> libraries version.
>> or you have multiple ofed libraries installed on your node and use
>> incorrect one.
>> could you please check that ofed_info -s indeed prints mofed 2.4-1.0.0?
>>
>>
>>
>>
>>
>> On Wed, Apr 22, 2015 at 7:59 AM, Subhra Mazumdar <
>> subhramazumd...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I compiled the openmpi that comes inside the mellanox hpcx package with
>>> mxm support instead of separately downloaded openmpi. I also used the
>>> environment as in the README so that no LD_PRELOAD (except our own library
>>> which is unrelated) is needed. Now it runs fine (no segfault) but we get
>>> same errors as before (saying initialization of MXM library failed). Is it
>>> using MXM successfully?
>>>
>>> [root@JARVICE
>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# mpirun
>>> --allow-run-as-root  --mca mtl mxm -n 1 /root/backend  localhost : -x
>>> LD_PRELOAD=/root/libci.so -n 1 /root/app2
>>>
>>> --
>>> WARNING: a request was made to bind a process. While the system
>>> supports binding the process itself, at least one node does NOT
>>> support binding memory to the process location.
>>>
>>>   Node:  JARVICE
>>>
>>> This usually is due to not having the required NUMA support installed
>>> on the node. In some Linux distributions, the required support is
>>> contained in the libnumactl and libnumactl-devel packages.
>>> This is a warning only; your job will continue, though performance may
>>> be degraded.
>>>
>>> --
>>>  i am backend
>>> [1429676565.121218] sys.c:719  MXM  WARN  Conflicting CPU
>>> frequencies detected, using: 2601.00
>>> [1429676565.122937] [JARVICE:14767:0]  ib_dev.c:445  MXM  WARN
>>>  failed call to ibv_exp_use_priv_env(): Function not implemented
>>> [1429676565.122950] [JARVICE:14767:0]  ib_dev.c:456  MXM  ERROR
>>> ibv_query_device() returned 38: Function not implemented
>>> [1429676565.123535] [JARVICE:14767:0]  ib_dev.c:445  MXM  WARN
>>>  failed call to ibv_exp_use_priv_env(): Function not implemented
>>> [1429676565.123543] [JARVICE:14767:0]  ib_dev.c:456  MXM  ERROR
>>> ibv_query_device() returned 38: Function not implemented
>>> [1429676565.124664] sys.c:719  MXM  WARN  Conflicting CPU
>>> frequencies detected, using: 2601.00
>>> [1429676565.126264] [JARVICE:14768:0]  ib_dev.c:445  MXM  WARN
>>>  failed call to ibv_exp_use_priv_env(): Function not implemented
>>> [1429676565.126276] [JARVICE:14768:0]  ib_dev.c:456  MXM  ERROR
>>> ibv_query_device() returned 38: Function not implemented
>>> [1429676565.126812] [JARVICE:14768:0]  ib_dev.c:445  MXM  WARN
>>>  failed call to ibv_exp_use_priv_env(): Function not implemented
>>> [1429676565.126821] [JARVICE:14768:0]  ib_dev.c:456  MXM  ERROR
>>> ibv_query_device() returned 38: Function not implemented
>>>
>>> --
>>> Initialization of MXM library failed.
>>>
>>>   Error: Input/output error
>>>
>>>
>>> --
>>>
>>> 
>>>
>>>
>>> Thanks,
>>> Subhra.
>>>
>>>
>>> On Sat, Apr 18, 2015 at 12:28 AM, Mike Dubman <mi...@dev.mellanox.co.il>
>>> wrote:
>>

Re: [OMPI users] MPI_THREAD_MULTIPLE and openib btl

2015-04-22 Thread Mike Dubman

cool, progress!

>>1429676565.124664] sys.c:719  MXM  WARN  Conflicting CPU
frequencies detected, using: 2601.00

means that cpu governor on your machine is not on "performance" mode

>> MXM  ERROR ibv_query_device() returned 38: Function not implemented

indicates that ofed installed on your nodes is not indeed 2.4.-1.0.0 or
there is a mismatch between ofed kernel drivers version and ofed userspace
libraries version.
or you have multiple ofed libraries installed on your node and use
incorrect one.
could you please check that ofed_info -s indeed prints mofed 2.4-1.0.0?





On Wed, Apr 22, 2015 at 7:59 AM, Subhra Mazumdar <subhramazumd...@gmail.com>
wrote:

> Hi,
>
> I compiled the openmpi that comes inside the mellanox hpcx package with
> mxm support instead of separately downloaded openmpi. I also used the
> environment as in the README so that no LD_PRELOAD (except our own library
> which is unrelated) is needed. Now it runs fine (no segfault) but we get
> same errors as before (saying initialization of MXM library failed). Is it
> using MXM successfully?
>
> [root@JARVICE
> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]# mpirun
> --allow-run-as-root  --mca mtl mxm -n 1 /root/backend  localhost : -x
> LD_PRELOAD=/root/libci.so -n 1 /root/app2
> --
> WARNING: a request was made to bind a process. While the system
> supports binding the process itself, at least one node does NOT
> support binding memory to the process location.
>
>   Node:  JARVICE
>
> This usually is due to not having the required NUMA support installed
> on the node. In some Linux distributions, the required support is
> contained in the libnumactl and libnumactl-devel packages.
> This is a warning only; your job will continue, though performance may be
> degraded.
> --
>  i am backend
> [1429676565.121218] sys.c:719  MXM  WARN  Conflicting CPU
> frequencies detected, using: 2601.00
> [1429676565.122937] [JARVICE:14767:0]  ib_dev.c:445  MXM  WARN  failed
> call to ibv_exp_use_priv_env(): Function not implemented
> [1429676565.122950] [JARVICE:14767:0]  ib_dev.c:456  MXM  ERROR
> ibv_query_device() returned 38: Function not implemented
> [1429676565.123535] [JARVICE:14767:0]  ib_dev.c:445  MXM  WARN  failed
> call to ibv_exp_use_priv_env(): Function not implemented
> [1429676565.123543] [JARVICE:14767:0]  ib_dev.c:456  MXM  ERROR
> ibv_query_device() returned 38: Function not implemented
> [1429676565.124664] sys.c:719  MXM  WARN  Conflicting CPU
> frequencies detected, using: 2601.00
> [1429676565.126264] [JARVICE:14768:0]  ib_dev.c:445  MXM  WARN  failed
> call to ibv_exp_use_priv_env(): Function not implemented
> [1429676565.126276] [JARVICE:14768:0]  ib_dev.c:456  MXM  ERROR
> ibv_query_device() returned 38: Function not implemented
> [1429676565.126812] [JARVICE:14768:0]  ib_dev.c:445  MXM  WARN  failed
> call to ibv_exp_use_priv_env(): Function not implemented
> [1429676565.126821] [JARVICE:14768:0]  ib_dev.c:456  MXM  ERROR
> ibv_query_device() returned 38: Function not implemented
> --
> Initialization of MXM library failed.
>
>   Error: Input/output error
>
> --
>
> 
>
>
> Thanks,
> Subhra.
>
>
> On Sat, Apr 18, 2015 at 12:28 AM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
>
>> could you please check that ofed_info -s indeed prints mofed 2.4-1.0.0?
>> why LD_PRELOAD needed in your command line? Can you try
>>
>> module load hpcx
>> mpirun -np $np test.exe
>> ?
>>
>> On Sat, Apr 18, 2015 at 8:39 AM, Subhra Mazumdar <
>> subhramazumd...@gmail.com> wrote:
>>
>>> I followed the instructions as in the README, now getting a different
>>> error:
>>>
>>> [root@JARVICE hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]#
>>> ../openmpi-1.8.4/openmpinstall/bin/mpirun --allow-run-as-root --mca mtl mxm
>>> -x LD_PRELOAD="../openmpi-1.8.4/openmpinstall/lib/libmpi.so.1
>>> ./mxm/lib/libmxm.so.2" -n 1 ../backend localhost : -x
>>> LD_PRELOAD="../openmpi-1.8.4/openmpinstall/lib/libmpi.so.1
>>> ./mxm/lib/libmxm.so.2 ../libci.so" -n 1 ../app2
>>>
>>>
>>> --
>>>
>>> WARNING: a request was made to bind a process. While the system
>>>
>>> s

Re: [OMPI users] MPI_THREAD_MULTIPLE and openib btl

2015-04-18 Thread Mike Dubman

>
> 2 0x0005640c mxm_handle_error()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:641
>
> 3 0x0005657c mxm_error_signal_handler()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:616
>
> 4 0x000329a0 killpg()  ??:0
>
> 5 0x0004812c _IO_vfprintf()  ??:0
>
> 6 0x0006f6da vasprintf()  ??:0
>
> 7 0x00059b3b opal_show_help_vstring()  ??:0
>
> 8 0x00026630 orte_show_help()  ??:0
>
> 9 0x1a3f mca_bml_r2_add_procs()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/ompi-mellanox-v1.8/ompi/mca/bml/r2/bml_r2.c:409
>
> 10 0x4475 mca_pml_ob1_add_procs()
>  
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/ompi-mellanox-v1.8/ompi/mca/pml/ob1/pml_ob1.c:332
>
> 11 0x000442f3 ompi_mpi_init()  ??:0
>
> 12 0x00067cb0 PMPI_Init_thread()  ??:0
>
> 13 0x00404fdf main()  /root/rain_ib/backend/backend.c:1237
>
> 14 0x0001ed1d __libc_start_main()  ??:0
>
> 15 0x00402db9 _start()  ??:0
>
> ===
>
> --
>
> mpirun noticed that process rank 1 with PID 450 on node JARVICE exited on
> signal 11 (Segmentation fault).
>
> --
>
> [JARVICE:00447] 1 more process has sent help message help-mtl-mxm.txt /
> mxm init
>
> [JARVICE:00447] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
>
> [root@JARVICE hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5]#
>
>
> Subhra.
>
>
> On Mon, Apr 13, 2015 at 10:58 PM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
>
>> Have you followed installation steps from README (Also here for reference
>> http://bgate.mellanox.com/products/hpcx/README.txt)
>>
>> ...
>>
>> * Load OpenMPI/OpenSHMEM v1.8 based package:
>>
>> % source $HPCX_HOME/hpcx-init.sh
>> % hpcx_load
>> % env | grep HPCX
>> % mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_usempi
>> % oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem
>> % hpcx_unload
>>
>> 3. Load HPCX environment from modules
>>
>> * Load OpenMPI/OpenSHMEM based package:
>>
>> % module use $HPCX_HOME/modulefiles
>> % module load hpcx
>> % mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c
>> % oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem
>> % module unload hpcx
>>
>> ...
>>
>> On Tue, Apr 14, 2015 at 5:42 AM, Subhra Mazumdar <
>> subhramazumd...@gmail.com> wrote:
>>
>>> I am using 2.4-1.0.0 mellanox ofed.
>>>
>>> I downloaded mofed tarball
>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5.tar and extracted
>>> it. It has mxm directory.
>>>
>>> hpcx-v1.2.0-325-[root@JARVICE ~]# ls
>>> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5
>>> archive  fcahpcx-init-ompi-mellanox-v1.8.sh  ibprof
>>> modulefiles  ompi-mellanox-v1.8  sources  VERSION
>>> bupc-master  hcoll  hpcx-init.sh knem
>>> mxm  README.txt  utils
>>>
>>> I tried using LD_PRELOAD for libmxm, but getting a different error stack
>>> now as following
>>>
>>> [root@JARVICE ~]# ./openmpi-1.8.4/openmpinstall/bin/mpirun
>>> --allow-run-as-root --mca mtl mxm -x
>>> LD_PRELOAD="./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1
>>> ./hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm/lib/libmxm.so.2"
>>> -n 1 ./backend  localhost : -x
>>> LD_PRELOAD="./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1
>>> ./hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm/lib/libmxm.so.2
>>> ./libci.so" -n 1 ./app2
>>>  i am backend
>>> [JARVICE:00564] mca: base: components_open: component pml / cm open
>>> function failed
>>> [JARVICE:564  :0] Caught signal 11 (Segmentation fault)
>>> [JARVICE:00565] mca: base: components_open: component pml / cm open
>>>

Re: [OMPI users] Select a card in a multi card system

2015-04-15 Thread Mike Dubman

Hi,
With MXM, you can specify list of devices to use for communication:

-x MXM_IB_PORTS="mlx5_1:1,mlx4_1:1"

also select specific or all transpoirts:

-x MXM_TLS=shm,self,ud

To change port rate one can use *ibportstate*

*http://www.hpcadvisorycouncil.com/events/2011/switzerland_workshop/pdf/Presentations/Day%202/2_IB_Tools.pdf
*


*M*


On Wed, Apr 15, 2015 at 10:09 AM, John Hearns 
wrote:

> If you have a system with two IB cards, can you choose using a command
> line switch which card to use with Openmpi?
>
> Also a more general question - can you change (or throttle back) the speed
> at which an Infiniband card works at?
> For example, to use an fDR card at QDR speeds.
>
> Thanks for any insights!
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26734.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] MPI_THREAD_MULTIPLE and openib btl

2015-04-14 Thread Mike Dubman

Have you followed installation steps from README (Also here for reference
http://bgate.mellanox.com/products/hpcx/README.txt)

...

* Load OpenMPI/OpenSHMEM v1.8 based package:

% source $HPCX_HOME/hpcx-init.sh
% hpcx_load
% env | grep HPCX
% mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_usempi
% oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem
% hpcx_unload

3. Load HPCX environment from modules

* Load OpenMPI/OpenSHMEM based package:

% module use $HPCX_HOME/modulefiles
% module load hpcx
% mpirun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_c
% oshrun -np 2 $HPCX_MPI_TESTS_DIR/examples/hello_oshmem
% module unload hpcx

...

On Tue, Apr 14, 2015 at 5:42 AM, Subhra Mazumdar <subhramazumd...@gmail.com>
wrote:

> I am using 2.4-1.0.0 mellanox ofed.
>
> I downloaded mofed tarball
> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5.tar and extracted
> it. It has mxm directory.
>
> hpcx-v1.2.0-325-[root@JARVICE ~]# ls
> hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5
> archive  fcahpcx-init-ompi-mellanox-v1.8.sh  ibprof  modulefiles
> ompi-mellanox-v1.8  sources  VERSION
> bupc-master  hcoll  hpcx-init.sh knemmxm
> README.txt  utils
>
> I tried using LD_PRELOAD for libmxm, but getting a different error stack
> now as following
>
> [root@JARVICE ~]# ./openmpi-1.8.4/openmpinstall/bin/mpirun
> --allow-run-as-root --mca mtl mxm -x
> LD_PRELOAD="./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1
> ./hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm/lib/libmxm.so.2"
> -n 1 ./backend  localhost : -x
> LD_PRELOAD="./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1
> ./hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm/lib/libmxm.so.2
> ./libci.so" -n 1 ./app2
>  i am backend
> [JARVICE:00564] mca: base: components_open: component pml / cm open
> function failed
> [JARVICE:564  :0] Caught signal 11 (Segmentation fault)
> [JARVICE:00565] mca: base: components_open: component pml / cm open
> function failed
> [JARVICE:565  :0] Caught signal 11 (Segmentation fault)
>  backtrace 
>  2 0x0005640c mxm_handle_error()
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:641
>  3 0x0005657c mxm_error_signal_handler()
> /scrap/jenkins/workspace/hpc-power-pack/label/r-vmb-rhel6-u5-x86-64-MOFED-CHECKER/hpcx_root/src/hpcx-v1.2.0-325-gcc-MLNX_OFED_LINUX-2.4-1.0.0-redhat6.5/mxm-v3.2/src/mxm/util/debug/debug.c:616
>  4 0x000329a0 killpg()  ??:0
>  5 0x00045491 mca_base_components_close()  ??:0
>  6 0x0004e99a mca_base_framework_close()  ??:0
>  7 0x00045431 mca_base_component_close()  ??:0
>  8 0x0004515c mca_base_framework_components_open()  ??:0
>  9 0x000a0de9 mca_pml_base_open()  pml_base_frame.c:0
> 10 0x0004eb1c mca_base_framework_open()  ??:0
> 11 0x00043eb3 ompi_mpi_init()  ??:0
> 12 0x00067cb0 PMPI_Init_thread()  ??:0
> 13 0x00404fdf main()  /root/rain_ib/backend/backend.c:1237
> 14 0x0001ed1d __libc_start_main()  ??:0
> 15 0x00402db9 _start()  ??:0
> ===
> --
> A requested component was not found, or was unable to be opened.  This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded).  Note that
> Open MPI stopped checking at the first component that it did not find.
>
> Host:  JARVICE
> Framework: mtl
> Component: mxm
> --
> --
> mpirun noticed that process rank 0 with PID 564 on node JARVICE exited on
> signal 11 (Segmentation fault).
> --
> [JARVICE:00562] 1 more process has sent help message help-mca-base.txt /
> find-available:not-valid
> [JARVICE:00562] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
>
>
> Subhra
>
>
> On Sun, Apr 12, 2015 at 10:48 PM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
>
>> seems like mxm was not found in your ld_library_path.
>>
>> what mofed version do you use?
>> does it have /opt/mellanox/mxm in it?
>> You could just run mpirun from HPCX package which looks for mxm
>> internally and recompile ompi as mentioned in README.
>>

Re: [OMPI users] MPI_THREAD_MULTIPLE and openib btl

2015-04-13 Thread Mike Dubman

seems like mxm was not found in your ld_library_path.

what mofed version do you use?
does it have /opt/mellanox/mxm in it?
You could just run mpirun from HPCX package which looks for mxm internally
and recompile ompi as mentioned in README.

On Mon, Apr 13, 2015 at 3:24 AM, Subhra Mazumdar <subhramazumd...@gmail.com>
wrote:

> Hi,
>
> I used mxm mtl as follows but getting segfault. It says mxm component not
> found but I have compiled openmpi with mxm. Any idea what I might be
> missing?
>
> [root@JARVICE ~]# ./openmpi-1.8.4/openmpinstall/bin/mpirun
> --allow-run-as-root --mca pml cm --mca mtl mxm -n 1 -x
> LD_PRELOAD=./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1 ./backend
> localhosst : -n 1 -x LD_PRELOAD="./libci.so
> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1" ./app2
>  i am backend
> [JARVICE:08398] *** Process received signal ***
> [JARVICE:08398] Signal: Segmentation fault (11)
> [JARVICE:08398] Signal code: Address not mapped (1)
> [JARVICE:08398] Failing at address: 0x10
> [JARVICE:08398] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7ff8d0ddb710]
> [JARVICE:08398] [ 1]
> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_components_close+0x21)[0x7ff8cf9ae491]
> [JARVICE:08398] [ 2]
> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_close+0x6a)[0x7ff8cf9b799a]
> [JARVICE:08398] [ 3]
> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_component_close+0x21)[0x7ff8cf9ae431]
> [JARVICE:08398] [ 4]
> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_components_open+0x11c)[0x7ff8cf9ae15c]
> [JARVICE:08398] [ 5]
> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(+0xa0de9)[0x7ff8d1089de9]
> [JARVICE:08398] [ 6]
> /root/openmpi-1.8.4/openmpinstall/lib/libopen-pal.so.6(mca_base_framework_open+0x7c)[0x7ff8cf9b7b1c]
> [JARVICE:08398] [ 7] [JARVICE:08398] mca: base: components_open: component
> pml / cm open function failed
>
> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(ompi_mpi_init+0x4b3)[0x7ff8d102ceb3]
> [JARVICE:08398] [ 8]
> ./openmpi-1.8.4/openmpinstall/lib/libmpi.so.1(PMPI_Init_thread+0x100)[0x7ff8d1050cb0]
> [JARVICE:08398] [ 9] ./backend[0x404fdf]
> [JARVICE:08398] [10]
> /lib64/libc.so.6(__libc_start_main+0xfd)[0x7ff8cfeded1d]
> [JARVICE:08398] [11] ./backend[0x402db9]
> [JARVICE:08398] *** End of error message ***
> --
> A requested component was not found, or was unable to be opened.  This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded).  Note that
> Open MPI stopped checking at the first component that it did not find.
>
> Host:  JARVICE
> Framework: mtl
> Component: mxm
> --
> --
> mpirun noticed that process rank 0 with PID 8398 on node JARVICE exited on
> signal 11 (Segmentation fault).
> --
>
>
> Subhra.
>
>
> On Fri, Apr 10, 2015 at 12:12 AM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
>
>> no need IPoIB, mxm uses native IB.
>>
>> Please see HPCX (pre-compiled ompi, integrated with MXM and FCA) README
>> file for details how to compile/select.
>>
>> The default transport is UD for internode communication and shared-memory
>> for intra-node.
>>
>> http://bgate,mellanox.com/products/hpcx/
>>
>> Also, mxm included in the Mellanox OFED.
>>
>> On Fri, Apr 10, 2015 at 5:26 AM, Subhra Mazumdar <
>> subhramazumd...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Does ipoib need to be configured on the ib cards for mxm (I have a
>>> separate ethernet connection too)? Also are there special flags in mpirun
>>> to select from UD/RC/DC? What is the default?
>>>
>>> Thanks,
>>> Subhra.
>>>
>>>
>>> On Tue, Mar 31, 2015 at 9:46 AM, Mike Dubman <mi...@dev.mellanox.co.il>
>>> wrote:
>>>
>>>> Hi,
>>>> mxm uses IB rdma/roce technologies. Once can select UD/RC/DC transports
>>>> to be used in mxm.
>>>>
>>>> By selecting mxm, all MPI p2p routines will be mapped to appropriate
>>>> mxm functions.
>>>>
>>>> M
>>>>
>>>> On Mon, Mar 30, 2015 at 7:32 PM, Subhra Mazumdar <
>>>> subhramazumd...@gmail.com> wrote:
>>>>
>>>

Re: [OMPI users] MPI_THREAD_MULTIPLE and openib btl

2015-04-10 Thread Mike Dubman

no need IPoIB, mxm uses native IB.

Please see HPCX (pre-compiled ompi, integrated with MXM and FCA) README
file for details how to compile/select.

The default transport is UD for internode communication and shared-memory
for intra-node.

http://bgate,mellanox.com/products/hpcx/

Also, mxm included in the Mellanox OFED.

On Fri, Apr 10, 2015 at 5:26 AM, Subhra Mazumdar <subhramazumd...@gmail.com>
wrote:

> Hi,
>
> Does ipoib need to be configured on the ib cards for mxm (I have a
> separate ethernet connection too)? Also are there special flags in mpirun
> to select from UD/RC/DC? What is the default?
>
> Thanks,
> Subhra.
>
>
> On Tue, Mar 31, 2015 at 9:46 AM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
>
>> Hi,
>> mxm uses IB rdma/roce technologies. Once can select UD/RC/DC transports
>> to be used in mxm.
>>
>> By selecting mxm, all MPI p2p routines will be mapped to appropriate mxm
>> functions.
>>
>> M
>>
>> On Mon, Mar 30, 2015 at 7:32 PM, Subhra Mazumdar <
>> subhramazumd...@gmail.com> wrote:
>>
>>> Hi MIke,
>>>
>>> Does the mxm mtl use infiniband rdma? Also from programming perspective,
>>> do I need to use anything else other than MPI_Send/MPI_Recv?
>>>
>>> Thanks,
>>> Subhra.
>>>
>>>
>>> On Sun, Mar 29, 2015 at 11:14 PM, Mike Dubman <mi...@dev.mellanox.co.il>
>>> wrote:
>>>
>>>> Hi,
>>>> openib btl does not support this thread model.
>>>> You can use OMPI w/ mxm (-mca mtl mxm) and multiple thread mode lin 1.8
>>>> x series or (-mca pml yalla) in the master branch.
>>>>
>>>> M
>>>>
>>>> On Mon, Mar 30, 2015 at 9:09 AM, Subhra Mazumdar <
>>>> subhramazumd...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Can MPI_THREAD_MULTIPLE and openib btl work together in open mpi
>>>>> 1.8.4? If so are there any command line options needed during run time?
>>>>>
>>>>> Thanks,
>>>>> Subhra.
>>>>>
>>>>> ___
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/users/2015/03/26574.php
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Kind Regards,
>>>>
>>>> M.
>>>>
>>>> ___
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/users/2015/03/26575.php
>>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/03/26580.php
>>>
>>
>>
>>
>> --
>>
>> Kind Regards,
>>
>> M.
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/03/26584.php
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/04/26663.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] MPI_THREAD_MULTIPLE and openib btl

2015-03-31 Thread Mike Dubman

Hi,
mxm uses IB rdma/roce technologies. Once can select UD/RC/DC transports to
be used in mxm.

By selecting mxm, all MPI p2p routines will be mapped to appropriate mxm
functions.

M

On Mon, Mar 30, 2015 at 7:32 PM, Subhra Mazumdar <subhramazumd...@gmail.com>
wrote:

> Hi MIke,
>
> Does the mxm mtl use infiniband rdma? Also from programming perspective,
> do I need to use anything else other than MPI_Send/MPI_Recv?
>
> Thanks,
> Subhra.
>
>
> On Sun, Mar 29, 2015 at 11:14 PM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
>
>> Hi,
>> openib btl does not support this thread model.
>> You can use OMPI w/ mxm (-mca mtl mxm) and multiple thread mode lin 1.8 x
>> series or (-mca pml yalla) in the master branch.
>>
>> M
>>
>> On Mon, Mar 30, 2015 at 9:09 AM, Subhra Mazumdar <
>> subhramazumd...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Can MPI_THREAD_MULTIPLE and openib btl work together in open mpi 1.8.4?
>>> If so are there any command line options needed during run time?
>>>
>>> Thanks,
>>> Subhra.
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/03/26574.php
>>>
>>
>>
>>
>> --
>>
>> Kind Regards,
>>
>> M.
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/03/26575.php
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/03/26580.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] MPI_THREAD_MULTIPLE and openib btl

2015-03-30 Thread Mike Dubman

Hi,
openib btl does not support this thread model.
You can use OMPI w/ mxm (-mca mtl mxm) and multiple thread mode lin 1.8 x
series or (-mca pml yalla) in the master branch.

M

On Mon, Mar 30, 2015 at 9:09 AM, Subhra Mazumdar 
wrote:

> Hi,
>
> Can MPI_THREAD_MULTIPLE and openib btl work together in open mpi 1.8.4?
> If so are there any command line options needed during run time?
>
> Thanks,
> Subhra.
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/03/26574.php
>

-- 

Kind Regards,

M.

Re: [OMPI users] Determine IB transport type of OpenMPI job

2015-01-11 Thread Mike Dubman

Hi,
also - you can use mxm library (which support RC,UD,DC and mixes) and comes
as part of Mellanox OFED.
The version for community OFED is also available from
http://mellanox.com/products/hpcx

On Fri, Jan 9, 2015 at 4:03 PM, Sasso, John (GE Power & Water, Non-GE) <
john1.sa...@ge.com> wrote:

>  For a multi-node job using OpenMPI 1.6.5 over InfiniBand where the OFED
> library is used, is there a way to tell what IB transport type is being
> used (RC, UC, UD, etc)?
>
>
>
> *---john*
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/01/26152.php
>

-- 

Kind Regards,

M.

Re: [OMPI users] ERROR: C_FUNLOC function

2014-12-18 Thread Mike Dubman

Hi Siegmar,
Could you please check the /etc/mtab file for real FS type for the
following mount points:

get_mounts: dirs[16]:/misc fs:autofs nfs:No
get_mounts: dirs[17]:/net fs:autofs nfs:No
get_mounts: dirs[18]:/home fs:autofs nfs:No

could you please check if mntent.h and paths.h were detected by "configure"
in config.log?

Thanks


On Thu, Dec 18, 2014 at 12:39 AM, Jeff Squyres (jsquyres) <
jsquy...@cisco.com> wrote:
>
> Siegmar --
>
> I filed https://github.com/open-mpi/ompi/issues/317 and
> https://github.com/open-mpi/ompi/issues/318.
>
>
>
> On Dec 17, 2014, at 3:33 PM, Siegmar Gross <
> siegmar.gr...@informatik.hs-fulda.de> wrote:
>
> > Hi Jeff,
> >
> >> This fix was just pushed to the OMPI master.  A new master tarball
> >> should be available shortly (probably within an hour or so -- look
> >> for a tarball dated Dec 17 at http://www.open-mpi.org/nightly/master/).
> >
> > Yes, I could build it now. Thank you very much to everybody who helped
> > to fix the problem. I get an error for "make check" on Solaris 10 Sparc,
> > Solaris 10 x86_64, and OpenSUSE Linux with both gcc-4.9.2 and Sun C 5.13.
> > Hopefully I have some time tomorrow to to test this version with some
> > simple programs.
> >
> > Linux, Sun C 5.13:
> > ==
> > ...
> > PASS: opal_bit_ops
> > Failure : Mismatch: input "/home", expected:0 got:1
> >
> > Failure : Mismatch: input "/net", expected:0 got:1
> >
> > Failure : Mismatch: input "/misc", expected:0 got:1
> >
> > SUPPORT: OMPI Test failed: opal_path_nfs() (3 of 20 failed)
> > Test usage: ./opal_path_nfs [DIR]
> > On Linux interprets output from mount(8) to check for nfs and verify
> opal_path_nfs()
> > Additionally, you may specify multiple DIR on the cmd-line, of which you
> the output
> > get_mounts: dirs[0]:/dev fs:devtmpfs nfs:No
> > get_mounts: dirs[1]:/dev/shm fs:tmpfs nfs:No
> > get_mounts: dirs[2]:/run fs:tmpfs nfs:No
> > get_mounts: dirs[3]:/dev/pts fs:devpts nfs:No
> > get_mounts: dirs[4]:/ fs:ext4 nfs:No
> > get_mounts: dirs[5]:/proc fs:proc nfs:No
> > get_mounts: dirs[6]:/sys fs:sysfs nfs:No
> > get_mounts: dirs[7]:/sys/kernel/debug fs:debugfs nfs:No
> > get_mounts: dirs[8]:/sys/kernel/security fs:securityfs nfs:No
> > get_mounts: dirs[9]:/local fs:ext4 nfs:No
> > get_mounts: dirs[10]:/var/lock fs:tmpfs nfs:No
> > get_mounts: dirs[11]:/var/run fs:tmpfs nfs:No
> > get_mounts: dirs[12]:/media fs:tmpfs nfs:No
> > get_mounts: dirs[13]:/usr/local fs:nfs nfs:Yes
> > get_mounts: dirs[14]:/opt/global fs:nfs nfs:Yes
> > get_mounts: already know dir[13]:/usr/local
> > get_mounts: dirs[13]:/usr/local fs:nfs nfs:Yes
> > get_mounts: dirs[15]:/export2 fs:nfs nfs:Yes
> > get_mounts: already know dir[14]:/opt/global
> > get_mounts: dirs[14]:/opt/global fs:nfs nfs:Yes
> > get_mounts: dirs[16]:/misc fs:autofs nfs:No
> > get_mounts: dirs[17]:/net fs:autofs nfs:No
> > get_mounts: dirs[18]:/home fs:autofs nfs:No
> > get_mounts: dirs[19]:/home/fd1026 fs:nfs nfs:Yes
> > test(): file:/home/fd1026 bool:1
> > test(): file:/home bool:0
> > test(): file:/net bool:0
> > test(): file:/misc bool:0
> > test(): file:/export2 bool:1
> > test(): file:/opt/global bool:1
> > test(): file:/usr/local bool:1
> > test(): file:/media bool:0
> > test(): file:/var/run bool:0
> > test(): file:/var/lock bool:0
> > test(): file:/local bool:0
> > test(): file:/sys/kernel/security bool:0
> > test(): file:/sys/kernel/debug bool:0
> > test(): file:/sys bool:0
> > test(): file:/proc bool:0
> > test(): file:/ bool:0
> > test(): file:/dev/pts bool:0
> > test(): file:/run bool:0
> > test(): file:/dev/shm bool:0
> > test(): file:/dev bool:0
> > FAIL: opal_path_nfs
> > 
> > 1 of 2 tests failed
> > Please report to http://www.open-mpi.org/community/help/
> > 
> > make[3]: *** [check-TESTS] Error 1
> > make[3]: Leaving directory
> >
> `/export2/src/openmpi-1.9/openmpi-dev-557-g01a24c4-Linux.x86_64.64_cc/test/util'
> > make[2]: *** [check-am] Error 2
> > make[2]: Leaving directory
> >
> `/export2/src/openmpi-1.9/openmpi-dev-557-g01a24c4-Linux.x86_64.64_cc/test/util'
> > make[1]: *** [check-recursive] Error 1
> > make[1]: Leaving directory
> `/export2/src/openmpi-1.9/openmpi-dev-557-g01a24c4-Linux.x86_64.64_cc/test'
> > make: *** [check-recursive] Error 1
> > tyr openmpi-dev-557-g01a24c4-Linux.x86_64.64_cc 133 dtmail_ssh &
> > [1] 17531
> > tyr openmpi-dev-557-g01a24c4-Linux.x86_64.64_cc 134 libSDtMail: Warning:
> Xt Warning: Missing charsets in
> > String to FontSet conversion
> > libSDtMail: Warning: Xt Warning: Cannot convert string "-dt-interface
> > user-medium-r-normal-s*-*-*-*-*-*-*-*-*" to type FontSet
> >
> > tyr openmpi-dev-557-g01a24c4-Linux.x86_64.64_cc 134
> >
> >
> >
> > Linux, gcc-4.9.2:
> > =
> > ...
> >  CC   opal_lifo.o
> >  CCLD opal_lifo
> >  CC   opal_fifo.o
> >  CCLD opal_fifo
> > make[3]: Leaving directory
> >
>

Re: [OMPI users] shmalloc error with >=512 mb

2014-11-17 Thread Mike Dubman

Hi,
the default memheap size is 256MB, you can override it with oshrun -x
SHMEM_SYMMETRIC_HEAP_SIZE=512M ...

On Mon, Nov 17, 2014 at 3:38 PM, Timur Ismagilov  wrote:

> Hello!
> Why does shmalloc return NULL when I try to allocate 512MB.
> When i thry to allocate 256mb - all fine.
> I use Open MPI/SHMEM v1.8.4 rc1 (v1.8.3-202-gb568b6e).
>
> programm:
>
> #include 
>
> #include 
>
> int main(int argc, char **argv)
> {
> int *src;
> start_pes(0);
>
> int length = 1024*1024*512;
> src = (int*) shmalloc(length);
>   if (src == NULL) {
> printf("can not allocate src: size = %dMb\n ", length/(1024*1024));
>   }
> return 0;
> }
>
> command:
>
> $oshrun -np 1 ./example_shmem
> can not allocate src: size = 512Mb
>
> How can i increse shmalloc memory size?
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/11/25821.php
>



-- 

Kind Regards,

M.

Re: [OMPI users] Building on a host with a shoddy OpenFabrics installation

2014-10-11 Thread Mike Dubman

Hi,
yep - you can compile OFED/MOFED in the $HOME/ofed dir and point OMPI
configure to it with "--with-verbs=/path/to/ofed/install".

You can download and
compile 
"libibverbs","libibumad","libibmad","librdmacm","opensm","infiniband-diags"
packages only with custom prefix.

M

On Fri, Oct 10, 2014 at 11:24 PM, Gary Jackson  wrote:

>
> I'm trying to build OpenMPI on a cluster that has InfiniBand adapters and
> at least enough of OpenFabrics to have working IPoIB. But some things are
> missing, like infiniband/verbs.h. Is it possible to build OpenMPI with a
> working openib btl on a host like this? For instance, can I do a partial
> installation of the necessary libraries and headers of OFED in to my home
> directory so I can get this working? I do not have root authority on these
> hosts, so doing a correct installation of OFED is sadly not a possibility.
>
> --
> Gary
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: http://www.open-mpi.org/community/lists/users/2014/10/
> 25486.php
>

-- 

Kind Regards,

M.

Re: [OMPI users] long initialization

2014-08-22 Thread Mike Dubman

Hi,
The default delimiter is ";" . You can change delimiter with
mca_base_env_list_delimiter.



On Fri, Aug 22, 2014 at 2:59 PM, Timur Ismagilov <tismagi...@mail.ru> wrote:

> Hello!
> If i use latest night snapshot:
>
> $ ompi_info -V
> Open MPI v1.9a1r32570
>
>1. In programm hello_c initialization takes ~1 min
>In ompi 1.8.2rc4 and ealier it takes ~1 sec(or less)
>2. if i use
>$mpirun  --mca mca_base_env_list
>'MXM_SHM_KCOPY_MODE=off,OMP_NUM_THREADS=8' --map-by slot:pe=8 -np 1
>./hello_c
>i got error
>config_parser.c:657  MXM  ERROR Invalid value for SHM_KCOPY_MODE:
>'off,OMP_NUM_THREADS=8'. Expected: [off|knem|cma|autodetect]
>but with -x all works fine (but with warn)
>$mpirun  -x MXM_SHM_KCOPY_MODE=off -x OMP_NUM_THREADS=8 -np 1 ./hello_c
>
>WARNING: The mechanism by which environment variables are explicitly
>..
>..
>..
>Hello, world, I am 0 of 1, (Open MPI v1.9a1, package: Open MPI
>semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570,
>Aug 21, 2014 (nightly snapshot tarball), 146)
>
>
> Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain <r...@open-mpi.org>:
>
>   Not sure I understand. The problem has been fixed in both the trunk and
> the 1.8 branch now, so you should be able to work with either of those
> nightly builds.
>
> On Aug 21, 2014, at 12:02 AM, Timur Ismagilov <tismagi...@mail.ru
> > wrote:
>
> Have i I any opportunity to run mpi jobs?
>
>
> Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain <r...@open-mpi.org
> >:
>
> yes, i know - it is cmr'd
>
> On Aug 20, 2014, at 10:26 AM, Mike Dubman <mi...@dev.mellanox.co.il>
> wrote:
>
> btw, we get same error in v1.8 branch as well.
>
>
> On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
> It was not yet fixed - but should be now.
>
> On Aug 20, 2014, at 6:39 AM, Timur Ismagilov <tismagi...@mail.ru> wrote:
>
> Hello!
>
> As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still have
> the problem
>
> a)
> $ mpirun  -np 1 ./hello_c
>
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --
>
> b)
> $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --
>
> c)
>
> $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca
> plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1
> ./hello_c
>
> [compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
> [compiler-2:14673] mca:base:select:( plm) Query of component [isolated]
> set priority to 0
> [compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
> [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set
> priority to 10
> [compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
> [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set
> priority to 75
> [compiler-2:14673] mca:base:select:( plm) Selected component [slurm]
> [compiler-2:14673] mca: base: components_register: registering oob
> components
> [compiler-2:14673] mca: base: components_register: found loaded component
> tcp
> [compiler-2:14673] mca: base: components_register: component tcp register
> function successful
> [compiler-2:14673] mca: base: components_open: opening oob components
> [compiler-2:14673] mca: base: components_open: found loaded component tcp
> [compiler-2:14673] mca: base: components_open: component tcp open function
> successful
> [compiler-2:14673] mca:oob:select: checking available component tcp
> [compiler-2:14673] mca:oob:select: Querying component [tcp]
> [compiler-2:14673] oob:tcp: comp

Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface

2014-08-21 Thread Mike Dubman

Hi FIlippo,

I think you can use SLURM_LOCALID var (at least with slurm v14.03.4-2)

$srun -N2 --ntasks-per-node 3  env |grep SLURM_LOCALID
SLURM_LOCALID=1
SLURM_LOCALID=2
SLURM_LOCALID=0
SLURM_LOCALID=0
SLURM_LOCALID=1
SLURM_LOCALID=2
$

Kind Regards,
M


On Thu, Aug 21, 2014 at 9:27 PM, Ralph Castain  wrote:

>
> On Aug 21, 2014, at 10:58 AM, Filippo Spiga 
> wrote:
>
> Dear Ralph
>
> On Aug 21, 2014, at 2:30 PM, Ralph Castain  wrote:
>
> I'm afraid that none of the mapping or binding options would be available
> under srun as those only work via mpirun. You can pass MCA params in the
> environment of course, or in default MCA param files.
>
>
> I understand. I hopefully be able to still pass the LAMA mca options as
> environment variables
>
>
> I'm afraid not - LAMA doesn't exist in Slurm, only in mpirun itself
>
> I fear by default srun completely takes over the process binding.
>
>
> I got another problem. On my cluster I have two GPU and two Ivy Bridge
> processors. To maximize the PCIe bandwidth I want to allocate GPU 0 to
> socket 0 and GPU 1 to socket 1. I use a script like this
>
> #!/bin/bash
> lrank=$OMPI_COMM_WORLD_LOCAL_RANK
> case ${lrank} in
> 0)
>  export CUDA_VISIBLE_DEVICES=0
>  "$@"
> ;;
> 1)
>  export CUDA_VISIBLE_DEVICES=1
>  "$@"
> ;;
> esac
>
>
> But OMP_COMM_WORLD_LOCAL_RANK is not defined is I use srun with PMI2 as
> luncher. Is there any equivalent option/environment variable that will help
> me achieve the same result?
>
>
> I'm afraid not - that's something we added. I'm unaware of any similar
> envar from Slurm, I'm afraid
>
>
>
> Thanks in advance!
> F
>
> --
> Mr. Filippo SPIGA, M.Sc.
> http://filippospiga.info ~ skype: filippo.spiga
>
> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
>
> *
> Disclaimer: "Please note this message and any attachments are CONFIDENTIAL
> and may be privileged or otherwise protected from disclosure. The contents
> are not to be disclosed to anyone other than the addressee. Unauthorized
> recipients are requested to preserve this confidentiality and to advise the
> sender immediately of any error in transmission."
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25119.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25120.php
>

Re: [OMPI users] ORTE daemon has unexpectedly failed after launch

2014-08-20 Thread Mike Dubman

btw, we get same error in v1.8 branch as well.


On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain  wrote:

> It was not yet fixed - but should be now.
>
> On Aug 20, 2014, at 6:39 AM, Timur Ismagilov  wrote:
>
> Hello!
>
> As i can see, the bug is fixed, but in Open MPI v1.9a1r32516  i still have
> the problem
>
> a)
> $ mpirun  -np 1 ./hello_c
>
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --
>
> b)
> $ mpirun --mca oob_tcp_if_include ib0 -np 1 ./hello_c
> --
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --
>
> c)
>
> $ mpirun --mca oob_tcp_if_include ib0 -debug-daemons --mca
> plm_base_verbose 5 -mca oob_base_verbose 10 -mca rml_base_verbose 10 -np 1
> ./hello_c
>
> [compiler-2:14673] mca:base:select:( plm) Querying component [isolated]
> [compiler-2:14673] mca:base:select:( plm) Query of component [isolated]
> set priority to 0
> [compiler-2:14673] mca:base:select:( plm) Querying component [rsh]
> [compiler-2:14673] mca:base:select:( plm) Query of component [rsh] set
> priority to 10
> [compiler-2:14673] mca:base:select:( plm) Querying component [slurm]
> [compiler-2:14673] mca:base:select:( plm) Query of component [slurm] set
> priority to 75
> [compiler-2:14673] mca:base:select:( plm) Selected component [slurm]
> [compiler-2:14673] mca: base: components_register: registering oob
> components
> [compiler-2:14673] mca: base: components_register: found loaded component
> tcp
> [compiler-2:14673] mca: base: components_register: component tcp register
> function successful
> [compiler-2:14673] mca: base: components_open: opening oob components
> [compiler-2:14673] mca: base: components_open: found loaded component tcp
> [compiler-2:14673] mca: base: components_open: component tcp open function
> successful
> [compiler-2:14673] mca:oob:select: checking available component tcp
> [compiler-2:14673] mca:oob:select: Querying component [tcp]
> [compiler-2:14673] oob:tcp: component_available called
> [compiler-2:14673] WORKING INTERFACE 1 KERNEL INDEX 1 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 2 KERNEL INDEX 3 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 3 KERNEL INDEX 4 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 4 KERNEL INDEX 5 FAMILY: V4
> [compiler-2:14673] WORKING INTERFACE 5 KERNEL INDEX 6 FAMILY: V4
> [compiler-2:14673] [[49095,0],0] oob:tcp:init adding 10.128.0.4 to our
> list of V4 connections
> [compiler-2:14673] WORKING INTERFACE 6 KERNEL INDEX 7 FAMILY: V4
> [compiler-2:14673] [[49095,0],0] TCP STARTUP
> [compiler-2:14673] [[49095,0],0] attempting to bind to IPv4 port 0
> [compiler-2:14673] [[49095,0],0] assigned IPv4 port 59460
> [compiler-2:14673] mca:oob:select: Adding component to end
> [compiler-2:14673] mca:oob:select: Found 1 active transports
> [compiler-2:14673] mca: base: components_register: registering rml
> components
> [compiler-2:14673] mca: base: components_register: found loaded component
> oob
> [compiler-2:14673] mca: base: components_register: component oob has no
> register or open function
> [compiler-2:14673] mca: base: components_open: opening rml components
> [compiler-2:14673] mca: base: components_open: found loaded component oob
> [compiler-2:14673] mca: base: components_open: component oob open function
> successful
> [compiler-2:14673] orte_rml_base_select: initializing rml component oob
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 30 for
> peer [[WILDCARD],WILDCARD]
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 15 for
> peer [[WILDCARD],WILDCARD]
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 32 for
> peer [[WILDCARD],WILDCARD]
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting persistent recv on tag 33 for
> peer [[WILDCARD],WILDCARD]
> [compiler-2:14673] [[49095,0],0] posting recv
> [compiler-2:14673] [[49095,0],0] posting

Re: [OMPI users] No log_num_mtt in Ubuntu 14.04

2014-08-19 Thread Mike Dubman

so, it seems you have old ofed w/o this parameter.
Can you install latest Mellanox ofed? or check which community ofed has it?


On Tue, Aug 19, 2014 at 9:34 AM, Rio Yokota  wrote:

> Here is what "modinfo mlx4_core" gives
>
> filename:
>   
> /lib/modules/3.13.0-34-generic/kernel/drivers/net/ethernet/mellanox/mlx4/mlx4_core.ko
> version:2.2-1
> license:Dual BSD/GPL
> description:Mellanox ConnectX HCA low-level driver
> author: Roland Dreier
> srcversion: 3AE29A0A6538EBBE9227361
> alias:  pci:v15B3d1010sv*sd*bc*sc*i*
> alias:  pci:v15B3d100Fsv*sd*bc*sc*i*
> alias:  pci:v15B3d100Esv*sd*bc*sc*i*
> alias:  pci:v15B3d100Dsv*sd*bc*sc*i*
> alias:  pci:v15B3d100Csv*sd*bc*sc*i*
> alias:  pci:v15B3d100Bsv*sd*bc*sc*i*
> alias:  pci:v15B3d100Asv*sd*bc*sc*i*
> alias:  pci:v15B3d1009sv*sd*bc*sc*i*
> alias:  pci:v15B3d1008sv*sd*bc*sc*i*
> alias:  pci:v15B3d1007sv*sd*bc*sc*i*
> alias:  pci:v15B3d1006sv*sd*bc*sc*i*
> alias:  pci:v15B3d1005sv*sd*bc*sc*i*
> alias:  pci:v15B3d1004sv*sd*bc*sc*i*
> alias:  pci:v15B3d1003sv*sd*bc*sc*i*
> alias:  pci:v15B3d1002sv*sd*bc*sc*i*
> alias:  pci:v15B3d676Esv*sd*bc*sc*i*
> alias:  pci:v15B3d6746sv*sd*bc*sc*i*
> alias:  pci:v15B3d6764sv*sd*bc*sc*i*
> alias:  pci:v15B3d675Asv*sd*bc*sc*i*
> alias:  pci:v15B3d6372sv*sd*bc*sc*i*
> alias:  pci:v15B3d6750sv*sd*bc*sc*i*
> alias:  pci:v15B3d6368sv*sd*bc*sc*i*
> alias:  pci:v15B3d673Csv*sd*bc*sc*i*
> alias:  pci:v15B3d6732sv*sd*bc*sc*i*
> alias:  pci:v15B3d6354sv*sd*bc*sc*i*
> alias:  pci:v15B3d634Asv*sd*bc*sc*i*
> alias:  pci:v15B3d6340sv*sd*bc*sc*i*
> depends:
> intree: Y
> vermagic:   3.13.0-34-generic SMP mod_unload modversions
> signer: Magrathea: Glacier signing key
> sig_key:50:0B:C5:C8:7D:4B:11:5C:F3:C1:50:4F:7A:92:E2:33:C6:14:3D:58
> sig_hashalgo:   sha512
> parm:   debug_level:Enable debug tracing if > 0 (int)
> parm:   msi_x:attempt to use MSI-X if nonzero (int)
> parm:   num_vfs:enable #num_vfs functions if num_vfs > 0
> num_vfs=port1,port2,port1+2 (array of byte)
> parm:   probe_vf:number of vfs to probe by pf driver (num_vfs > 0)
> probe_vf=port1,port2,port1+2 (array of byte)
> parm:   log_num_mgm_entry_size:log mgm size, that defines the num
> of qp per mcg, for example: 10 gives 248.range: 7 <= log_num_mgm_entry_size
> <= 12. To activate device managed flow steering when available, set to -1
> (int)
> parm:   enable_64b_cqe_eqe:Enable 64 byte CQEs/EQEs when the FW
> supports this (default: True) (bool)
> parm:   log_num_mac:Log2 max number of MACs per ETH port (1-7)
> (int)
> parm:   log_num_vlan:Log2 max number of VLANs per ETH port (0-7)
> (int)
> parm:   use_prio:Enable steering by VLAN priority on ETH ports
> (0/1, default 0) (bool)
> parm:   log_mtts_per_seg:Log2 number of MTT entries per segment
> (1-7) (int)
> parm:   port_type_array:Array of port types: HW_DEFAULT (0) is
> default 1 for IB, 2 for Ethernet (array of int)
> parm:   enable_qos:Enable Quality of Service support in the HCA
> (default: off) (bool)
> parm:   internal_err_reset:Reset device on internal errors if
> non-zero (default 1, in SRIOV mode default is 0) (int)
>
> most likely you installing old ofed which does not have this parameter:
>
> try:
>
> #modinfo mlx4_core
>
> and see if it is there.
> I would suggest install latest OFED or Mellanox OFED.
>
>
> On Mon, Aug 18, 2014 at 9:53 PM, Rio Yokota  wrote:
>
>> I get "ofed_info: command not found". Note that I don't install the
>> entire OFED, but do a component wise installation by doing "apt-get install
>> infiniband-diags ibutils ibverbs-utils libmlx4-dev" for the drivers and
>> utilities.
>>
>> Hi,
>> what ofed version do you use?
>> (ofed_info -s)
>>
>>
>> On Sun, Aug 17, 2014 at 7:16 PM, Rio Yokota  wrote:
>>
>>> I have recently upgraded from Ubuntu 12.04 to 14.04 and OpenMPI gives
>>> the following warning upon execution, which did not appear before the
>>> upgrade.
>>>
>>> WARNING: It appears that your OpenFabrics subsystem is configured to only
>>> allow registering part of your physical memory. This can cause MPI jobs
>>> to
>>> run with erratic performance, hang, and/or crash.
>>>
>>> Everything that I could find on google suggests to change log_num_mtt,
>>> but I cannot do this for the following reasons:
>>> 1. There is no log_num_mtt in /sys/module/mlx4_core/parameters/
>>> 2. Adding "options mlx4_core log_num_mtt=24" to
>>> /etc/modprobe.d/mlx4.conf

Re: [OMPI users] No log_num_mtt in Ubuntu 14.04

2014-08-18 Thread Mike Dubman

most likely you installing old ofed which does not have this parameter:

try:

#modinfo mlx4_core

and see if it is there.
I would suggest install latest OFED or Mellanox OFED.


On Mon, Aug 18, 2014 at 9:53 PM, Rio Yokota  wrote:

> I get "ofed_info: command not found". Note that I don't install the entire
> OFED, but do a component wise installation by doing "apt-get install
> infiniband-diags ibutils ibverbs-utils libmlx4-dev" for the drivers and
> utilities.
>
> Hi,
> what ofed version do you use?
> (ofed_info -s)
>
>
> On Sun, Aug 17, 2014 at 7:16 PM, Rio Yokota  wrote:
>
>> I have recently upgraded from Ubuntu 12.04 to 14.04 and OpenMPI gives the
>> following warning upon execution, which did not appear before the upgrade.
>>
>> WARNING: It appears that your OpenFabrics subsystem is configured to only
>> allow registering part of your physical memory. This can cause MPI jobs to
>> run with erratic performance, hang, and/or crash.
>>
>> Everything that I could find on google suggests to change log_num_mtt,
>> but I cannot do this for the following reasons:
>> 1. There is no log_num_mtt in /sys/module/mlx4_core/parameters/
>> 2. Adding "options mlx4_core log_num_mtt=24" to /etc/modprobe.d/mlx4.conf
>> doesn't seem to change anything
>> 3. I am not sure how I can restart the driver because there is no
>> "/etc/init.d/openibd" file (I've rebooted the system but it didn't do
>> anything to create log_num_mtt)
>>
>> [Template information]
>> 1. OpenFabrics is from the Ubuntu distribution using "apt-get install
>> infiniband-diags ibutils ibverbs-utils libmlx4-dev"
>> 2. OS is Ubuntu 14.04 LTS
>> 3. Subnet manager is from the Ubuntu distribution using "apt-get install
>> opensm"
>> 4. Output of ibv_devinfo is:
>> hca_id: mlx4_0
>> transport:  InfiniBand (0)
>> fw_ver: 2.10.600
>> node_guid:  0002:c903:003d:52b0
>> sys_image_guid: 0002:c903:003d:52b3
>> vendor_id:  0x02c9
>> vendor_part_id: 4099
>> hw_ver: 0x0
>> board_id:   MT_1100120019
>> phys_port_cnt:  1
>> port:   1
>> state:  PORT_ACTIVE (4)
>> max_mtu:4096 (5)
>> active_mtu: 4096 (5)
>> sm_lid: 1
>> port_lid:   1
>> port_lmc:   0x00
>> link_layer: InfiniBand
>> 5. Output of ifconfig for IB is
>> ib0   Link encap:UNSPEC  HWaddr
>> 80-00-00-48-FE-80-00-00-00-00-00-00-00-00-00-00
>>   inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
>>   inet6 addr: fe80::202:c903:3d:52b1/64 Scope:Link
>>   UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
>>   RX packets:26 errors:0 dropped:0 overruns:0 frame:0
>>   TX packets:34 errors:0 dropped:16 overruns:0 carrier:0
>>   collisions:0 txqueuelen:256
>>   RX bytes:5843 (5.8 KB)  TX bytes:4324 (4.3 KB)
>> 6. ulimit -l is "unlimited"
>>
>> Thanks,
>> Rio
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2014/08/25048.php
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25049.php
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25062.php
>

Re: [OMPI users] mpi+openshmem hybrid

2014-08-14 Thread Mike Dubman

You can use hybrid mode.
following code works for me with ompi 1.8.2

#include 
#include 
#include "shmem.h"
#include "mpi.h"

int main(int argc, char *argv[])
{
MPI_Init(, );
start_pes(0);

{
int version = 0;
int subversion = 0;
int num_proc = 0;
int my_proc = 0;
int comm_size = 0;
int comm_rank = 0;

MPI_Get_version(, );
fprintf(stdout, "MPI version: %d.%d\n", version, subversion);

num_proc = _num_pes();
my_proc = _my_pe();

fprintf(stdout, "PE#%d of %d\n", my_proc, num_proc);

MPI_Comm_size(MPI_COMM_WORLD, _size);
MPI_Comm_rank(MPI_COMM_WORLD, _rank);

fprintf(stdout, "Comm rank#%d of %d\n", comm_rank, comm_size);
}

return 0;
}



On Thu, Aug 14, 2014 at 11:05 AM, Timur Ismagilov 
wrote:

> Hello!
> I use Open MPI v1.9a132520.
>
> Can I use hybrid mpi+openshmem?
> Where can i read about?
>
> I have some problems in simple programm:
>
> #include 
>
> #include "shmem.h"
> #include "mpi.h"
>
> int main(int argc, char* argv[])
> {
> int proc, nproc;
> int rank, size, len;
> char version[MPI_MAX_LIBRARY_VERSION_STRING];
>
> MPI_Init(, );
> start_pes(0);
> MPI_Finalize();
>
> return 0;
> }
>
> I compile with oshcc, with mpicc i got a compile error.
>
> 1. When i run this programm with mpirun/oshrun i got an output
>
> [1408002416.687274] [node1-130-01:26354:0] proto.c:64 MXM WARN mxm is
> destroyed but still has pending receive requests
> [1408002416.687604] [node1-130-01:26355:0] proto.c:64 MXM WARN mxm is
> destroyed but still has pending receive requests
>
> 2. If in programm, i use this code
> start_pes(0);
> MPI_Init(, );
> MPI_Finalize();
>
> i got an error:
>
> --
> Calling MPI_Init or MPI_Init_thread twice is erroneous.
> --
> [node1-130-01:26469] *** An error occurred in MPI_Init
> [node1-130-01:26469] *** reported by process [2397634561,140733193388033]
> [node1-130-01:26469] *** on communicator MPI_COMM_WORLD
> [node1-130-01:26469] *** MPI_ERR_OTHER: known error not in list
> [node1-130-01:26469] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> [node1-130-01:26469] *** and potentially your MPI job)
> [node1-130-01:26468] [[36585,1],0] ORTE_ERROR_LOG: Not found in file
> routed_radix.c at line 395
> [node1-130-01:26469] [[36585,1],1] ORTE_ERROR_LOG: Not found in file
> routed_radix.c at line 395
> [compiler-2:02175] 1 more process has sent help message
> help-mpi-errors.txt / mpi_errors_are_fatal
> [compiler-2:02175] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see all help / error messages
>
>
> --
> Calling MPI_Init or MPI_Init_thread twice is erroneous.
> --
> [node1-130-01:26469] *** An error occurred in MPI_Init
> [node1-130-01:26469] *** reported by process [2397634561,140733193388033]
> [node1-130-01:26469] *** on communicator MPI_COMM_WORLD
> [node1-130-01:26469] *** MPI_ERR_OTHER: known error not in list
> [node1-130-01:26469] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,
> [node1-130-01:26469] ***and potentially your MPI job)
> [node1-130-01:26468] [[36585,1],0] ORTE_ERROR_LOG: Not found in file
> routed_radix.c at line 395
> [node1-130-01:26469] [[36585,1],1] ORTE_ERROR_LOG: Not found in file
> routed_radix.c at line 395
> [compiler-2:02175] 1 more process has sent help message
> help-mpi-errors.txt / mpi_errors_are_fatal
> [compiler-2:02175] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see all help / error messages
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/08/25010.php
>

Re: [OMPI users] openib component not available

2014-07-24 Thread Mike Dubman

Hi,
The openib btl is not compatible with "thread multiple" paradigm.
You need to use mxm (lib on top of verbs) for ompi and threads.

mxm is part of MOFED or you can download HPCX package (tarball of ompi +
mxm) from http://mellanox.com/products/hpcx

M


On Thu, Jul 24, 2014 at 1:06 PM, madhurima madhunapanthula <
erankima...@gmail.com> wrote:

>
> Hi,
>
>
> I am trying to setup openmpi 1.8.1 on a system with infiniband.
>
> The configure option I am using is
>
> ./configure  --enable-mpi-thread-multiple
>
>
> After installation,  it is not showing any openib BTL component when I use
> the command:
> 'ompi_info --param btl openib'
>
>
> On the same machine, I have also installed openmpi 1.6.5. When i issue the
> command 'ompi_info --param btl openib' from this setup, it lists many
> openib components.
>
> Should I use any flag/option to enable openib in openmpi1.8.1?
>
>
>
> --
> Lokah samasta sukhinobhavanthu
>
> Thanks,
> Madhurima
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24861.php
>

Re: [OMPI users] Salloc and mpirun problem

2014-07-16 Thread Mike Dubman

please add following flags to mpirun "--mca plm_base_verbose 10
--debug-daemons" and attach output.
Thx


On Wed, Jul 16, 2014 at 11:12 AM, Timur Ismagilov 
wrote:

> Hello!
> I have Open MPI v1.9a1r32142 and slurm 2.5.6.
>
> I can not use mpirun after salloc:
>
> $salloc -N2 --exclusive -p test -J ompi
> $LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
> mpirun -np 1 hello_c
>
> -
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
>
> --
> But if i use mpirun in sbutch script it looks correct:
> $cat ompi_mxm3.0
> #!/bin/sh
> LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
> mpirun  -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --map-by slot:pe=8 "$@"
>
> $sbatch -N2  --exclusive -p test -J ompi  ompi_mxm3.0 ./hello_c
> Submitted batch job 645039
> $cat slurm-645039.out
> [warn] Epoll ADD(1) on fd 0 failed.  Old events were 0; read change was 1
> (add); write change was 0 (none): Operation not permitted
> [warn] Epoll ADD(4) on fd 1 failed.  Old events were 0; read change was 0
> (none); write change was 1 (add): Operation not permitted
> Hello, world, I am 0 of 2, (Open MPI v1.9a1, package: Open MPI
> semenov@compiler-2 Distribution, ident: 1.9a1r32142, repo rev: r32142,
> Jul 04, 2014 (nightly snapshot tarball), 146)
> Hello, world, I am 1 of 2, (Open MPI v1.9a1, package: Open MPI
> semenov@compiler-2 Distribution, ident: 1.9a1r32142, repo rev: r32142,
> Jul 04, 2014 (nightly snapshot tarball), 146)
>
> Regards,
> Timur
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/07/24777.php
>

Re: [OMPI users] poor performance using the openib btl

2014-06-25 Thread Mike Dubman

Hi
what ofed/mofed are you using? what HCA, distro and command line?
M


On Wed, Jun 25, 2014 at 1:40 AM, Maxime Boissonneault <
maxime.boissonnea...@calculquebec.ca> wrote:

>  What are your threading options for OpenMPI (when it was built) ?
>
> I have seen OpenIB BTL completely lock when some level of threading is
> enabled before.
>
> Maxime Boissonneault
>
>
> Le 2014-06-24 18:18, Fischer, Greg A. a écrit :
>
>  Hello openmpi-users,
>
>
>
> A few weeks ago, I posted to the list about difficulties I was having
> getting openib to work with Torque (see “openib segfaults with Torque”,
> June 6, 2014). The issues were related to Torque imposing restrictive
> limits on locked memory, and have since been resolved.
>
>
>
> However, now that I’ve had some time to test the applications, I’m seeing
> abysmal performance over the openib layer. Applications run with the tcp
> btl execute about 10x faster than with the openib btl. Clearly something
> still isn’t quite right.
>
>
>
> I tried running with “-mca btl_openib_verbose 1”, but didn’t see anything
> resembling a smoking gun. How should I go about determining the source of
> the problem? (This uses the same OpenMPI Version 1.8.1 / SLES11 SP3 / GCC
> 4.8.3 setup discussed previously.)
>
>
>
> Thanks,
>
> Greg
>
>
> ___
> users mailing listus...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/06/24697.php
>
>
>
> --
> -
> Maxime Boissonneault
> Analyste de calcul - Calcul Québec, Université Laval
> Ph. D. en physique
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2014/06/24698.php
>

Re: [OMPI users] [warn] Epoll ADD(1) on fd 0 failed

2014-06-10 Thread Mike Dubman

btw, the output comes from ompi`s libevent and not from slurm itself (sorry
about confusion and thanks to Yossi for catching this)


opal/mca/event/libevent2021/libevent/epoll.c:
event_warn("Epoll %s(%d) on fd %d failed.  Old events were %d; read change
was %d (%s); write change was %d (%s)",
opal/mca/event/libevent2021/libevent/epoll.c:
event_debug(("Epoll %s(%d) on fd %d okay. [old events were %d; read change
was %d; write change was %d]",



On Fri, Jun 6, 2014 at 3:38 PM, Ralph Castain  wrote:

> Possible - honestly don't know
>
> On Jun 6, 2014, at 12:16 AM, Timur Ismagilov  wrote:
>
> Sometimes,  after termination of the program, launched with the command
> "sbatch ... -o myprogram.out .",  no file "myprogram.out"  is being
> produced. Could this be due to the above mentioned problem?
>
>
> Thu, 5 Jun 2014 07:45:01 -0700 от Ralph Castain :
>
> FWIW: support for the --resv-ports option was deprecated and removed on
> the OMPI side a long time ago.
>
> I'm not familiar enough with "oshrun" to know if it is doing anything
> unusual - I believe it is just a renaming of our usual "mpirun". I suspect
> this is some interaction with sbatch, but I'll take a look. I haven't see
> that warning. Mike indicated he thought it is due to both slurm and OMPI
> trying to control stdin/stdout, in which case it shouldn't be happening but
> you can safely ignore it
>
>
> On Jun 5, 2014, at 3:04 AM, Timur Ismagilov  wrote:
>
> I use cmd line
>
> $sbatch -p test --exclusive -N 2 -o hello_oshmem.out -e hello_oshmem.err
> shrun_mxm3.0 ./hello_oshmem
>
> where script shrun_mxm3.0:
>
> $cat shrun_mxm3.0
>
>   #!/bin/sh
>
>   #srun --resv-ports "$@"
>   #exit $?
>
>   [ x"$TMPDIR" == x"" ] && TMPDIR=/tmp
>   HOSTFILE=${TMPDIR}/hostfile.${SLURM_JOB_ID}
>   srun hostname -s|sort|uniq -c|awk '{print $2" slots="$1}' > $HOSTFILE ||
> { rm -f $HOSTFILE; exit 255; }
>
> LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so
> oshrun -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --hostfile $HOSTFILE "$@"
>
>   rc=$?
>   rm -f $HOSTFILE
>
>   exit $rc
>
> I configured openmpi using
>
> ./configure CC=icc CXX=icpc F77=ifort FC=ifort
> --prefix=/mnt/data/users/dm2/vol3/semenov/_scratch/openmpi-1.8.1_mxm-3.0
> --with-mxm=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/ --with-
>slurm --with-platform=contrib/platform/mellanox/optimized
>
>
> Fri, 30 May 2014 07:09:54 -0700 от Ralph Castain :
>
> Can you pass along the cmd line that generated that output, and how OMPI
> was configured?
>
> On May 30, 2014, at 5:11 AM, Тимур Исмагилов  wrote:
>
> Hello!
>
> I am using Open MPI v1.8.1 and slurm 2.5.6.
>
> I got this messages when i try to run example (hello_oshmem.cpp) program:
>
> [warn] Epoll ADD(1) on fd 0 failed. Old events were 0; read change was 1
> (add); write change was 0 (none): Operation not permitted
> [warn] Epoll ADD(4) on fd 1 failed. Old events were 0; read change was 0
> (none); write change was 1 (add): Operation not permitted
> Hello, world, I am 0 of 2
> Hello, world, I am 1 of 2
>
> What does this warnings mean?
>
> I lunch this job using sbatch and mpirun with hostfile (got it from :
>  $srun hostname -s|sort|uniq -c|awk '{print $2" slots="$1}' > $HOSTFILE)
>
> Regards,
> Timur
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] OPENIB unknown transport errors

2014-06-07 Thread Mike Dubman

could you please attach output of "ibv_devinfo -v" and "ofed_info -s"
Thx


On Sat, Jun 7, 2014 at 12:53 AM, Tim Miller  wrote:

> Hi Josh,
>
> I asked one of our more advanced users to add the "-mca btl_openib_if_include
> mlx4_0:1" argument to his job script. Unfortunately, the same error
> occurred as before.
>
> We'll keep digging on our end; if you have any other suggestions, please
> let us know.
>
> Tim
>
>
> On Thu, Jun 5, 2014 at 7:32 PM, Tim Miller  wrote:
>
>> Hi Josh,
>>
>> Thanks for attempting to sort this out. In answer to your questions:
>>
>> 1. Node allocation is done by TORQUE, however we don't use the TM API to
>> launch jobs (long story). Instead, we just pass a hostfile to mpirun, and
>> mpirun uses the ssh launcher to actually communicate and launch the
>> processes on remote nodes.
>> 2. We have only one port per HCA (the HCA silicon is integrated with the
>> motherboard on most of our nodes, including all that have this issue). They
>> are all configured to use InfiniBand (no IPoIB or other protocols).
>> 3. No, we don't explicitly ask for a device port pair. We will try your
>> suggestion and report back.
>>
>> Thanks again!
>>
>> Tim
>>
>>
>> On Thu, Jun 5, 2014 at 2:22 PM, Joshua Ladd  wrote:
>>
>>> Strange indeed. This info (remote adapter info) is passed around in the
>>> modex and the struct is locally populated during add procs.
>>>
>>> 1. How do you launch jobs? Mpirun, srun, or something else?
>>> 2. How many active ports do you have on each HCA? Are they all
>>> configured to use IB?
>>> 3. Do you explicitly ask for a device:port pair with the "if include"
>>> mca param? If not, can you please add "-mca btl_openib_if_include mlx4_0:1"
>>> (assuming you have a ConnectX-3 HCA and port 1 is configured to run over
>>> IB.)
>>>
>>> Josh
>>>
>>>
>>> On Wed, Jun 4, 2014 at 12:47 PM, Tim Miller  wrote:
>>>
 Hi,

 I'd like to revive this thread, since I am still periodically getting
 errors of this type. I have built 1.8.1 with --enable-debug and run with
 -mca btl_openib_verbose 10. Unfortunately, this doesn't seem to provide any
 additional information that I can find useful. I've gone ahead and attached
 a dump of the output under 1.8.1. The key lines are:


 --
 Open MPI detected two different OpenFabrics transport types in the same
 Infiniband network.
 Such mixed network trasport configuration is not supported by Open MPI.

   Local host:w1
   Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428)
   Local transport type:  MCA_BTL_OPENIB_TRANSPORT_IB

   Remote host:   w16
   Remote Adapter:(vendor 0x2c9, part ID 26428)
   Remote transport type: MCA_BTL_OPENIB_TRANSPORT_UNKNOWN

 -

 Note that the vendor and part IDs are the same. If I immediately run on
 the same two nodes using MVAPICH2, everything is fine.

 I'm really very befuddled by this. OpenMPI sees that the two cards are
 the same and made by the same vendor, yet it thinks the transport types are
 different (and one is unknown). I'm hoping someone with some experience
 with how the OpenIB BTL works can shed some light on this problem...

 Tim


 On Fri, May 9, 2014 at 7:39 PM, Joshua Ladd 
 wrote:

>
> Just wondering if you've tried with the latest stable OMPI, 1.8.1? I'm
> wondering if this is an issue with the OOB. If you have a debug build, you
> can run -mca btl_openib_verbose 10
>
> Josh
>
>
> On Fri, May 9, 2014 at 6:26 PM, Joshua Ladd 
> wrote:
>
>> Hi, Tim
>>
>> Run "ibstat" on each host:
>>
>> 1. Make sure the adapters are alive and active.
>>
>> 2. Look at the Link Layer settings for host w34. Does it match host
>> w4's?
>>
>>
>> Josh
>>
>>
>> On Fri, May 9, 2014 at 1:18 PM, Tim Miller 
>> wrote:
>>
>>> Hi All,
>>>
>>> We're using OpenMPI 1.7.3 with Mellanox ConnectX InfiniBand
>>> adapters, and periodically our jobs abort at start-up with the following
>>> error:
>>>
>>> ===
>>> Open MPI detected two different OpenFabrics transport types in the
>>> same Infiniband network.
>>> Such mixed network trasport configuration is not supported by Open
>>> MPI.
>>>
>>>   Local host:w4
>>>   Local adapter: mlx4_0 (vendor 0x2c9, part ID 26428)
>>>   Local transport type:  MCA_BTL_OPENIB_TRANSPORT_IB
>>>
>>>   Remote host:   w34
>>>   Remote Adapter:(vendor 0x2c9, part ID 26428)
>>>   Remote transport type:

Re: [OMPI users] spml_ikrit_np random values

2014-06-06 Thread Mike Dubman

fixed here: https://svn.open-mpi.org/trac/ompi/changeset/31962

Thanks for report.


On Thu, Jun 5, 2014 at 7:45 PM, Mike Dubman <mi...@dev.mellanox.co.il>
wrote:

> seems oshmem_info uses uninitialized value.
> we will check it, thanks for report.
>
>
> On Thu, Jun 5, 2014 at 6:56 PM, Timur Ismagilov <tismagi...@mail.ru>
> wrote:
>
>> Hello!
>>
>> I am using Open MPI v1.8.1.
>>
>> $oshmem_info -a --parsable | grep spml_ikrit_np
>>
>> mca:spml:ikrit:param:spml_ikrit_np:value:1620524368  (alwase new value)
>> mca:spml:ikrit:param:spml_ikrit_np:source:default
>> mca:spml:ikrit:param:spml_ikrit_np:status:writeable
>> mca:spml:ikrit:param:spml_ikrit_np:level:9
>> mca:spml:ikrit:param:spml_ikrit_np:help:[integer] Minimal allowed job's
>> NP to activate ikrit
>> mca:spml:ikrit:param:spml_ikrit_np:deprecated:no
>> mca:spml:ikrit:param:spml_ikrit_np:type:int
>> mca:spml:ikrit:param:spml_ikrit_np:disabled:false
>>
>> why spml_ikrit_np gets a new value each time?
>> Regards,
>> Timur
>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>

Re: [OMPI users] Problem with yoda component in oshmem.

2014-06-06 Thread Mike Dubman

could you please provide command line ?


On Fri, Jun 6, 2014 at 10:56 AM, Timur Ismagilov  wrote:

> Hello!
>
> I am using Open MPI v1.8.1 in
> example program hello_oshmem.cpp.
>
> When I put  spml_ikrit_np = 1000 (more than 4) and run task on 4 (2,1)
> nodes, I get an:
> in out file:
> No available spml components were found!
>
> This means that there are no components of this type installed on your
> system or all the components reported that they could not be used.
>
> This is a fatal error; your SHMEM process is likely to abort. Check the
> output of the "ompi_info" command and ensure that components of this
> type are available on your system. You may also wish to check the
> value of the "component_path" MCA parameter and ensure that it has at
> least one directory that contains valid MCA components
>
> in err file:
> [node1-128-31:05405] SPML ikrit cannot be selected
>
> Regards,
>
> Timur
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] spml_ikrit_np random values

2014-06-05 Thread Mike Dubman

seems oshmem_info uses uninitialized value.
we will check it, thanks for report.


On Thu, Jun 5, 2014 at 6:56 PM, Timur Ismagilov  wrote:

> Hello!
>
> I am using Open MPI v1.8.1.
>
> $oshmem_info -a --parsable | grep spml_ikrit_np
>
> mca:spml:ikrit:param:spml_ikrit_np:value:1620524368  (alwase new value)
> mca:spml:ikrit:param:spml_ikrit_np:source:default
> mca:spml:ikrit:param:spml_ikrit_np:status:writeable
> mca:spml:ikrit:param:spml_ikrit_np:level:9
> mca:spml:ikrit:param:spml_ikrit_np:help:[integer] Minimal allowed job's NP
> to activate ikrit
> mca:spml:ikrit:param:spml_ikrit_np:deprecated:no
> mca:spml:ikrit:param:spml_ikrit_np:type:int
> mca:spml:ikrit:param:spml_ikrit_np:disabled:false
>
> why spml_ikrit_np gets a new value each time?
> Regards,
> Timur
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Deadly warning "Epoll ADD(4) on fd 2 failed." ?

2014-05-28 Thread Mike Dubman

I think it comes from PMI API used by OMPI/SLURM.
SLURM`s libpmi is trying to control stdout/stdin which is already
controlled by OMPI.


On Tue, May 27, 2014 at 8:31 PM, Ralph Castain  wrote:

> I'm unaware of any OMPI error message like that - might be caused by
> something in libevent as that could be using epoll, so it could be caused
> by us. However, I'm a little concerned about the use of the prerelease
> version of Slurm as we know that PMI is having some problems over there.
>
> So out of curiosity - how was this job launched? Via mpirun or directly
> using srun?
>
>
> On May 27, 2014, at 1:22 AM, Filippo Spiga 
> wrote:
>
> Dear all,
>
> I am using Open MPI v1.8.2 night snapshot compiled with SLURM support
> (version 14.03pre5). These two messages below appeared during a job of 2048
> MPI that died after 24 hours!
>
> [warn] Epoll ADD(1) on fd 0 failed.  Old events were 0; read change was 1
> (add); write change was 0 (none): Operation not permitted
>
> [warn] Epoll ADD(4) on fd 2 failed.  Old events were 0; read change was 0
> (none); write change was 1 (add): Operation not permitted
>
>
> The first one, appeared immediately at the beginning had no effect. The
> application started to compute and it successfully called a big parallel
> eigensolver. The second message appeared after 18~19 hours of non-stop
> computation and the application crashed without showing any other error
> message! Regularly I was checking that MPI processes were not stuck, after
> this message the processes were all aborted without dumping anything on
> stdout/stderr. It is quite weird.
>
> I believe these messages come from Open MPI (but correct me if I am
> wrong!). I am going to look at the application and the various libraries to
> find out if something is wrong. In the meanwhile it will be a great help if
> anyone can clarify the exact meaning of these warning messages.
>
> Many thanks in advance.
>
> Regards,
> Filippo
>
> --
> Mr. Filippo SPIGA, M.Sc.
> http://www.linkedin.com/in/filippospiga ~ skype: filippo.spiga
>
> «Nobody will drive us out of Cantor's paradise.» ~ David Hilbert
>
> *
> Disclaimer: "Please note this message and any attachments are CONFIDENTIAL
> and may be privileged or otherwise protected from disclosure. The contents
> are not to be disclosed to anyone other than the addressee. Unauthorized
> recipients are requested to preserve this confidentiality and to advise the
> sender immediately of any error in transmission."
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] no ikrit component of in oshmem

2014-04-23 Thread Mike Dubman

Hi Timur,

What "configure" line you used? ikrit could be compile-it if no
"--with-mxm=/opt/mellanox/mxm" was provided.
Can you please attach your config.log?

Thanks

On Wed, Apr 23, 2014 at 3:10 PM, Тимур Исмагилов  wrote:

> Hi!
> I am trying to build openmpi 1.8 with Open SHMEM and Mellanox MXM.
> But oshmem_info does not display methe information about ikrit in spml.
>
> ...
> MCA scoll: mpi (MCA v2.0, API v1.0, Component v1.8)
> MCA spml: yoda (MCA v2.0, API v2.0, Component v1.8)
> MCA sshmem: mmap (MCA v2.0, API v2.0, Component v1.8)
> ...
>
> С уважением,
>
> tismagi...@mail.ru
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] probable bug in 1.9a1r31409

2014-04-16 Thread Mike Dubman

Hi,
I committed your patch to the trunk.
thanks
M


On Wed, Apr 16, 2014 at 6:49 PM, Mike Dubman <mi...@dev.mellanox.co.il>wrote:

> +1
> looks good.
>
>
> On Wed, Apr 16, 2014 at 4:35 PM, Åke Sandgren 
> <ake.sandg...@hpc2n.umu.se>wrote:
>
>> On 04/16/2014 02:25 PM, Åke Sandgren wrote:
>>
>>> Hi!
>>>
>>> Found this problem when building r31409 with Pathscale 5.0
>>>
>>> pshmem_barrier.c:81:6: error: redeclaration of 'pshmem_barrier_all' must
>>> have the 'overloadable' attribute
>>> void shmem_barrier_all(void)
>>>   ^
>>> ../../../../oshmem/shmem/c/profile/defines.h:193:37: note: expanded from
>>> macro 'shmem_barrier_all'
>>> #define shmem_barrier_all   pshmem_barrier_all
>>>  ^
>>> pshmem_barrier.c:78:14: note: previous overload of function is here
>>> #pragma weak shmem_barrier_all = pshmem_barrier_all
>>>   ^
>>> ../../../../oshmem/shmem/c/profile/defines.h:193:37: note: expanded from
>>> macro 'shmem_barrier_all'
>>> #define shmem_barrier_all   pshmem_barrier_all
>>>  ^
>>> pragma weak and define clashing...
>>>
>>
>>
>> Suggested patch attached (actually there where two similar cases...)
>>
>>
>>
>> --
>> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
>> Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
>> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>

Re: [OMPI users] probable bug in 1.9a1r31409

2014-04-16 Thread Mike Dubman

+1
looks good.


On Wed, Apr 16, 2014 at 4:35 PM, Åke Sandgren wrote:

> On 04/16/2014 02:25 PM, Åke Sandgren wrote:
>
>> Hi!
>>
>> Found this problem when building r31409 with Pathscale 5.0
>>
>> pshmem_barrier.c:81:6: error: redeclaration of 'pshmem_barrier_all' must
>> have the 'overloadable' attribute
>> void shmem_barrier_all(void)
>>   ^
>> ../../../../oshmem/shmem/c/profile/defines.h:193:37: note: expanded from
>> macro 'shmem_barrier_all'
>> #define shmem_barrier_all   pshmem_barrier_all
>>  ^
>> pshmem_barrier.c:78:14: note: previous overload of function is here
>> #pragma weak shmem_barrier_all = pshmem_barrier_all
>>   ^
>> ../../../../oshmem/shmem/c/profile/defines.h:193:37: note: expanded from
>> macro 'shmem_barrier_all'
>> #define shmem_barrier_all   pshmem_barrier_all
>>  ^
>> pragma weak and define clashing...
>>
>
>
> Suggested patch attached (actually there where two similar cases...)
>
>
>
> --
> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
> Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] one more finding in openmpi-1.7.5a1

2014-02-14 Thread Mike Dubman

Thanks for prompt help.
Could you please resent the patch as attachment which can be applied with
"patch" command, my mail client messes long lines.


On Fri, Feb 14, 2014 at 7:40 AM,  wrote:

>
>
> Thanks. I'm not familiar with mindist mapper. But obviously
> checking for ORTE_MAPPING_BYDIST is missing. In addition,
> ORTE_MAPPING_PPR is missing again by my mistake.
>
> Please try this patch.
>
> if OPAL_HAVE_HWLOC
> } else if (ORTE_MAPPING_BYCORE == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_CORE);
> } else if (ORTE_MAPPING_BYL1CACHE == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_L1CACHE);
> } else if (ORTE_MAPPING_BYL2CACHE == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_L2CACHE);
> } else if (ORTE_MAPPING_BYL3CACHE == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_L3CACHE);
> } else if (ORTE_MAPPING_BYSOCKET == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_SOCKET);
> } else if (ORTE_MAPPING_BYNUMA == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_NUMA);
> } else if (ORTE_MAPPING_BYBOARD == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_BOARD);
> } else if (ORTE_MAPPING_BYHWTHREAD == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_HWTHREAD);
> } else if (ORTE_MAPPING_PPR == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_SLOT);
> } else if (ORTE_MAPPING_BYDIST == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_SLOT);
> #endif
>
> Tetsuya Mishima
>
> > Hi,
> > after this patch we get this in jenkins:
> >
> > 07:03:15 [vegas12.mtr.labs.mlnx:01646] [[26922,0],0] ORTE_ERROR_LOG: Not
> implemented in file rmaps_mindist_module.c at line 39107:03:15
> [vegas12.mtr.labs.mlnx:01646] [[26922,0],0] ORTE_ERROR_LOG: Not
> > implemented in file base/rmaps_base_map_job.c at line 285
> >
> >
> >
> >
> >
> > On Fri, Feb 14, 2014 at 6:35 AM,  wrote:
> >
> >
> > Sorry, one more shot - byslot was dropped!
> >
> > if (NULL == spec) {
> > /* check for map-by object directives - we set the
> >  * ranking to match if one was given
> >  */
> > if (ORTE_MAPPING_GIVEN & ORTE_GET_MAPPING_DIRECTIVE(mapping)) {
> > if (ORTE_MAPPING_BYSLOT == ORTE_GET_MAPPING_POLICY(mapping))
> {
> > ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_SLOT);
> > } else if (ORTE_MAPPING_BYNODE == ORTE_GET_MAPPING_POLICY
> > (mapping)) {
> > ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_NODE);
> > #if OPAL_HAVE_HWLOC
> > } else if (ORTE_MAPPING_BYCORE == ORTE_GET_MAPPING_POLICY
> > (mapping)) {
> > ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_CORE);
> > } else if (ORTE_MAPPING_BYL1CACHE == ORTE_GET_MAPPING_POLICY
> > (mapping)) {
> > ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_L1CACHE);
> > } else if (ORTE_MAPPING_BYL2CACHE == ORTE_GET_MAPPING_POLICY
> > (mapping)) {
> > ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_L2CACHE);
> > } else if (ORTE_MAPPING_BYL3CACHE == ORTE_GET_MAPPING_POLICY
> > (mapping)) {
> > ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_L3CACHE);
> > } else if (ORTE_MAPPING_BYSOCKET == ORTE_GET_MAPPING_POLICY
> > (mapping)) {
> > ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_SOCKET);
> > } else if (ORTE_MAPPING_BYNUMA == ORTE_GET_MAPPING_POLICY
> > (mapping)) {
> > ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_NUMA);
> > } else if (ORTE_MAPPING_BYBOARD == ORTE_GET_MAPPING_POLICY
> > (mapping)) {
> > ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_BOARD);
> > } else if (ORTE_MAPPING_BYHWTHREAD == ORTE_GET_MAPPING_POLICY
> > (mapping)) {
> > ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_HWTHREAD);
> > #endif
> >
> > Tetusya Mishima
> >
> > > I've found it. Please add 2 lines(770, 771) in rmaps_base_frame.c:
> > >
> > > 747  if (NULL == spec) {
> > > 748  /* check for map-by object directives - we set the
> > > 749   * ranking to match if one was given
> > > 750   */
> > > 751  if (ORTE_MAPPING_GIVEN & ORTE_GET_MAPPING_DIRECTIVE
> > > (mapping)) {
> > > 752  if (ORTE_MAPPING_BYCORE == ORTE_GET_MAPPING_POLICY
> > > (mapping)) {
> > > 753  ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_CORE);
> > > 754  } else if

Re: [OMPI users] one more finding in openmpi-1.7.5a1

2014-02-14 Thread Mike Dubman

Hi,
after this patch we get this in jenkins:

*07:03:15* [vegas12.mtr.labs.mlnx:01646] [[26922,0],0] ORTE_ERROR_LOG:
Not implemented in file rmaps_mindist_module.c at line 391*07:03:15*
[vegas12.mtr.labs.mlnx:01646] [[26922,0],0] ORTE_ERROR_LOG: Not
implemented in file base/rmaps_base_map_job.c at line 285





On Fri, Feb 14, 2014 at 6:35 AM,  wrote:

>
>
> Sorry, one more shot - byslot was dropped!
>
> if (NULL == spec) {
> /* check for map-by object directives - we set the
>  * ranking to match if one was given
>  */
> if (ORTE_MAPPING_GIVEN & ORTE_GET_MAPPING_DIRECTIVE(mapping)) {
> if (ORTE_MAPPING_BYSLOT == ORTE_GET_MAPPING_POLICY(mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_SLOT);
> } else if (ORTE_MAPPING_BYNODE == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_NODE);
> #if OPAL_HAVE_HWLOC
> } else if (ORTE_MAPPING_BYCORE == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_CORE);
> } else if (ORTE_MAPPING_BYL1CACHE == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_L1CACHE);
> } else if (ORTE_MAPPING_BYL2CACHE == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_L2CACHE);
> } else if (ORTE_MAPPING_BYL3CACHE == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_L3CACHE);
> } else if (ORTE_MAPPING_BYSOCKET == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_SOCKET);
> } else if (ORTE_MAPPING_BYNUMA == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_NUMA);
> } else if (ORTE_MAPPING_BYBOARD == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_BOARD);
> } else if (ORTE_MAPPING_BYHWTHREAD == ORTE_GET_MAPPING_POLICY
> (mapping)) {
> ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_HWTHREAD);
> #endif
>
> Tetusya Mishima
>
> > I've found it. Please add 2 lines(770, 771) in rmaps_base_frame.c:
> >
> > 747  if (NULL == spec) {
> > 748  /* check for map-by object directives - we set the
> > 749   * ranking to match if one was given
> > 750   */
> > 751  if (ORTE_MAPPING_GIVEN & ORTE_GET_MAPPING_DIRECTIVE
> > (mapping)) {
> > 752  if (ORTE_MAPPING_BYCORE == ORTE_GET_MAPPING_POLICY
> > (mapping)) {
> > 753  ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_CORE);
> > 754  } else if (ORTE_MAPPING_BYNODE ==
> > ORTE_GET_MAPPING_POLICY(mapping)) {
> > 755  ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_NODE);
> > 756  } else if (ORTE_MAPPING_BYL1CACHE ==
> > ORTE_GET_MAPPING_POLICY(mapping)) {
> > 757  ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_L1CACHE);
> > 758  } else if (ORTE_MAPPING_BYL2CACHE ==
> > ORTE_GET_MAPPING_POLICY(mapping)) {
> > 759  ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_L2CACHE);
> > 760  } else if (ORTE_MAPPING_BYL3CACHE ==
> > ORTE_GET_MAPPING_POLICY(mapping)) {
> > 761  ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_L3CACHE);
> > 762  } else if (ORTE_MAPPING_BYSOCKET ==
> > ORTE_GET_MAPPING_POLICY(mapping)) {
> > 763  ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_SOCKET);
> > 764  } else if (ORTE_MAPPING_BYNUMA ==
> > ORTE_GET_MAPPING_POLICY(mapping)) {
> > 765  ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_NUMA);
> > 766  } else if (ORTE_MAPPING_BYBOARD ==
> > ORTE_GET_MAPPING_POLICY(mapping)) {
> > 767  ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_BOARD);
> > 768  } else if (ORTE_MAPPING_BYHWTHREAD ==
> > ORTE_GET_MAPPING_POLICY(mapping)) {
> > 769  ORTE_SET_RANKING_POLICY(tmp,
> > ORTE_RANK_BY_HWTHREAD);
> > 770  } else if (ORTE_MAPPING_PPR == ORTE_GET_MAPPING_POLICY
> > (mapping)) {
> > 771  ORTE_SET_RANKING_POLICY(tmp, ORTE_RANK_BY_SLOT);
> > 772  }
> >
> > Tetsuya Mishima
> >
> > > You are welcome, Ralph.
> > >
> > > But, after fixing it, I'm facing another problem whin I use ppr option:
> > > [mishima@manage openmpi-1.7.4]$ mpirun -np 2 -map-by ppr:1:socket
> > -bind-to
> > > socket -report-bindings ~/mis/openmpi/demos/m
> > > yprog
> > > [manage.cluster:28057] [[25570,0],0] ORTE_ERROR_LOG: Not implemented in
> > > file rmaps_ppr.c at line 389
> > > [manage.cluster:28057] [[25570,0],0] ORTE_ERROR_LOG: Not implemented in
> > > file base/rmaps_base_map_job.c at line 285
> > >
> > > I confirmed it worked when it reverted back.
> > > I'm a little bit confused. Could you take a look?
> > >

Re: [OMPI users] Get your Open MPI schwag!

2013-10-25 Thread Mike Dubman

i see, it makes sense.

new proposal for back side:

no pictures, just slogan:

If Chuck says you need OpenMPI, you need OpenMPI.

-- certified by Chuck Norris.

 what do you say?



On Fri, Oct 25, 2013 at 12:40 PM, Ralph Castain <r...@open-mpi.org> wrote:

> I'm afraid that picture is copyrighted, Mike. While I enjoy the
> enthusiasm, I actually suspect we would get into trouble using Chuck
> Norris' name without first obtaining his permission.
>
> On Oct 25, 2013, at 2:28 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote:
>
> ok, so - here is a final proposal:
>
> front:
> small OMPI logo,  under it slogan: OpenMPI - Breaking the Barriers!
>
> back side:
>
> small medum picture of
> http://www.deviantart.com/art/chuck-norris-stamp-of-approval-184646710
>
> with text below:
>
> OpenMPI: Leading Parallel Computing Framework, since 2003...
>
>
> comments?
>
> too modest?
>
>
> On Thu, Oct 24, 2013 at 5:38 AM, Damien Hocking <dam...@khubla.com> wrote:
>
>> Heheheheh.
>>
>> Chuck Norris has zero latency and infinite bandwidth.
>> Chuck Norris is a hardware implementation only.  Software is for sissys.
>> Chuck Norris's version of MPI_IRecv just gives you the answer.
>> Chuck Norris has a 128-bit memory space.
>> Chuck Norris's Law says Chuck Norris gets twice as amazing every 18
>> months.
>> Chuck Norris is infinitely scalable.
>> MPI_COMM_WORLD is only a small part of Chuck Norris's mind.
>> Chuck Norris can power Exascale.  Twice.
>>
>> :-)
>>
>> Damien
>>
>>
>> On 23/10/2013 4:26 PM, Shamis, Pavel wrote:
>>
>>> +1 for Chuck Norris
>>>
>>> Pavel (Pasha) Shamis
>>> ---
>>> Computer Science Research Group
>>> Computer Science and Math Division
>>> Oak Ridge National Laboratory
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Oct 23, 2013, at 1:12 PM, Mike Dubman <mi...@dev.mellanox.co.il<**
>>> mailto:mi...@dev.mellanox.co.**il <mi...@dev.mellanox.co.il>>> wrote:
>>>
>>> maybe to add some nice/funny slogan on the front under the logo, and
>>> cool picture on the back.
>>> some of community members are still in early twenties (and counting) .
>>>  :)
>>>
>>> shall we open a contest for good slogan to put? and mid-size pict to put
>>> on the back side?
>>>
>>> - living the parallel world
>>> - iOMPI
>>> - OpenMPI - breaking the barriers!
>>> ...
>>> for the mid-sized back-side picture, I suggest chuck norris, you can`t
>>> beat it.
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Oct 23, 2013 at 7:48 PM, John Hearns <hear...@googlemail.com<**
>>> mailto:hear...@googlemail.com>**> wrote:
>>>
>>> OpenMPI aprons. Nice! Good to wear when cooking up those Chef recipes.
>>> (Did I really just say that...)
>>>
>>> __**_
>>> users mailing list
>>> us...@open-mpi.org<mailto:user**s...@open-mpi.org <us...@open-mpi.org>>
>>> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>
>>> __**_
>>> users mailing list
>>> us...@open-mpi.org<mailto:user**s...@open-mpi.org <us...@open-mpi.org>>
>>> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>
>>> __**_
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>>
>>
>> __**_
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Get your Open MPI schwag!

2013-10-25 Thread Mike Dubman

ok, so - here is a final proposal:

front:
small OMPI logo,  under it slogan: OpenMPI - Breaking the Barriers!

back side:

small medum picture of
http://www.deviantart.com/art/chuck-norris-stamp-of-approval-184646710

with text below:

OpenMPI: Leading Parallel Computing Framework, since 2003...


comments?

too modest?


On Thu, Oct 24, 2013 at 5:38 AM, Damien Hocking <dam...@khubla.com> wrote:

> Heheheheh.
>
> Chuck Norris has zero latency and infinite bandwidth.
> Chuck Norris is a hardware implementation only.  Software is for sissys.
> Chuck Norris's version of MPI_IRecv just gives you the answer.
> Chuck Norris has a 128-bit memory space.
> Chuck Norris's Law says Chuck Norris gets twice as amazing every 18 months.
> Chuck Norris is infinitely scalable.
> MPI_COMM_WORLD is only a small part of Chuck Norris's mind.
> Chuck Norris can power Exascale.  Twice.
>
> :-)
>
> Damien
>
>
> On 23/10/2013 4:26 PM, Shamis, Pavel wrote:
>
>> +1 for Chuck Norris
>>
>> Pavel (Pasha) Shamis
>> ---
>> Computer Science Research Group
>> Computer Science and Math Division
>> Oak Ridge National Laboratory
>>
>>
>>
>>
>>
>>
>> On Oct 23, 2013, at 1:12 PM, Mike Dubman <mi...@dev.mellanox.co.il<**
>> mailto:mi...@dev.mellanox.co.**il <mi...@dev.mellanox.co.il>>> wrote:
>>
>> maybe to add some nice/funny slogan on the front under the logo, and cool
>> picture on the back.
>> some of community members are still in early twenties (and counting) .
>>  :)
>>
>> shall we open a contest for good slogan to put? and mid-size pict to put
>> on the back side?
>>
>> - living the parallel world
>> - iOMPI
>> - OpenMPI - breaking the barriers!
>> ...
>> for the mid-sized back-side picture, I suggest chuck norris, you can`t
>> beat it.
>>
>>
>>
>>
>>
>> On Wed, Oct 23, 2013 at 7:48 PM, John Hearns <hear...@googlemail.com<**
>> mailto:hear...@googlemail.com>**> wrote:
>>
>> OpenMPI aprons. Nice! Good to wear when cooking up those Chef recipes.
>> (Did I really just say that...)
>>
>> __**_
>> users mailing list
>> us...@open-mpi.org<mailto:user**s...@open-mpi.org <us...@open-mpi.org>>
>> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>
>> __**_
>> users mailing list
>> us...@open-mpi.org<mailto:user**s...@open-mpi.org <us...@open-mpi.org>>
>> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>
>> __**_
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>>
>
> __**_
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/**mailman/listinfo.cgi/users<http://www.open-mpi.org/mailman/listinfo.cgi/users>
>

Re: [OMPI users] Get your Open MPI schwag!

2013-10-23 Thread Mike Dubman

maybe to add some nice/funny slogan on the front under the logo, and cool
picture on the back.
some of community members are still in early twenties (and counting) . 
:)

shall we open a contest for good slogan to put? and mid-size pict to put on
the back side?

- living the parallel world
- iOMPI
- OpenMPI - breaking the barriers!
...
for the mid-sized back-side picture, I suggest chuck norris, you can`t beat
it.

On Wed, Oct 23, 2013 at 7:48 PM, John Hearns  wrote:

> OpenMPI aprons. Nice! Good to wear when cooking up those Chef recipes.
> (Did I really just say that...)
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Big job, InfiniBand, MPI_Alltoallv and ibv_create_qp failed

2013-07-31 Thread Mike Dubman

Hi,
What OFED vendor and version do you use?
Regards
M


On Tue, Jul 30, 2013 at 8:42 PM, Paul Kapinos wrote:

> Dear Open MPI experts,
>
> An user at our cluster has a problem running a kinda of big job:
> (- the job using 3024 processes (12 per node, 252 nodes) runs fine)
> - the job using 4032 processes (12 per node, 336 nodes) produce the error
> attached below.
>
> Well, the http://www.open-mpi.org/faq/?**category=openfabrics#ib-**
> locked-pagesis
>  well-known one; both recommended tweakables (user limits and registered
> memory size) are at MAX now, nevertheless someone queue pair could not be
> created.
>
> Our blind guess is the number of completion queues is exhausted.
>
> What happen' when raising the value from standard to max?
> What max size of Open MPI jobs have been seen at all?
> What max size of Open MPI jobs *using MPI_Alltoallv* have been seen at all?
> Is there a way to manage the size/the number of queue pairs? (XRC not
> availabe)
> Is there a way to tell MPI_Alltoallv to use less queue pairs, even when
> this could lead to slow-down?
>
> There is a suspicious parameter in the mlx4_core module:
> $ modinfo mlx4_core | grep log_num_cq
> parm:   log_num_cq:log maximum number of CQs per HCA  (int)
>
> Is this the tweakable parameter?
> What is the default, and max value?
>
> Any help would be welcome...
>
> Best,
>
> Paul Kapinos
>
> P.S. There should be no connection problen somewhere between the nodes; a
> test job with 1x process on each node has been ran sucessfully just before
> starting the actual job, which also ran OK for a while - until calling
> MPI_Alltoallv.
>
>
>
>
>
>
> --**--**
> --
> A process failed to create a queue pair. This usually means either
> the device has run out of queue pairs (too many connections) or
> there are insufficient resources available to allocate a queue pair
> (out of memory). The latter can happen if either 1) insufficient
> memory is available, or 2) no more physical memory can be registered
> with the device.
>
> For more information on memory registration see the Open MPI FAQs at:
> http://www.open-mpi.org/faq/?**category=openfabrics#ib-**locked-pages
>
> Local host: linuxbmc1156.rz.RWTH-Aachen.DE
> Local device:   mlx4_0
> Queue pair type:Reliable connected (RC)
> --**--**
> --
> [linuxbmc1156.rz.RWTH-Aachen.**DE 
> ][[3703,1],4021][connect/**btl_openib_connect_oob.c:867:**rml_recv_cb]
> error in endpoint reply start connect
> [linuxbmc1156.rz.RWTH-Aachen.**DE:9632]
> *** An error occurred in MPI_Alltoallv
> [linuxbmc1156.rz.RWTH-Aachen.**DE:9632]
> *** on communicator MPI_COMM_WORLD
> [linuxbmc1156.rz.RWTH-Aachen.**DE:9632]
> *** MPI_ERR_OTHER: known error not in list
> [linuxbmc1156.rz.RWTH-Aachen.**DE:9632]
> *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
> [linuxbmc1156.rz.RWTH-Aachen.**DE 
> ][[3703,1],4024][connect/**btl_openib_connect_oob.c:867:**rml_recv_cb]
> error in endpoint reply start connect
> [linuxbmc1156.rz.RWTH-Aachen.**DE 
> ][[3703,1],4027][connect/**btl_openib_connect_oob.c:867:**rml_recv_cb]
> error in endpoint reply start connect
> [linuxbmc0840.rz.RWTH-Aachen.**DE 
> ][[3703,1],10][connect/btl_**openib_connect_oob.c:867:rml_**recv_cb]
> error in endpoint reply start connect
> [linuxbmc0840.rz.RWTH-Aachen.**DE 
> ][[3703,1],1][connect/btl_**openib_connect_oob.c:867:rml_**recv_cb] error
> in endpoint reply start connect
> [linuxbmc0840.rz.RWTH-Aachen.**DE:17696]
> [[3703,0],0]-[[3703,1],10] mca_oob_tcp_msg_recv: readv failed: Connection
> reset by peer (104)
> [linuxbmc0840.rz.RWTH-Aachen.**DE:17696]
> [[3703,0],0]-[[3703,1],8] mca_oob_tcp_msg_recv: readv failed: Connection
> reset by peer (104)
> [linuxbmc0840.rz.RWTH-Aachen.**DE:17696]
> [[3703,0],0]-[[3703,1],9] mca_oob_tcp_msg_recv: readv failed: Connection
> reset by peer (104)
> [linuxbmc0840.rz.RWTH-Aachen.**DE:17696]
> [[3703,0],0]-[[3703,1],1] mca_oob_tcp_msg_recv: readv failed: Connection
> reset by peer (104)
> [linuxbmc0840.rz.RWTH-Aachen.**DE:17696]
> 9 more processes have sent help message help-mpi-btl-openib-cpc-base.**txt
> /

Re: [OMPI users] max. message size

2013-07-17 Thread Mike Dubman

do you use IB as a transport? max message size in IB/RDMA  is limited to
2G, but OMPI 1.7 splits large buffers during RDMA into 2G chunks.



On Wed, Jul 17, 2013 at 11:51 AM, mohammad assadsolimani <
m.assadsolim...@jesus.ch> wrote:

>
> Dear all,
>
> I do my PhD in physics and  use a program, which uses openmpi for
> a sophisticated calculation.
> But there is a Problem with "max. message size ". That is limited to  ~2GB.
> Someone  suggested that I have to use chunks i.e. I have to  disassemble
> the massages
> in smaller massages.  That might be nice, but I do not know how?
> I often was searching the last time in internet however I did not get an
> example.
> Is there any other possibility to increase this volume without
> manipulation of
> the code?
>
> The version of my ompi:mpirun (Open MPI) 1.5.5
>
>
> I am very grateful for all of   your help  and thank you in advanced
> Mohammad
>
> --**--**--
> Webmail: http://mail.livenet.ch
> Glauben entdecken: http://www.jesus.ch
> Christliches Webportal: http://www.livenet.ch
>
>
>
> __**_
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/**mailman/listinfo.cgi/users
>

Re: [OMPI users] using the xrc queues

2013-07-09 Thread Mike Dubman

Hi,
I would suggest use MXM (part of mofed, can be downloaded as standalone rpm
from http://mellanox.com/products/mxm for ofed)

It uses UD (constant memory footprint) and should provide good performance.
The next MXM v2.0 will support RC and DC (reliable UD) as well.

Once mxm is installed from rpm (or extracted elsewhere from rpm->tarball) -
you can point ompi configure with "--with-mxm=/path/to/mxm")
Regards
M


On Fri, Jul 5, 2013 at 10:33 PM, Ben  wrote:

> I'm part of a team that maintains a global climate model running under
> mpi. Recently we have been trying it out with different mpi stacks
> at high resolution/processor counts.
> At one point in the code there is a large number of mpi_isends/mpi_recv
> (tens to hundreds of thousands) when data distributed across all mpi
> processes must be collective on a particular processor or processors be
> transformed to a new resolution before writing. At first the model was
> crashing with a message:
> "A process failed to create a queue pair. This usually means either the
> device has run out of queue pairs (too many connections) or there are
> insufficient resources available to allocate a queue pair (out of memory).
> The latter can happen if either 1) insufficient memory is available, or 2)
> no more physical memory can be registered with the device."
> when it hit the part of code with the send/receives. Watching the node
> memory in an xterm I could see the memory skyrocket and fill the node.
>
> Somewhere we found a suggestion try using the xrc queues (
> http://www.open-mpi.org/faq/?**category=openfabrics#ib-xrc)
> to get around this problem and indeed running with
>
> setenv OMPI_MCA_btl_openib_receive_**queues "X,128,256,192,128:X,2048,256,
> **128,32:X,12288,256,128,32:X,**65536,256,128,32"
> mpirun --bind-to-core -np numproc ./app
>
> allowed the model to successfully run. It still seems to use a large
> amount of memory when it writes (on the order of several Gb). Does anyone
> have any  suggestions on how to perhaps tweak the settings to help with
> memory use.
>
> --
> Ben Auer, PhD   SSAI, Scientific Programmer/Analyst
> NASA GSFC,  Global Modeling and Assimilation Office
> Code 610.1, 8800 Greenbelt Rd, Greenbelt, MD  20771
> Phone: 301-286-9176   Fax: 301-614-6246
>
> __**_
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/**mailman/listinfo.cgi/users
>

Re: [OMPI users] EXTERNAL: Re: Need advice on performance problem

2013-06-12 Thread Mike Dubman

Also, what ofed version (ofed_info -s) and mxm version (rpm -qi mxm) do you
use?


On Wed, Jun 12, 2013 at 3:30 AM, Ralph Castain  wrote:

> Great! Would you mind showing the revised table? I'm curious as to the
> relative performance.
>
>
> On Jun 11, 2013, at 4:53 PM, eblo...@1scom.net wrote:
>
> > Problem solved. I did not configure with --with-mxm=/opt/mellanox/mcm and
> > this location was not auto-detected.  Once I rebuilt with this option,
> > everything worked fine. Scaled better than MVAPICH out to 800. MVAPICH
> > configure log showed that it had found this component of the OFED stack.
> >
> > Ed
> >
> >
> >> If you run at 224 and things look okay, then I would suspect something
> in
> >> the upper level switch that spans cabinets. At that point, I'd have to
> >> leave it to Mellanox to advise.
> >>
> >>
> >> On Jun 11, 2013, at 6:55 AM, "Blosch, Edwin L"  >
> >> wrote:
> >>
> >>> I tried adding "-mca btl openib,sm,self"  but it did not make any
> >>> difference.
> >>>
> >>> Jesus’ e-mail this morning has got me thinking.  In our system, each
> >>> cabinet has 224 cores, and we are reaching a different level of the
> >>> system architecture when we go beyond 224.  I got an additional data
> >>> point at 256 and found that performance is already falling off. Perhaps
> >>> I did not build OpenMPI properly to support the Mellanox adapters that
> >>> are used in the backplane, or I need some configuration setting similar
> >>> to FAQ #19 in the Tuning/Openfabrics section.
> >>>
> >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> On
> >>> Behalf Of Ralph Castain
> >>> Sent: Sunday, June 09, 2013 6:48 PM
> >>> To: Open MPI Users
> >>> Subject: Re: [OMPI users] EXTERNAL: Re: Need advice on performance
> >>> problem
> >>>
> >>> Strange - it looks like a classic oversubscription behavior. Another
> >>> possibility is that it isn't using IB for some reason when extended to
> >>> the other nodes. What does your cmd line look like? Have you tried
> >>> adding "-mca btl openib,sm,self" just to ensure it doesn't use TCP for
> >>> some reason?
> >>>
> >>>
> >>> On Jun 9, 2013, at 4:31 PM, "Blosch, Edwin L"  >
> >>> wrote:
> >>>
> >>>
> >>> Correct.  20 nodes, 8 cores per dual-socket on each node = 360.
> >>>
> >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> On
> >>> Behalf Of Ralph Castain
> >>> Sent: Sunday, June 09, 2013 6:18 PM
> >>> To: Open MPI Users
> >>> Subject: Re: [OMPI users] EXTERNAL: Re: Need advice on performance
> >>> problem
> >>>
> >>> So, just to be sure - when you run 320 "cores", you are running across
> >>> 20 nodes?
> >>>
> >>> Just want to ensure we are using "core" the same way - some people
> >>> confuse cores with hyperthreads.
> >>>
> >>> On Jun 9, 2013, at 3:50 PM, "Blosch, Edwin L"  >
> >>> wrote:
> >>>
> >>>
> >>>
> >>> 16.  dual-socket Xeon, E5-2670.
> >>>
> >>> I am trying a larger model to see if the performance drop-off happens
> at
> >>> a different number of cores.
> >>> Also I’m running some intermediate core-count sizes to refine the curve
> >>> a bit.
> >>> I also added mpi_show_mca_params all, and at the same time,
> >>> btl_openib_use_eager_rdma 1, just to see if that does anything.
> >>>
> >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]
> On
> >>> Behalf Of Ralph Castain
> >>> Sent: Sunday, June 09, 2013 5:04 PM
> >>> To: Open MPI Users
> >>> Subject: EXTERNAL: Re: [OMPI users] Need advice on performance problem
> >>>
> >>> Looks to me like things are okay thru 160, and then things fall apart
> >>> after that point. How many cores are on a node?
> >>>
> >>>
> >>> On Jun 9, 2013, at 1:59 PM, "Blosch, Edwin L"  >
> >>> wrote:
> >>>
> >>>
> >>>
> >>>
> >>> I’m having some trouble getting good scaling with OpenMPI 1.6.4 and I
> >>> don’t know where to start looking. This is an Infiniband FDR network
> >>> with Sandy Bridge nodes.  I am using affinity (--bind-to-core) but no
> >>> other options. As the number of cores goes up, the message sizes are
> >>> typically going down. There seem to be lots of options in the FAQ, and
> I
> >>> would welcome any advice on where to start.  All these timings are on a
> >>> completely empty system except for me.
> >>>
> >>> Thanks
> >>>
> >>>
> >>>MPI  # cores   Ave. Rate   Std. Dev. %  # timings
> >>> SpeedupEfficiency
> >>>
> 
> >>> MVAPICH|   16   |8.6783  |   0.995 % |   2  |
> >>> 16.000  |  1.
> >>> MVAPICH|   48   |8.7665  |   1.937 % |   3  |
> >>> 47.517  |  0.9899
> >>> MVAPICH|   80   |8.8900  |   2.291 % |   3  |
> >>> 78.095  |  0.9762
> >>> MVAPICH|  160   |8.9897  |   2.409 % |   3  |
> >>> 154.457  |  0.9654
> >>> MVAPICH

Re: [OMPI users] Using Service Levels (SLs) with OpenMPI 1.6.4 + MLNX_OFED 2.0

2013-06-11 Thread Mike Dubman

--mca btl_openib_ib_path_record_**service_level 1 flag controls openib btl,
you need to remove  --mca mtl mxm  from command line.

Have you compiled OpenMPI with rhel6.4 inbox ofed driver? AFAIK, the MOFED
2.x does not have XRC and you mentioned "--enable-openib-connectx-xrc" flag
in configure.


On Tue, Jun 11, 2013 at 3:02 PM, Jesús Escudero Sahuquillo <
jescud...@dsi.uclm.es> wrote:

> I have a 16-node Mellanox cluster built with Mellanox ConnectX3 cards.
> Recently I have updated the MLNX_OFED to the 2.0.5 version. The reason of
> this e-mail to the OpenMPI users list is that I am not able to run MPI
> applications using the service levels (SLs) feature of the OpenMPI driver.
>
> Currently, the nodes have the Red-Hat 6.4 with the kernel
> 2.6.32-358.el6.x86_64. I have compiled OpenMPI 1.6.4 with:
>
>  ./configure --with-sge --with-openib=/usr --enable-openib-connectx-xrc
> --enable-mpi-thread-multiple --with-threads --with-hwloc
> --enable-heterogeneous --with-fca=/opt/mellanox/fca --with-mxm-libdir=/opt/
> **mellanox/mxm/lib --with-mxm=/opt/mellanox/mxm
> --prefix=/home/jescudero/opt/**openmpi
>
> I have modified the OpenSM code (which is based on 3.3.15) in order to
> include a special routing algorithm based on "ftree". Apparently all is
> correct with the OpenSM since it returns the SLs when I execute the command
> "saquery --src-to-dst slid:dlid". Anyway, I have also tried to run the
> OpenSM with the DFSSSP algorithm.
>
> However, when I try to run MPI applications (i.e. HPCC, OSU or even
> alltoall.c -included in the OpenMPI sources-) I experience some errors if
> the "btl_openib_path_record_info" is set to "1", otherwise (i.e. if the
> btl_openib_path_record_info is not enabled) the application execution ends
> correctly. I run the MPI application with the next command:
>
> mpirun -display-allocation -display-map -np 8 -machinefile maquinas.aux
> --mca btl openib,self,sm --mca mtl mxm --mca 
> btl_openib_ib_path_record_**service_level
> 1 --mca btl_openib_cpc_include oob hpcc
>
> I obtain the next trace:
>
> [nodo20.X][[31227,1],6][**connect/btl_openib_connect_sl.**c:239:get_pathrecord_info]
> error posting receive on QP [0x16db] errno says: Success [0]
> [nodo15.X][[31227,1],4][**connect/btl_openib_connect_sl.**c:239:get_pathrecord_info]
> error posting receive on QP [0x1749] errno says: Success [0]
> [nodo17.X][[31227,1],5][**connect/btl_openib_connect_sl.**c:239:get_pathrecord_info]
> error posting receive on QP [0x1783] errno says: Success [0]
> [nodo21.X][[31227,1],7][**connect/btl_openib_connect_sl.**c:239:get_pathrecord_info]
> error posting receive on QP [0x1838] errno says: Success [0]
> [nodo21.X][[31227,1],7][**connect/btl_openib_connect_**oob.c:885:rml_recv_cb]
> endpoint connect error: -1
> [nodo17.X][[31227,1],5][**connect/btl_openib_connect_**oob.c:885:rml_recv_cb]
> endpoint connect error: -1
> [nodo15.X][[31227,1],4][**connect/btl_openib_connect_**oob.c:885:rml_recv_cb]
> endpoint connect error: -1
> [nodo20.X][[31227,1],6][**connect/btl_openib_connect_**oob.c:885:rml_recv_cb]
> endpoint connect error: -1
>
> Does anyone know what I am doing wrong?
>
> All the best,
>
>
>
>
>
>
> __**_
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/**mailman/listinfo.cgi/users
>

Re: [OMPI users] CentOS 6.3 & OpenMPI 1.6.3

2012-12-03 Thread Mike Dubman

Please download http://mellanox.com/downloads/hpc/mxm/v1.1/mxm-latest.tar,
it contains mxm.rpm for mofed 1.5.4.1

On Mon, Dec 3, 2012 at 8:18 AM, Mike Dubman <mike.o...@gmail.com> wrote:

> ohh.. you have MOFED 1.5.4.1, thought it was 1.5.3-3.1.0
> will provide you a link to mxm package compiled with this MOFED version
> (thanks to no ABI in OFED).
>
> On Sun, Dec 2, 2012 at 10:04 PM, Joseph Farran <jfar...@uci.edu> wrote:
>
>> 1.5.4.1
>
>
>

Re: [OMPI users] CentOS 6.3 & OpenMPI 1.6.3

2012-12-03 Thread Mike Dubman

ohh.. you have MOFED 1.5.4.1, thought it was 1.5.3-3.1.0
will provide you a link to mxm package compiled with this MOFED version
(thanks to no ABI in OFED).

On Sun, Dec 2, 2012 at 10:04 PM, Joseph Farran  wrote:

> 1.5.4.1

Re: [OMPI users] CentOS 6.3 & OpenMPI 1.6.3

2012-12-02 Thread Mike Dubman

please redownload from
http://mellanox.com/downloads/hpc/mxm/v1.1/mxm-latest.tar
it contains binaries compiled with mofed 1.5.3-3.1.0
M

On Sun, Dec 2, 2012 at 12:13 PM, Mike Dubman <mike.o...@gmail.com> wrote:

>
> It seems that your active mofed is 1.5.3-3.1.0, while installed mxm was
> compiled with 1.5.3-3.0.0
> MOFED is not binary compatible, let me check and send you the link for mxm
> compiled with mofed that you have.
>
> Also, MOFED contains ompi 1.6.0 which is already compiled with mxm
> (/usr/mpi/...)
> On Sun, Dec 2, 2012 at 12:01 PM, Joseph Farran <jfar...@uci.edu> wrote:
>
>>  Same thing.
>>
>> My new config:
>>
>> CFLAGS="" FCFLAGS="" ./configure\
>> --with-sge  \
>> --with-openib=/usr  \
>>
>> --enable-openib-connectx-xrc\
>> --enable-mpi-thread-multiple\
>> --with-threads  \
>> --with-hwloc\
>> --enable-heterogeneous  \
>> --with-fca=/opt/mellanox/fca\
>> --with-mxm-libdir=/opt/mellanox/mxm/lib \
>> --with-mxm=/opt/mellanox/mxm\
>>
>> Fails at the same spot:
>>
>>
>> make[2]: Entering directory
>> `/data/apps/sources/openmpi-1.6.3/ompi/mca/mtl/mxm'
>>   CC mtl_mxm.lo
>>   CC mtl_mxm_cancel.lo
>>   CC mtl_mxm_endpoint.lo
>>   CC mtl_mxm_probe.lo
>>   CC mtl_mxm_recv.lo
>>
>>   CCLD   mca_mtl_mxm.la
>> /bin/grep: /usr/local/mofed-inst/1.5.3-3.0.0/lib/librdmacm.la: No such
>> file or directory
>> /bin/sed: can't read /usr/local/mofed-inst/1.5.3-3.0.0/lib/librdmacm.la:
>> No such file or directory
>> libtool: link: `/usr/local/mofed-inst/1.5.3-3.0.0/lib/librdmacm.la' is
>> not a valid libtool archive
>> make[2]: *** [mca_mtl_mxm.la] Error 1
>> make[2]: Leaving directory
>> `/data/apps/sources/openmpi-1.6.3/ompi/mca/mtl/mxm'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory `/data/apps/sources/openmpi-1.6.3/ompi'
>> make: *** [all-recursive] Error 1
>>
>>
>>
>> On 12/2/2012 1:37 AM, Mike Dubman wrote:
>>
>>  please change "--with-openib" to "--with-openib=/usr"  and retry
>> configure/make stage.
>> 10x
>>
>>  On Sun, Dec 2, 2012 at 10:36 AM, Joseph Farran <jfar...@uci.edu> wrote:
>>
>>>  Hi Mike.
>>>
>>> Thanks for the help!
>>>
>>> I am installing OFED on an NFS share partition so that all compute nodes
>>> will have access.
>>>
>>> For the "--with-openib" option, I don't specify one.   My config file
>>> looks like this:
>>>
>>> CFLAGS="" FCFLAGS="" ./configure\
>>>
>>> --with-sge  \
>>> --with-openib   \
>>> --enable-openib-connectx-xrc\
>>> --enable-mpi-thread-multiple\
>>> --with-threads  \
>>> --with-hwloc\
>>> --enable-heterogeneous  \
>>> --with-fca=/opt/mellanox/fca\
>>>  --with-mxm-libdir=/opt/mellanox/mxm/lib \
>>> --with-mxm=/opt/mellanox/mxm\
>>> --prefix=/data/openmpi-1-6.3
>>>
>>> Please advise,
>>> Joseph
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 12/1/2012 11:39 PM, Mike Dubman wrote:
>>>
>>>  Hi Joseph,
>>> I guess you install MOFED under /usr, is that right?
>>> Could you please specify "--with-openib=/usr" parameter during ompi
>>> "configure" stage?
>>> 10x
>>> M
>>>
>>>  On Fri, Nov 30, 2012 at 1:11 AM, Joseph Farran <jfar...@uci.edu> wrote:
>>>
>>>> Hi YK:
>>>>
>>>> Yes, I have those installed but they are newer versions:
>>>>
>>>> # rpm -qa | grep rdma
>>>> librdmacm-1.0.15-1.x86_64
>>>> librdmacm-utils-1.0.15-1.x86_64
>>>> librdmacm-devel-1.0.15-1.x86_64
>>>> # locate librdmacm.la
>>>> #
>>>>
>>>> Here are the RPMs the Mellanox build created for kernel:
>>>> 2.6.32-279.14.1.el6.x86_64
>>>>
>>>> # ls *rdma*
>>>> librdmacm-1.0.15-1.i686.rpmlibrdmacm-devel-1.0.15-1.i686.rpm
>>>>  librdmacm-utils-1.0.15-1.i686.rpm
>>>> librdmacm-1.0.15-1.x86_64.rpm  librdmacm-devel-1.0.15-1.x86_64.rpm
>>>>  librdmacm-utils-1.0.15-1.x86_64.rpm
>>>>
>>>>
>>>> On 11/29/2012 02:59 PM, Yevgeny Kliteynik wrote:
>>>>
>>>>> Joseph,
>>>>>
>>>>>
>>>>> You're supposed to have librdmacm installed as part of MLNX_OFED
>>>>> installation.
>>>>> What does "rpm -qa | grep rdma" tell?
>>>>>
>>>>>$ rpm -qa | grep rdma
>>>>>librdmacm-devel-1.0.14.1-1.x86_64
>>>>>librdmacm-utils-1.0.14.1-1.x86_64
>>>>>librdmacm-1.0.14.1-1.x86_64
>>>>>
>>>>>$ locate librdmacm.la
>>>>>/usr/local/mofed/1.5.3-4.0.9/lib/librdmacm.la
>>>>>
>>>>> -- YK
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>>
>

Re: [OMPI users] CentOS 6.3 & OpenMPI 1.6.3

2012-12-02 Thread Mike Dubman

It seems that your active mofed is 1.5.3-3.1.0, while installed mxm was
compiled with 1.5.3-3.0.0
MOFED is not binary compatible, let me check and send you the link for mxm
compiled with mofed that you have.

Also, MOFED contains ompi 1.6.0 which is already compiled with mxm
(/usr/mpi/...)
On Sun, Dec 2, 2012 at 12:01 PM, Joseph Farran <jfar...@uci.edu> wrote:

>  Same thing.
>
> My new config:
>
> CFLAGS="" FCFLAGS="" ./configure\
> --with-sge  \
> --with-openib=/usr  \
>
> --enable-openib-connectx-xrc\
> --enable-mpi-thread-multiple\
> --with-threads  \
> --with-hwloc\
> --enable-heterogeneous  \
> --with-fca=/opt/mellanox/fca\
> --with-mxm-libdir=/opt/mellanox/mxm/lib \
> --with-mxm=/opt/mellanox/mxm\
>
> Fails at the same spot:
>
>
> make[2]: Entering directory
> `/data/apps/sources/openmpi-1.6.3/ompi/mca/mtl/mxm'
>   CC mtl_mxm.lo
>   CC mtl_mxm_cancel.lo
>   CC mtl_mxm_endpoint.lo
>   CC mtl_mxm_probe.lo
>   CC mtl_mxm_recv.lo
>
>   CCLD   mca_mtl_mxm.la
> /bin/grep: /usr/local/mofed-inst/1.5.3-3.0.0/lib/librdmacm.la: No such
> file or directory
> /bin/sed: can't read /usr/local/mofed-inst/1.5.3-3.0.0/lib/librdmacm.la:
> No such file or directory
> libtool: link: `/usr/local/mofed-inst/1.5.3-3.0.0/lib/librdmacm.la' is
> not a valid libtool archive
> make[2]: *** [mca_mtl_mxm.la] Error 1
> make[2]: Leaving directory
> `/data/apps/sources/openmpi-1.6.3/ompi/mca/mtl/mxm'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory `/data/apps/sources/openmpi-1.6.3/ompi'
> make: *** [all-recursive] Error 1
>
>
>
> On 12/2/2012 1:37 AM, Mike Dubman wrote:
>
>  please change "--with-openib" to "--with-openib=/usr"  and retry
> configure/make stage.
> 10x
>
>  On Sun, Dec 2, 2012 at 10:36 AM, Joseph Farran <jfar...@uci.edu> wrote:
>
>>  Hi Mike.
>>
>> Thanks for the help!
>>
>> I am installing OFED on an NFS share partition so that all compute nodes
>> will have access.
>>
>> For the "--with-openib" option, I don't specify one.   My config file
>> looks like this:
>>
>> CFLAGS="" FCFLAGS="" ./configure\
>>
>> --with-sge  \
>> --with-openib   \
>> --enable-openib-connectx-xrc\
>> --enable-mpi-thread-multiple\
>> --with-threads  \
>> --with-hwloc    \
>> --enable-heterogeneous  \
>> --with-fca=/opt/mellanox/fca\
>>  --with-mxm-libdir=/opt/mellanox/mxm/lib \
>> --with-mxm=/opt/mellanox/mxm\
>> --prefix=/data/openmpi-1-6.3
>>
>> Please advise,
>> Joseph
>>
>>
>>
>>
>>
>>
>> On 12/1/2012 11:39 PM, Mike Dubman wrote:
>>
>>  Hi Joseph,
>> I guess you install MOFED under /usr, is that right?
>> Could you please specify "--with-openib=/usr" parameter during ompi
>> "configure" stage?
>> 10x
>> M
>>
>>  On Fri, Nov 30, 2012 at 1:11 AM, Joseph Farran <jfar...@uci.edu> wrote:
>>
>>> Hi YK:
>>>
>>> Yes, I have those installed but they are newer versions:
>>>
>>> # rpm -qa | grep rdma
>>> librdmacm-1.0.15-1.x86_64
>>> librdmacm-utils-1.0.15-1.x86_64
>>> librdmacm-devel-1.0.15-1.x86_64
>>> # locate librdmacm.la
>>> #
>>>
>>> Here are the RPMs the Mellanox build created for kernel:
>>> 2.6.32-279.14.1.el6.x86_64
>>>
>>> # ls *rdma*
>>> librdmacm-1.0.15-1.i686.rpmlibrdmacm-devel-1.0.15-1.i686.rpm
>>>  librdmacm-utils-1.0.15-1.i686.rpm
>>> librdmacm-1.0.15-1.x86_64.rpm  librdmacm-devel-1.0.15-1.x86_64.rpm
>>>  librdmacm-utils-1.0.15-1.x86_64.rpm
>>>
>>>
>>> On 11/29/2012 02:59 PM, Yevgeny Kliteynik wrote:
>>>
>>>> Joseph,
>>>>
>>>>
>>>> You're supposed to have librdmacm installed as part of MLNX_OFED
>>>> installation.
>>>> What does "rpm -qa | grep rdma" tell?
>>>>
>>>>$ rpm -qa | grep rdma
>>>>librdmacm-devel-1.0.14.1-1.x86_64
>>>>librdmacm-utils-1.0.14.1-1.x86_64
>>>>librdmacm-1.0.14.1-1.x86_64
>>>>
>>>>$ locate librdmacm.la
>>>>/usr/local/mofed/1.5.3-4.0.9/lib/librdmacm.la
>>>>
>>>> -- YK
>>>>
>>>>
>>>
>>
>>
>
>

Re: [OMPI users] CentOS 6.3 & OpenMPI 1.6.3

2012-12-02 Thread Mike Dubman

please change "--with-openib" to "--with-openib=/usr"  and retry
configure/make stage.
10x

On Sun, Dec 2, 2012 at 10:36 AM, Joseph Farran <jfar...@uci.edu> wrote:

>  Hi Mike.
>
> Thanks for the help!
>
> I am installing OFED on an NFS share partition so that all compute nodes
> will have access.
>
> For the "--with-openib" option, I don't specify one.   My config file
> looks like this:
>
> CFLAGS="" FCFLAGS="" ./configure\
>
> --with-sge  \
> --with-openib   \
> --enable-openib-connectx-xrc\
> --enable-mpi-thread-multiple\
> --with-threads  \
> --with-hwloc\
> --enable-heterogeneous  \
> --with-fca=/opt/mellanox/fca\
> --with-mxm-libdir=/opt/mellanox/mxm/lib \
> --with-mxm=/opt/mellanox/mxm    \
> --prefix=/data/openmpi-1-6.3
>
> Please advise,
> Joseph
>
>
>
>
>
>
> On 12/1/2012 11:39 PM, Mike Dubman wrote:
>
>  Hi Joseph,
> I guess you install MOFED under /usr, is that right?
> Could you please specify "--with-openib=/usr" parameter during ompi
> "configure" stage?
> 10x
> M
>
>  On Fri, Nov 30, 2012 at 1:11 AM, Joseph Farran <jfar...@uci.edu> wrote:
>
>> Hi YK:
>>
>> Yes, I have those installed but they are newer versions:
>>
>> # rpm -qa | grep rdma
>> librdmacm-1.0.15-1.x86_64
>> librdmacm-utils-1.0.15-1.x86_64
>> librdmacm-devel-1.0.15-1.x86_64
>> # locate librdmacm.la
>> #
>>
>> Here are the RPMs the Mellanox build created for kernel:
>> 2.6.32-279.14.1.el6.x86_64
>>
>> # ls *rdma*
>> librdmacm-1.0.15-1.i686.rpmlibrdmacm-devel-1.0.15-1.i686.rpm
>>  librdmacm-utils-1.0.15-1.i686.rpm
>> librdmacm-1.0.15-1.x86_64.rpm  librdmacm-devel-1.0.15-1.x86_64.rpm
>>  librdmacm-utils-1.0.15-1.x86_64.rpm
>>
>>
>> On 11/29/2012 02:59 PM, Yevgeny Kliteynik wrote:
>>
>>> Joseph,
>>>
>>>
>>> You're supposed to have librdmacm installed as part of MLNX_OFED
>>> installation.
>>> What does "rpm -qa | grep rdma" tell?
>>>
>>>$ rpm -qa | grep rdma
>>>librdmacm-devel-1.0.14.1-1.x86_64
>>>librdmacm-utils-1.0.14.1-1.x86_64
>>>librdmacm-1.0.14.1-1.x86_64
>>>
>>>$ locate librdmacm.la
>>>/usr/local/mofed/1.5.3-4.0.9/lib/librdmacm.la
>>>
>>> -- YK
>>>
>>>
>>
>
>

Re: [OMPI users] OpenMPI-1.6.3 & MXM

2012-12-02 Thread Mike Dubman

Hi,

The mxm which is part of MOFED 1.5.3 supports OMPI 1.6.0.

The mxm upgrade is needed to work with OMPI 1.6.3+

Please remove mxm from your cluster nodes (rpm -e mxm)
Install latest from  http://mellanox/com/products/mxm/
Compile ompi 1.6.3, add following to its configure line: ./configure
--with-openib=/usr --with-mxm=/opt/mellanox/mxm <...>)

Regards
M

On Sat, Dec 1, 2012 at 2:23 AM, Joseph Farran  wrote:

>  Konz,
>
> For whatever it is worth, I am in the same boat.
>
> I have CentOS 6.3, trying to compile OpenMPI 1.6.3 with the mxm from
> Mellanox and it fails.
>
> Also, the Mellanox OFED ( MLNX_OFED_LINUX-1.5.3-3.1.0-rhel6.3-x86_64 )
> does not work either.
>
> Mellanox really needs to step in here and help out.
>
> I have a cluster full of Mellanox products and I hate to think we chose
> the wrong Infiniband vendor.
>
> Joseph
>
>
>
> On 11/30/2012 12:33 PM, Konz, Jeffrey (SSA Solution Centers) wrote:
>
>  I tried building the latest OpenMPI-1.6.3 with MXM support and got this
> error:
>
> ** **
>
> make[2]: Entering directory `Src/openmpi-1.6.3/ompi/mca/mtl/mxm'
>
>   CC mtl_mxm.lo
>
>   CC mtl_mxm_cancel.lo
>
>   CC mtl_mxm_component.lo
>
>   CC mtl_mxm_endpoint.lo
>
>   CC mtl_mxm_probe.lo
>
>   CC mtl_mxm_recv.lo
>
>   CC mtl_mxm_send.lo
>
> mtl_mxm_send.c: In function 'ompi_mtl_mxm_send':
>
> mtl_mxm_send.c:96: error: 'mxm_wait_t' undeclared (first use in this
> function)
>
> mtl_mxm_send.c:96: error: (Each undeclared identifier is reported only once
> 
>
> mtl_mxm_send.c:96: error: for each function it appears in.)
>
> mtl_mxm_send.c:96: error: expected ';' before 'wait'
>
> mtl_mxm_send.c:104: error: 'MXM_REQ_FLAG_BLOCKING' undeclared (first use
> in this function)
>
> mtl_mxm_send.c:118: error: 'MXM_REQ_FLAG_SEND_SYNC' undeclared (first use
> in this function)
>
> mtl_mxm_send.c:134: error: 'wait' undeclared (first use in this function)*
> ***
>
> mtl_mxm_send.c: In function 'ompi_mtl_mxm_isend':
>
> mtl_mxm_send.c:183: error: 'MXM_REQ_FLAG_SEND_SYNC' undeclared (first use
> in this function)
>
> make[2]: *** [mtl_mxm_send.lo] Error 1
>
> ** **
>
> ** **
>
> Our OFED is 1.5.3 and our MXM version is 1.0.601. 
>
> ** **
>
> Thanks,
>
> ** **
>
> -Jeff
>
> ** **
>
> /**/
>
> /* Jeff Konz  jeffrey.k...@hp.com */
>
> /* Solutions Architect   HPC Benchmarking */
>
> /* Americas Shared Solutions Architecture (SSA)   */
>
> /* Hewlett-Packard Company*/
>
> /* Office: 248-491-7480  Mobile: 248-345-6857 */ 
>
> /**/
>
> 
>
>
> ___
> users mailing 
> listusers@open-mpi.orghttp://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] CentOS 6.3 & OpenMPI 1.6.3

2012-12-02 Thread Mike Dubman

Hi Joseph,
I guess you install MOFED under /usr, is that right?
Could you please specify "--with-openib=/usr" parameter during ompi
"configure" stage?
10x
M

On Fri, Nov 30, 2012 at 1:11 AM, Joseph Farran  wrote:

> Hi YK:
>
> Yes, I have those installed but they are newer versions:
>
> # rpm -qa | grep rdma
> librdmacm-1.0.15-1.x86_64
> librdmacm-utils-1.0.15-1.x86_**64
> librdmacm-devel-1.0.15-1.x86_**64
> # locate librdmacm.la
> #
>
> Here are the RPMs the Mellanox build created for kernel:
> 2.6.32-279.14.1.el6.x86_64
>
> # ls *rdma*
> librdmacm-1.0.15-1.i686.rpmlibrdmacm-devel-1.0.15-1.i686.**rpm
>  librdmacm-utils-1.0.15-1.i686.**rpm
> librdmacm-1.0.15-1.x86_64.rpm  librdmacm-devel-1.0.15-1.x86_**64.rpm
>  librdmacm-utils-1.0.15-1.x86_**64.rpm
>
>
> On 11/29/2012 02:59 PM, Yevgeny Kliteynik wrote:
>
>> Joseph,
>>
>>
>> You're supposed to have librdmacm installed as part of MLNX_OFED
>> installation.
>> What does "rpm -qa | grep rdma" tell?
>>
>>$ rpm -qa | grep rdma
>>librdmacm-devel-1.0.14.1-1.**x86_64
>>librdmacm-utils-1.0.14.1-1.**x86_64
>>librdmacm-1.0.14.1-1.x86_64
>>
>>$ locate librdmacm.la
>>/usr/local/mofed/1.5.3-4.0.9/**lib/librdmacm.la
>>
>> -- YK
>>
>>
>

Re: [OMPI users] CentOS 6.3 & OpenMPI 1.6.3

2012-11-28 Thread Mike Dubman

You need mxm-1.1.3a5e745-1.x86_64-**rhel6u3.rpm


On Wed, Nov 28, 2012 at 7:44 PM, Joseph Farran  wrote:

> mxm-1.1.3a5e745-1.x86_64-**rhel6u3.rpm
>

Re: [OMPI users] application with mxm hangs on startup

2012-08-24 Thread Mike Dubman

Hi,
Could you please download latest mxm from
http://www.mellanox.com/products/mxm/ and retry?
The mxm version which comes with OFED 1.5.3 was tested with OMPI 1.6.0.

Regards
M

On Wed, Aug 22, 2012 at 2:22 PM, Pavel Mezentsev
wrote:

> I've tried to launch the application on nodes with QDR Infiniband. The
> first attempt with 2 processes worked, but the following was printed to the
> output:
> [1345633953.436676] [b01:2523 :0] mpool.c:99   MXM ERROR Invalid
> mempool parameter(s)
> [1345633953.436676] [b01:2522 :0] mpool.c:99   MXM ERROR Invalid
> mempool parameter(s)
> --
> MXM was unable to create an endpoint. Please make sure that the network
> link is
> active on the node and the hardware is functioning.
>
>   Error: Invalid parameter
>
> --
>
> The results from this launch didn't differ from the results of the launch
> without MXM.
>
> Then I've tried to launch it with 256 processes, but got the same message
> from each process and then the application crashed. After that I'm
> observing the same behavior as with FDR: application hangs in
> the beginning.
>
> Best regards, Pavel Mezentsev.
>
>
> 2012/8/22 Pavel Mezentsev 
>
>> Hello!
>>
>> I've built openmpi 1.6.1rc3 with support of MXM. But when I try to launch
>> an application using this mtl it hangs and can't figure out why.
>>
>> If I launch it with np below 128 then everything works fine since mxm
>> isn't used. I've tried setting the threshold to 0 and launching 2 processes
>> with the same result: hangs on startup.
>> What could be causing this problem?
>>
>> Here is the command I execute:
>> /opt/openmpi/1.6.1/mxm-test/bin/mpirun \
>> -np $NP \
>> -hostfile hosts_fdr2 \
>> --mca mtl mxm \
>> --mca btl ^tcp \
>> --mca mtl_mxm_np 0 \
>> -x OMP_NUM_THREADS=$NT \
>> -x LD_LIBRARY_PATH \
>> --bind-to-core \
>> -npernode 16 \
>> --mca coll_fca_np 0 -mca coll_fca_enable 0 \
>> ./IMB-MPI1 -npmin $NP Allreduce Reduce Barrier Bcast
>> Allgather Allgatherv
>>
>> I'm performing the tests on nodes with Intel SB processors and FDR.
>> Openmpi was configured with the following parameters:
>> CC=icc CXX=icpc F77=ifort FC=ifort ./configure
>> --prefix=/opt/openmpi/1.6.1rc3/mxm-test --with-mxm=/opt/mellanox/mxm
>> --with-fca=/opt/mellanox/fca --with-knem=/usr/share/knem
>> I'm using the latest ofed from mellanox: 1.5.3-3.1.0 on centos 6.1 with
>> default kernel: 2.6.32-131.0.15.
>> The compilation with default mxm (1.0.601) failed so I installed the
>> latest version from mellanox: 1.1.1227
>>
>> Best regards, Pavel Mezentsev.
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] ompi mca mxm version

2012-05-11 Thread Mike Dubman

ob1/openib is RC based which have scalability issues, mxm 1.1 is ud based
and kicks in at scale.
We observe mxm outperforms ob1 on 8+ nodes.

We will update docs as you mentioned, thanks

Regards




On Thu, May 10, 2012 at 4:30 PM, Derek Gerstmann <derek.gerstm...@uwa.edu.au
> wrote:

> On May 9, 2012, at 7:41 PM, Mike Dubman wrote:
>
> > you need latest OMPI 1.6.x and latest MXM (
> ftp://bgate.mellanox.com/hpc/mxm/v1.1/mxm_1.1.1067.tar)
>
> Excellent!  Thanks for the quick response!  Using the MXM v1.1.1067
> against OMPI v1.6.x did the trick.  Please (!!!) add a note to the docs for
> OMPI 1.6.x to help out other users -- there's zero mention of this anywhere
> that I could find from scouring the archives and source code.
>
> Sadly, performance isn't what we'd expect.  OB1 is outperforming CM MXM
> (consistently).
>
> Are there any suggested configuration settings?  We tried all the obvious
> ones listed in the OMPI Wiki and mailing list archives, but few have had
> much of an effect.
>
> We seem to do better with the OB1 openib btl, than the lower level CM MXM.
>  Any suggestions?
>
> Here's numbers from the OSU MicroBenchmarks (for the MBW_MR test) running
> on 2x pairs, aka 4 separate hosts, each using Mellanox ConnectX, one card
> per host, single port, single switch):
>
> -- OB1
> > /opt/openmpi/1.6.0/bin/mpiexec -np 4 --mca pml ob1 --mca btl ^tcp --mca
> mpi_use_pinned 1 -hostfile all_hosts ./osu-micro-benchmarks/osu_mbw_mr
> # OSU MPI Multiple Bandwidth / Message Rate Test v3.6
> # [ pairs: 2 ] [ window size: 64 ]
> # Size  MB/sMessages/s
> 1   2.912909711.73
> 2   5.972984274.11
> 4  11.702924292.78
> 8  23.002874502.93
> 16 44.752796639.64
> 32 89.492796639.64
> 64175.982749658.96
> 128   292.412284459.86
> 256   527.842061874.61
> 512   961.651878221.77
> 1024 1669.061629943.87
> 2048 2220.431084193.45
> 4096 2906.57 709611.68
> 8192 3017.65 368365.70
> 163845225.97 318967.95
> 327685418.98 165374.23
> 655365998.07  91523.27
> 131072   6031.69  46018.16
> 262144   6063.38  23129.97
> 524288   5971.77  11390.24
> 1048576  5788.75   5520.59
> 2097152  5791.39   2761.55
> 4194304  5820.60   1387.74
>
> -- MXM
> > /opt/openmpi/1.6.0/bin/mpiexec -np 4 --mca pml cm --mca mtl mxm --mca
> btl ^tcp --mca mpi_use_pinned 1 -hostfile all_hosts
> ./osu-micro-benchmarks/osu_mbw_mr
> # OSU MPI Multiple Bandwidth / Message Rate Test v3.6
> # [ pairs: 2 ] [ window size: 64 ]
> # Size  MB/sMessages/s
> 1   2.072074863.43
> 2   4.142067830.81
> 4  10.572642471.39
> 8  23.162895275.37
> 16 38.732420627.22
> 32 66.772086718.41
> 64147.872310414.05
> 128   284.942226109.85
> 256   537.272098709.64
> 512  1041.912034989.43
> 1024 1930.931885676.34
> 2048 1998.68 975916.00
> 4096 2880.72 703299.77
> 8192 3608.45 440484.17
> 163844027.15 245797.51
> 327684464.85 136256.47
> 655364594.22  70102.23
> 131072   4655.62  35519.55
> 262144   4671.56  17820.58
> 524288   4604.16   8781.74
> 1048576  4635.51   4420.77
> 2097152  3575.17   1704.78
> 4194304  2828.19674.29
>
> Thanks!
>
> -[dg]
>
> Derek Gerstmann, PhD Student
> The University of Western Australia (UWA)
>
> w: http://local.ivec.uwa.edu.au/~derek
> e: derek.gerstmann [at] icrar.org
>
> On May 9, 2012, at 7:41 PM, Mike Dubman wrote:
>
> > you need latest OMPI 1.6.x and latest MXM (
> ftp://bgate.mellanox.com/hpc/mxm/v1.1/mxm_1.1.1067.tar)
> >
> >
> >
> > On Wed, May 9, 2012 at 6:02 AM, Derek Gerstmann <
> derek.gerstm...@uwa.edu.au> wrote:
> > What versions o

Re: [OMPI users] ompi mca mxm version

2012-05-09 Thread Mike Dubman

you need latest OMPI 1.6.x and latest MXM (
ftp://bgate.mellanox.com/hpc/mxm/v1.1/mxm_1.1.1067.tar)



On Wed, May 9, 2012 at 6:02 AM, Derek Gerstmann
wrote:

> What versions of OpenMPI and the Mellanox MXM libraries have been tested
> and verified to work?
>
> We are currently trying to build OpenMPI v1.5.5 against the MXM 1.0.601
> (included in the MLNX_OFED_LINUX-1.5.3-3.0.0 distribution) and are getting
> build errors.
>
> Specifically, there's a single undefined type (mxm_wait_t) being used in
> the OpenMPI tree:
>
>openmpi-1.5.5/ompi/mca/mtl/mxm/mtl_mxm_send.c:44mxm_wait_t
> wait;
>
> There is no mxm_wait_t defined anywhere in the current MXM API
> (/opt/mellanox/mxm/include/mxm/api), which suggests a version mismatch.
>
> The OpenMPI v1.6 branch has a note in the readme saying "Minor Fixes for
> Mellanox MXM" were added, but the same undefined mxm_wait_t is still being
> used.
>
> What versions of OpenMPI and MXM are verified to work?
>
> Thanks!
>
> -[dg]
>
> Derek Gerstmann, PhD Student
> The University of Western Australia (UWA)
>
> w: http://local.ivec.uwa.edu.au/~derek
> e: derek.gerstmann [at] icrar.org
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Possible bug in finalize, OpenMPI v1.5, head revision

2012-01-26 Thread Mike Dubman

so far did not happen yet - will report if it does.

On Tue, Jan 24, 2012 at 5:10 PM, Jeff Squyres <jsquy...@cisco.com> wrote:

> Ralph's fix has now been committed to the v1.5 trunk (yesterday).
>
> Did that fix it?
>
>
> On Jan 22, 2012, at 3:40 PM, Mike Dubman wrote:
>
> > it was compiled with the same ompi.
> > We see it occasionally on different clusters with different ompi
> folders. (all v1.5)
> >
> > On Thu, Jan 19, 2012 at 5:44 PM, Ralph Castain <r...@open-mpi.org> wrote:
> > I didn't commit anything to the v1.5 branch yesterday - just the trunk.
> >
> > As I told Mike off-list, I think it may have been that the binary was
> compiled against a different OMPI version by mistake. It looks very much
> like what I'd expect to have happen in that scenario.
> >
> > On Jan 19, 2012, at 7:52 AM, Jeff Squyres wrote:
> >
> > > Did you "svn up"?  I ask because Ralph committed some stuff yesterday
> that may have fixed this.
> > >
> > >
> > > On Jan 18, 2012, at 5:19 PM, Andrew Senin wrote:
> > >
> > >> No, nothing specific. Only basic settings (--mca btl openib,self
> > >> --npernode 1, etc).
> > >>
> > >> Actually I'm were confused with this error because today it just
> > >> disapeared. I had 2 separate folders where it was reproduced in 100%
> > >> of test runs. Today I recompiled the source and it is gone in both
> > >> folders. But yesterday I tried recompiling multiple times with no
> > >> effect. So I believe this must be somehow related to some unknown
> > >> settings in the lab which have been changed. Trying to reproduce the
> > >> crash now...
> > >>
> > >> Regards,
> > >> Andrew Senin.
> > >>
> > >> On Thu, Jan 19, 2012 at 12:05 AM, Jeff Squyres <jsquy...@cisco.com>
> wrote:
> > >>> Jumping in pretty late in this thread here...
> > >>>
> > >>> I see that it's failing in opal_hwloc_base_close().  That's a little
> worrysome.
> > >>>
> > >>> I do see an odd path through the hwloc initialization that *could*
> cause an error during finalization -- but it would involve you setting an
> invalid value for an MCA parameter.  Are you setting
> hwloc_base_mem_bind_failure_action or
> > >>> hwloc_base_mem_alloc_policy, perchance?
> > >>>
> > >>>
> > >>> On Jan 16, 2012, at 1:56 PM, Andrew Senin wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> I think I've found a bug in the hear revision of the OpenMPI 1.5
> > >>>> branch. If it is configured with --disable-debug it crashes in
> > >>>> finalize on the hello_c.c example. Did I miss something out?
> > >>>>
> > >>>> Configure options:
> > >>>> ./configure --with-pmi=/usr/ --with-slurm=/usr/ --without-psm
> > >>>> --disable-debug --enable-mpirun-prefix-by-default
> > >>>>
> --prefix=/hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install
> > >>>>
> > >>>> Runtime command and output:
> > >>>> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:../lib ./mpirun --mca btl
> openib,self
> > >>>> --npernode 1 --host mir1,mir2 ./hello
> > >>>>
> > >>>> Hello, world, I am 0 of 2
> > >>>> Hello, world, I am 1 of 2
> > >>>> [mir1:05542] *** Process received signal ***
> > >>>> [mir1:05542] Signal: Segmentation fault (11)
> > >>>> [mir1:05542] Signal code: Address not mapped (1)
> > >>>> [mir1:05542] Failing at address: 0xe8
> > >>>> [mir2:10218] *** Process received signal ***
> > >>>> [mir2:10218] Signal: Segmentation fault (11)
> > >>>> [mir2:10218] Signal code: Address not mapped (1)
> > >>>> [mir2:10218] Failing at address: 0xe8
> > >>>> [mir1:05542] [ 0] /lib64/libpthread.so.0() [0x390d20f4c0]
> > >>>> [mir1:05542] [ 1]
> > >>>>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(+0x1346a8)
> > >>>> [0x7f4588cee6a8]
> > >>>> [mir1:05542] [ 2]
> > >>>>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_hwloc_base_close+0x32)
> > >>>> [0x7f4588cee700]
> > >>>> [mir1:05542] [ 3]
> > >>>>
>

Re: [OMPI users] Possible bug in finalize, OpenMPI v1.5, head revision

2012-01-22 Thread Mike Dubman

it was compiled with the same ompi.
We see it occasionally on different clusters with different ompi folders.
(all v1.5)

On Thu, Jan 19, 2012 at 5:44 PM, Ralph Castain  wrote:

> I didn't commit anything to the v1.5 branch yesterday - just the trunk.
>
> As I told Mike off-list, I think it may have been that the binary was
> compiled against a different OMPI version by mistake. It looks very much
> like what I'd expect to have happen in that scenario.
>
> On Jan 19, 2012, at 7:52 AM, Jeff Squyres wrote:
>
> > Did you "svn up"?  I ask because Ralph committed some stuff yesterday
> that may have fixed this.
> >
> >
> > On Jan 18, 2012, at 5:19 PM, Andrew Senin wrote:
> >
> >> No, nothing specific. Only basic settings (--mca btl openib,self
> >> --npernode 1, etc).
> >>
> >> Actually I'm were confused with this error because today it just
> >> disapeared. I had 2 separate folders where it was reproduced in 100%
> >> of test runs. Today I recompiled the source and it is gone in both
> >> folders. But yesterday I tried recompiling multiple times with no
> >> effect. So I believe this must be somehow related to some unknown
> >> settings in the lab which have been changed. Trying to reproduce the
> >> crash now...
> >>
> >> Regards,
> >> Andrew Senin.
> >>
> >> On Thu, Jan 19, 2012 at 12:05 AM, Jeff Squyres 
> wrote:
> >>> Jumping in pretty late in this thread here...
> >>>
> >>> I see that it's failing in opal_hwloc_base_close().  That's a little
> worrysome.
> >>>
> >>> I do see an odd path through the hwloc initialization that *could*
> cause an error during finalization -- but it would involve you setting an
> invalid value for an MCA parameter.  Are you setting
> hwloc_base_mem_bind_failure_action or
> >>> hwloc_base_mem_alloc_policy, perchance?
> >>>
> >>>
> >>> On Jan 16, 2012, at 1:56 PM, Andrew Senin wrote:
> >>>
>  Hi,
> 
>  I think I've found a bug in the hear revision of the OpenMPI 1.5
>  branch. If it is configured with --disable-debug it crashes in
>  finalize on the hello_c.c example. Did I miss something out?
> 
>  Configure options:
>  ./configure --with-pmi=/usr/ --with-slurm=/usr/ --without-psm
>  --disable-debug --enable-mpirun-prefix-by-default
> 
> --prefix=/hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install
> 
>  Runtime command and output:
>  LD_LIBRARY_PATH=$LD_LIBRARY_PATH:../lib ./mpirun --mca btl openib,self
>  --npernode 1 --host mir1,mir2 ./hello
> 
>  Hello, world, I am 0 of 2
>  Hello, world, I am 1 of 2
>  [mir1:05542] *** Process received signal ***
>  [mir1:05542] Signal: Segmentation fault (11)
>  [mir1:05542] Signal code: Address not mapped (1)
>  [mir1:05542] Failing at address: 0xe8
>  [mir2:10218] *** Process received signal ***
>  [mir2:10218] Signal: Segmentation fault (11)
>  [mir2:10218] Signal code: Address not mapped (1)
>  [mir2:10218] Failing at address: 0xe8
>  [mir1:05542] [ 0] /lib64/libpthread.so.0() [0x390d20f4c0]
>  [mir1:05542] [ 1]
> 
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(+0x1346a8)
>  [0x7f4588cee6a8]
>  [mir1:05542] [ 2]
> 
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_hwloc_base_close+0x32)
>  [0x7f4588cee700]
>  [mir1:05542] [ 3]
> 
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_finalize+0x73)
>  [0x7f4588d1beb2]
>  [mir1:05542] [ 4]
> 
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(orte_finalize+0xfe)
>  [0x7f4588c81eb5]
>  [mir1:05542] [ 5]
> 
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(ompi_mpi_finalize+0x67a)
>  [0x7f4588c217c3]
>  [mir1:05542] [ 6]
> 
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(PMPI_Finalize+0x59)
>  [0x7f4588c39959]
>  [mir1:05542] [ 7] ./hello(main+0x69) [0x4008fd]
>  [mir1:05542] [ 8] /lib64/libc.so.6(__libc_start_main+0xfd)
> [0x390ca1ec5d]
>  [mir1:05542] [ 9] ./hello() [0x4007d9]
>  [mir1:05542] *** End of error message ***
>  [mir2:10218] [ 0] /lib64/libpthread.so.0() [0x3a6dc0f4c0]
>  [mir2:10218] [ 1]
> 
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(+0x1346a8)
>  [0x7f409f31d6a8]
>  [mir2:10218] [ 2]
> 
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_hwloc_base_close+0x32)
>  [0x7f409f31d700]
>  [mir2:10218] [ 3]
> 
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_finalize+0x73)
>  [0x7f409f34aeb2]
>  [mir2:10218] [ 4]
> 
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(orte_finalize+0xfe)
>  [0x7f409f2b0eb5]
>  [mir2:10218] [ 5]

Re: [OMPI users] Possible bug in finalize, OpenMPI v1.5, head revision

2012-01-17 Thread Mike Dubman

It happens for us on RHEL 6.0

On Tue, Jan 17, 2012 at 3:46 AM, Ralph Castain wrote:

> Well, I'm afraid I can't replicate your report. It runs fine for me.
>
> Sent from my iPad
>
> On Jan 16, 2012, at 4:25 PM, Ralph Castain  wrote:
>
> > Hprobably a bug. I haven't tested that branch yet. Will take a
> look.
> >
> > Sent from my iPad
> >
> > On Jan 16, 2012, at 11:56 AM, Andrew Senin 
> wrote:
> >
> >> Hi,
> >>
> >> I think I've found a bug in the hear revision of the OpenMPI 1.5
> >> branch. If it is configured with --disable-debug it crashes in
> >> finalize on the hello_c.c example. Did I miss something out?
> >>
> >> Configure options:
> >> ./configure --with-pmi=/usr/ --with-slurm=/usr/ --without-psm
> >> --disable-debug --enable-mpirun-prefix-by-default
> >>
> --prefix=/hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install
> >>
> >> Runtime command and output:
> >> LD_LIBRARY_PATH=$LD_LIBRARY_PATH:../lib ./mpirun --mca btl openib,self
> >> --npernode 1 --host mir1,mir2 ./hello
> >>
> >> Hello, world, I am 0 of 2
> >> Hello, world, I am 1 of 2
> >> [mir1:05542] *** Process received signal ***
> >> [mir1:05542] Signal: Segmentation fault (11)
> >> [mir1:05542] Signal code: Address not mapped (1)
> >> [mir1:05542] Failing at address: 0xe8
> >> [mir2:10218] *** Process received signal ***
> >> [mir2:10218] Signal: Segmentation fault (11)
> >> [mir2:10218] Signal code: Address not mapped (1)
> >> [mir2:10218] Failing at address: 0xe8
> >> [mir1:05542] [ 0] /lib64/libpthread.so.0() [0x390d20f4c0]
> >> [mir1:05542] [ 1]
> >>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(+0x1346a8)
> >> [0x7f4588cee6a8]
> >> [mir1:05542] [ 2]
> >>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_hwloc_base_close+0x32)
> >> [0x7f4588cee700]
> >> [mir1:05542] [ 3]
> >>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_finalize+0x73)
> >> [0x7f4588d1beb2]
> >> [mir1:05542] [ 4]
> >>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(orte_finalize+0xfe)
> >> [0x7f4588c81eb5]
> >> [mir1:05542] [ 5]
> >>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(ompi_mpi_finalize+0x67a)
> >> [0x7f4588c217c3]
> >> [mir1:05542] [ 6]
> >>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(PMPI_Finalize+0x59)
> >> [0x7f4588c39959]
> >> [mir1:05542] [ 7] ./hello(main+0x69) [0x4008fd]
> >> [mir1:05542] [ 8] /lib64/libc.so.6(__libc_start_main+0xfd)
> [0x390ca1ec5d]
> >> [mir1:05542] [ 9] ./hello() [0x4007d9]
> >> [mir1:05542] *** End of error message ***
> >> [mir2:10218] [ 0] /lib64/libpthread.so.0() [0x3a6dc0f4c0]
> >> [mir2:10218] [ 1]
> >>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(+0x1346a8)
> >> [0x7f409f31d6a8]
> >> [mir2:10218] [ 2]
> >>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_hwloc_base_close+0x32)
> >> [0x7f409f31d700]
> >> [mir2:10218] [ 3]
> >>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(opal_finalize+0x73)
> >> [0x7f409f34aeb2]
> >> [mir2:10218] [ 4]
> >>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(orte_finalize+0xfe)
> >> [0x7f409f2b0eb5]
> >> [mir2:10218] [ 5]
> >>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(ompi_mpi_finalize+0x67a)
> >> [0x7f409f2507c3]
> >> [mir2:10218] [ 6]
> >>
> /hpc/home/USERS/senina/projects/distribs/openmpi-svn_v1.5/install/lib/libmpi.so.1(PMPI_Finalize+0x59)
> >> [0x7f409f268959]
> >> [mir2:10218] [ 7] ./hello(main+0x69) [0x4008fd]
> >> [mir2:10218] [ 8] /lib64/libc.so.6(__libc_start_main+0xfd)
> [0x3a6d41ec5d]
> >> [mir2:10218] [ 9] ./hello() [0x4007d9]
> >> [mir2:10218] *** End of error message ***
> >>
> --
> >> mpirun noticed that process rank 0 with PID 5542 on node mir1 exited
> >> on signal 11 (Segmentation fault).
> >> -
> >>
> >> Thanks,
> >> Andrew Senin
> >> ___
> >> users mailing list
> >> us...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] MPI_Send doesn't work if the data >= 2GB

2010-12-06 Thread Mike Dubman

Hi,
What interconnect and command line do you use? For InfiniBand openib
component there is a known issue with large transfers (2GB)

https://svn.open-mpi.org/trac/ompi/ticket/2623

try disabling memory pinning:
http://www.open-mpi.org/faq/?category=openfabrics#large-message-leave-pinned


regards
M


2010/12/6 孟宪军 

> hi,
>
> In my computers(X86-64), the sizeof(int)=4, but the
> sizeof(long)=sizeof(double)=sizeof(size_t)=8. when I checked my mpi.h file,
> I found that the definition about the sizeof(int) is correct. meanwhile, I
> think the mpi.h file was generated according to my compute environment when
> I compiled the Openmpi. So, my codes still don't work. :(
>
> Further, I found when I called the collective routines(such as,
> MPI_Allgatherv(...)) which are implemented by the Point 2 Point don't work
> either when the data > 2GB.
>
> Thanks
> Xianjun
>
> 2010/12/6 Tim Prince 
>
> On 12/5/2010 7:13 PM, 孟宪军 wrote:
>>
>>> hi,
>>>
>>> I met a question recently when I tested the MPI_send and MPI_Recv
>>> functions. When I run the following codes, the processes hanged and I
>>> found there was not data transmission in my network at all.
>>>
>>> BTW: I finished this test on two X86-64 computers with 16GB memory and
>>> installed Linux.
>>>
>>> 1 #include 
>>> 2 #include 
>>> 3 #include 
>>> 4 #include 
>>> 5
>>> 6
>>> 7 int main(int argc, char** argv)
>>> 8 {
>>> 9 int localID;
>>> 10 int numOfPros;
>>> 11 size_t Gsize = (size_t)2 * 1024 * 1024 * 1024;
>>> 12
>>> 13 char* g = (char*)malloc(Gsize);
>>> 14
>>> 15 MPI_Init(, );
>>> 16 MPI_Comm_size(MPI_COMM_WORLD, );
>>> 17 MPI_Comm_rank(MPI_COMM_WORLD, );
>>> 18
>>> 19 MPI_Datatype MPI_Type_lkchar;
>>> 20 MPI_Type_contiguous(2048, MPI_BYTE, _Type_lkchar);
>>> 21 MPI_Type_commit(_Type_lkchar);
>>> 22
>>> 23 if (localID == 0)
>>> 24 {
>>> 25 MPI_Send(g, 1024*1024, MPI_Type_lkchar, 1, 1, MPI_COMM_WORLD);
>>> 26 }
>>> 27
>>> 28 if (localID != 0)
>>> 29 {
>>> 30 MPI_Status status;
>>> 31 MPI_Recv(g, 1024*1024, MPI_Type_lkchar, 0, 1, \
>>> 32 MPI_COMM_WORLD, );
>>> 33 }
>>> 34
>>> 35 MPI_Finalize();
>>> 36
>>> 37 return 0;
>>> 38 }
>>>
>>>  You supplied all your constants as 32-bit signed data, so, even if the
>> count for MPI_Send() and MPI_Recv() were a larger data type, you would see
>> this limit. Did you look at your  ?
>>
>> --
>> Tim Prince
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] openib issues

2010-08-10 Thread Mike Dubman

Hey Eloi,

What HCA card do you have ? Can you post code/instructions howto reproduce
it?
10x
Mike

On Mon, Aug 9, 2010 at 5:22 PM, Eloi Gaudry  wrote:

> Hi,
>
> Could someone have a look on these two different error messages ? I'd like
> to know the reason(s) why they were displayed and their actual meaning.
>
> Thanks,
> Eloi
>
> On Monday 19 July 2010 16:38:57 Eloi Gaudry wrote:
> > Hi,
> >
> > I've been working on a random segmentation fault that seems to occur
> during
> > a collective communication when using the openib btl (see [OMPI users]
> > [openib] segfault when using openib btl).
> >
> > During my tests, I've come across different issues reported by
> > OpenMPI-1.4.2:
> >
> > 1/
> > [[12770,1],0][btl_openib_component.c:3227:handle_wc] from bn0103 to:
> bn0122
> > error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for
> > wr_id 560618664 opcode 1  vendor error 105 qp_idx 3
> >
> > 2/
> > [[992,1],6][btl_openib_component.c:3227:handle_wc] from pbn04 to: pbn05
> > error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for
> > wr_id 162858496 opcode 1  vendor error 136 qp_idx
> > 0[[992,1],5][btl_openib_component.c:3227:handle_wc] from pbn05 to: pbn04
> > error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number
> 5
> > for wr_id 485900928 opcode 0  vendor error 249 qp_idx 0
> >
> >
> --
> > The OpenFabrics stack has reported a network error event.  Open MPI will
> > try to continue, but your job may end up failing.
> >
> >   Local host:p'"
> >   MPI process PID:   20743
> >   Error number:  3 (IBV_EVENT_QP_ACCESS_ERR)
> >
> > This error may indicate connectivity problems within the fabric; please
> > contact your system administrator.
> >
> --
> >
> > I'd like to know what these two errors mean and where they come from.
> >
> > Thanks for your help,
> > Eloi
>
> --
>
>
> Eloi Gaudry
>
> Free Field Technologies
> Company Website: http://www.fft.be
> Company Phone:   +32 10 487 959
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

[OMPI users] Error: system limit exceeded on number of pipes that can be open

2009-08-11 Thread Mike Dubman

Hello guys,


When executing following command with mtt and ompi 1.3.3:

mpirun --host 
witch15,witch15,witch15,witch15,witch16,witch16,witch16,witch16,witch17,witch17,witch17,witch17,witch18,witch18,witch18,witch18,witch19,witch19,witch19,witch19
-np 20   --mca btl_openib_use_srq 1  --mca btl self,sm,openib
~mtt/mtt-scratch/20090809140816_dellix8_11812/installs/mnum/tests/ibm/ibm/dynamic/loop_spawn


getting following errors:

parent: MPI_Comm_spawn #0 return : 0
parent: MPI_Comm_spawn #20 return : 0
parent: MPI_Comm_spawn #40 return : 0
parent: MPI_Comm_spawn #60 return : 0
parent: MPI_Comm_spawn #80 return : 0
parent: MPI_Comm_spawn #100 return : 0
parent: MPI_Comm_spawn #120 return : 0
parent: MPI_Comm_spawn #140 return : 0
parent: MPI_Comm_spawn #160 return : 0
parent: MPI_Comm_spawn #180 return : 0
parent: MPI_Comm_spawn #200 return : 0
parent: MPI_Comm_spawn #220 return : 0
parent: MPI_Comm_spawn #240 return : 0
parent: MPI_Comm_spawn #260 return : 0
parent: MPI_Comm_spawn #280 return : 0
parent: MPI_Comm_spawn #300 return : 0
parent: MPI_Comm_spawn #320 return : 0
parent: MPI_Comm_spawn #340 return : 0
parent: MPI_Comm_spawn #360 return : 0
parent: MPI_Comm_spawn #380 return : 0
parent: MPI_Comm_spawn #400 return : 0
parent: MPI_Comm_spawn #420 return : 0
parent: MPI_Comm_spawn #440 return : 0
parent: MPI_Comm_spawn #460 return : 0
parent: MPI_Comm_spawn #480 return : 0
parent: MPI_Comm_spawn #500 return : 0
parent: MPI_Comm_spawn #520 return : 0
parent: MPI_Comm_spawn #540 return : 0
parent: MPI_Comm_spawn #560 return : 0
parent: MPI_Comm_spawn #580 return : 0
--
mpirun was unable to launch the specified application as it
encountered an error:

Error: system limit exceeded on number of pipes that can be open
Node: witch19

when attempting to start process rank 0.

This can be resolved by setting the mca parameter opal_set_max_sys_limits to 1,
increasing your limit descriptor setting (using limit or ulimit commands),
asking the system administrator for that node to increase the system limit, or
by rearranging your processes to place fewer of them on that node.




Do you know what OS param I should change in order to resolve it?

Thanks

Mike

Re: [OMPI users] 1.3.1 -rf rankfile behaviour ??

2009-07-16 Thread Mike Dubman

Hello Ralph,

It seems that Option2 is preferred, because it is more intuitive for
end-user to create rankfile for mpi job, which is described by -app cmd
line.

All hosts definitions used inside -app , will be treated like a single
global hostlist combined from all hosts appearing inside "-app file" and
rankfile will refer to any host, appearing inside "-app " directive.
is that correct?


regards

Mike


P.S. man mpirun claims, that:

-app  Provide an appfile, ignoring all other command line options.

but it seems that it does not ignore all other command line options.
Even more, it seems that it is very comfortable to specify some per_job
parameters inside mpirun command just before -app appfile and putting
per_host params inside "-app appfile". What do you think?








On Thu, Jul 16, 2009 at 2:00 AM, Ralph Castain  wrote:

> Hmmm...well actually, there isn't a bug in the code. This is an interesting
> question!
> Here is the problem. It has to do with how -host is processed. Remember, in
> the new scheme (as of 1.3.0), in the absence of any other info (e.g., an RM
> allocation or hostfile), we cycle across -all- the -host specifications to
> create a global pool of allocated nodes. Hence, you got the following:
>
> ==   ALLOCATED NODES   ==
>>
>>  Data for node: Name: dellix7   Num slots: 0Max slots: 0
>>  Data for node: Name: witch1Num slots: 1Max slots: 0
>>  Data for node: Name: witch2Num slots: 1Max slots: 0
>>
>> =
>>
>
> When we start mapping, we call the base function to get the available nodes
> for this particular app_context. The function starts with the entire
> allocation. It then checks for a hostfile, which in this case it won't find.
>
> Subsequently, it looks at the -host spec and removes -all- nodes in the
> list that were not included in -host. In the case of app_context=0, the
> "-host witch1" causes us to remove dellix7 and witch2 from the list -
> leaving only witch1.
>
> This list is passed back to the rank_file mapper. The rf mapper then looks
> at your rankfile, which tells it to put rank=0 on the +n1 node on the list.
>
> But the list only has ONE node on the list, which would correspond to +n0!
> Hence the error message.
>
> We have two potential solutions I can see:
>
> Option 1. we can leave things as they are, and you adjust your rankfile to:
>
> rank 0=+n0 slot=0
> rank 1=+n0 slot=0
>
> Since you specified -host witch2 for the second app_context, this will work
> to put rank0 on witch1 and rank1 on witch2. However, I admit that it looks a
> little weird.
>
> Alternatively, you could adjust your appfile to:
>
> -np 1 -host witch1,witch2 ./hello_world
> -np 1 ./hello_world
>
> Note you could have -host witch1,witch2 on the second line too, if you
> wanted. Now your current rankfile would put rank0 on witch2 and rank1 on
> witch1.
>
> Option 2. we could modify your relative node syntax to be based on the
> eventual total allocation. In this case, we would not use the base function
> to give us a list, but instead would construct it from the allocated node
> pool.
> Your current rankfile would give you what you wanted since we wouldn't count 
> the HNP's node in the pool as it wasn't included in the allocation.
>
>
> Any thoughts on how you'd like to do this? I can make it work either way, but 
> have no personal preference.
> Ralph
>
> On Jul 15, 2009, at 7:38 AM, Ralph Castain wrote:
>
> Okay, I'll dig into it - must be a bug in my code.
>
> Sorry for the problem! Thanks for patience in tracking it down...
> Ralph
>
> On Wed, Jul 15, 2009 at 7:28 AM, Lenny Verkhovsky <
> lenny.verkhov...@gmail.com> wrote:
>
>> Thanks, Ralph,
>> I guess your guess was correct, here is the display map.
>>
>>
>> $cat rankfile
>> rank 0=+n1 slot=0
>> rank 1=+n0 slot=0
>> $cat appfile
>> -np 1 -host witch1 ./hello_world
>> -np 1 -host witch2 ./hello_world
>> $mpirun -np 2 -rf rankfile --display-allocation  -app appfile
>>
>> ==   ALLOCATED NODES   ==
>>
>>  Data for node: Name: dellix7   Num slots: 0Max slots: 0
>>  Data for node: Name: witch1Num slots: 1Max slots: 0
>>  Data for node: Name: witch2Num slots: 1Max slots: 0
>>
>> =
>>
>> --
>> Rankfile claimed host +n1 by index that is bigger than number of allocated
>> hosts.
>>
>>
>> On Wed, Jul 15, 2009 at 4:10 PM, Ralph Castain  wrote:
>>
>>> What is supposed to happen is this:
>>>
>>> 1. each line of the appfile causes us to create a new app_context. We
>>> store the provided -host info in that object.
>>>
>>> 2. when we create the "allocation", we cycle through -all- the
>>> app_contexts and add -all- of their -host info into the list of allocated
>>> nodes
>>>
>>> 3. when we

Re: [OMPI users] Can 2 IB HCAs give twice the bandwidth?

2008-10-22 Thread Mike Dubman

using 2 HCAs on the same PCI-Exp bus (as well as 2 ports from the same HCA)
will not improve performance, PCI-Exp is the bottleneck.


On Mon, Oct 20, 2008 at 2:28 AM, Mostyn Lewis  wrote:

> Well, here's what I see with the IMB PingPong test using two ConnectX DDR
> cards
> in each of 2 machines. I'm just quoting the last line at 10 repetitions of
> the 4194304 bytes.
>
> Scali_MPI_Connect-5.6.4-59151: (multi rail setup in /etc/dat.conf)
>   #bytes #repetitions  t[usec]   Mbytes/sec
>  4194304   10  2198.24  1819.63
> mvapich2-1.2rc2: (MV2_NUM_HCAS=2 MV2_NUM_PORTS=1)
>   #bytes #repetitions  t[usec]   Mbytes/sec
>  4194304   10  2427.24  1647.96
> OpenMPI SVN 19772:
>   #bytes #repetitions  t[usec]   Mbytes/sec
>  4194304   10  3676.35  1088.03
>
> Repeatable within bounds.
>
> This is OFED-1.3.1 and I peered at
> /sys/class/infiniband/mlx4_0/ports/1/counters/port_rcv_packets
> and
> /sys/class/infiniband/mlx4_1/ports/1/counters/port_rcv_packets
> on one of the 2 machines and looked at what happened for Scali
> and OpenMPI.
>
> Scali packets:
> HCA 0 port1 = 115116625 - 114903198 = 213427
> HCA 1 port1 =  78099566 -  77886143 = 213423
> 
>  426850
> OpenMPI packets:
> HCA 0 port1 = 115233624 - 115116625 = 116999
> HCA 1 port1 =  78216425 -  78099566 = 116859
> 
>  233858
>
> Scali is set up so that data larger than 8192 bytes is striped
> across the 2 HCAs using 8192 bytes per HCA in a round robin fashion.
>
> So, it seems that OpenMPI is using both ports but strangley ends
> up with a Mbytes/sec rate which is worse than a single HCA only.
> If I use a --mca btl_openib_if_exclude mlx41:1, we get
>   #bytes #repetitions  t[usec]   Mbytes/sec
>  4194304   10  3080.59  1298.45
>
> So, what's taking so long? Is this a threading question?
>
> DM
>
>
> On Sun, 19 Oct 2008, Jeff Squyres wrote:
>
>  On Oct 18, 2008, at 9:19 PM, Mostyn Lewis wrote:
>>
>>  Can OpenMPI do like Scali and MVAPICH2 and utilize 2 IB HCAs per machine
>>> to approach double the bandwidth on simple tests such as IMB PingPong?
>>>
>>
>>
>> Yes.  OMPI will automatically (and aggressively) use as many active ports
>> as you have.  So you shouldn't need to list devices+ports -- OMPI will
>> simply use all ports that it finds in the active state.  If your ports are
>> on physically separate IB networks, then each IB network will require a
>> different subnet ID so that OMPI can compute reachability properly.
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

87 matches

Mail list logo