Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-11-23 Thread Howard Pritchard via users
HI All,

I opened a new issue to track the coll_perf failure in case its not related
to the HDF5 problem reported earlier.

https://github.com/open-mpi/ompi/issues/8246

Howard


Am Mo., 23. Nov. 2020 um 12:14 Uhr schrieb Dave Love via users <
users@lists.open-mpi.org>:

> Mark Dixon via users  writes:
>
> > Surely I cannot be the only one who cares about using a recent openmpi
> > with hdf5 on lustre?
>
> I generally have similar concerns.  I dug out the romio tests, assuming
> something more basic is useful.  I ran them with ompi 4.0.5+ucx on
> Mark's lustre system (similar to a few nodes of Summit, apart from the
> filesystem, but with quad-rail IB which doesn't give the bandwidth I
> expected).
>
> The perf test says romio performs a bit better.  Also -- from overall
> time -- it's faster on IMB-IO (which I haven't looked at in detail, and
> ran with suboptimal striping).
>
>   Test: perf
>   romio321
>   Access size per process = 4194304 bytes, ntimes = 5
>   Write bandwidth without file sync = 19317.372354 Mbytes/sec
>   Read bandwidth without prior file sync = 35033.325451 Mbytes/sec
>   Write bandwidth including file sync = 1081.096713 Mbytes/sec
>   Read bandwidth after file sync = 47135.349155 Mbytes/sec
>   ompio
>   Access size per process = 4194304 bytes, ntimes = 5
>   Write bandwidth without file sync = 18442.698536 Mbytes/sec
>   Read bandwidth without prior file sync = 31958.198676 Mbytes/sec
>   Write bandwidth including file sync = 1081.058583 Mbytes/sec
>   Read bandwidth after file sync = 31506.854710 Mbytes/sec
>
> However, romio coll_perf fails as follows, and ompio runs.  Isn't there
> mpi-io regression testing?
>
>   [gpu025:89063:0:89063] Caught signal 11 (Segmentation fault: address not
> mapped to object at address 0x1fffbc10)
>    backtrace (tid:  89063) 
>0 0x0005453c ucs_debug_print_backtrace()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucs/debug/debug.c:656
>1 0x00041b04 ucp_rndv_pack_data()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1335
>2 0x0001c814 uct_self_ep_am_bcopy()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:278
>3 0x0003f7ac uct_ep_am_bcopy()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2561
>4 0x0003f7ac ucp_do_am_bcopy_multi()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.inl:79
>5 0x0003f7ac ucp_rndv_progress_am_bcopy()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1352
>6 0x00041cb8 ucp_request_try_send()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223
>7 0x00041cb8 ucp_request_send()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258
>8 0x00041cb8 ucp_rndv_rtr_handler()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1754
>9 0x0001c984 uct_iface_invoke_am()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/base/uct_iface.h:635
>   10 0x0001c984 uct_self_iface_sendrecv_am()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:149
>   11 0x0001c984 uct_self_ep_am_short()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:262
>   12 0x0002ee30 uct_ep_am_short()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2549
>   13 0x0002ee30 ucp_do_am_single()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.c:68
>   14 0x00042908 ucp_proto_progress_rndv_rtr()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:172
>   15 0x0003f4c4 ucp_request_try_send()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223
>   16 0x0003f4c4 ucp_request_send()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258
>   17 0x0003f4c4 ucp_rndv_req_send_rtr()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:423
>   18 0x00045214 ucp_rndv_matched()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1262
>   19 0x00046158 ucp_rndv_process_rts()
> 

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-15 Thread Howard Pritchard via users
HI Martin,

Thanks this is helpful.  Are you getting this timeout when you're running
the spawner process as a singleton?

Howard

Am Fr., 14. Aug. 2020 um 17:44 Uhr schrieb Martín Morales <
martineduardomora...@hotmail.com>:

> Howard,
>
>
>
> I pasted below, the error message after a while of the hang I referred.
>
> Regards,
>
>
>
> Martín
>
>
>
> -
>
>
>
> *A request has timed out and will therefore fail:*
>
>
>
> *  Operation:  LOOKUP: orted/pmix/pmix_server_pub.c:345*
>
>
>
> *Your job may terminate as a result of this problem. You may want to*
>
> *adjust the MCA parameter pmix_server_max_wait and try again. If this*
>
> *occurred during a connect/accept operation, you can adjust that time*
>
> *using the pmix_base_exchange_timeout parameter.*
>
>
> *--*
>
>
> *--*
>
> *It looks like MPI_INIT failed for some reason; your parallel process is*
>
> *likely to abort.  There are many reasons that a parallel process can*
>
> *fail during MPI_INIT; some of which are due to configuration or
> environment*
>
> *problems.  This failure appears to be an internal failure; here's some*
>
> *additional information (which may only be relevant to an Open MPI*
>
> *developer):*
>
>
>
> *  ompi_dpm_dyn_init() failed*
>
> *  --> Returned "Timeout" (-15) instead of "Success" (0)*
>
>
> *--*
>
> *[nos-GF7050VT-M:03767] *** An error occurred in MPI_Init*
>
> *[nos-GF7050VT-M:03767] *** reported by process [2337734658,0]*
>
> *[nos-GF7050VT-M:03767] *** on a NULL communicator*
>
> *[nos-GF7050VT-M:03767] *** Unknown error*
>
> *[nos-GF7050VT-M:03767] *** MPI_ERRORS_ARE_FATAL (processes in this
> communicator will now abort,*
>
> *[nos-GF7050VT-M:03767] ***and potentially your MPI job)*
>
> *[osboxes:02457] *** An error occurred in MPI_Comm_spawn*
>
> *[osboxes:02457] *** reported by process [2337734657,0]*
>
> *[osboxes:02457] *** on communicator MPI_COMM_WORLD*
>
> *[osboxes:02457] *** MPI_ERR_UNKNOWN: unknown error*
>
> *[osboxes:02457] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,*
>
> *[osboxes:02457] ***and potentially your MPI job)*
>
> *[osboxes:02458] 1 more process has sent help message help-orted.txt /
> timedout*
>
> *[osboxes:02458] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages*
>
>
>
>
>
>
>
>
>
> *From: *Martín Morales via users 
> *Sent: *viernes, 14 de agosto de 2020 19:40
> *To: *Howard Pritchard 
> *Cc: *Martín Morales ; Open MPI Users
> 
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
>
> Hi Howard.
>
>
>
> Thanks for the track in Github. I have run with mpirun without “master” in
> the hostfile and runs ok. The hanging occurs when I run like a singleton
> (no mpirun) which is the way I need to run. If I make a top in both
> machines the processes are correctly mapped but hangued. Seems the
> MPI_Init() function doesn’t return. Thanks for your help.
>
> Best regards,
>
>
>
> Martín
>
>
>
>
>
>
>
>
>
>
>
>
>
> *From: *Howard Pritchard 
> *Sent: *viernes, 14 de agosto de 2020 15:18
> *To: *Martín Morales 
> *Cc: *Open MPI Users 
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
>
> Hi Martin,
>
>
>
> I opened an issue on Open MPI's github to track this
> https://github.com/open-mpi/ompi/issues/8005
>
>
>
> You may be seeing another problem if you removed master from the host
> file.
>
> Could you add the --debug-daemons option to the mpirun and post the output?
>
>
>
> Howard
>
>
>
>
>
> Am Di., 11. Aug. 2020 um 17:35 Uhr schrieb Martín Morales <
> martineduardomora...@hotmail.com>:
>
> Hi Howard.
>
>
>
> Great!, that works for the crashing problem with OMPI 4.0.4. However It
> stills hanging if I remove “master” (host which launches spawning
> processes) from my hostfile.
>
> I need spawn only in “worker”. Is there a way or workaround for doing this
> without mpirun?
>
> Thanks a lot for your assistance.
>
>
>
> Martín
>
>
>
>
>
>
>
>
>
> *From: *Howard Pritchard 
> *Sent: *lunes, 10 de agosto de 2020 19:13
> *To: *Martín

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-14 Thread Howard Pritchard via users
Hi Martin,

I opened an issue on Open MPI's github to track this
https://github.com/open-mpi/ompi/issues/8005

You may be seeing another problem if you removed master from the host file.
Could you add the --debug-daemons option to the mpirun and post the output?

Howard


Am Di., 11. Aug. 2020 um 17:35 Uhr schrieb Martín Morales <
martineduardomora...@hotmail.com>:

> Hi Howard.
>
>
>
> Great!, that works for the crashing problem with OMPI 4.0.4. However It
> stills hanging if I remove “master” (host which launches spawning
> processes) from my hostfile.
>
> I need spawn only in “worker”. Is there a way or workaround for doing this
> without mpirun?
>
> Thanks a lot for your assistance.
>
>
>
> Martín
>
>
>
>
>
>
>
>
>
> *From: *Howard Pritchard 
> *Sent: *lunes, 10 de agosto de 2020 19:13
> *To: *Martín Morales 
> *Cc: *Open MPI Users 
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
>
> Hi Martin,
>
>
>
> I was able to reproduce this with 4.0.x branch.  I'll open an issue.
>
>
>
> If you really want to use 4.0.4, then what you'll need to do is build an
> external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and
> then build Open MPI using the --with-pmix=where your pmix is installed
>
> You will also need to build both Open MPI and PMIx against the same
> libevent.   There's a configure option with both packages to use an
> external libevent installation.
>
>
>
> Howard
>
>
>
>
>
> Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales <
> martineduardomora...@hotmail.com>:
>
> Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have
> to post this on the bug section? Thanks and regards.
>
>
>
> Martín
>
>
>
> *From: *Howard Pritchard 
> *Sent: *lunes, 10 de agosto de 2020 14:44
> *To: *Open MPI Users 
> *Cc: *Martín Morales 
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
>
> Hello Martin,
>
>
>
> Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx
> version that introduced a problem with spawn for the 4.0.2-4.0.4 versions.
>
> This is supposed to be fixed in the 4.0.5 release.  Could you try the
> 4.0.5rc1 tarball and see if that addresses the problem you're seeing?
>
>
>
> https://www.open-mpi.org/software/ompi/v4.0/
>
>
>
> Howard
>
>
>
>
>
>
>
> Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users <
> users@lists.open-mpi.org>:
>
>
>
> Hello people!
>
> I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one
> "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded
> OMPI just like this:
>
>
>
> ./configure --prefix=/usr/local/openmpi-4.0.4/bin/
>
>
>
> My hostfile is this:
>
>
>
> master slots=2
> worker slots=2
>
>
>
> I'm trying to dynamically allocate the processes with MPI_Comm_Spawn().
>
> If I launch the processes only on the "master" machine It's ok. But if I
> use the hostfile crashes with this:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *--
> At least one pair of MPI processes are unable to reach each other for MPI
> communications.  This means that no Open MPI device has indicated that it
> can be used to communicate between these processes.  This is an error; Open
> MPI requires that all MPI processes be able to reach each other.  This
> error can sometimes be the result of forgetting to specify the "self" BTL.
>   Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M   Process 2
> ([[35155,1],0]) is on host: unknown!   BTLs attempted: tcp self Your MPI
> job is now going to abort; sorry.
> --
> [nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file
> dpm/dpm.c at line 493
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can fail
> during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):   ompi_dpm_dyn_init() failed

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-13 Thread Howard Pritchard via users
Hi Ralph,

I've not yet determined whether this is actually a PMIx issue or the way
the dpm stuff in OMPI is handling PMIx namespaces.

Howard


Am Di., 11. Aug. 2020 um 19:34 Uhr schrieb Ralph Castain via users <
users@lists.open-mpi.org>:

> Howard - if there is a problem in PMIx that is causing this problem, then
> we really could use a report on it ASAP as we are getting ready to release
> v3.1.6 and I doubt we have addressed anything relevant to what is being
> discussed here.
>
>
>
> On Aug 11, 2020, at 4:35 PM, Martín Morales via users <
> users@lists.open-mpi.org> wrote:
>
> Hi Howard.
>
> Great!, that works for the crashing problem with OMPI 4.0.4. However It
> stills hanging if I remove “master” (host which launches spawning
> processes) from my hostfile.
> I need spawn only in “worker”. Is there a way or workaround for doing this
> without mpirun?
> Thanks a lot for your assistance.
>
> Martín
>
>
>
>
> *From: *Howard Pritchard 
> *Sent: *lunes, 10 de agosto de 2020 19:13
> *To: *Martín Morales 
> *Cc: *Open MPI Users 
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
> Hi Martin,
>
> I was able to reproduce this with 4.0.x branch.  I'll open an issue.
>
> If you really want to use 4.0.4, then what you'll need to do is build an
> external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and
> then build Open MPI using the --with-pmix=where your pmix is installed
> You will also need to build both Open MPI and PMIx against the same
> libevent.   There's a configure option with both packages to use an
> external libevent installation.
>
> Howard
>
>
> Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales <
> martineduardomora...@hotmail.com>:
>
> Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have
> to post this on the bug section? Thanks and regards.
>
>
> Martín
>
>
> *From: *Howard Pritchard 
> *Sent: *lunes, 10 de agosto de 2020 14:44
> *To: *Open MPI Users 
> *Cc: *Martín Morales 
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
> Hello Martin,
>
>
> Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx
> version that introduced a problem with spawn for the 4.0.2-4.0.4 versions.
> This is supposed to be fixed in the 4.0.5 release.  Could you try the
> 4.0.5rc1 tarball and see if that addresses the problem you're seeing?
>
>
> https://www.open-mpi.org/software/ompi/v4.0/
>
>
> Howard
>
>
>
>
>
>
> Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users <
> users@lists.open-mpi.org>:
>
>
> Hello people!
> I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one
> "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded
> OMPI just like this:
>
>
> ./configure --prefix=/usr/local/openmpi-4.0.4/bin/
>
>
> My hostfile is this:
>
>
> master slots=2
> worker slots=2
>
>
> I'm trying to dynamically allocate the processes with MPI_Comm_Spawn().
> If I launch the processes only on the "master" machine It's ok. But if I
> use the hostfile crashes with this:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *--At
> least one pair of MPI processes are unable to reach each other forMPI
> communications.  This means that no Open MPI device has indicatedthat it
> can be used to communicate between these processes.  This isan error; Open
> MPI requires that all MPI processes be able to reacheach other.  This error
> can sometimes be the result of forgetting tospecify the "self" BTL.
> Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M  Process 2
> ([[35155,1],0]) is on host: unknown!  BTLs attempted: tcp selfYour MPI job
> is now going to abort;
> sorry.--[nos-GF7050VT-M:22526]
> [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line
> 493--It
> looks like MPI_INIT failed for some reason; your parallel process islikely
> to abort.  There are many reasons that a parallel process canfail during
> MPI_INIT; some of which are due to configuration or environmentproblems.
> This failure appears to be an internal failure; here's someadditional
> information (which may only be relevant to an Open MPIdeveloper):
> 

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-10 Thread Howard Pritchard via users
Hi Martin,

I was able to reproduce this with 4.0.x branch.  I'll open an issue.

If you really want to use 4.0.4, then what you'll need to do is build an
external PMIx 3.1.2 (the PMIx that was embedded in Open MPI 4.0.1), and
then build Open MPI using the --with-pmix=where your pmix is installed
You will also need to build both Open MPI and PMIx against the same
libevent.   There's a configure option with both packages to use an
external libevent installation.

Howard


Am Mo., 10. Aug. 2020 um 13:52 Uhr schrieb Martín Morales <
martineduardomora...@hotmail.com>:

> Hi Howard. Unfortunately the issue persists in OMPI 4.0.5rc1. Do I have
> to post this on the bug section? Thanks and regards.
>
>
>
> Martín
>
>
>
> *From: *Howard Pritchard 
> *Sent: *lunes, 10 de agosto de 2020 14:44
> *To: *Open MPI Users 
> *Cc: *Martín Morales 
> *Subject: *Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with
> dynamically processes allocation. OMPI 4.0.1 don't.
>
>
>
> Hello Martin,
>
>
>
> Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx
> version that introduced a problem with spawn for the 4.0.2-4.0.4 versions.
>
> This is supposed to be fixed in the 4.0.5 release.  Could you try the
> 4.0.5rc1 tarball and see if that addresses the problem you're seeing?
>
>
>
> https://www.open-mpi.org/software/ompi/v4.0/
>
>
>
> Howard
>
>
>
>
>
>
>
> Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users <
> users@lists.open-mpi.org>:
>
>
>
> Hello people!
>
> I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one
> "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded
> OMPI just like this:
>
>
>
> ./configure --prefix=/usr/local/openmpi-4.0.4/bin/
>
>
>
> My hostfile is this:
>
>
>
> master slots=2
> worker slots=2
>
>
>
> I'm trying to dynamically allocate the processes with MPI_Comm_Spawn().
>
> If I launch the processes only on the "master" machine It's ok. But if I
> use the hostfile crashes with this:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *--
> At least one pair of MPI processes are unable to reach each other for MPI
> communications.  This means that no Open MPI device has indicated that it
> can be used to communicate between these processes.  This is an error; Open
> MPI requires that all MPI processes be able to reach each other.  This
> error can sometimes be the result of forgetting to specify the "self" BTL.
>   Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M   Process 2
> ([[35155,1],0]) is on host: unknown!   BTLs attempted: tcp self Your MPI
> job is now going to abort; sorry.
> --
> [nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file
> dpm/dpm.c at line 493
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can fail
> during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):   ompi_dpm_dyn_init() failed   --> Returned "Unreachable" (-12)
> instead of "Success" (0)
> --
> [nos-GF7050VT-M:22526] *** An error occurred in MPI_Init
> [nos-GF7050VT-M:22526] *** reported by process [2303918082,1]
> [nos-GF7050VT-M:22526] *** on a NULL communicator [nos-GF7050VT-M:22526]
> *** Unknown error [nos-GF7050VT-M:22526] *** MPI_ERRORS_ARE_FATAL
> (processes in this communicator will now abort, [nos-GF7050VT-M:22526] ***
>and potentially your MPI job)*
>
>
>
> Note: host "nos-GF7050VT-M" is "worker"
>
>
>
> But If I run without "master" in hostfile, the processes are launched but
> It hangs: MPI_Init() doesn't returns.
>
> I launched the script (pasted below) in this 2 ways with the same result:
>
>
>
> $ ./simple_spawn 2
>
> $ mpirun -np 1 ./simple_spawn 2
>
>
>
> The "simple_spawn" script:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
&

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-10 Thread Howard Pritchard via users
Hello Martin,

Between Open MPI 4.0.1 and Open MPI 4.0.4 we upgraded the internal PMIx
version that introduced a problem with spawn for the 4.0.2-4.0.4 versions.
This is supposed to be fixed in the 4.0.5 release.  Could you try the
4.0.5rc1 tarball and see if that addresses the problem you're seeing?

https://www.open-mpi.org/software/ompi/v4.0/

Howard



Am Do., 6. Aug. 2020 um 09:50 Uhr schrieb Martín Morales via users <
users@lists.open-mpi.org>:

>
>
> Hello people!
>
> I'm using OMPI 4.0.4 in a very simple scenario. Just 2 machines, one
> "master", one "worker" on a Ethernet LAN. Both with Ubuntu 18.04.I builded
> OMPI just like this:
>
>
>
> ./configure --prefix=/usr/local/openmpi-4.0.4/bin/
>
>
>
> My hostfile is this:
>
>
>
> master slots=2
> worker slots=2
>
>
>
> I'm trying to dynamically allocate the processes with MPI_Comm_Spawn().
>
> If I launch the processes only on the "master" machine It's ok. But if I
> use the hostfile crashes with this:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *--
> At least one pair of MPI processes are unable to reach each other for MPI
> communications.  This means that no Open MPI device has indicated that it
> can be used to communicate between these processes.  This is an error; Open
> MPI requires that all MPI processes be able to reach each other.  This
> error can sometimes be the result of forgetting to specify the "self" BTL.
>   Process 1 ([[35155,2],1]) is on host: nos-GF7050VT-M   Process 2
> ([[35155,1],0]) is on host: unknown!   BTLs attempted: tcp self Your MPI
> job is now going to abort; sorry.
> --
> [nos-GF7050VT-M:22526] [[35155,2],1] ORTE_ERROR_LOG: Unreachable in file
> dpm/dpm.c at line 493
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can fail
> during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):   ompi_dpm_dyn_init() failed   --> Returned "Unreachable" (-12)
> instead of "Success" (0)
> --
> [nos-GF7050VT-M:22526] *** An error occurred in MPI_Init
> [nos-GF7050VT-M:22526] *** reported by process [2303918082,1]
> [nos-GF7050VT-M:22526] *** on a NULL communicator [nos-GF7050VT-M:22526]
> *** Unknown error [nos-GF7050VT-M:22526] *** MPI_ERRORS_ARE_FATAL
> (processes in this communicator will now abort, [nos-GF7050VT-M:22526] ***
>and potentially your MPI job)*
>
>
>
> Note: host "nos-GF7050VT-M" is "worker"
>
>
>
> But If I run without "master" in hostfile, the processes are launched but
> It hangs: MPI_Init() doesn't returns.
>
> I launched the script (pasted below) in this 2 ways with the same result:
>
>
>
> $ ./simple_spawn 2
>
> $ mpirun -np 1 ./simple_spawn 2
>
>
>
> The "simple_spawn" script:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *#include "mpi.h" #include  #include  int main(int
> argc, char ** argv){ int processesToRun; MPI_Comm parentcomm,
> intercomm; MPI_Info info; int rank, size, hostName_len; char
> hostName[200]; MPI_Init( ,  ); MPI_Comm_get_parent(
>  ); MPI_Comm_rank(MPI_COMM_WORLD, );
> MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Get_processor_name(hostName,
> _len); if (parentcomm == MPI_COMM_NULL) {
> if(argc < 2 ){ printf("Processes number needed!");
> return 0; } processesToRun = atoi(argv[1]);
> MPI_Info_create(  ); MPI_Info_set( info, "hostfile",
> "./hostfile" ); MPI_Info_set( info, "map_by", "node" );
> MPI_Comm_spawn( argv[0], MPI_ARGV_NULL, processesToRun, info, 0,
> MPI_COMM_WORLD, , MPI_ERRCODES_IGNORE); printf("I'm the
> parent.\n"); } else { printf("I'm the spawned h: %s  r/s:
> %i/%i.\n", hostName, rank, size ); } fflush(stdout);
> MPI_Finalize(); return 0; }*
>
>
>
> I came from OMPI 4.0.1. In this version It's working... with some
> inconsistencies I'm afraid. That's why I decided to upgrade to OMPI 4.0.4.
>
> I tried several versions with no luck. Is there maybe an intrinsic problem
> with the OMPI dynamic allocation functionality?
>
> Any help will be very appreciated. Best regards.
>
>
>
> Martín
>
>
>


Re: [OMPI users] Differences 4.0.3 -> 4.0.4 (Regression?)

2020-08-08 Thread Howard Pritchard via users
Hello Michael,

Not sure what could be causing this in terms of delta between v4.0.3 and
v4.0.4.
Two things to try

- add --debug-daemons and --mca pmix_base_verbose 100 to the mpirun line
and compare output from the v4.0.3 and v4.0.4 installs
- perhaps try using the --enable-mpirun-prefix-by-default configure option
and reinstall v4.0.4

Howard


Am Do., 6. Aug. 2020 um 04:48 Uhr schrieb Michael Fuckner via users <
users@lists.open-mpi.org>:

> Hi,
>
> I have a small setup with one headnode and two compute nodes connected
> via IB-QDR running CentOS 8.2 and Mellanox OFED 4.9 LTS. I installed
> openmpi 3.0.6, 3.1.6, 4.0.3 and 4.0.4 with identical configuration
> (configure, compile, nothing configured in openmpi-mca-params.conf), the
> output from ompi-info and orte-info looks identical.
>
> There is a small benchmark basically just doing MPI_Send() and
> MPI_Recv(). I can invoke it directly like this (as 4.0.3 and 4.0.4)
>
> /opt/openmpi/4.0.3/gcc/bin/mpirun -np 16 -hostfile HOSTFILE_2x8 -nolocal
> ./OWnetbench.openmpi-4.0.3
>
> when running this job from slurm, it works with 4.0.3, but there is an
> error with 4.0.4. Any hint what to check?
>
>
> ### running ./OWnetbench/OWnetbench.openmpi-4.0.4 with
> /opt/openmpi/4.0.4/gcc/bin/mpirun ###
> [node002.cluster:04960] MCW rank 0 bound to socket 0[core 7[hwt 0-1]]:
> [../../../../../../../BB]
> [node002.cluster:04963] PMIX ERROR: OUT-OF-RESOURCE in file
> client/pmix_client.c at line 231
> [node002.cluster:04963] OPAL ERROR: Error in file pmix3x_client.c at
> line 112
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> [node002.cluster:04963] Local abort before MPI_INIT completed completed
> successfully, but am not able to aggregate error messages, and not able
> to guarantee that all other processes were kil
> led!
> --
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code. Per user-direction, the job has been aborted.
> --
> --
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
>
>Process name: [[15424,1],0]
>Exit code:1
> --
>
> Any hint why 4.0.4 behaves not like the other versions?
>
> --
> DELTA Computer Products GmbH
> Röntgenstr. 4
> D-21465 Reinbek bei Hamburg
> T: +49 40 300672-30
> F: +49 40 300672-11
> E: michael.fuck...@delta.de
>
> Internet: https://www.delta.de
> Handelsregister Lübeck HRB 3678-RE, Ust.-IdNr.: DE135110550
> Geschäftsführer: Hans-Peter Hellmann
>


Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-29 Thread Howard Pritchard via users
Collin,

A couple of things to try.  First, could you just configure without using
the mellanox platform file and see if you can run the app with 100 or more
processes?
Another thing to try is to keep using the mellanox platform file, but run
the app with

mpirun --mca pml ob1 -np 100 bin/xhpcg

and see if the app runs successfully.

Howard


Am Mo., 27. Jan. 2020 um 09:29 Uhr schrieb Collin Strassburger <
cstrassbur...@bihrle.com>:

> Hello Howard,
>
>
>
> To remove potential interactions, I have found that the issue persists
> without ucx and hcoll support.
>
>
>
> Run command: mpirun -np 128 bin/xhpcg
>
> Output:
>
> --
>
> mpirun was unable to start the specified application as it encountered an
>
> error:
>
>
>
> Error code: 63
>
> Error name: (null)
>
> Node: Gen2Node4
>
>
>
> when attempting to start process rank 0.
>
> --
>
> 128 total processes failed to start
>
>
>
> It returns this error for any process I initialize with >100 processes per
> node.  I get the same error message for multiple different codes, so the
> error code is mpi related rather than being program specific.
>
>
>
> Collin
>
>
>
> *From:* Howard Pritchard 
> *Sent:* Monday, January 27, 2020 11:20 AM
> *To:* Open MPI Users 
> *Cc:* Collin Strassburger 
> *Subject:* Re: [OMPI users] OMPI returns error 63 on AMD 7742 when
> utilizing 100+ processors per node
>
>
>
> Hello Collen,
>
>
>
> Could you provide more information about the error.  Is there any output
> from either Open MPI or, maybe, UCX, that could provide more information
> about the problem you are hitting?
>
>
>
> Howard
>
>
>
>
>
> Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users <
> users@lists.open-mpi.org>:
>
> Hello,
>
>
>
> I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5.  Both of
> these versions cause the same error (error code 63) when utilizing more
> than 100 cores on a single node.  The processors I am utilizing are AMD
> Epyc “Rome” 7742s.  The OS is CentOS 8.1.  I have tried compiling with both
> the default gcc 8 and locally compiled gcc 9.  I have already tried
> modifying the maximum name field values with no success.
>
>
>
> My compile options are:
>
> ./configure
>
>  --prefix=${HPCX_HOME}/ompi
>
>  --with-platform=contrib/platform/mellanox/optimized
>
>
>
> Any assistance would be appreciated,
>
> Collin
>
>
>
> Collin Strassburger
>
> Bihrle Applied Research Inc.
>
>
>
>


Re: [OMPI users] OMPI returns error 63 on AMD 7742 when utilizing 100+ processors per node

2020-01-27 Thread Howard Pritchard via users
Hello Collen,

Could you provide more information about the error.  Is there any output
from either Open MPI or, maybe, UCX, that could provide more information
about the problem you are hitting?

Howard


Am Mo., 27. Jan. 2020 um 08:38 Uhr schrieb Collin Strassburger via users <
users@lists.open-mpi.org>:

> Hello,
>
>
>
> I am having difficulty with OpenMPI versions 4.0.2 and 3.1.5.  Both of
> these versions cause the same error (error code 63) when utilizing more
> than 100 cores on a single node.  The processors I am utilizing are AMD
> Epyc “Rome” 7742s.  The OS is CentOS 8.1.  I have tried compiling with both
> the default gcc 8 and locally compiled gcc 9.  I have already tried
> modifying the maximum name field values with no success.
>
>
>
> My compile options are:
>
> ./configure
>
>  --prefix=${HPCX_HOME}/ompi
>
>  --with-platform=contrib/platform/mellanox/optimized
>
>
>
> Any assistance would be appreciated,
>
> Collin
>
>
>
> Collin Strassburger
>
> Bihrle Applied Research Inc.
>
>
>


Re: [OMPI users] Do idle MPI threads consume clock cycles?

2019-02-25 Thread Howard Pritchard
Hello Mark,

You may want to checkout this package:

https://github.com/lanl/libquo

Another option would be to do something like use an MPI_Ibarrier in the
application
with all the MPI processes but rank 0 going into a loop over waiting for
completion of the barrier
and doing a sleep.  Once rank 0 had completed the OpenMP work, it would
then enter the
barrier and wait for completion.

This type of problem may be helped in a future MPI that supports the notion
of MPI Sessions.
With this approach, you would initialize one MPI session for normal
messaging behavior, using
polling for fast processing of messages.  Your MPI library would use this
for its existing messaging.
You could initialize a second MPI session to use blocking methods for
message receipt.  You would
use a communicator derived from the second session to do what's described
above for the loop
with sleep on an Ibarrier.

Good luck,

Howard


Am Do., 21. Feb. 2019 um 11:25 Uhr schrieb Mark McClure <
mark.w.m...@gmail.com>:

> I have the following, rather unusual, scenario...
>
> I have a program running with OpenMP on a multicore computer. At one point
> in the program, I want to use an external package that is written to
> exploit MPI, not OpenMP, parallelism. So a (rather awkward) solution could
> be to launch the program in MPI, but most of the time, everything is being
> done in a single MPI process, which is using OpenMP (ie, run my current
> program in a single MPI process). Then, when I get to the part where I need
> to use the external package, distribute out the information to all the MPI
> processes, run it across all, and then pull them back to the master
> process. This is awkward, but probably better than my current approach,
> which is running the external package on a single processor (ie, not
> exploiting parallelism in this time-consuming part of the code).
>
> If I use this strategy, I fear that the idle MPI processes may be
> consuming clock cycles while I am running the rest of the program on the
> master process with OpenMP. Thus, they may compete with the OpenMP threads.
> OpenMP does not close threads between every pragma, but OMP_WAIT_POLICY can
> be set to sleep idle threads (actually, this is the default behavior). I
> have not been able to find any equivalent documentation regarding the
> behavior of idle threads in MPI.
>
> Best regards,
> Mark
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Howard Pritchard
>
> - Adam LeBlanc
>
> On Wed, Feb 20, 2019 at 2:08 PM Jeff Squyres (jsquyres) via users <
> users@lists.open-mpi.org> wrote:
>
>> Can you try the latest 4.0.x nightly snapshot and see if the problem
>> still occurs?
>>
>> https://www.open-mpi.org/nightly/v4.0.x/
>>
>>
>> > On Feb 20, 2019, at 1:40 PM, Adam LeBlanc  wrote:
>> >
>> > I do here is the output:
>> >
>> > 2 total processes killed (some possibly by mpirun during cleanup)
>> > [pandora:12238] *** Process received signal ***
>> > [pandora:12238] Signal: Segmentation fault (11)
>> > [pandora:12238] Signal code: Invalid permissions (2)
>> > [pandora:12238] Failing at address: 0x7f5c8e31fff0
>> > [pandora:12238] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f5ca205f680]
>> > [pandora:12238] [ 1] [pandora:12237] *** Process received signal ***
>> > /usr/lib64/libc.so.6(+0x14c4a0)[0x7f5ca1dcc4a0]
>> > [pandora:12238] [ 2] [pandora:12237] Signal: Segmentation fault (11)
>> > [pandora:12237] Signal code: Invalid permissions (2)
>> > [pandora:12237] Failing at address: 0x7f6c4ab3fff0
>> > /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f5ca16fbe55]
>> > [pandora:12238] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f5ca231798b]
>> > [pandora:12238] [ 4]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f5ca22eeda7]
>> > [pandora:12238] [ 5] IMB-MPI1[0x40b83b]
>> > [pandora:12238] [ 6] IMB-MPI1[0x407155]
>> > [pandora:12238] [ 7] IMB-MPI1[0x4022ea]
>> > [pandora:12238] [ 8]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f5ca1ca23d5]
>> > [pandora:12238] [ 9] IMB-MPI1[0x401d49]
>> > [pandora:12238] *** End of error message ***
>> > [pandora:12237] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6c5e73f680]
>> > [pandora:12237] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6c5e4ac4a0]
>> > [pandora:12237] [ 2]
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6c5dddbe55]
>> > [pandora:12237] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6c5e9f798b]
>> > [pandora:12237] [ 4]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6c5e9ceda7]
>> > [pandora:12237] [ 5] IMB-MPI1[0x40b83b]
>> > [pandora:12237] [ 6] IMB-MPI1[0x407155]
>> > [pandora:12237] [ 7] IMB-MPI1[0x4022ea]
>> > [pandora:12237] [ 8]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6c5e3823d5]
>> > [pandora:12237] [ 9] IMB-MPI1[0x401d49]
>> > [pandora:12237] *** End of error message ***
>> > [phoebe:07408] *** Process received signal ***
>> > [phoebe:07408] Signal: Segmentation fault (11)
>> > [phoebe:07408] Signal code: Invalid permissions (2)
>> > [phoebe:07408] Failing at address: 0x7f6b9ca9fff0
>> > [titan:07169] *** Process received signal ***
>> > [titan:07169] Signal: Segmentation fault (11)
>> > [titan:07169] Signal code: Invalid permissions (2)
>> > [titan:07169] Failing at address: 0x7fc01295fff0
>> > [phoebe:07408] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f6bb03b7680]
>> > [phoebe:07408] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f6bb01244a0]
>> > [phoebe:07408] [ 2] [titan:07169] [ 0]
>> /usr/lib64/libpthread.so.0(+0xf680)[0x7fc026117680]
>> > [titan:07169] [ 1]
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f6bafa53e55]
>> > [phoebe:07408] [ 3] /usr/lib64/libc.so.6(+0x14c4a0)[0x7fc025e844a0]
>> > [titan:07169] [ 2]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f6bb066f98b]
>> > [phoebe:07408] [ 4]
>> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7fc0257b3e55]
>> > [titan:07169] [ 3]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f6bb0646da7]
>> > [phoebe:07408] [ 5] IMB-MPI1[0x40b83b]
>> > [phoebe:07408] [ 6] IMB-MPI1[0x407155]
>> >
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7fc0263cf98b]
>> > [titan:07169] [ 4] [phoebe:07408] [ 7] IMB-MPI1[0x4022ea]
>> > [phoebe:07408] [ 8]
>> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7fc0263a6da7]
>> > [titan:07169] [ 5] IMB-MPI1[0x40b83b]
>> > [titan:07169] [ 6] IMB-MPI1[0x407155]
>> > [titan:07169] [ 7]
>> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f6bafffa3d5]
>> > [phoebe:07408] [ 9] IMB-MPI1[0x401d49]
>> > [phoebe:07408] *** End of error message ***
>> > IMB-MP

Re: [OMPI users] OpenMPI v4.0.0 signal 11 (Segmentation fault)

2019-02-20 Thread Howard Pritchard
HI Adam,

As a sanity check, if you try to use --mca btl self,vader,tcp

do you still see the segmentation fault?

Howard


Am Mi., 20. Feb. 2019 um 08:50 Uhr schrieb Adam LeBlanc <
alebl...@iol.unh.edu>:

> Hello,
>
> When I do a run with OpenMPI v4.0.0 on Infiniband with this command:
> mpirun --mca btl_openib_warn_no_device_params_found 0 --map-by node --mca
> orte_base_help_aggregate 0 --mca btl openib,vader,self --mca pml ob1 --mca
> btl_openib_allow_ib 1 -np 6
>  -hostfile /home/aleblanc/ib-mpi-hosts IMB-MPI1
>
> I get this error:
>
> #
> # Benchmarking Reduce_scatter
> # #processes = 4
> # ( 2 additional processes waiting in MPI_Barrier)
> #
>#bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
> 0 1000 0.14 0.15 0.14
> 4 1000 5.00 7.58 6.28
> 8 1000 5.13 7.68 6.41
>16 1000 5.05 7.74 6.39
>32 1000 5.43 7.96 6.75
>64 1000 6.78 8.56 7.69
>   128 1000 7.77 9.55 8.59
>   256 1000 8.2810.96 9.66
>   512 1000 9.1912.4910.85
>  1024 100011.7815.0113.38
>  2048 100017.4119.5118.52
>  4096 100025.7328.2226.89
>  8192 100047.7549.4448.79
> 16384 100081.1090.1584.75
> 32768 1000   163.01   178.58   173.19
> 65536  640   315.63   340.51   333.18
>131072  320   475.48   528.82   510.85
>262144  160   979.70  1063.81  1035.61
>524288   80  2070.51  2242.58  2150.15
>   1048576   40  4177.36  4527.25  4431.65
>   2097152   20  8738.08  9340.50  9147.89
> [pandora:04500] *** Process received signal ***
> [pandora:04500] Signal: Segmentation fault (11)
> [pandora:04500] Signal code: Address not mapped (1)
> [pandora:04500] Failing at address: 0x7f310eb0
> [pandora:04499] *** Process received signal ***
> [pandora:04499] Signal: Segmentation fault (11)
> [pandora:04499] Signal code: Address not mapped (1)
> [pandora:04499] Failing at address: 0x7f28b110
> [pandora:04500] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f3126bef680]
> [pandora:04500] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f312695c4a0]
> [pandora:04500] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f312628be55]
> [pandora:04500] [ 3] [pandora:04499] [ 0]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f3126ea798b]
> [pandora:04500] [ 4] /usr/lib64/libpthread.so.0(+0xf680)[0x7f28c91ef680]
> [pandora:04499] [ 1]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f3126e7eda7]
> [pandora:04500] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04500] [ 6] IMB-MPI1[0x407155]
> [pandora:04500] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04500] [ 8] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f28c8f5c4a0]
> [pandora:04499] [ 2]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f31268323d5]
> [pandora:04500] [ 9] IMB-MPI1[0x401d49]
> [pandora:04500] *** End of error message ***
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f28c888be55]
> [pandora:04499] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f28c94a798b]
> [pandora:04499] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f28c947eda7]
> [pandora:04499] [ 5] IMB-MPI1[0x40b83b]
> [pandora:04499] [ 6] IMB-MPI1[0x407155]
> [pandora:04499] [ 7] IMB-MPI1[0x4022ea]
> [pandora:04499] [ 8]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f28c8e323d5]
> [pandora:04499] [ 9] IMB-MPI1[0x401d49]
> [pandora:04499] *** End of error message ***
> [phoebe:03779] *** Process received signal ***
> [phoebe:03779] Signal: Segmentation fault (11)
> [phoebe:03779] Signal code: Address not mapped (1)
> [phoebe:03779] Failing at address: 0x7f483d60
> [phoebe:03779] [ 0] /usr/lib64/libpthread.so.0(+0xf680)[0x7f48556c7680]
> [phoebe:03779] [ 1] /usr/lib64/libc.so.6(+0x14c4a0)[0x7f48554344a0]
> [phoebe:03779] [ 2]
> /opt/openmpi/4.0.0/lib/libopen-pal.so.40(+0x4be55)[0x7f4854d63e55]
> [phoebe:03779] [ 3]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(ompi_coll_base_reduce_scatter_intra_ring+0x23b)[0x7f485597f98b]
> [phoebe:03779] [ 4]
> /opt/openmpi/4.0.0/lib/libmpi.so.40(PMPI_Reduce_scatter+0x1c7)[0x7f4855956da7]
> [phoebe:03779] [ 5] IMB-MPI1[0x40b83b]
> [phoebe:03779] [ 6] IMB-MPI1[0x407155]
> [phoebe:03779] [ 7] 

Re: [OMPI users] Help Getting Started with Open MPI and PMIx and UCX

2019-01-20 Thread Howard Pritchard
Hi Matt

Definitely do not include the ucx option for an omnipath cluster.  Actually
if you accidentally installed ucx in it’s default location use on the
system Switch to this config option

—with-ucx=no

Otherwise you will hit

https://github.com/openucx/ucx/issues/750

Howard


Gilles Gouaillardet  schrieb am Sa. 19. Jan.
2019 um 18:41:

> Matt,
>
> There are two ways of using PMIx
>
> - if you use mpirun, then the MPI app (e.g. the PMIx client) will talk
> to mpirun and orted daemons (e.g. the PMIx server)
> - if you use SLURM srun, then the MPI app will directly talk to the
> PMIx server provided by SLURM. (note you might have to srun
> --mpi=pmix_v2 or something)
>
> In the former case, it does not matter whether you use the embedded or
> external PMIx.
> In the latter case, Open MPI and SLURM have to use compatible PMIx
> libraries, and you can either check the cross-version compatibility
> matrix,
> or build Open MPI with the same PMIx used by SLURM to be on the safe
> side (not a bad idea IMHO).
>
>
> Regarding the hang, I suggest you try different things
> - use mpirun in a SLURM job (e.g. sbatch instead of salloc so mpirun
> runs on a compute node rather than on a frontend node)
> - try something even simpler such as mpirun hostname (both with sbatch
> and salloc)
> - explicitly specify the network to be used for the wire-up. you can
> for example mpirun --mca oob_tcp_if_include 192.168.0.0/24 if this is
> the network subnet by which all the nodes (e.g. compute nodes and
> frontend node if you use salloc) communicate.
>
>
> Cheers,
>
> Gilles
>
> On Sat, Jan 19, 2019 at 3:31 AM Matt Thompson  wrote:
> >
> > On Fri, Jan 18, 2019 at 1:13 PM Jeff Squyres (jsquyres) via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> On Jan 18, 2019, at 12:43 PM, Matt Thompson  wrote:
> >> >
> >> > With some help, I managed to build an Open MPI 4.0.0 with:
> >>
> >> We can discuss each of these params to let you know what they are.
> >>
> >> > ./configure --disable-wrapper-rpath --disable-wrapper-runpath
> >>
> >> Did you have a reason for disabling these?  They're generally good
> things.  What they do is add linker flags to the wrapper compilers (i.e.,
> mpicc and friends) that basically put a default path to find libraries at
> run time (that can/will in most cases override LD_LIBRARY_PATH -- but you
> can override these linked-in-default-paths if you want/need to).
> >
> >
> > I've had these in my Open MPI builds for a while now. The reason was one
> of the libraries I need for the climate model I work on went nuts if both
> of them weren't there. It was originally the rpath one but then eventually
> (Open MPI 3?) I had to add the runpath one. But I have been updating the
> libraries more aggressively recently (due to OS upgrades) so it's possible
> this is no longer needed.
> >
> >>
> >>
> >> > --with-psm2
> >>
> >> Ensure that Open MPI can include support for the PSM2 library, and
> abort configure if it cannot.
> >>
> >> > --with-slurm
> >>
> >> Ensure that Open MPI can include support for SLURM, and abort configure
> if it cannot.
> >>
> >> > --enable-mpi1-compatibility
> >>
> >> Add support for MPI_Address and other MPI-1 functions that have since
> been deleted from the MPI 3.x specification.
> >>
> >> > --with-ucx
> >>
> >> Ensure that Open MPI can include support for UCX, and abort configure
> if it cannot.
> >>
> >> > --with-pmix=/usr/nlocal/pmix/2.1
> >>
> >> Tells Open MPI to use the PMIx that is installed at
> /usr/nlocal/pmix/2.1 (instead of using the PMIx that is bundled internally
> to Open MPI's source code tree/expanded tarball).
> >>
> >> Unless you have a reason to use the external PMIx, the internal/bundled
> PMIx is usually sufficient.
> >
> >
> > Ah. I did not know that. I figured if our SLURM was built linked to a
> specific PMIx v2 that I should build Open MPI with the same PMIx. I'll
> build an Open MPI 4 without specifying this.
> >
> >>
> >>
> >> > --with-libevent=/usr
> >>
> >> Same as previous; change "pmix" to "libevent" (i.e., use the external
> libevent instead of the bundled libevent).
> >>
> >> > CC=icc CXX=icpc FC=ifort
> >>
> >> Specify the exact compilers to use.
> >>
> >> > The MPI 1 is because I need to build HDF5 eventually and I added psm2
> because it's an Omnipath cluster. The libevent was probably a red herring
> as libevent-devel wasn't installed on the system. It was eventually, and I
> just didn't remove the flag. And I saw no errors in the build!
> >>
> >> Might as well remove the --with-libevent if you don't need it.
> >>
> >> > However, I seem to have built an Open MPI that doesn't work:
> >> >
> >> > (1099)(master) $ mpirun --version
> >> > mpirun (Open MPI) 4.0.0
> >> >
> >> > Report bugs to http://www.open-mpi.org/community/help/
> >> > (1100)(master) $ mpirun -np 4 ./helloWorld.mpi3.SLES12.OMPI400.exe
> >> >
> >> > It just sits there...forever. Can the gurus here help me figure out
> what I managed to break? Perhaps I added too much to my configure 

Re: [OMPI users] Segmentation fault using openmpi-master-201901030305-ee26ed9

2019-01-04 Thread Howard Pritchard
Hi Sigmar,

I observed this problem yesterday myself and should have a fix in to master
later today.


Howard


Am Fr., 4. Jan. 2019 um 05:30 Uhr schrieb Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> I've installed (tried to install) openmpi-master-201901030305-ee26ed9 on
> my "SUSE Linux Enterprise Server 12.3 (x86_64)" with gcc-7.3.0,
> icc-19.0.1.144
> pgcc-18.4-0, and Sun C 5.15 (Oracle Developer Studio 12.6). Unfortunately,
> I
> still cannot build it with Sun C and I get a segmentation fault for one of
> my small programs for the other compilers.
>
> I get the following error for Sun C that I reported some time ago.
> https://www.mail-archive.com/users@lists.open-mpi.org/msg32816.html
>
>
> The program runs as expected if I only use my local machine "loki" and it
> breaks if I add a remote machine (even if I only use the remote machine
> without "loki").
>
> loki hello_1 114 ompi_info | grep -e "Open MPI repo revision" -e"Configure
> command line"
>Open MPI repo revision: v2.x-dev-6601-gee26ed9
>Configure command line: '--prefix=/usr/local/openmpi-master_64_gcc'
> '--libdir=/usr/local/openmpi-master_64_gcc/lib64'
> '--with-jdk-bindir=/usr/local/jdk-11/bin'
> '--with-jdk-headers=/usr/local/jdk-11/include'
> 'JAVA_HOME=/usr/local/jdk-11'
> 'LDFLAGS=-m64 -L/usr/local/cuda/lib64' 'CC=gcc' 'CXX=g++' 'FC=gfortran'
> 'CFLAGS=-m64 -I/usr/local/cuda/include' 'CXXFLAGS=-m64
> -I/usr/local/cuda/include' 'FCFLAGS=-m64' 'CPP=cpp
> -I/usr/local/cuda/include'
> 'CXXCPP=cpp -I/usr/local/cuda/include' '--enable-mpi-cxx'
> '--enable-cxx-exceptions' '--enable-mpi-java'
> '--with-cuda=/usr/local/cuda'
> '--with-valgrind=/usr/local/valgrind' '--with-hwloc=internal'
> '--without-verbs'
> '--with-wrapper-cflags=-std=c11 -m64' '--with-wrapper-cxxflags=-m64'
> '--with-wrapper-fcflags=-m64' '--enable-debug'
>
>
> loki hello_1 115 mpiexec -np 4 --host loki:2,nfs2:2 hello_1_mpi
> Process 0 of 4 running on loki
> Process 1 of 4 running on loki
> Process 2 of 4 running on nfs2
> Process 3 of 4 running on nfs2
>
> Now 3 slave tasks are sending greetings.
>
> Greetings from task 1:
>message type:3
>msg length:  132 characters
> ... (complete output of my program)
>
> [nfs2:01336] *** Process received signal ***
> [nfs2:01336] Signal: Segmentation fault (11)
> [nfs2:01336] Signal code: Address not mapped (1)
> [nfs2:01336] Failing at address: 0x7feea4849268
> [nfs2:01336] [ 0] /lib64/libpthread.so.0(+0x10c10)[0x7feeacbbec10]
> [nfs2:01336] [ 1]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(+0x7cd34)[0x7feeadd94d34]
> [nfs2:01336] [ 2]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(+0x78673)[0x7feeadd90673]
> [nfs2:01336] [ 3]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(+0x7ac2c)[0x7feeadd92c2c]
> [nfs2:01336] [ 4]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(opal_finalize_cleanup_domain+0x3e)[0x7feeadd56507]
> [nfs2:01336] [ 5]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(opal_finalize_util+0x56)[0x7feeadd56667]
> [nfs2:01336] [ 6]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-pal.so.0(opal_finalize+0xd3)[0x7feeadd567de]
> [nfs2:01336] [ 7]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-rte.so.0(orte_finalize+0x1ba)[0x7feeae09d7ea]
> [nfs2:01336] [ 8]
>
> /usr/local/openmpi-master_64_gcc/lib64/libopen-rte.so.0(orte_daemon+0x3ddd)[0x7feeae0cf55d]
> [nfs2:01336] [ 9] orted[0x40086d]
> [nfs2:01336] [10] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7feeac829725]
> [nfs2:01336] [11] orted[0x400739]
> [nfs2:01336] *** End of error message ***
> Segmentation fault (core dumped)
> loki hello_1 116
>
>
> I would be grateful, if somebody can fix the problem. Do you need anything
> else? Thank you very much for any help in advance.
>
>
> Kind regards
>
> Siegmar
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Unable to build Open MPI with external PMIx library support

2018-12-17 Thread Howard Pritchard
HI Eduardo,

The config.log looked nominal.Could you try the following additional
options to the build with the internal PMIx builds:

--enable-orterun-prefix-by-default --disable-dlopen


?

Also, for the mpirun built using the internal PMIx,

could you check the output of ldd?


And just in case, check if the PMIX_INSTALL_PREFIX is

somehow being set?


Howard



Am Mo., 17. Dez. 2018 um 03:29 Uhr schrieb Eduardo Rothe <
eduardo.ro...@yahoo.co.uk>:

> Hi Howard,
>
> Thank you for you reply. I have just re-executed the whole process and
> here is the config.log (in attachment to this message)!
>
> Just for restating, when I use internal PMIx I get the following error
> while running mpirun (using Open MPI 4.0.0):
>
> --
> We were unable to find any usable plugins for the BFROPS framework. This
> PMIx
> framework requires at least one plugin in order to operate. This can be
> caused
> by any of the following:
>
> * we were unable to build any of the plugins due to some combination
>   of configure directives and available system support
>
> * no plugin was selected due to some combination of MCA parameter
>   directives versus built plugins (i.e., you excluded all the plugins
>   that were built and/or could execute)
>
> * the PMIX_INSTALL_PREFIX environment variable, or the MCA parameter
>   "mca_base_component_path", is set and doesn't point to any location
>   that includes at least one usable plugin for this framework.
>
> Please check your installation and environment.
> ------
>
> Regards,
> Eduardo
>
>
> On Saturday, 15 December 2018, 18:35:44 CET, Howard Pritchard <
> hpprit...@gmail.com> wrote:
>
>
> Hi Eduardo
>
> Could you post the config.log for the build with internal PMIx so we can
> figure that out first.
>
> Howard
>
> Eduardo Rothe via users  schrieb am Fr. 14.
> Dez. 2018 um 09:41:
>
> Open MPI: 4.0.0
> PMIx: 3.0.2
> OS: Debian 9
>
> I'm building a debian package for Open MPI and either I get the following
> error messages while configuring:
>
>   undefined reference to symbol 'dlopen@@GLIBC_2.2.5'
>   undefined reference to symbol 'lt_dlopen'
>
> when using the configure option:
>
>   ./configure --with-pmix=/usr/lib/x86_64-linux-gnu/pmix
>
> or otherwise, if I use the following configure options:
>
>   ./configure --with-pmix=external
> --with-pmix-libdir=/usr/lib/x86_64-linux-gnu/pmix
>
> I have a successfull compile, but when running mpirun I get the following
> message:
>
> --
> We were unable to find any usable plugins for the BFROPS framework. This
> PMIx
> framework requires at least one plugin in order to operate. This can be
> caused
> by any of the following:
>
> * we were unable to build any of the plugins due to some combination
>   of configure directives and available system support
>
> * no plugin was selected due to some combination of MCA parameter
>   directives versus built plugins (i.e., you excluded all the plugins
>   that were built and/or could execute)
>
> * the PMIX_INSTALL_PREFIX environment variable, or the MCA parameter
>   "mca_base_component_path", is set and doesn't point to any location
>   that includes at least one usable plugin for this framework.
>
> Please check your installation and environment.
> --
>
> What I find most strange is that I get the same error message (unable to
> find
> any usable plugins for the BFROPS framework) even if I don't configure
> external PMIx support!
>
> Can someone please hint me about what's going on?
>
> Cheers!
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Unable to build Open MPI with external PMIx library support

2018-12-15 Thread Howard Pritchard
Hi Eduardo

Could you post the config.log for the build with internal PMIx so we can
figure that out first.

Howard

Eduardo Rothe via users  schrieb am Fr. 14. Dez.
2018 um 09:41:

> Open MPI: 4.0.0
> PMIx: 3.0.2
> OS: Debian 9
>
> I'm building a debian package for Open MPI and either I get the following
> error messages while configuring:
>
>   undefined reference to symbol 'dlopen@@GLIBC_2.2.5'
>   undefined reference to symbol 'lt_dlopen'
>
> when using the configure option:
>
>   ./configure --with-pmix=/usr/lib/x86_64-linux-gnu/pmix
>
> or otherwise, if I use the following configure options:
>
>   ./configure --with-pmix=external
> --with-pmix-libdir=/usr/lib/x86_64-linux-gnu/pmix
>
> I have a successfull compile, but when running mpirun I get the following
> message:
>
> --
> We were unable to find any usable plugins for the BFROPS framework. This
> PMIx
> framework requires at least one plugin in order to operate. This can be
> caused
> by any of the following:
>
> * we were unable to build any of the plugins due to some combination
>   of configure directives and available system support
>
> * no plugin was selected due to some combination of MCA parameter
>   directives versus built plugins (i.e., you excluded all the plugins
>   that were built and/or could execute)
>
> * the PMIX_INSTALL_PREFIX environment variable, or the MCA parameter
>   "mca_base_component_path", is set and doesn't point to any location
>   that includes at least one usable plugin for this framework.
>
> Please check your installation and environment.
> --
>
> What I find most strange is that I get the same error message (unable to
> find
> any usable plugins for the BFROPS framework) even if I don't configure
> external PMIx support!
>
> Can someone please hint me about what's going on?
>
> Cheers!
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [Open MPI Announce] Open MPI 4.0.0 Released

2018-11-14 Thread Howard Pritchard
Hi Bert,

If you'd prefer to return to the land of convenience and don't need to mix
MPI
and OpenSHMEM, then you may want to try the path I outlined in the email
archived at the following link

https://www.mail-archive.com/users@lists.open-mpi.org/msg32274.html

Howard


Am Di., 13. Nov. 2018 um 23:10 Uhr schrieb Bert Wesarg via users <
users@lists.open-mpi.org>:

> Dear Takahiro,
> On Wed, Nov 14, 2018 at 5:38 AM Kawashima, Takahiro
>  wrote:
> >
> > XPMEM moved to GitLab.
> >
> > https://gitlab.com/hjelmn/xpmem
>
> the first words from the README aren't very pleasant to read:
>
> This is an experimental version of XPMEM based on a version provided by
> Cray and uploaded to https://code.google.com/p/xpmem. This version
> supports
> any kernel 3.12 and newer. *Keep in mind there may be bugs and this version
> may cause kernel panics, code crashes, eat your cat, etc.*
>
> Installing this on my laptop where I just want developing with SHMEM
> it would be a pitty to lose work just because of that.
>
> Best,
> Bert
>
> >
> > Thanks,
> > Takahiro Kawashima,
> > Fujitsu
> >
> > > Hello Bert,
> > >
> > > What OS are you running on your notebook?
> > >
> > > If you are running Linux, and you have root access to your system,
> then
> > > you should be able to resolve the Open SHMEM support issue by
> installing
> > > the XPMEM device driver on your system, and rebuilding UCX so it picks
> > > up XPMEM support.
> > >
> > > The source code is on GitHub:
> > >
> > > https://github.com/hjelmn/xpmem
> > >
> > > Some instructions on how to build the xpmem device driver are at
> > >
> > > https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM
> > >
> > > You will need to install the kernel source and symbols rpms on your
> > > system before building the xpmem device driver.
> > >
> > > Hope this helps,
> > >
> > > Howard
> > >
> > >
> > > Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users <
> > > users@lists.open-mpi.org>:
> > >
> > > > Hi,
> > > >
> > > > On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
> > > >  wrote:
> > > > >
> > > > > The Open MPI Team, representing a consortium of research,
> academic, and
> > > > > industry partners, is pleased to announce the release of Open MPI
> version
> > > > > 4.0.0.
> > > > >
> > > > > v4.0.0 is the start of a new release series for Open MPI.
> Starting with
> > > > > this release, the OpenIB BTL supports only iWarp and RoCE by
> default.
> > > > > Starting with this release,  UCX is the preferred transport
> protocol
> > > > > for Infiniband interconnects. The embedded PMIx runtime has been
> updated
> > > > > to 3.0.2.  The embedded Romio has been updated to 3.2.1.  This
> > > > > release is ABI compatible with the 3.x release streams. There have
> been
> > > > numerous
> > > > > other bug fixes and performance improvements.
> > > > >
> > > > > Note that starting with Open MPI v4.0.0, prototypes for several
> > > > > MPI-1 symbols that were deleted in the MPI-3.0 specification
> > > > > (which was published in 2012) are no longer available by default in
> > > > > mpi.h. See the README for further details.
> > > > >
> > > > > Version 4.0.0 can be downloaded from the main Open MPI web site:
> > > > >
> > > > >   https://www.open-mpi.org/software/ompi/v4.0/
> > > > >
> > > > >
> > > > > 4.0.0 -- September, 2018
> > > > > 
> > > > >
> > > > > - OSHMEM updated to the OpenSHMEM 1.4 API.
> > > > > - Do not build OpenSHMEM layer when there are no SPMLs available.
> > > > >   Currently, this means the OpenSHMEM layer will only build if
> > > > >   a MXM or UCX library is found.
> > > >
> > > > so what is the most convenience way to get SHMEM working on a single
> > > > shared memory node (aka. notebook)? I just realized that I don't have
> > > > a SHMEM since Open MPI 3.0. But building with UCX does not help
> > > > either. I tried with UCX 1.4 but Open MPI SHMEM
> > > > still does not work:
> > > >
> > > > $ oshcc -o shmem_hello_world-4.0.0
> openmpi-4.0.0/examples/hello_oshmem_c.c
> > > > $ oshrun -np 2 ./shmem_hello_world-4.0.0
> > > > [1542109710.217344] [tudtug:27715:0] select.c:406  UCX  ERROR
> > > > no remote registered memory access transport to tudtug:27716:
> > > > self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
> > > > tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
> > > > mm/posix - Destination is unreachable, cma/cma - no put short
> > > > [1542109710.217344] [tudtug:27716:0] select.c:406  UCX  ERROR
> > > > no remote registered memory access transport to tudtug:27715:
> > > > self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
> > > > tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
> > > > mm/posix - Destination is unreachable, cma/cma - no put short
> > > > [tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
> > > > Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
> > > > [tudtug:27715] 

Re: [OMPI users] [Open MPI Announce] Open MPI 4.0.0 Released

2018-11-13 Thread Howard Pritchard
Hello Bert,

What OS are you running on your notebook?

If you are running Linux, and you have root access to your system,  then
you should be able to resolve the Open SHMEM support issue by installing
the XPMEM device driver on your system, and rebuilding UCX so it picks
up XPMEM support.

The source code is on GitHub:

https://github.com/hjelmn/xpmem

Some instructions on how to build the xpmem device driver are at

https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM

You will need to install the kernel source and symbols rpms on your
system before building the xpmem device driver.

Hope this helps,

Howard


Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users <
users@lists.open-mpi.org>:

> Hi,
>
> On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
>  wrote:
> >
> > The Open MPI Team, representing a consortium of research, academic, and
> > industry partners, is pleased to announce the release of Open MPI version
> > 4.0.0.
> >
> > v4.0.0 is the start of a new release series for Open MPI.  Starting with
> > this release, the OpenIB BTL supports only iWarp and RoCE by default.
> > Starting with this release,  UCX is the preferred transport protocol
> > for Infiniband interconnects. The embedded PMIx runtime has been updated
> > to 3.0.2.  The embedded Romio has been updated to 3.2.1.  This
> > release is ABI compatible with the 3.x release streams. There have been
> numerous
> > other bug fixes and performance improvements.
> >
> > Note that starting with Open MPI v4.0.0, prototypes for several
> > MPI-1 symbols that were deleted in the MPI-3.0 specification
> > (which was published in 2012) are no longer available by default in
> > mpi.h. See the README for further details.
> >
> > Version 4.0.0 can be downloaded from the main Open MPI web site:
> >
> >   https://www.open-mpi.org/software/ompi/v4.0/
> >
> >
> > 4.0.0 -- September, 2018
> > 
> >
> > - OSHMEM updated to the OpenSHMEM 1.4 API.
> > - Do not build OpenSHMEM layer when there are no SPMLs available.
> >   Currently, this means the OpenSHMEM layer will only build if
> >   a MXM or UCX library is found.
>
> so what is the most convenience way to get SHMEM working on a single
> shared memory node (aka. notebook)? I just realized that I don't have
> a SHMEM since Open MPI 3.0. But building with UCX does not help
> either. I tried with UCX 1.4 but Open MPI SHMEM
> still does not work:
>
> $ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c
> $ oshrun -np 2 ./shmem_hello_world-4.0.0
> [1542109710.217344] [tudtug:27715:0] select.c:406  UCX  ERROR
> no remote registered memory access transport to tudtug:27716:
> self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
> tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
> mm/posix - Destination is unreachable, cma/cma - no put short
> [1542109710.217344] [tudtug:27716:0] select.c:406  UCX  ERROR
> no remote registered memory access transport to tudtug:27715:
> self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
> tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
> mm/posix - Destination is unreachable, cma/cma - no put short
> [tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
> Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
> [tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
> Error: add procs FAILED rc=-2
> [tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
> Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
> [tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
> Error: add procs FAILED rc=-2
> --
> It looks like SHMEM_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during SHMEM_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open SHMEM
> developer):
>
>   SPML add procs failed
>   --> Returned "Out of resource" (-2) instead of "Success" (0)
> --
> [tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
> initialize - aborting
> [tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
> initialize - aborting
> --
> SHMEM_ABORT was invoked on rank 0 (pid 27715, host=tudtug) with errorcode
> -1.
> --
> --
> A SHMEM process is aborting at a time when it cannot guarantee that all
> of its peer processes in the job will be killed properly.  You should
> 

Re: [OMPI users] [EXTERNAL] Re: OpenMPI 3.1.0 Lock Up on POWER9 w/ CUDA9.2

2018-07-02 Thread Howard Pritchard
HI Si,

Could you add --disable-builtin-atomics

to the configure options and see if the hang goes away?

Howard


2018-07-02 8:48 GMT-06:00 Jeff Squyres (jsquyres) via users <
users@lists.open-mpi.org>:

> Simon --
>
> You don't currently have another Open MPI installation in your PATH /
> LD_LIBRARY_PATH, do you?
>
> I have seen dependency library loads cause "make check" to get confused,
> and instead of loading the libraries from the build tree, actually load
> some -- but not all -- of the required OMPI/ORTE/OPAL/etc. libraries from
> an installation tree.  Hilarity ensues (to include symptoms such as running
> forever).
>
> Can you double check that you have no Open MPI libraries in your
> LD_LIBRARY_PATH before running "make check" on the build tree?
>
>
>
> > On Jun 30, 2018, at 3:18 PM, Hammond, Simon David via users <
> users@lists.open-mpi.org> wrote:
> >
> > Nathan,
> >
> > Same issue with OpenMPI 3.1.1 on POWER9 with GCC 7.2.0 and CUDA9.2.
> >
> > S.
> >
> > --
> > Si Hammond
> > Scalable Computer Architectures
> > Sandia National Laboratories, NM, USA
> > [Sent from remote connection, excuse typos]
> >
> >
> > On 6/16/18, 10:10 PM, "Nathan Hjelm"  wrote:
> >
> >Try the latest nightly tarball for v3.1.x. Should be fixed.
> >
> >> On Jun 16, 2018, at 5:48 PM, Hammond, Simon David via users <
> users@lists.open-mpi.org> wrote:
> >>
> >> The output from the test in question is:
> >>
> >> Single thread test. Time: 0 s 10182 us 10 nsec/poppush
> >> Atomics thread finished. Time: 0 s 169028 us 169 nsec/poppush
> >> 
> >>
> >> S.
> >>
> >> --
> >> Si Hammond
> >> Scalable Computer Architectures
> >> Sandia National Laboratories, NM, USA
> >> [Sent from remote connection, excuse typos]
> >>
> >>
> >> On 6/16/18, 5:45 PM, "Hammond, Simon David"  wrote:
> >>
> >>   Hi OpenMPI Team,
> >>
> >>   We have recently updated an install of OpenMPI on POWER9 system
> (configuration details below). We migrated from OpenMPI 2.1 to OpenMPI 3.1.
> We seem to have a symptom where code than ran before is now locking up and
> making no progress, getting stuck in wait-all operations. While I think
> it's prudent for us to root cause this a little more, I have gone back and
> rebuilt MPI and re-run the "make check" tests. The opal_fifo test appears
> to hang forever. I am not sure if this is the cause of our issue but wanted
> to report that we are seeing this on our system.
> >>
> >>   OpenMPI 3.1.0 Configuration:
> >>
> >>   ./configure --prefix=/home/projects/ppc64le-pwr9-nvidia/openmpi/3.
> 1.0-nomxm/gcc/7.2.0/cuda/9.2.88 --with-cuda=$CUDA_ROOT --enable-mpi-java
> --enable-java --with-lsf=/opt/lsf/10.1 --with-lsf-libdir=/opt/lsf/10.
> 1/linux3.10-glibc2.17-ppc64le/lib --with-verbs
> >>
> >>   GCC versions are 7.2.0, built by our team. CUDA is 9.2.88 from NVIDIA
> for POWER9 (standard download from their website). We enable IBM's JDK
> 8.0.0.
> >>   RedHat: Red Hat Enterprise Linux Server release 7.5 (Maipo)
> >>
> >>   Output:
> >>
> >>   make[3]: Entering directory `/home/sdhammo/openmpi/
> openmpi-3.1.0/test/class'
> >>   make[4]: Entering directory `/home/sdhammo/openmpi/
> openmpi-3.1.0/test/class'
> >>   PASS: ompi_rb_tree
> >>   PASS: opal_bitmap
> >>   PASS: opal_hash_table
> >>   PASS: opal_proc_table
> >>   PASS: opal_tree
> >>   PASS: opal_list
> >>   PASS: opal_value_array
> >>   PASS: opal_pointer_array
> >>   PASS: opal_lifo
> >>   
> >>
> >>   Output from Top:
> >>
> >>   20   0   73280   4224   2560 S 800.0  0.0  17:22.94 lt-opal_fifo
> >>
> >>   --
> >>   Si Hammond
> >>   Scalable Computer Architectures
> >>   Sandia National Laboratories, NM, USA
> >>   [Sent from remote connection, excuse typos]
> >>
> >>
> >>
> >>
> >> ___
> >> users mailing list
> >> users@lists.open-mpi.org
> >> https://lists.open-mpi.org/mailman/listinfo/users
> >
> >
> > ___
> > users mailing list
> > users@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/users
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] A couple of general questions

2018-06-14 Thread Howard Pritchard
Hello Charles

You are heading in the right direction.

First you might want to run the libfabric fi_info command to see what
capabilities you picked up from the libfabric RPMs.

Next you may well not actually be using the OFI  mtl.

Could you run your app with

export OMPI_MCA_mtl_base_verbose=100

and post the output?

It would also help if you described the system you are using :  OS
interconnect cpu type etc.

Howard

Charles A Taylor  schrieb am Do. 14. Juni 2018 um 06:36:

> Because of the issues we are having with OpenMPI and the openib BTL
> (questions previously asked), I’ve been looking into what other transports
> are available.  I was particularly interested in OFI/libfabric support but
> cannot find any information on it more recent than a reference to the usNIC
> BTL from 2015 (Jeff Squyres, Cisco).  Unfortunately, the openmpi-org
> website FAQ’s covering OpenFabrics support don’t mention anything beyond
> OpenMPI 1.8.  Given that 3.1 is the current stable version, that seems odd.
>
> That being the case, I thought I’d ask here. After laying down the
> libfabric-devel RPM and building (3.1.0) with —with-libfabric=/usr, I end
> up with an “ofi” MTL but nothing else.   I can run with OMPI_MCA_mtl=ofi
> and OMPI_MCA_btl=“self,vader,openib” but it eventually crashes in
> libopen-pal.so.   (mpi_waitall() higher up the stack).
>
> GIZMO:9185 terminated with signal 11 at PC=2b4d4b68a91d SP=7ffcfbde9ff0.
> Backtrace:
>
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(+0x9391d)[0x2b4d4b68a91d]
>
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libopen-pal.so.40(opal_progress+0x24)[0x2b4d4b632754]
>
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(ompi_request_default_wait_all+0x11f)[0x2b4d47be2a6f]
>
> /apps/mpi/intel/2018.1.163/openmpi/3.1.0/lib64/libmpi.so.40(PMPI_Waitall+0xbd)[0x2b4d47c2ce4d]
>
> Questions: Am I using the OFI MTL as intended?   Should there be an “ofi”
> BTL?   Does anyone use this?
>
> Thanks,
>
> Charlie Taylor
> UF Research Computing
>
> PS - If you could use some help updating the FAQs, I’d be willing to put
> in some time.  I’d probably learn a lot.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Problem running with UCX/oshmem on single node?

2018-05-09 Thread Howard Pritchard
Hi Craig,

You are experiencing problems because you don't have a transport installed
that UCX can use for oshmem.

You either need to go and buy a connectx4/5 HCA from mellanox (and maybe a
switch), and install that
on your system, or else install xpmem (https://github.com/hjelmn/xpmem).
Note there is a bug right now
in UCX that you may hit if you try to go thee xpmem only  route:

https://github.com/open-mpi/ompi/issues/5083
and
https://github.com/openucx/ucx/issues/2588

If you are just running on a single node and want to experiment with the
OpenSHMEM program model,
and do not have mellanox mlx5 equipment installed on the node, you are much
better off trying to use SOS
over OFI libfabric:

https://github.com/Sandia-OpenSHMEM/SOS
https://github.com/ofiwg/libfabric/releases

For SOS you will need to install the hydra launcher as well:

http://www.mpich.org/downloads/

I really wish google would do a better job at hitting my responses about
this type of problem.  I seem to
respond every couple of months to this exact problem on this mail list.


Howard


2018-05-09 13:11 GMT-06:00 Craig Reese :

>
> I'm trying to play with oshmem on a single node (just to have a way to do
> some simple
> experimentation and playing around) and having spectacular problems:
>
> CentOS 6.9 (gcc 4.4.7)
> built and installed ucx 1.3.0
> built and installed openmpi-3.1.0
>
> [cfreese]$ cat oshmem.c
>
> #include 
> int
> main() {
> shmem_init();
> }
>
> [cfreese]$ mpicc oshmem.c -loshmem
>
> [cfreese]$ shmemrun -np 2 ./a.out
>
> [ucs1l:30118] mca: base: components_register: registering framework spml
> components
> [ucs1l:30118] mca: base: components_register: found loaded component ucx
> [ucs1l:30119] mca: base: components_register: registering framework spml
> components
> [ucs1l:30119] mca: base: components_register: found loaded component ucx
> [ucs1l:30119] mca: base: components_register: component ucx register
> function successful
> [ucs1l:30118] mca: base: components_register: component ucx register
> function successful
> [ucs1l:30119] mca: base: components_open: opening spml components
> [ucs1l:30119] mca: base: components_open: found loaded component ucx
> [ucs1l:30118] mca: base: components_open: opening spml components
> [ucs1l:30118] mca: base: components_open: found loaded component ucx
> [ucs1l:30119] mca: base: components_open: component ucx open function
> successful
> [ucs1l:30118] mca: base: components_open: component ucx open function
> successful
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
> mca_spml_base_select() select: initializing spml component ucx
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173
> - mca_spml_ucx_component_init() in ucx, my priority is 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:107 -
> mca_spml_base_select() select: initializing spml component ucx
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:173
> - mca_spml_ucx_component_init() in ucx, my priority is 21
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184
> - mca_spml_ucx_component_init() *** ucx initialized 
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
> mca_spml_base_select() select: init returned priority 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
> mca_spml_base_select() selected ucx best priority 21
> [ucs1l:30118] ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
> mca_spml_base_select() select: component ucx selected
> [ucs1l:30118] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
> mca_spml_ucx_enable() *** ucx ENABLED 
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx_component.c:184
> - mca_spml_ucx_component_init() *** ucx initialized 
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:119 -
> mca_spml_base_select() select: init returned priority 21
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:160 -
> mca_spml_base_select() selected ucx best priority 21
> [ucs1l:30119] ../../../../oshmem/mca/spml/base/spml_base_select.c:194 -
> mca_spml_base_select() select: component ucx selected
> [ucs1l:30119] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:82 -
> mca_spml_ucx_enable() *** ucx ENABLED 
>
> here's where I think the real issue is
>
> [1525891910.424102] [ucs1l:30119:0] select.c:316  UCX  ERROR no
> remote registered memory access transport to : mm/posix -
> Destination is unreachable, mm/sysv - Destination is unreachable, tcp/eth0
> - no put short, self/self - Destination is unreachable
> [1525891910.424104] [ucs1l:30118:0] select.c:316  UCX  ERROR no
> remote registered memory access transport to : mm/posix -
> Destination is unreachable, mm/sysv - Destination is unreachable, tcp/eth0
> - no put short, self/self - Destination is unreachable
>
> [ucs1l:30119] Error 

Re: [OMPI users] Debug build of v3.0.1 tarball

2018-05-04 Thread Howard Pritchard
HI Adam,

I think you'll have better luck setting the CFLAGS on the configure line.

try

./configure CFLAGS="-g -O0" your other configury options.

Howard


2018-05-04 12:09 GMT-06:00 Moody, Adam T. <mood...@llnl.gov>:

> Hi Howard,
>
> I do have a make clean after the configure.  To be extra safe, I’m now
> also deleting the source directory and untarring for each build to make
> sure I have a clean starting point.
>
>
>
> I do get a successful build if I add --enable-debug to configure and then
> do a simple make that has no CFLAGS or LDFLAGS:
>
>
>
> make -j VERBOSE=1
>
>
>
> So that’s good.  However, looking at the compile lines that were used, I
> see a -g but no -O0.  I’m trying to force the -g -O0, because our debuggers
> show the best info at that optimization level.
>
>
>
> If I then also add a CFLAGS=”-g -O0” to my make command, I see the “-g
> -O0” in the compile lines, but then the pthread link error shows up:
>
>
>
> make -j CFLAGS=”-g -O0” VERBOSE=1
>
>
>
>   CC   opal_wrapper.o
>
>   GENERATE opal_wrapper.1
>
>   CCLD opal_wrapper
>
> ../../../opal/.libs/libopen-pal.so: undefined reference to
> `pthread_atfork'
>
> collect2: error: ld returned 1 exit status
>
> make[2]: *** [opal_wrapper] Error 1
>
>
>
> Also setting LDFLAGS fixes that up.  Just wondering whether I’m going
> about it the right way in trying to get -g -O0 in the build.
>
>
>
> Thanks for your help,
>
> -Adam
>
>
>
> *From: *users <users-boun...@lists.open-mpi.org> on behalf of Howard
> Pritchard <hpprit...@gmail.com>
> *Reply-To: *Open MPI Users <users@lists.open-mpi.org>
> *Date: *Friday, May 4, 2018 at 7:46 AM
> *To: *Open MPI Users <users@lists.open-mpi.org>
> *Subject: *Re: [OMPI users] Debug build of v3.0.1 tarball
>
>
>
> HI Adam,
>
>
>
> Sorry didn't notice you did try the --enable-debug flag.  That should not
> have
>
> led to the link error building the opal dso.  Did you do a make clean after
>
> rerunning configure?
>
>
>
> Howard
>
>
>
>
>
> 2018-05-04 8:22 GMT-06:00 Howard Pritchard <hpprit...@gmail.com>:
>
> Hi Adam,
>
>
>
> Did you try using the --enable-debug configure option along with your
> CFLAGS options?
>
> You may want to see if that simplifies your build.
>
>
>
> In any case, we'll fix the problems you found.
>
>
>
> Howard
>
>
>
>
>
> 2018-05-03 15:00 GMT-06:00 Moody, Adam T. <mood...@llnl.gov>:
>
> Hello Open MPI team,
>
> I'm looking for the recommended way to produce a debug build of Open MPI
> v3.0.1 that compiles with “-g -O0” so that I get accurate debug info under
> a debugger.
>
> So far, I've gone through the following sequence.  I started with
> CFLAGS="-g -O0" on make:
>
> shell$ ./configure --prefix=$installdir --disable-silent-rules \
>
>   --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi
>
> shell$ make -j CFLAGS="-g -O0" VERBOSE=1
>
> That led to the following error:
>
> In file included from ../../../../opal/util/arch.h:26:0,
>
>  from btl_openib.h:43,
>
>  from btl_openib_component.c:79:
>
> btl_openib_component.c: In function 'progress_pending_frags_wqe':
>
> btl_openib_component.c:3351:29: error: 'opal_list_item_t' has no member named 
> 'opal_list_item_refcount'
>
>  assert(0 == frag->opal_list_item_refcount);
>
>  ^
>
> make[2]: *** [btl_openib_component.lo] Error 1
>
> make[2]: *** Waiting for unfinished jobs
>
> make[2]: Leaving directory `.../openmpi-3.0.1/opal/mca/btl/openib'
>
> So it seems the assert is referring to a field structure that is protected
> by a debug flag.  I then added --enable-debug to configure, which led to:
>
> make[2]: Entering directory `.../openmpi-3.0.1/opal/tools/wrappers'
>
>   CC   opal_wrapper.o
>
>   GENERATE opal_wrapper.1
>
>   CCLD opal_wrapper
>
> ../../../opal/.libs/libopen-pal.so: undefined reference to `pthread_atfork'
>
> collect2: error: ld returned 1 exit status
>
> make[2]: *** [opal_wrapper] Error 1
>
> make[2]: Leaving directory `.../openmpi-3.0.1/opal/tools/wrappers'
>
> Finally, if I also add LDFLAGS="-lpthread" to make, I get a build:
>
> shell$ ./configure --prefix=$installdir --enable-debug --disable-silent-rules 
> \
>
>   --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi
>
> shell$ make -j CFLAGS="-g -O0" LDFLAGS="-lpthread" VERBOSE=1
>
> Am I doing this correct

Re: [OMPI users] Debug build of v3.0.1 tarball

2018-05-04 Thread Howard Pritchard
HI Adam,

Sorry didn't notice you did try the --enable-debug flag.  That should not
have
led to the link error building the opal dso.  Did you do a make clean after
rerunning configure?

Howard


2018-05-04 8:22 GMT-06:00 Howard Pritchard <hpprit...@gmail.com>:

> Hi Adam,
>
> Did you try using the --enable-debug configure option along with your
> CFLAGS options?
> You may want to see if that simplifies your build.
>
> In any case, we'll fix the problems you found.
>
> Howard
>
>
> 2018-05-03 15:00 GMT-06:00 Moody, Adam T. <mood...@llnl.gov>:
>
>> Hello Open MPI team,
>>
>> I'm looking for the recommended way to produce a debug build of Open MPI
>> v3.0.1 that compiles with “-g -O0” so that I get accurate debug info under
>> a debugger.
>>
>> So far, I've gone through the following sequence.  I started with
>> CFLAGS="-g -O0" on make:
>>
>> shell$ ./configure --prefix=$installdir --disable-silent-rules \
>>
>>   --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi
>>
>> shell$ make -j CFLAGS="-g -O0" VERBOSE=1
>>
>> That led to the following error:
>>
>> In file included from ../../../../opal/util/arch.h:26:0,
>>
>>  from btl_openib.h:43,
>>
>>  from btl_openib_component.c:79:
>>
>> btl_openib_component.c: In function 'progress_pending_frags_wqe':
>>
>> btl_openib_component.c:3351:29: error: 'opal_list_item_t' has no member 
>> named 'opal_list_item_refcount'
>>
>>  assert(0 == frag->opal_list_item_refcount);
>>
>>  ^
>>
>> make[2]: *** [btl_openib_component.lo] Error 1
>>
>> make[2]: *** Waiting for unfinished jobs
>>
>> make[2]: Leaving directory `.../openmpi-3.0.1/opal/mca/btl/openib'
>>
>> So it seems the assert is referring to a field structure that is
>> protected by a debug flag.  I then added --enable-debug to configure, which
>> led to:
>>
>> make[2]: Entering directory `.../openmpi-3.0.1/opal/tools/wrappers'
>>
>>   CC   opal_wrapper.o
>>
>>   GENERATE opal_wrapper.1
>>
>>   CCLD opal_wrapper
>>
>> ../../../opal/.libs/libopen-pal.so: undefined reference to `pthread_atfork'
>>
>> collect2: error: ld returned 1 exit status
>>
>> make[2]: *** [opal_wrapper] Error 1
>>
>> make[2]: Leaving directory `.../openmpi-3.0.1/opal/tools/wrappers'
>>
>> Finally, if I also add LDFLAGS="-lpthread" to make, I get a build:
>>
>> shell$ ./configure --prefix=$installdir --enable-debug 
>> --disable-silent-rules \
>>
>>   --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi
>>
>> shell$ make -j CFLAGS="-g -O0" LDFLAGS="-lpthread" VERBOSE=1
>>
>> Am I doing this correctly?
>>
>> Is there a pointer to the configure/make flags for this?
>>
>> I did find this page that describes the developer build from a git clone,
>> but that seemed a bit overkill since I am looking for a debug build from
>> the distribution tarball instead of the git clone (avoid the autotools
>> nightmare):
>>
>> https://www.open-mpi.org/source/building.php
>>
>> Thanks.
>>
>> -Adam
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Debug build of v3.0.1 tarball

2018-05-04 Thread Howard Pritchard
Hi Adam,

Did you try using the --enable-debug configure option along with your
CFLAGS options?
You may want to see if that simplifies your build.

In any case, we'll fix the problems you found.

Howard


2018-05-03 15:00 GMT-06:00 Moody, Adam T. :

> Hello Open MPI team,
>
> I'm looking for the recommended way to produce a debug build of Open MPI
> v3.0.1 that compiles with “-g -O0” so that I get accurate debug info under
> a debugger.
>
> So far, I've gone through the following sequence.  I started with
> CFLAGS="-g -O0" on make:
>
> shell$ ./configure --prefix=$installdir --disable-silent-rules \
>
>   --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi
>
> shell$ make -j CFLAGS="-g -O0" VERBOSE=1
>
> That led to the following error:
>
> In file included from ../../../../opal/util/arch.h:26:0,
>
>  from btl_openib.h:43,
>
>  from btl_openib_component.c:79:
>
> btl_openib_component.c: In function 'progress_pending_frags_wqe':
>
> btl_openib_component.c:3351:29: error: 'opal_list_item_t' has no member named 
> 'opal_list_item_refcount'
>
>  assert(0 == frag->opal_list_item_refcount);
>
>  ^
>
> make[2]: *** [btl_openib_component.lo] Error 1
>
> make[2]: *** Waiting for unfinished jobs
>
> make[2]: Leaving directory `.../openmpi-3.0.1/opal/mca/btl/openib'
>
> So it seems the assert is referring to a field structure that is protected
> by a debug flag.  I then added --enable-debug to configure, which led to:
>
> make[2]: Entering directory `.../openmpi-3.0.1/opal/tools/wrappers'
>
>   CC   opal_wrapper.o
>
>   GENERATE opal_wrapper.1
>
>   CCLD opal_wrapper
>
> ../../../opal/.libs/libopen-pal.so: undefined reference to `pthread_atfork'
>
> collect2: error: ld returned 1 exit status
>
> make[2]: *** [opal_wrapper] Error 1
>
> make[2]: Leaving directory `.../openmpi-3.0.1/opal/tools/wrappers'
>
> Finally, if I also add LDFLAGS="-lpthread" to make, I get a build:
>
> shell$ ./configure --prefix=$installdir --enable-debug --disable-silent-rules 
> \
>
>   --disable-new-dtags --enable-mpi-cxx --enable-cxx-exceptions --with-pmi
>
> shell$ make -j CFLAGS="-g -O0" LDFLAGS="-lpthread" VERBOSE=1
>
> Am I doing this correctly?
>
> Is there a pointer to the configure/make flags for this?
>
> I did find this page that describes the developer build from a git clone,
> but that seemed a bit overkill since I am looking for a debug build from
> the distribution tarball instead of the git clone (avoid the autotools
> nightmare):
>
> https://www.open-mpi.org/source/building.php
>
> Thanks.
>
> -Adam
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Eager RDMA causing slow osu_bibw with 3.0.0

2018-04-05 Thread Howard Pritchard
Hello Ben,

Thanks for the info.   You would probably be better off installing UCX on
your cluster and rebuilding your Open MPI with the
--with-ucx
configure option.

Here's what I'm seeing with Open MPI 3.0.1 on a ConnectX5 based cluster
using ob1/openib BTL:

mpirun -map-by ppr:1:node -np 2 ./osu_bibw

# OSU MPI Bi-Directional Bandwidth Test v5.1

# Size  Bandwidth (MB/s)

1   0.00

2   0.00

4   0.01

8   0.02

16  0.04

32  0.07

64  0.13

128   273.64

256   485.04

512   869.51

1024 1434.99

2048 2208.12

4096 3055.67

8192 3896.93

16384  89.29

32768 252.59

65536 614.42

131072  22878.74

262144  23846.93

524288  24256.23

1048576 24498.27

2097152 24615.64

4194304 24632.58


export OMPI_MCA_pml=ucx

# OSU MPI Bi-Directional Bandwidth Test v5.1

# Size  Bandwidth (MB/s)

1   4.57

2   8.95

4  17.67

8  35.99

16 71.99

32141.56

64208.86

128   410.32

256   495.56

512  1455.98

1024 2414.78

2048 3008.19

4096 5351.62

8192 5563.66

163845945.16

327686061.33

65536   21376.89

131072  23462.99

262144  24064.56

524288  24366.84

1048576 24550.75

2097152 24649.03

4194304 24693.77

You can get ucx off of GitHub

https://github.com/openucx/ucx/releases


There is also a pre-release version of UCX (1.3.0RCX?) packaged as an RPM

available in MOFED 4.3.  See


http://www.mellanox.com/page/products_dyn?product_family=26=linux_sw_drivers


I was using UCX 1.2.2 for the results above.


Good luck,


Howard




2018-04-05 1:12 GMT-06:00 Ben Menadue :

> Hi,
>
> Another interesting point. I noticed that the last two message sizes
> tested (2MB and 4MB) are lower than expected for both osu_bw and osu_bibw.
> Increasing the minimum size to use the RDMA pipeline to above these sizes
> brings those two data-points up to scratch for both benchmarks:
>
> *3.0.0, osu_bw, no rdma for large messages*
>
> > mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -map-by
> ppr:1:node -np 2 -H r6,r7 ./osu_bw -m 2097152:4194304
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 2097152  6133.22
> 4194304  6054.06
>
> *3.0.0, osu_bibw, eager rdma disabled, no rdma for large messages*
>
> > mpirun -mca btl_openib_min_rdma_pipeline_size 4194304 -mca
> btl_openib_use_eager_rdma 0 -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bibw -m
> 2097152:4194304
> # OSU MPI Bi-Directional Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 2097152 11397.85
> 4194304 11389.64
>
> This makes me think something odd is going on in the RDMA pipeline.
>
> Cheers,
> Ben
>
>
>
> On 5 Apr 2018, at 5:03 pm, Ben Menadue  wrote:
>
> Hi,
>
> We’ve just been running some OSU benchmarks with OpenMPI 3.0.0 and noticed
> that *osu_bibw* gives nowhere near the bandwidth I’d expect (this is on
> FDR IB). However, *osu_bw* is fine.
>
> If I disable eager RDMA, then *osu_bibw* gives the expected
> numbers. Similarly, if I increase the number of eager RDMA buffers, it
> gives the expected results.
>
> OpenMPI 1.10.7 gives consistent, reasonable numbers with default settings,
> but they’re not as good as 3.0.0 (when tuned) for large buffers. The same
> option changes produce no different in the performance for 1.10.7.
>
> I was wondering if anyone else has noticed anything similar, and if this
> is unexpected, if anyone has a suggestion on how to investigate further?
>
> Thanks,
> Ben
>
>
> Here’s are the numbers:
>
> *3.0.0, osu_bw, default settings*
>
> > mpirun -map-by ppr:1:node -np 2 -H r6,r7 ./osu_bw
> # OSU MPI Bandwidth Test v5.4.0
> # Size  Bandwidth (MB/s)
> 1   1.13
> 2   2.29
> 4   4.63
> 8   9.21
> 16 18.18
> 32 36.46
> 64 69.95
> 128   128.55
> 256   250.74
> 512   451.54
> 1024  829.44
> 2048 1475.87
> 4096 2119.99
> 8192 3452.37
> 163842866.51
> 327684048.17
> 655365030.54
> 131072   5573.81
> 262144   5861.61
> 524288   6015.15
> 1048576

Re: [OMPI users] OpenMPI with Portals4 transport

2018-02-08 Thread Howard Pritchard
HI Brian,

Thanks for the info.   I'm not sure I quite get the response though.  Is
the race condition in the way
Open MPI Portals4 MTL is using portals or is a problem in the portals
implementation itself?

Howard


2018-02-08 9:20 GMT-07:00 D. Brian Larkins <brianlark...@gmail.com>:

> Howard,
>
> Looks like ob1 is working fine. When I looked into the problems with ob1,
> it looked like the progress thread was polling the Portals event queue
> before it had been initialized.
>
> b.
>
> $ mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib osu_latency
> WARNING: Ummunotify not found: Not using ummunotify can result in
> incorrect results download and install ummunotify from:
>  http://support.systemfabricworks.com/downloads/ummunotify/
> ummunotify-v2.tar.bz2
> WARNING: Ummunotify not found: Not using ummunotify can result in
> incorrect results download and install ummunotify from:
>  http://support.systemfabricworks.com/downloads/ummunotify/
> ummunotify-v2.tar.bz2
> # OSU MPI Latency Test
> # SizeLatency (us)
> 0 1.87
> 1 1.93
> 2 1.90
> 4 1.94
> 8 1.94
> 161.96
> 321.97
> 641.99
> 128   2.43
> 256   2.50
> 512   2.71
> 1024  3.01
> 2048  3.45
> 4096  4.56
> 8192  6.39
> 16384 8.79
> 3276811.50
> 6553616.59
> 131072   27.10
> 262144   46.97
> 524288   87.55
> 1048576     168.89
> 2097152 331.40
> 4194304 654.08
>
>
> On Feb 7, 2018, at 9:04 PM, Howard Pritchard <hpprit...@gmail.com> wrote:
>
> HI Brian,
>
> As a sanity check, can you see if the ob1 pml works okay, i.e.
>
>  mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib ./osu_latency
>
> Howard
>
>
> 2018-02-07 11:03 GMT-07:00 brian larkins <brianlark...@gmail.com>:
>
>> Hello,
>>
>> I’m doing some work with Portals4 and am trying to run some MPI programs
>> using the Portals 4 as the transport layer. I’m running into problems and
>> am hoping that someone can help me figure out how to get things working.
>> I’m using OpenMPI 3.0.0 with the following configuration:
>>
>> ./configure CFLAGS=-pipe —prefix=path/to/install --enable-picky
>> --enable-debug --enable-mpi-fortran --with-portals4=path/to/portals4
>> --disable-oshmem --disable-vt --disable-java --disable-mpi-io
>> --disable-io-romio --disable-libompitrace --disable-btl-portals4-flow-control
>> --disable-mtl-portals4-flow-control
>>
>> I have also tried the head from the git repo and 2.1.2 with the same
>> results. A simpler configure line (w —prefix and —with-portals4=) also gets
>> same results.
>>
>> Portals4 configuration is from github master and configured thus:
>>
>> ./configure —prefix=path/to/portals4 --with-ev=path/to/libev
>> --enable-transport-ib --enable-fast --enable-zero-mrs --enable-me-triggered
>>
>> If I specify the cm pml on the command-line, I can get examples/hello_c
>> to run correctly. Trying to get some latency numbers using the OSU
>> benchmarks is where my trouble begins:
>>
>> $ mpirun -n 2 --mca mtl portals4  --mca pml cm env
>> PTL_DISABLE_MEM_REG_CACHE=1 ./osu_latency
>> NOTE: Ummunotify and IB registered mem cache disabled, set
>> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable.
>> NOTE: Ummunotify and IB registered mem cache disabled, set
>> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable.
>> # OSU MPI Latency Test
>> # SizeLatency (us)
>> 025.96
>> [node41:19740] *** An error occurred in MPI_Barrier
>> [node41:19740] *** reported by process [139815819542529,4294967297]
>> [node41:19740] *** on communicator MPI_COMM_WORLD
>> [node41:19740] *** MPI_ERR_OTHER: known error not in list
>> [node41:19740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
>> will now abort,
>> [node41:19740] ***and potentially your MPI job)
>>
>> Not specifying CM gets an earlier segfault (defaults to ob1) and looks to
>> be a progress thread initialization problem.
>> Using PTL_IGNORE_UMMUNOTIFY=1  gets here:
>>
>> $ mpirun --mca pml cm -n 2 env PTL_IGNORE_UMMUNOTIFY=1 ./osu_latency
>> # OSU MPI Latency Test
>> # SizeLatency (us)
>> 0   

Re: [OMPI users] Using OpenSHMEM with Shared Memory

2018-02-07 Thread Howard Pritchard
HI Ben,

I'm afraid this is bad news for using UCX.  The problem is that when UCX
was configured/built, it did not
find a transport for doing one sided put/get transfers.  If you're feeling
lucky, you may want to
install xpmem (https://github.com/hjelmn/xpmem) and rebuild UCX.  This
requires building a device driver against
your kernel source and taking steps to getting the xpmem.ko loaded into the
kernel, etc.

There's an alternative however which works just fine on a laptop running
linux or osx. Check out

https://github.com/Sandia-OpenSHMEM/SOS/releases

and get the 1.4.0 release.

For build/install, follow the directions at

https://github.com/Sandia-OpenSHMEM/SOS/wiki/OFI-Build-Instructions

Note you will also need to install the MPICH hydra launcher as well.

Sandia OpenSHMEM over OFI libfabric uses TCP sockets as the fallback if
nothing else
is available.  I use this version of OpenSHMEM if I'm doing SHMEM stuff on
my mac  (no vm's).

Howard


2018-02-07 12:49 GMT-07:00 Benjamin Brock :

>
> Here's what I get with those environment variables:
>
> https://hastebin.com/ibimipuden.sql
>
> I'm running Arch Linux (but with OpenMPI/UCX installed from source as
> described in my earlier message).
>
> Ben
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI with Portals4 transport

2018-02-07 Thread Howard Pritchard
HI Brian,

As a sanity check, can you see if the ob1 pml works okay, i.e.

 mpirun -n 2 --mca pml ob1 --mca btl self,vader,openib ./osu_latency

Howard


2018-02-07 11:03 GMT-07:00 brian larkins :

> Hello,
>
> I’m doing some work with Portals4 and am trying to run some MPI programs
> using the Portals 4 as the transport layer. I’m running into problems and
> am hoping that someone can help me figure out how to get things working.
> I’m using OpenMPI 3.0.0 with the following configuration:
>
> ./configure CFLAGS=-pipe —prefix=path/to/install --enable-picky
> --enable-debug --enable-mpi-fortran --with-portals4=path/to/portals4
> --disable-oshmem --disable-vt --disable-java --disable-mpi-io
> --disable-io-romio --disable-libompitrace --disable-btl-portals4-flow-control
> --disable-mtl-portals4-flow-control
>
> I have also tried the head from the git repo and 2.1.2 with the same
> results. A simpler configure line (w —prefix and —with-portals4=) also gets
> same results.
>
> Portals4 configuration is from github master and configured thus:
>
> ./configure —prefix=path/to/portals4 --with-ev=path/to/libev
> --enable-transport-ib --enable-fast --enable-zero-mrs --enable-me-triggered
>
> If I specify the cm pml on the command-line, I can get examples/hello_c to
> run correctly. Trying to get some latency numbers using the OSU benchmarks
> is where my trouble begins:
>
> $ mpirun -n 2 --mca mtl portals4  --mca pml cm env
> PTL_DISABLE_MEM_REG_CACHE=1 ./osu_latency
> NOTE: Ummunotify and IB registered mem cache disabled, set
> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable.
> NOTE: Ummunotify and IB registered mem cache disabled, set
> PTL_DISABLE_MEM_REG_CACHE=0 to re-enable.
> # OSU MPI Latency Test
> # SizeLatency (us)
> 025.96
> [node41:19740] *** An error occurred in MPI_Barrier
> [node41:19740] *** reported by process [139815819542529,4294967297]
> [node41:19740] *** on communicator MPI_COMM_WORLD
> [node41:19740] *** MPI_ERR_OTHER: known error not in list
> [node41:19740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,
> [node41:19740] ***and potentially your MPI job)
>
> Not specifying CM gets an earlier segfault (defaults to ob1) and looks to
> be a progress thread initialization problem.
> Using PTL_IGNORE_UMMUNOTIFY=1  gets here:
>
> $ mpirun --mca pml cm -n 2 env PTL_IGNORE_UMMUNOTIFY=1 ./osu_latency
> # OSU MPI Latency Test
> # SizeLatency (us)
> 024.14
> 126.24
> [node41:19993] *** Process received signal ***
> [node41:19993] Signal: Segmentation fault (11)
> [node41:19993] Signal code: Address not mapped (1)
> [node41:19993] Failing at address: 0x141
> [node41:19993] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7fa6ac73b710]
> [node41:19993] [ 1] /ascldap/users/dblarki/opt/portals4.master/lib/
> libportals.so.4(+0xcd65)[0x7fa69b770d65]
> [node41:19993] [ 2] /ascldap/users/dblarki/opt/portals4.master/lib/
> libportals.so.4(PtlPut+0x143)[0x7fa69b773fb3]
> [node41:19993] [ 3] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_
> portals4.so(+0xa961)[0x7fa698cf5961]
> [node41:19993] [ 4] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_
> portals4.so(+0xb0e5)[0x7fa698cf60e5]
> [node41:19993] [ 5] /ascldap/users/dblarki/opt/ompi/lib/openmpi/mca_mtl_
> portals4.so(ompi_mtl_portals4_send+0x90)[0x7fa698cf61d1]
> [node41:19993] [ 6] /ascldap/users/dblarki/opt/
> ompi/lib/openmpi/mca_pml_cm.so(+0x5430)[0x7fa69a794430]
> [node41:19993] [ 7] /ascldap/users/dblarki/opt/ompi/lib/libmpi.so.40(PMPI_
> Send+0x2b4)[0x7fa6ac9ff018]
> [node41:19993] [ 8] ./osu_latency[0x40106f]
> [node41:19993] [ 9] /lib64/libc.so.6(__libc_start_
> main+0xfd)[0x7fa6ac3b6d5d]
> [node41:19993] [10] ./osu_latency[0x400c59]
>
> This cluster is running RHEL 6.5 without ummunotify modules, but I get the
> same results on a local (small) cluster running ubuntu 16.04 with
> ummunotify loaded.
>
> Any help would be much appreciated.
> thanks,
>
> brian.
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Using OpenSHMEM with Shared Memory

2018-02-07 Thread Howard Pritchard
HI Ben,

Could you set these environment variables and post the output ?

export OMPI_MCA_spml=ucx
export OMPI_MCA_spml_base_verbose=100

then run your test?

Also,  what OS are you using?

Howard


2018-02-06 20:10 GMT-07:00 Jeff Hammond :

>
> On Tue, Feb 6, 2018 at 3:58 PM Benjamin Brock 
> wrote:
>
>> How can I run an OpenSHMEM program just using shared memory?  I'd like to
>> use OpenMPI to run SHMEM programs locally on my laptop.
>>
>
> It’s not Open-MPI itself but OSHMPI sits on top of any MPI-3 library and
> has a mode to bypass MPI for one-sided if only used within a shared-memory
> domain.
>
>
> See https://github.com/jeffhammond/oshmpi and use --enable-smp-optimizations.
> While I don’t actively maintain it and it doesn’t support the latest spec,
> I’ll fix bugs and implement features on demand if users file GitHub issues.
>
> Sorry for the shameless self-promotion but I know a few folks who use
> OSHMPI specifically because of the SMP feature.
>
> Sandia OpenSHMEM with OFI definitely works on shared-memory as well. I use
> it for all of my Travis CI testing of SHMEM code on both Mac and Linux.
>
> Jeff
>
>
>> I understand that the old SHMEM component (Yoda?) was taken out, and that
>> UCX is now required.  I have a build of OpenMPI with UCX as per the
>> directions on this random GitHub Page
>> 
>> .
>>
>> When I try to just `shmemrun`, I get a complaint about not haivng any
>> splm components available.
>>
>> [xiii@shini kmer_hash]$ shmemrun -np 2 ./kmer_generic_hash
>> 
>> --
>> No available spml components were found!
>>
>> This means that there are no components of this type installed on your
>> system or all the components reported that they could not be used.
>>
>> This is a fatal error; your SHMEM process is likely to abort.  Check the
>> output of the "ompi_info" command and ensure that components of this
>> type are available on your system.  You may also wish to check the
>> value of the "component_path" MCA parameter and ensure that it has at
>> least one directory that contains valid MCA components.
>> 
>> --
>> [shini:16341] SPML ikrit cannot be selected
>> [shini:16342] SPML ikrit cannot be selected
>> [shini:16336] 1 more process has sent help message
>> help-oshmem-memheap.txt / find-available:none-found
>> [shini:16336] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>> all help / error messages
>>
>>
>> I tried fiddling with the MCA command-line settings, but didn't have any
>> luck.  Is it possible to do this?  Can anyone point me to some
>> documentation?
>>
>> Thanks,
>>
>> Ben
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] About my GPU performance using Openmpi-2.0.4

2017-12-13 Thread Howard Pritchard
Hi Phanikumar

It’s unlikely the warning message you are seeing is related to GPU
performance.  Have you tried adding

—with-verbs=no

to your config line?  That should quash openib complaint.

Howard

Phanikumar Pentyala  schrieb am Mo. 11. Dez. 2017
um 22:43:

> Dear users and developers,
>
> Currently I am using two Tesla K40m cards for my computational work on
> quantum espresso (QE) suit http://www.quantum-espresso.org/. My GPU
> enabled QE code running very slower than normal version. When I am
> submitting my job on gpu it was showing some error that "A high-performance
> Open MPI point-to-point messaging module was unable to find any relevant
> network interfaces:
>
> Module: OpenFabrics (openib)
>   Host: qmel
>
> Another transport will be used instead, although this may result in
> lower performance.
>
> Is this the reason for diminishing GPU performance ??
>
> I done installation by
>
> 1. ./configure --prefix=/home//software/openmpi-2.0.4
> --disable-openib-dynamic-sl --disable-openib-udcm --disable-openib-rdmacm"
> because we don't have any Infiband adapter HCA in server.
>
> 2. make all
>
> 3. make install
>
> Please correct me If I done any mistake in my installation or I have to
> use Infiband adaptor for using Openmpi??
>
> I read lot of posts in openmpi forum to remove above error while
> submitting job, I added tag of "--mca btl ^openib" , still no use error
> vanished but performance was same.
>
> Current details of server are:
>
> Server: FUJITSU PRIMERGY RX2540 M2
> CUDA version: 9.0
> openmpi version: 2.0.4 with intel mkl libraries
> QE-gpu version (my application): 5.4.0
>
> P.S: Extra information attached
>
> Thanks in advance
>
> Regards
> Phanikumar
> Research scholar
> IIT Kharagpur
> Kharagpur, westbengal
> India
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [EXTERNAL] Re: Using shmem_int_fadd() in OpenMPI\'s SHMEM

2017-11-22 Thread Howard Pritchard
Hi Ben,

Actually I did some checking about the brew install for OFi libfabric.
It looks like if your brew is up to date, it will pick up libfabric 1.5.2.

Howard


2017-11-22 15:21 GMT-07:00 Howard Pritchard <hpprit...@gmail.com>:

> HI Ben,
>
> Even on one box, the yoda component doesn't work any more.
>
> If you want to do OpenSHMEM programming on you Macbook pro (like I do)
> and you don't want to set up a VM to use UCX, then you can use
> Sandia OpenSHMEM implementation.
>
> https://github.com/Sandia-OpenSHMEM/SOS
>
> You will need to install the MPICH hydra launcher
>
> http://www.mpich.org/downloads/versions/
>
> as the SOS needs that for its oshrun launcher.
>
> I use hydra-3.2 on my mac with SOS.
>
> You will also need to install OFI libfabric:
>
> https://github.com/ofiwg/libfabric
>
> I'd suggest installing the OFI 1.5.1 tarball.  OFI is also available via
> brew
> but its so old that I doubt it will work with recent versions of SOS.
>
> If you'd like to use UCX, you'll need to install it and Open MPI on a VM
> running  a linux distro.
>
> Howard
>
>
> 2017-11-21 12:47 GMT-07:00 Benjamin Brock <br...@cs.berkeley.edu>:
>
>> > What version of Open MPI are you trying to use?
>>
>> Open MPI 2.1.1-2 as distributed by Arch Linux.
>>
>> > Also, could you describe something about your system.
>>
>> This is all in shared memory on a MacBook Pro; no networking involved.
>>
>> The seg fault with the code example above looks like this:
>>
>> [xiii@shini kmer_hash]$ g++ minimal.cpp -o minimal `shmemcc
>> --showme:link`
>> [xiii@shini kmer_hash]$ !shm
>> shmemrun -n 2 ./minimal
>> [shini:08284] *** Process received signal ***
>> [shini:08284] Signal: Segmentation fault (11)
>> [shini:08284] Signal code: Address not mapped (1)
>> [shini:08284] Failing at address: 0x18
>> [shini:08284] [ 0] /usr/lib/libpthread.so.0(+0x11da0)[0x7f06fb763da0]
>> [shini:08284] [ 1] /usr/lib/openmpi/openmpi/mca_s
>> pml_yoda.so(mca_spml_yoda_get+0x7da)[0x7f06e0eef0aa]
>> [shini:08284] [ 2] /usr/lib/openmpi/openmpi/mca_a
>> tomic_basic.so(atomic_basic_lock+0xb2)[0x7f06e08d90d2]
>> [shini:08284] [ 3] /usr/lib/openmpi/openmpi/mca_a
>> tomic_basic.so(mca_atomic_basic_fadd+0x4a)[0x7f06e08d949a]
>> [shini:08284] [ 4] /usr/lib/openmpi/liboshmem.so.
>> 20(shmem_int_fadd+0x90)[0x7f06fc5a7660]
>> [shini:08284] [ 5] ./minimal(+0x94f)[0x55a5cde7e94f]
>> [shini:08284] [ 6] /usr/lib/libc.so.6(__libc_star
>> t_main+0xea)[0x7f06fb3baf6a]
>> [shini:08284] [ 7] ./minimal(+0x80a)[0x55a5cde7e80a]
>> [shini:08284] *** End of error message ***
>> 
>> --
>> shmemrun noticed that process rank 1 with PID 0 on node shini exited on
>> signal 11 (Segmentation fault).
>> 
>> --
>>
>> Cheers,
>>
>> Ben
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
>>
>
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [EXTERNAL] Re: Using shmem_int_fadd() in OpenMPI\'s SHMEM

2017-11-22 Thread Howard Pritchard
HI Ben,

Even on one box, the yoda component doesn't work any more.

If you want to do OpenSHMEM programming on you Macbook pro (like I do)
and you don't want to set up a VM to use UCX, then you can use
Sandia OpenSHMEM implementation.

https://github.com/Sandia-OpenSHMEM/SOS

You will need to install the MPICH hydra launcher

http://www.mpich.org/downloads/versions/

as the SOS needs that for its oshrun launcher.

I use hydra-3.2 on my mac with SOS.

You will also need to install OFI libfabric:

https://github.com/ofiwg/libfabric

I'd suggest installing the OFI 1.5.1 tarball.  OFI is also available via
brew
but its so old that I doubt it will work with recent versions of SOS.

If you'd like to use UCX, you'll need to install it and Open MPI on a VM
running  a linux distro.

Howard


2017-11-21 12:47 GMT-07:00 Benjamin Brock :

> > What version of Open MPI are you trying to use?
>
> Open MPI 2.1.1-2 as distributed by Arch Linux.
>
> > Also, could you describe something about your system.
>
> This is all in shared memory on a MacBook Pro; no networking involved.
>
> The seg fault with the code example above looks like this:
>
> [xiii@shini kmer_hash]$ g++ minimal.cpp -o minimal `shmemcc --showme:link`
> [xiii@shini kmer_hash]$ !shm
> shmemrun -n 2 ./minimal
> [shini:08284] *** Process received signal ***
> [shini:08284] Signal: Segmentation fault (11)
> [shini:08284] Signal code: Address not mapped (1)
> [shini:08284] Failing at address: 0x18
> [shini:08284] [ 0] /usr/lib/libpthread.so.0(+0x11da0)[0x7f06fb763da0]
> [shini:08284] [ 1] /usr/lib/openmpi/openmpi/mca_s
> pml_yoda.so(mca_spml_yoda_get+0x7da)[0x7f06e0eef0aa]
> [shini:08284] [ 2] /usr/lib/openmpi/openmpi/mca_a
> tomic_basic.so(atomic_basic_lock+0xb2)[0x7f06e08d90d2]
> [shini:08284] [ 3] /usr/lib/openmpi/openmpi/mca_a
> tomic_basic.so(mca_atomic_basic_fadd+0x4a)[0x7f06e08d949a]
> [shini:08284] [ 4] /usr/lib/openmpi/liboshmem.so.
> 20(shmem_int_fadd+0x90)[0x7f06fc5a7660]
> [shini:08284] [ 5] ./minimal(+0x94f)[0x55a5cde7e94f]
> [shini:08284] [ 6] /usr/lib/libc.so.6(__libc_star
> t_main+0xea)[0x7f06fb3baf6a]
> [shini:08284] [ 7] ./minimal(+0x80a)[0x55a5cde7e80a]
> [shini:08284] *** End of error message ***
> --
> shmemrun noticed that process rank 1 with PID 0 on node shini exited on
> signal 11 (Segmentation fault).
> --
>
> Cheers,
>
> Ben
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] [EXTERNAL] Re: Using shmem_int_fadd() in OpenMPI's SHMEM

2017-11-22 Thread Howard Pritchard
HI Folks,

For the Open MPI 2.1.1 release, the only OSHMEM SPML's that work are the
ikrit and ucx.
yoda doesn't work.

Ikrit only works on systems with Mellanox iinterconnects and requires MXM
to be installed.
This is recommended for systems with connectx3 or older HCAs.  For systems
with
connectx4 or connectx5 you should be using UCX.

You'll need to add --with-ucx + arguments as required to the configure
command line when
you build Open MPI/OSHMEM to pick up the ucx stuff.

A gotcha is that by default, the ucx spml is not selected, so either on the
oshrun
command line add

--mca spml ucx

or via env. variable

export OMPI_MCA_spml=ucx

I verified that a 2.1.1 release + UCX 1.2.0 builds your test (after fixing
the unusual
include files) and passes on my mellanox connectx5 cluster.

Howard


2017-11-21 8:24 GMT-07:00 Hammond, Simon David <sdha...@sandia.gov>:

> Hi Howard/OpenMPI Users,
>
>
>
> I have had a similar seg-fault this week using OpenMPI 2.1.1 with GCC
> 4.9.3 so I tried to compile the example code in the email below. I see
> similar behavior to a small benchmark we have in house (but using inc not
> finc).
>
>
>
> When I run on a single node (both PE’s on the same node) I get the error
> below. But, if I run on multiple nodes (say 2 nodes with one PE per node)
> then the code runs fine. Same thing for my benchmark which uses
> shmem_longlong_inc. For reference, we are using InfiniBand on our cluster
> and dual-socket Haswell processors.
>
>
>
> Hope that helps,
>
>
>
> S.
>
>
>
> $ shmemrun -n 2 ./testfinc
>
> --
>
> WARNING: There is at least non-excluded one OpenFabrics device found,
>
> but there are no active ports detected (or Open MPI was unable to use
>
> them).  This is most certainly not what you wanted.  Check your
>
> cables, subnet manager configuration, etc.  The openib BTL will be
>
> ignored for this job.
>
>
>
>   Local host: shepard-lsm1
>
> --
>
> [shepard-lsm1:49505] *** Process received signal ***
>
> [shepard-lsm1:49505] Signal: Segmentation fault (11)
>
> [shepard-lsm1:49505] Signal code: Address not mapped (1)
>
> [shepard-lsm1:49505] Failing at address: 0x18
>
> [shepard-lsm1:49505] [ 0] /lib64/libpthread.so.0(+0xf710)[0x7ffc4cd9e710]
>
> [shepard-lsm1:49505] [ 1] /home/projects/x86-64-haswell/
> openmpi/2.1.1/gcc/4.9.3/lib/openmpi/mca_spml_yoda.so(mca_
> spml_yoda_get+0x86d)[0x7ffc337cf37d]
>
> [shepard-lsm1:49505] [ 2] /home/projects/x86-64-haswell/
> openmpi/2.1.1/gcc/4.9.3/lib/openmpi/mca_atomic_basic.so(
> atomic_basic_lock+0x9a)[0x7ffc32f190aa]
>
> [shepard-lsm1:49505] [ 3] /home/projects/x86-64-haswell/
> openmpi/2.1.1/gcc/4.9.3/lib/openmpi/mca_atomic_basic.so(
> mca_atomic_basic_fadd+0x39)[0x7ffc32f19409]
>
> [shepard-lsm1:49505] [ 4] /home/projects/x86-64-haswell/
> openmpi/2.1.1/gcc/4.9.3/lib/liboshmem.so.20(shmem_int_
> fadd+0x80)[0x7ffc4d2fc110]
>
> [shepard-lsm1:49505] [ 5] ./testfinc[0x400888]
>
> [shepard-lsm1:49505] [ 6] /lib64/libc.so.6(__libc_start_
> main+0xfd)[0x7ffc4ca19d5d]
>
> [shepard-lsm1:49505] [ 7] ./testfinc[0x400739]
>
> [shepard-lsm1:49505] *** End of error message ***
>
> --
>
> shmemrun noticed that process rank 1 with PID 0 on node shepard-lsm1
> exited on signal 11 (Segmentation fault).
>
> --
>
> [shepard-lsm1:49499] 1 more process has sent help message
> help-mpi-btl-openib.txt / no active ports found
>
> [shepard-lsm1:49499] Set MCA parameter "orte_base_help_aggregate" to 0 to
> see all help / error messages
>
>
>
> --
>
> Si Hammond
>
> Scalable Computer Architectures
>
> Sandia National Laboratories, NM, USA
>
>
>
>
>
> *From: *users <users-boun...@lists.open-mpi.org> on behalf of Howard
> Pritchard <hpprit...@gmail.com>
> *Reply-To: *Open MPI Users <users@lists.open-mpi.org>
> *Date: *Monday, November 20, 2017 at 4:11 PM
> *To: *Open MPI Users <users@lists.open-mpi.org>
> *Subject: *[EXTERNAL] Re: [OMPI users] Using shmem_int_fadd() in
> OpenMPI's SHMEM
>
>
>
> HI Ben,
>
>
>
> What version of Open MPI are you trying to use?
>
>
>
> Also, could you describe something about your system.  If its a cluster
>
> what sort of interconnect is being used.
>
>
>
> Howard
>
>
>
>
>
> 2017-11-20 14:13 GMT-07:00 Benjamin Brock <br...@cs.berkeley.edu>:
>
> What's the proper way to use

Re: [OMPI users] Using shmem_int_fadd() in OpenMPI's SHMEM

2017-11-20 Thread Howard Pritchard
HI Ben,

What version of Open MPI are you trying to use?

Also, could you describe something about your system.  If its a cluster
what sort of interconnect is being used.

Howard


2017-11-20 14:13 GMT-07:00 Benjamin Brock :

> What's the proper way to use shmem_int_fadd() in OpenMPI's SHMEM?
>
> A minimal example seems to seg fault:
>
> #include 
> #include 
>
> #include 
>
> int main(int argc, char **argv) {
>   shmem_init();
>   const size_t shared_segment_size = 1024;
>   void *shared_segment = shmem_malloc(shared_segment_size);
>
>   int *arr = (int *) shared_segment;
>   int *local_arr = (int *) malloc(sizeof(int) * 10);
>
>   if (shmem_my_pe() == 1) {
> shmem_int_fadd((int *) shared_segment, 1, 0);
>   }
>   shmem_barrier_all();
>
>   return 0;
> }
>
>
> Where am I going wrong here?  This sort of thing works in Cray SHMEM.
>
> Ben Bock
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] Problems building OpenMPI 2.1.1 on Intel KNL

2017-11-20 Thread Howard Pritchard
Hello Ake,

Would you mind opening an issue on Github so we can track this?

https://github.com/open-mpi/ompi/issues

There's a template to show what info we need to fix this.

Thanks very much for reporting this,

Howard


2017-11-20 3:26 GMT-07:00 Åke Sandgren :

> Hi!
>
> When the xppsl-libmemkind-dev package version 1.5.3 is installed
> building OpenMPI fails.
>
> opal/mca/mpool/memkind uses the macro MEMKIND_NUM_BASE_KIND which has
> been moved to memkind/internal/memkind_private.h
>
> Current master is also using that so I think that will also fail.
>
> Are there anyone working on this?
>
> --
> Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden
> Internet: a...@hpc2n.umu.se   Phone: +46 90 7866134 Fax: +46 90-580 14
> Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] OMPI 2.1.2 and SLURM compatibility

2017-11-17 Thread Howard Pritchard
Hello Bennet,

What you are trying to do using srun as the job launcher should work.
Could you post the contents
of /etc/slurm/slurm.conf for your system?

Could you also post the output of the following command:

ompi_info --all | grep pmix

to the mail list.

the config.log from your build would also be useful.

Howard

2017-11-16 9:30 GMT-07:00 r...@open-mpi.org :

> What Charles said was true but not quite complete. We still support the
> older PMI libraries but you likely have to point us to wherever slurm put
> them.
>
> However,we definitely recommend using PMIx as you will get a faster launch
>
> Sent from my iPad
>
> > On Nov 16, 2017, at 9:11 AM, Bennet Fauber  wrote:
> >
> > Charlie,
> >
> > Thanks a ton!  Yes, we are missing two of the three steps.
> >
> > Will report back after we get pmix installed and after we rebuild
> > Slurm.  We do have a new enough version of it, at least, so we might
> > have missed the target, but we did at least hit the barn.  ;-)
> >
> >
> >
> >> On Thu, Nov 16, 2017 at 10:54 AM, Charles A Taylor 
> wrote:
> >> Hi Bennet,
> >>
> >> Three things...
> >>
> >> 1. OpenMPI 2.x requires PMIx in lieu of pmi1/pmi2.
> >>
> >> 2. You will need slurm 16.05 or greater built with —with-pmix
> >>
> >> 2a. You will need pmix 1.1.5 which you can get from github.
> >> (https://github.com/pmix/tarballs).
> >>
> >> 3. then, to launch your mpi tasks on the allocated resources,
> >>
> >>   srun —mpi=pmix ./hello-mpi
> >>
> >> I’m replying to the list because,
> >>
> >> a) this information is harder to find than you might think.
> >> b) someone/anyone can correct me if I’’m giving a bum steer.
> >>
> >> Hope this helps,
> >>
> >> Charlie Taylor
> >> University of Florida
> >>
> >> On Nov 16, 2017, at 10:34 AM, Bennet Fauber  wrote:
> >>
> >> I think that OpenMPI is supposed to support SLURM integration such that
> >>
> >>   srun ./hello-mpi
> >>
> >> should work?  I built OMPI 2.1.2 with
> >>
> >> export CONFIGURE_FLAGS='--disable-dlopen --enable-shared'
> >> export COMPILERS='CC=gcc CXX=g++ FC=gfortran F77=gfortran'
> >>
> >> CMD="./configure \
> >>   --prefix=${PREFIX} \
> >>   --mandir=${PREFIX}/share/man \
> >>   --with-slurm \
> >>   --with-pmi \
> >>   --with-lustre \
> >>   --with-verbs \
> >>   $CONFIGURE_FLAGS \
> >>   $COMPILERS
> >>
> >> I have a simple hello-mpi.c (source included below), which compiles
> >> and runs with mpirun, both on the login node and in a job.  However,
> >> when I try to use srun in place of mpirun, I get instead a hung job,
> >> which upon cancellation produces this output.
> >>
> >> [bn2.stage.arc-ts.umich.edu:116377] PMI_Init [pmix_s1.c:162:s1_init]:
> >> PMI is not initialized
> >> [bn1.stage.arc-ts.umich.edu:36866] PMI_Init [pmix_s1.c:162:s1_init]:
> >> PMI is not initialized
> >> [warn] opal_libevent2022_event_active: event has no event_base set.
> >> [warn] opal_libevent2022_event_active: event has no event_base set.
> >> slurmstepd: error: *** STEP 86.0 ON bn1 CANCELLED AT
> 2017-11-16T10:03:24 ***
> >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
> >> slurmstepd: error: *** JOB 86 ON bn1 CANCELLED AT 2017-11-16T10:03:24
> ***
> >>
> >> The SLURM web page suggests that OMPI 2.x and later support PMIx, and
> >> to use `srun --mpi=pimx`, however that no longer seems to be an
> >> option, and using the `openmpi` type isn't working (neither is pmi2).
> >>
> >> [bennet@beta-build hello]$ srun --mpi=list
> >> srun: MPI types are...
> >> srun: mpi/pmi2
> >> srun: mpi/lam
> >> srun: mpi/openmpi
> >> srun: mpi/mpich1_shmem
> >> srun: mpi/none
> >> srun: mpi/mvapich
> >> srun: mpi/mpich1_p4
> >> srun: mpi/mpichgm
> >> srun: mpi/mpichmx
> >>
> >> To get the Intel PMI to work with srun, I have to set
> >>
> >>   I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
> >>
> >> Is there a comparable environment variable that must be set to enable
> >> `srun` to work?
> >>
> >> Am I missing a build option or misspecifying one?
> >>
> >> -- bennet
> >>
> >>
> >> Source of hello-mpi.c
> >> ==
> >> #include 
> >> #include 
> >> #include "mpi.h"
> >>
> >> int main(int argc, char **argv){
> >>
> >> int rank;  /* rank of process */
> >> int numprocs;  /* size of COMM_WORLD */
> >> int namelen;
> >> int tag=10;/* expected tag */
> >> int message;   /* Recv'd message */
> >> char processor_name[MPI_MAX_PROCESSOR_NAME];
> >> MPI_Status status; /* status of recv */
> >>
> >> /* call Init, size, and rank */
> >> MPI_Init(, );
> >> MPI_Comm_size(MPI_COMM_WORLD, );
> >> MPI_Comm_rank(MPI_COMM_WORLD, );
> >> MPI_Get_processor_name(processor_name, );
> >>
> >> printf("Process %d on %s out of %d\n", rank, processor_name, numprocs);
> >>
> >> if(rank != 0){
> >>   MPI_Recv(,/*buffer for message */
> >>   1,/*MAX count to recv */
> >> MPI_INT,/*type to recv */
> >>   0,

Re: [OMPI users] [OMPI devel] Open MPI 2.0.4rc2 available for testing

2017-11-02 Thread Howard Pritchard
HI Siegmar,

Could you check if you also see a similar problem with OMPI master when you
build with the Sun compiler?

I opened issue 4436 to track this issue.  Not sure we'll have time to fix
it for 2.0.4 though.

Howard


2017-11-02 3:49 GMT-06:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> thank you very much for the fix. Unfortunately, I still get an error
> with Sun C 5.15.
>
>
> loki openmpi-2.0.4rc2-Linux.x86_64.64_cc 125 tail -30
> log.make.Linux.x86_64.64_cc
>   CC   src/client/pmix_client.lo
> "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h",
> line 161: warning: parameter in inline asm statement unused: %3
> "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h",
> line 207: warning: parameter in inline asm statement unused: %2
> "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h",
> line 228: warning: parameter in inline asm statement unused: %2
> "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h",
> line 249: warning: parameter in inline asm statement unused: %2
> "/export2/src/openmpi-2.0.4/openmpi-2.0.4rc2/opal/include/opal/sys/x86_64/atomic.h",
> line 270: warning: parameter in inline asm statement unused: %2
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 235: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Get_version
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 240: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Init
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 408: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Initialized
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 416: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Finalize
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 488: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Abort
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 616: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Put
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 703: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Commit
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 789: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Resolve_peers
> "../../../../../../openmpi-2.0.4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c",
> line 852: redeclaration must have the same or more restrictive linker
> scoping: OPAL_PMIX_PMIX112_PMIx_Resolve_nodes
> cc: acomp failed for ../../../../../../openmpi-2.0.
> 4rc2/opal/mca/pmix/pmix112/pmix/src/client/pmix_client.c
> Makefile:1242: recipe for target 'src/client/pmix_client.lo' failed
> make[4]: *** [src/client/pmix_client.lo] Error 1
> make[4]: Leaving directory '/export2/src/openmpi-2.0.4/op
> enmpi-2.0.4rc2-Linux.x86_64.64_cc/opal/mca/pmix/pmix112/pmix'
> Makefile:1486: recipe for target 'all-recursive' failed
> make[3]: *** [all-recursive] Error 1
> make[3]: Leaving directory '/export2/src/openmpi-2.0.4/op
> enmpi-2.0.4rc2-Linux.x86_64.64_cc/opal/mca/pmix/pmix112/pmix'
> Makefile:1935: recipe for target 'all-recursive' failed
> make[2]: *** [all-recursive] Error 1
> make[2]: Leaving directory '/export2/src/openmpi-2.0.4/op
> enmpi-2.0.4rc2-Linux.x86_64.64_cc/opal/mca/pmix/pmix112'
> Makefile:2301: recipe for target 'all-recursive' failed
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory '/export2/src/openmpi-2.0.4/op
> enmpi-2.0.4rc2-Linux.x86_64.64_cc/opal'
> Makefile:1800: recipe for target 'all-recursive' failed
> make: *** [all-recursive] Error 1
> loki openmpi-2.0.4rc2-Linux.x86_64.64_cc 125
>
>
>
> I would be grateful, if somebody can fix these problems as well.
> Thank you very much for any help in advance.
>
>
> Kind regards
>
> Siegmar
>
>
>
> On 11/01/17 23:18, Howard Pritchard wrote:
>
>> HI Folks,
>>
>> We decided to roll an rc2 to pick up a PMIx fix:
>>
>&g

Re: [OMPI users] Strange benchmarks at large message sizes

2017-09-19 Thread Howard Pritchard
Hello Cooper

Could you rerun your test with the following env. variable set

export OMPI_MCA_coll=self,basic,libnbc

and see if that helps?

Also, what type of interconnect are you using - ethernet, IB, ...?

Howard



2017-09-19 8:56 GMT-06:00 Cooper Burns :

> Hello,
>
> I have been running some simple benchmarks and saw some strange behaviour:
> All tests are done on 4 nodes with 24 cores each (total of 96 mpi
> processes)
>
> When I run MPI_Allreduce() I see the run time spike up (about 10x) when I
> go from reducing a total of 4096KB to 8192KB for example, when count is
> 2^21 (8192 kb of 4 byte ints):
>
> MPI_Allreduce(send_buf, recv_buf, count, MPI_SUM, MPI_COMM_WORLD)
>
> is slower than:
>
> MPI_Allreduce(send_buf, recv_buf, count*/2*, MPI_INT, MPI_SUM,
> MPI_COMM_WORLD)
> MPI_Allreduce(send_buf* + count/2*, recv_buf *+ count/2*, count*/2*,MPI_INT,
> MPI_SUM, MPI_COMM_WORLD)
>
> Just wondering if anyone knows what the cause of this behaviour is.
>
> Thanks!
> Cooper
>
>
> Cooper Burns
> Senior Research Engineer
> 
> 
> 
> 
> 
> (608) 230-1551
> convergecfd.com
> 
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] openmpi-2.1.2rc2: warnings from "make" and "make check"

2017-08-30 Thread Howard Pritchard
Hi Siegmar,

Opened issue 4151 to track this.

Thanks,

Howard


2017-08-21 7:13 GMT-06:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> I've installed openmpi-2.1.2rc2 on my "SUSE Linux Enterprise Server 12.2
> (x86_64)" with Sun C 5.15 (Oracle Developer Studio 12.6) and gcc-7.1.0.
> Perhaps somebody wants to eliminate the following warnings.
>
>
> openmpi-2.1.2rc2-Linux.x86_64.64_gcc/log.make.Linux.x86_64.6
> 4_gcc:openmpi-2.1.2rc2/ompi/mca/io/romio314/romio/adio/common/utils.c:97:3:
> warning: passing argument 3 of 'PMPI_Type_hindexed' discards 'const'
> qualifier from pointer target type [-Wdiscarded-qualifiers]
> openmpi-2.1.2rc2-Linux.x86_64.64_gcc/log.make.Linux.x86_64.6
> 4_gcc:openmpi-2.1.2rc2/ompi/mpiext/cuda/c/mpiext_cuda_c.h:16:0: warning:
> "MPIX_CUDA_AWARE_SUPPORT" redefined
>
>
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-custom.c",
> line 88: warning: initializer will be sign-extended: -1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-linux.c",
> line 2640: warning: initializer will be sign-extended: -1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-synthetic.c",
> line 851: warning: initializer will be sign-extended: -1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-x86.c",
> line 113: warning: initializer will be sign-extended: -1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/hwloc/hwloc1112/hwloc/src/topology-xml.c",
> line 1667: warning: initializer will be sign-extended: -1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/ompi/mca/io/romio314/romio/adio/common/ad_fstype.c",
> line 428: warning: statement not reached
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/ompi/mca/io/romio314/romio/adio/common/ad_threaded_io.c",
> line 31: warning: statement not reached
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/ompi/mca/io/romio314/romio/adio/common/utils.c",
> line 97: warning: argument #3 is incompatible with prototype:
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 161:
> warning: parameter in inline asm statement unused: %3
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 207:
> warning: parameter in inline asm statement unused: %2
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 228:
> warning: parameter in inline asm statement unused: %2
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 249:
> warning: parameter in inline asm statement unused: %2
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/atomic.h", line 270:
> warning: parameter in inline asm statement unused: %2
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/pmix/pmix112/pmix/src/client/pmi1.c", line
> 708: warning: null dimension: argvp
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c",
> line 266: warning: initializer will be sign-extended: -1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/pmix/pmix112/pmix/src/server/pmix_server.c",
> line 267: warning: initializer will be sign-extended: -1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/ompi/mpiext/cuda/c/mpiext_cuda_c.h", line 16:
> warning: macro redefined: MPIX_CUDA_AWARE_SUPPORT
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/include/opal/sys/x86_64/timer.h", line 49:
> warning: initializer does not fit or is out of range: 0x8007
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/pmix/pmix112/pmix1_client.c", line 408:
> warning: enum type mismatch: arg #1
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"openmpi-2.1.2rc2/opal/mca/base/mca_base_component_repository.c",
> line 265: warning: statement not reached
> openmpi-2.1.2rc2-Linux.x86_64.64_cc/log.make.Linux.x86_64.64
> _cc:"/export2/src/openmpi-2.1.2/openmpi-2.1.2rc2/opal/mca/pm
> ix/pmix112/pmix/include/pmi.h", line 788: warning: null dimension: argvp
>
>
>
> openmpi-2.1.2rc2-Linux.x86_64.64_gcc/log.make-check.Linux.x8
> 

Re: [OMPI users] openmpi-master-201708190239-9d3f451: warnings from "make" and "make check"

2017-08-30 Thread Howard Pritchard
Hi Siegmar,

I opened issue 4151 to track this.  This is relevant to a project to get
open mpi to build with -Werror.

Thanks very much,

Howard


2017-08-21 7:27 GMT-06:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> I've installed openmpi-master-201708190239-9d3f451 on my "SUSE Linux
> Enterprise
> Server 12.2 (x86_64)" with Sun C 5.15 (Oracle Developer Studio 12.6) and
> gcc-7.1.0. Perhaps somebody wants to eliminate the following warnings.
>
>
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log.
> make.Linux.x86_64.64_gcc:../../../../../../../../../openmpi-
> master-201708190239-9d3f451/opal/mca/pmix/pmix2x/pmix/src/
> mca/bfrops/base/bfrop_base_copy.c:414:22: warning: statement will never
> be executed [-Wswitch-unreachable]
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log.
> make.Linux.x86_64.64_gcc:../../../../../openmpi-master-20170
> 8190239-9d3f451/ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c:136:34:
> warning: passing argument 1 of '__xpg_basename' discards 'const' qualifier
> from pointer target type [-Wdiscarded-qualifiers]
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log.
> make.Linux.x86_64.64_gcc:../../../../../openmpi-master-20170
> 8190239-9d3f451/ompi/mpiext/cuda/c/mpiext_cuda_c.h:16:0: warning:
> "MPIX_CUDA_AWARE_SUPPORT" redefined
>
>
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log.
> make-check.Linux.x86_64.64_gcc:../../../openmpi-master-
> 201708190239-9d3f451/test/class/opal_fifo.c:109:26: warning: assignment
> discards 'volatile' qualifier from pointer target type
> [-Wdiscarded-qualifiers]
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_gcc/log.
> make-check.Linux.x86_64.64_gcc:../../../openmpi-master-
> 201708190239-9d3f451/test/class/opal_lifo.c:72:26: warning: assignment
> discards 'volatile' qualifier from pointer target type
> [-Wdiscarded-qualifiers]
>
>
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /opal/mca/pmix/pmix2x/pmix/src/mca/base/pmix_mca_base_component_repository.c",
> line 266: warning: statement not reached
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /opal/mca/pmix/pmix2x/pmix/src/mca/bfrops/base/bfrop_base_copy.c", line
> 414: warning: statement not reached
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology-linux.c", line 2797:
> warning: initializer will be sign-extended: -1
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology-synthetic.c", line 946:
> warning: initializer will be sign-extended: -1
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology-x86.c", line 238: warning:
> initializer will be sign-extended: -1
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /opal/mca/hwloc/hwloc2a/hwloc/hwloc/topology-xml.c", line 2404: warning:
> initializer will be sign-extended: -1
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /opal/mca/pmix/pmix2x/pmix/src/client/pmi1.c", line 711: warning: null
> dimension: argvp
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /ompi/mca/io/romio314/romio/adio/common/ad_fstype.c", line 428: warning:
> statement not reached
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /ompi/mca/io/romio314/romio/adio/common/ad_threaded_io.c", line 31:
> warning: statement not reached
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /ompi/mca/coll/monitoring/coll_monitoring_component.c", line 160:
> warning: improper pointer/integer combination: op "="
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /ompi/mca/sharedfp/sm/sharedfp_sm_file_open.c", line 136: warning:
> argument #1 is incompatible with prototype:
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /ompi/mca/topo/treematch/treematch/tm_malloc.c", line 71: warning:
> statement not reached
> openmpi-master-201708190239-9d3f451-Linux.x86_64.64_cc/log.
> make.Linux.x86_64.64_cc:"openmpi-master-201708190239-9d3f451
> /ompi/mca/topo/treematch/treematch/tm_tree.c", line 1188: warning:
> statement not reached
> 

Re: [OMPI users] pmix, lxc, hpcx

2017-05-26 Thread Howard Pritchard
Hi John,

In the 2.1.x release stream a shared memory capability was introduced into
the PMIx component.

I know nothing about LXC containers, but it looks to me like there's some
issue when PMIx tries
to create these shared memory segments.  I'd check to see if there's
something about your
container configuration that is preventing the creation of shared memory
segments.

Howard


2017-05-26 15:18 GMT-06:00 John Marshall :

> Hi,
>
> I have built openmpi 2.1.1 with hpcx-1.8 and tried to run some mpi code
> under
> ubuntu 14.04 and LXC (1.x) but I get the following:
>
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/dstore/pmix_esh.c at line 1651
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/dstore/pmix_esh.c at line 1751
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/dstore/pmix_esh.c at line 1114
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/common/pmix_jobdata.c at line 93
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/common/pmix_jobdata.c at line 333
> [ib7-bc2oo42-be10p16.science.gc.ca:16035] PMIX ERROR: OUT-OF-RESOURCE in file 
> src/server/pmix_server.c at line 606
>
> I do not get the same outside of the LXC container and my code runs fine.
>
> I've looked for more info on these messages but could not find anything
> helpful. Are these messages indicative of something missing in, or some
> incompatibility with, the container?
>
> When I build using 2.0.2, I do not have a problem running inside or
> outside of
> the container.
>
> Thanks,
> John
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-22 Thread Howard Pritchard
Forgot you probably need an equal sign after btl arg

Howard Pritchard <hpprit...@gmail.com> schrieb am Mi. 22. März 2017 um
18:11:

> Hi Goetz
>
> Thanks for trying these other versions.  Looks like a bug.  Could you post
> the config.log output from your build of the 2.1.0 to the list?
>
> Also could you try running the job using this extra command line arg to
> see if the problem goes away?
>
> mpirun --mca btl ^vader (rest of your args)
>
> Howard
>
> Götz Waschk <goetz.was...@gmail.com> schrieb am Mi. 22. März 2017 um
> 13:09:
>
> On Wed, Mar 22, 2017 at 7:46 PM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
> > Hi Goetz,
> >
> > Would you mind testing against the 2.1.0 release or the latest from the
> > 1.10.x series (1.10.6)?
>
> Hi Howard,
>
> after sending my mail I have tested both 1.10.6 and 2.1.0 and I have
> received the same error. I have also tested outside of slurm using
> ssh, same problem.
>
> Here's the message from 2.1.0:
> [pax11-10:21920] *** Process received signal ***
> [pax11-10:21920] Signal: Bus error (7)
> [pax11-10:21920] Signal code: Non-existant physical address (2)
> [pax11-10:21920] Failing at address: 0x2b5d5b752290
> [pax11-10:21920] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b5d446e9370]
> [pax11-10:21920] [ 1]
>
> /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b5d531645e0]
> [pax11-10:21920] [ 2]
>
> /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x211)[0x2b5d44f607c1]
> [pax11-10:21920] [ 3]
>
> /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x2b51)[0x2b5d53162b51]
> [pax11-10:21920] [ 4]
>
> /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x3f)[0x2b5d5bb0a17f]
> [pax11-10:21920] [ 5]
>
> /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xa7a)[0x2b5d5bafe0aa]
> [pax11-10:21920] [ 6]
>
> /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x399)[0x2b5d44480429]
> [pax11-10:21920] [ 7]
>
> /opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2b5d86ab]
> [pax11-10:21920] [ 8] IMB-MPI1[0x40b2ff]
> [pax11-10:21920] [ 9] IMB-MPI1[0x402646]
> [pax11-10:21920] [10]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b5d44917b35]
> [pax11-10:21920] [11] IMB-MPI1[0x401f79]
> [pax11-10:21920] *** End of error message ***
> --
> mpirun noticed that process rank 320 with PID 21920 on node pax11-10
> exited on signal 7 (Bus error).
> --
>
>
> Regards, Götz Waschk
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-22 Thread Howard Pritchard
Hi Goetz

Thanks for trying these other versions.  Looks like a bug.  Could you post
the config.log output from your build of the 2.1.0 to the list?

Also could you try running the job using this extra command line arg to see
if the problem goes away?

mpirun --mca btl ^vader (rest of your args)

Howard

Götz Waschk <goetz.was...@gmail.com> schrieb am Mi. 22. März 2017 um 13:09:

On Wed, Mar 22, 2017 at 7:46 PM, Howard Pritchard <hpprit...@gmail.com>
wrote:
> Hi Goetz,
>
> Would you mind testing against the 2.1.0 release or the latest from the
> 1.10.x series (1.10.6)?

Hi Howard,

after sending my mail I have tested both 1.10.6 and 2.1.0 and I have
received the same error. I have also tested outside of slurm using
ssh, same problem.

Here's the message from 2.1.0:
[pax11-10:21920] *** Process received signal ***
[pax11-10:21920] Signal: Bus error (7)
[pax11-10:21920] Signal code: Non-existant physical address (2)
[pax11-10:21920] Failing at address: 0x2b5d5b752290
[pax11-10:21920] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b5d446e9370]
[pax11-10:21920] [ 1]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(mca_btl_vader_frag_init+0x70)[0x2b5d531645e0]
[pax11-10:21920] [ 2]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libopen-pal.so.20(opal_free_list_grow_st+0x211)[0x2b5d44f607c1]
[pax11-10:21920] [ 3]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_btl_vader.so(+0x2b51)[0x2b5d53162b51]
[pax11-10:21920] [ 4]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send_request_start_prepare+0x3f)[0x2b5d5bb0a17f]
[pax11-10:21920] [ 5]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0xa7a)[0x2b5d5bafe0aa]
[pax11-10:21920] [ 6]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(ompi_coll_base_allreduce_intra_ring+0x399)[0x2b5d44480429]
[pax11-10:21920] [ 7]
/opt/ohpc/pub/mpi/openmpi-gnu/2.1.0/lib/libmpi.so.20(PMPI_Allreduce+0x17b)[0x2b5d86ab]
[pax11-10:21920] [ 8] IMB-MPI1[0x40b2ff]
[pax11-10:21920] [ 9] IMB-MPI1[0x402646]
[pax11-10:21920] [10]
/usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b5d44917b35]
[pax11-10:21920] [11] IMB-MPI1[0x401f79]
[pax11-10:21920] *** End of error message ***
--
mpirun noticed that process rank 320 with PID 21920 on node pax11-10
exited on signal 7 (Bus error).
--


Regards, Götz Waschk
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Openmpi 1.10.4 crashes with 1024 processes

2017-03-22 Thread Howard Pritchard
Hi Goetz,

Would you mind testing against the 2.1.0 release or the latest from the
1.10.x series (1.10.6)?

Thanks,

Howard


2017-03-22 6:25 GMT-06:00 Götz Waschk :

> Hi everyone,
>
> I'm testing a new machine with 32 nodes of 32 cores each using the IMB
> benchmark. It is working fine with 512 processes, but it crashes with
> 1024 processes after a running for a minute:
>
> [pax11-17:16978] *** Process received signal ***
> [pax11-17:16978] Signal: Bus error (7)
> [pax11-17:16978] Signal code: Non-existant physical address (2)
> [pax11-17:16978] Failing at address: 0x2b147b785450
> [pax11-17:16978] [ 0] /usr/lib64/libpthread.so.0(+0xf370)[0x2b1473b13370]
> [pax11-17:16978] [ 1]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_
> vader.so(mca_btl_vader_frag_init+0x8e)[0x2b14794a413e]
> [pax11-17:16978] [ 2]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(ompi_
> free_list_grow+0x199)[0x2b147384f309]
> [pax11-17:16978] [ 3]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_btl_
> vader.so(+0x270d)[0x2b14794a270d]
> [pax11-17:16978] [ 4]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_
> ob1.so(mca_pml_ob1_send_request_start_prepare+0x43)[0x2b1479ae3a13]
> [pax11-17:16978] [ 5]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_pml_
> ob1.so(mca_pml_ob1_send+0x89a)[0x2b1479ad90ca]
> [pax11-17:16978] [ 6]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/openmpi/mca_coll_
> tuned.so(ompi_coll_tuned_allreduce_intra_ring+0x3f1)[0x2b147ad6ec41]
> [pax11-17:16978] [ 7]
> /opt/ohpc/pub/mpi/openmpi-gnu/1.10.4/lib/libmpi.so.12(MPI_
> Allreduce+0x17b)[0x2b147387d6bb]
> [pax11-17:16978] [ 8] IMB-MPI1[0x40b316]
> [pax11-17:16978] [ 9] IMB-MPI1[0x407284]
> [pax11-17:16978] [10] IMB-MPI1[0x40250e]
> [pax11-17:16978] [11]
> /usr/lib64/libc.so.6(__libc_start_main+0xf5)[0x2b1473d41b35]
> [pax11-17:16978] [12] IMB-MPI1[0x401f79]
> [pax11-17:16978] *** End of error message ***
> --
> mpirun noticed that process rank 552 with PID 0 on node pax11-17
> exited on signal 7 (Bus error).
> --
>
> The program is started from the slurm batch system using mpirun. The
> same application is working fine when using mvapich2 instead.
>
> Regards, Götz Waschk
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Shared Windows and MPI_Accumulate

2017-03-03 Thread Howard Pritchard
Hello Joseph,

I'm still unable to reproduce this system on my SLES12 x86_64 node.

Are you building with CFLAGS=-O3?

If so, could you build without CFLAGS set and see if you still see the
failure?

Howard


2017-03-02 2:34 GMT-07:00 Joseph Schuchart <schuch...@hlrs.de>:

> Hi Howard,
>
> Thanks for trying to reproduce this. It seems that on master the issue
> occurs less frequently but is still there. I used the following bash
> one-liner on my laptop and on our Linux Cluster (single node, 4 processes):
>
> ```
> $ for i in $(seq 1 100) ; do echo $i && mpirun -n 4
> ./mpi_shared_accumulate | grep \! && break ; done
> 1
> 2
> [0] baseptr[0]: 1004 (expected 1010) [!!!]
> [0] baseptr[1]: 1005 (expected 1011) [!!!]
> [0] baseptr[2]: 1006 (expected 1012) [!!!]
> [0] baseptr[3]: 1007 (expected 1013) [!!!]
> [0] baseptr[4]: 1008 (expected 1014) [!!!]
> ```
>
> Sometimes the error occurs after one or two iterations (like above),
> sometimes only at iteration 20 or later. However, I can reproduce it within
> the 100 runs every time I run the statement above. I am attaching the
> config.log and output of ompi_info of master on my laptop. Please let me
> know if I can help with anything else.
>
> Thanks,
> Joseph
>
> On 03/01/2017 11:24 PM, Howard Pritchard wrote:
>
> Hi Joseph,
>
> I built this test with craypich (Cray MPI) and it passed.  I also tried
> with Open MPI master and the test passed.  I also tried with 2.0.2
> and can't seem to reproduce on my system.
>
> Could you post the output of config.log?
>
> Also, how intermittent is the problem?
>
>
> Thanks,
>
> Howard
>
>
>
>
> 2017-03-01 8:03 GMT-07:00 Joseph Schuchart <schuch...@hlrs.de>:
>
>> Hi all,
>>
>> We are seeing issues in one of our applications, in which processes in a
>> shared communicator allocate a shared MPI window and execute MPI_Accumulate
>> simultaneously on it to iteratively update each process' values. The test
>> boils down to the sample code attached. Sample output is as follows:
>>
>> ```
>> $ mpirun -n 4 ./mpi_shared_accumulate
>> [1] baseptr[0]: 1010 (expected 1010)
>> [1] baseptr[1]: 1011 (expected 1011)
>> [1] baseptr[2]: 1012 (expected 1012)
>> [1] baseptr[3]: 1013 (expected 1013)
>> [1] baseptr[4]: 1014 (expected 1014)
>> [2] baseptr[0]: 1005 (expected 1010) [!!!]
>> [2] baseptr[1]: 1006 (expected 1011) [!!!]
>> [2] baseptr[2]: 1007 (expected 1012) [!!!]
>> [2] baseptr[3]: 1008 (expected 1013) [!!!]
>> [2] baseptr[4]: 1009 (expected 1014) [!!!]
>> [3] baseptr[0]: 1010 (expected 1010)
>> [0] baseptr[0]: 1010 (expected 1010)
>> [0] baseptr[1]: 1011 (expected 1011)
>> [0] baseptr[2]: 1012 (expected 1012)
>> [0] baseptr[3]: 1013 (expected 1013)
>> [0] baseptr[4]: 1014 (expected 1014)
>> [3] baseptr[1]: 1011 (expected 1011)
>> [3] baseptr[2]: 1012 (expected 1012)
>> [3] baseptr[3]: 1013 (expected 1013)
>> [3] baseptr[4]: 1014 (expected 1014)
>> ```
>>
>> Each process should hold the same values but sometimes (not on all
>> executions) random processes diverge (marked through [!!!]).
>>
>> I made the following observations:
>>
>> 1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with MPICH
>> 3.2.
>> 2) The issue occurs only if the window is allocated through
>> MPI_Win_allocate_shared, using MPI_Win_allocate works fine.
>> 3) The code assumes that MPI_Accumulate atomically updates individual
>> elements (please correct me if that is not covered by the MPI standard).
>>
>> Both OpenMPI and the example code were compiled using GCC 5.4.1 and run
>> on a Linux system (single node). OpenMPI was configure with
>> --enable-mpi-thread-multiple and --with-threads but the application is not
>> multi-threaded. Please let me know if you need any other information.
>>
>> Cheers
>> Joseph
>>
>> --
>> Dipl.-Inf. Joseph Schuchart
>> High Performance Computing Center Stuttgart (HLRS)
>> Nobelstr. 19
>> D-70569 Stuttgart
>>
>> Tel.: +49(0)711-68565890
>> Fax: +49(0)711-6856832
>> E-Mail: schuch...@hlrs.de
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
>
>
> ___
> users mailing 
> listus...@lists.open-mpi.orghttps://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> --
> Dipl.-Inf. Joseph Schuchart
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstr. 19
> D-70569 Stuttgart
>
> Tel.: +49(0)711-68565890 <+49%20711%2068565890>
> Fax: +49(0)711-6856832 <+49%20711%206856832>
> E-Mail: schuch...@hlrs.de
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] sharedfp/lockedfile collision between multiple program instances

2017-03-03 Thread Howard Pritchard
Hi Edgar

Please open an issue too so we can track the fix.

Howard


Edgar Gabriel  schrieb am Fr. 3. März 2017 um
07:45:

> Nicolas,
>
> thank you for the bug report, I can confirm the behavior. I will work on
> a patch and will try to get that into the next release, should hopefully
> not be too complicated.
>
> Thanks
>
> Edgar
>
>
> On 3/3/2017 7:36 AM, Nicolas Joly wrote:
> > Hi,
> >
> > We just got hit by a problem with sharedfp/lockedfile component under
> > v2.0.1 (should be identical with v2.0.2). We had 2 instances of an MPI
> > program running conccurrently on the same input file and using
> > MPI_File_read_shared() function ...
> >
> > If the shared file pointer is maintained with the lockedfile
> > component, a "XXX.lockedfile" is created near to the data
> > file. Unfortunately, this fixed name will collide with multiple tools
> > instances ;)
> >
> > Running 2 instances of the following command line (source code
> > attached) on the same machine will show the problematic behaviour.
> >
> > mpirun -n 1 --mca sharedfp lockedfile ./shrread -v input.dat
> >
> > Confirmed with lsof(8) output :
> >
> > njoly@tars [~]> lsof input.dat.lockedfile
> > COMMAND  PID  USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
> > shrread 5876 njoly   21w   REG   0,308 13510798885996031
> input.dat.lockedfile
> > shrread 5884 njoly   21w   REG   0,308 13510798885996031
> input.dat.lockedfile
> >
> > Thanks in advance.
> >
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Shared Windows and MPI_Accumulate

2017-03-01 Thread Howard Pritchard
Hi Joseph,

I built this test with craypich (Cray MPI) and it passed.  I also tried
with Open MPI master and the test passed.  I also tried with 2.0.2
and can't seem to reproduce on my system.

Could you post the output of config.log?

Also, how intermittent is the problem?


Thanks,

Howard




2017-03-01 8:03 GMT-07:00 Joseph Schuchart :

> Hi all,
>
> We are seeing issues in one of our applications, in which processes in a
> shared communicator allocate a shared MPI window and execute MPI_Accumulate
> simultaneously on it to iteratively update each process' values. The test
> boils down to the sample code attached. Sample output is as follows:
>
> ```
> $ mpirun -n 4 ./mpi_shared_accumulate
> [1] baseptr[0]: 1010 (expected 1010)
> [1] baseptr[1]: 1011 (expected 1011)
> [1] baseptr[2]: 1012 (expected 1012)
> [1] baseptr[3]: 1013 (expected 1013)
> [1] baseptr[4]: 1014 (expected 1014)
> [2] baseptr[0]: 1005 (expected 1010) [!!!]
> [2] baseptr[1]: 1006 (expected 1011) [!!!]
> [2] baseptr[2]: 1007 (expected 1012) [!!!]
> [2] baseptr[3]: 1008 (expected 1013) [!!!]
> [2] baseptr[4]: 1009 (expected 1014) [!!!]
> [3] baseptr[0]: 1010 (expected 1010)
> [0] baseptr[0]: 1010 (expected 1010)
> [0] baseptr[1]: 1011 (expected 1011)
> [0] baseptr[2]: 1012 (expected 1012)
> [0] baseptr[3]: 1013 (expected 1013)
> [0] baseptr[4]: 1014 (expected 1014)
> [3] baseptr[1]: 1011 (expected 1011)
> [3] baseptr[2]: 1012 (expected 1012)
> [3] baseptr[3]: 1013 (expected 1013)
> [3] baseptr[4]: 1014 (expected 1014)
> ```
>
> Each process should hold the same values but sometimes (not on all
> executions) random processes diverge (marked through [!!!]).
>
> I made the following observations:
>
> 1) The issue occurs with both OpenMPI 1.10.6 and 2.0.2 but not with MPICH
> 3.2.
> 2) The issue occurs only if the window is allocated through
> MPI_Win_allocate_shared, using MPI_Win_allocate works fine.
> 3) The code assumes that MPI_Accumulate atomically updates individual
> elements (please correct me if that is not covered by the MPI standard).
>
> Both OpenMPI and the example code were compiled using GCC 5.4.1 and run on
> a Linux system (single node). OpenMPI was configure with
> --enable-mpi-thread-multiple and --with-threads but the application is not
> multi-threaded. Please let me know if you need any other information.
>
> Cheers
> Joseph
>
> --
> Dipl.-Inf. Joseph Schuchart
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstr. 19
> D-70569 Stuttgart
>
> Tel.: +49(0)711-68565890
> Fax: +49(0)711-6856832
> E-Mail: schuch...@hlrs.de
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Issues with different IB adapters and openmpi 2.0.2

2017-02-27 Thread Howard Pritchard
Hi Orion

Does the problem occur if you only use font2 and 3?  Do you have MXM
installed on the font1 node?

The 2.x series is using PMIX and it could be that is impacting the PML
sanity check.

Howard


Orion Poplawski  schrieb am Mo. 27. Feb. 2017 um 14:50:

> We have a couple nodes with different IB adapters in them:
>
> font1/var/log/lspci:03:00.0 InfiniBand [0c06]: Mellanox Technologies
> MT25204
> [InfiniHost III Lx HCA] [15b3:6274] (rev 20)
> font2/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220
> InfiniBand
> HCA [1077:7220] (rev 02)
> font3/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220
> InfiniBand
> HCA [1077:7220] (rev 02)
>
> With 1.10.3 we saw the following errors with mpirun:
>
> [font2.cora.nwra.com:13982] [[23220,1],10] selected pml cm, but peer
> [[23220,1],0] on font1 selected pml ob1
>
> which crashed MPI_Init.
>
> We worked around this by passing "--mca pml ob1".  I notice now with
> openmpi
> 2.0.2 without that option I no longer see errors, but the mpi program will
> hang shortly after startup.  Re-adding the option makes it work, so I'm
> assuming the underlying problem is still the same, but openmpi appears to
> have
> stopped alerting me to the issue.
>
> Thoughts?
>
> --
> Orion Poplawski
> Technical Manager  720-772-5637
> NWRA, Boulder/CoRA Office FAX: 303-415-9702
> 3380 Mitchell Lane   or...@nwra.com
> Boulder, CO 80301   http://www.nwra.com
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] MPI_THREAD_MULTIPLE: Fatal error on MPI_Win_create

2017-02-18 Thread Howard Pritchard
Hi Joseph

What OS are you using when running the test?

Could you try running with

export OMPI_mca_osc=^pt2pt
and
export OMPI_mca_osc_base_verbose=10

This error message was put in to this OMPI release because this part of the
code has known problems when used multi threaded.



Joseph Schuchart  schrieb am Sa. 18. Feb. 2017 um 04:02:

> All,
>
> I am seeing a fatal error with OpenMPI 2.0.2 if requesting support for
> MPI_THREAD_MULTIPLE and afterwards creating a window using
> MPI_Win_create. I am attaching a small reproducer. The output I get is
> the following:
>
> ```
> MPI_THREAD_MULTIPLE supported: yes
> MPI_THREAD_MULTIPLE supported: yes
> MPI_THREAD_MULTIPLE supported: yes
> MPI_THREAD_MULTIPLE supported: yes
> --
> The OSC pt2pt component does not support MPI_THREAD_MULTIPLE in this
> release.
> Workarounds are to run on a single node, or to use a system with an RDMA
> capable network such as Infiniband.
> --
> [beryl:10705] *** An error occurred in MPI_Win_create
> [beryl:10705] *** reported by process [2149974017,2]
> [beryl:10705] *** on communicator MPI_COMM_WORLD
> [beryl:10705] *** MPI_ERR_WIN: invalid window
> [beryl:10705] *** MPI_ERRORS_ARE_FATAL (processes in this communicator
> will now abort,
> [beryl:10705] ***and potentially your MPI job)
> [beryl:10698] 3 more processes have sent help message help-osc-pt2pt.txt
> / mpi-thread-multiple-not-supported
> [beryl:10698] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> [beryl:10698] 3 more processes have sent help message
> help-mpi-errors.txt / mpi_errors_are_fatal
> ```
>
> I am running on a single node (my laptop). Both OpenMPI and the
> application were compiled using GCC 5.3.0. Naturally, there is no
> support for Infiniband available. Should I signal OpenMPI that I am
> indeed running on a single node? If so, how can I do that? Can't this be
> detected by OpenMPI automatically? The test succeeds if I only request
> MPI_THREAD_SINGLE.
>
> OpenMPI 2.0.2 has been configured using only
> --enable-mpi-thread-multiple and --prefix configure parameters. I am
> attaching the output of ompi_info.
>
> Please let me know if you need any additional information.
>
> Cheers,
> Joseph
>
> --
> Dipl.-Inf. Joseph Schuchart
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstr. 19
> D-70569 Stuttgart
>
> Tel.: +49(0)711-68565890
> Fax: +49(0)711-6856832
> E-Mail: schuch...@hlrs.de
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Problem with MPI_Comm_spawn using openmpi 2.0.x + sbatch

2017-02-15 Thread Howard Pritchard
Hi Anastasia,

Definitely check the mpirun when in batch environment but you may also want
to upgrade to Open MPI 2.0.2.

Howard

r...@open-mpi.org  schrieb am Mi. 15. Feb. 2017 um 07:49:

> Nothing immediate comes to mind - all sbatch does is create an allocation
> and then run your script in it. Perhaps your script is using a different
> “mpirun” command than when you type it interactively?
>
> On Feb 14, 2017, at 5:11 AM, Anastasia Kruchinina <
> nastja.kruchin...@gmail.com> wrote:
>
> Hi,
>
> I am trying to use MPI_Comm_spawn function in my code. I am having trouble
> with openmpi 2.0.x + sbatch (batch system Slurm).
> My test program is located here:
> http://user.it.uu.se/~anakr367/files/MPI_test/
>
> When I am running my code I am getting an error:
>
> OPAL ERROR: Timeout in file
> ../../../../openmpi-2.0.1/opal/mca/pmix/base/pmix_base_fns.c at line 193
> *** An error occurred in MPI_Init_thread
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or
> environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>ompi_dpm_dyn_init() failed
>--> Returned "Timeout" (-15) instead of "Success" (0)
> --
>
> The interesting thing is that there is no error when I am firstly
> allocating nodes with salloc and then run my program. So, I noticed that
> the program works fine using openmpi 1.x+sbach/salloc or openmpi
> 2.0.x+salloc but not openmpi 2.0.x+sbatch.
>
> The error was reproduced on three different computer clusters.
>
> Best regards,
> Anastasia
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI not running any job on Mac OS X 10.12

2017-02-06 Thread Howard Pritchard
Hi Michel,

Could you try running the app with

export TMPDIR=/tmp

set in the shell you are using?

Howard


2017-02-02 13:46 GMT-07:00 Michel Lesoinne <mlesoi...@cmsoftinc.com>:

Howard,

First, thanks to you and Jeff for looking into this with me. 
I tried ../configure --disable-shared --enable-static --prefix ~/.local
The result is the same as without --disable-shared. i.e. I get the
following error:

[Michels-MacBook-Pro.local:92780] [[46617,0],0] ORTE_ERROR_LOG: Bad
parameter in file ../../orte/orted/pmix/pmix_server.c at line 262

[Michels-MacBook-Pro.local:92780] [[46617,0],0] ORTE_ERROR_LOG: Bad
parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line
666

--

It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during orte_init; some of which are due to configuration or

environment problems.  This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

Open MPI developer):


  pmix server init failed

  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS

--

On Thu, Feb 2, 2017 at 12:29 PM, Howard Pritchard <hpprit...@gmail.com>
wrote:

Hi Michel

Try adding --enable-static to the configure.
That fixed the problem for me.

Howard

Michel Lesoinne <mlesoi...@cmsoftinc.com> schrieb am Mi. 1. Feb. 2017 um
19:07:

I have compiled OpenMPI 2.0.2 on a new Macbook running OS X 10.12 and have
been trying to run simple program.
I configured openmpi with
../configure --disable-shared --prefix ~/.local
make all install

Then I have  a simple code only containing a call to MPI_Init.
I compile it with
mpirun -np 2 ./mpitest

The output is:

[Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
unable to open mca_patcher_overwrite: File not found (ignored)

[Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
unable to open mca_shmem_mmap: File not found (ignored)

[Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
unable to open mca_shmem_posix: File not found (ignored)

[Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
unable to open mca_shmem_sysv: File not found (ignored)

--

It looks like opal_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during opal_init; some of which are due to configuration or

environment problems.  This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

Open MPI developer):


  opal_shmem_base_select failed

  --> Returned value -1 instead of OPAL_SUCCESS

--

Without the --disable-shared in the configuration, then I get:


[Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
parameter in file ../../orte/orted/pmix/pmix_server.c at line 264

[Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line
666

--

It looks like orte_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during orte_init; some of which are due to configuration or

environment problems.  This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

Open MPI developer):


  pmix server init failed

  --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS

--




Has anyone seen this? What am I missing?
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users



___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Open MPI over RoCE using breakout cable and switch

2017-02-03 Thread Howard Pritchard
Hello Brendan,

Sorry for the delay in responding.  I've been on travel the past two weeks.

I traced through the debug output you sent.  It provided enough information
to show that for some reason, when using the breakout cable, Open MPI
is unable to complete initialization it needs to use the openib BTL.  It
correctly detects that the first port is not available, but for port 1, it
still fails to initialize.

To debug this further, I'd need to provide you with a custom Open MPI
to try that would have more debug output in the suspect area.

If you'd like to go this route let me know and I'll build a one of library
to try to debug this problem.

One thing to do just as a sanity check is to try tcp:

mpirun --mca btl tcp,self,sm 

with the breakout cable.  If that doesn't work, then I think there may
be some network setup problem that needs to be resolved first before
trying custom Open MPI tarballs.

Thanks,

Howard




2017-02-01 15:08 GMT-07:00 Brendan Myers <brendan.my...@soft-forge.com>:

> Hello Howard,
>
> I was wondering if you have been able to look at this issue at all, or if
> anyone has any ideas on what to try next.
>
>
>
> Thank you,
>
> Brendan
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Brendan
> Myers
> *Sent:* Tuesday, January 24, 2017 11:11 AM
>
> *To:* 'Open MPI Users' <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
> switch
>
>
>
> Hello Howard,
>
> Here is the error output after building with debug enabled.  These CX4
> Mellanox cards view each port as a separate device and I am using port 1 on
> the card which is device mlx5_0.
>
>
>
> Thank you,
>
> Brendan
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org
> <users-boun...@lists.open-mpi.org>] *On Behalf Of *Howard Pritchard
> *Sent:* Tuesday, January 24, 2017 8:21 AM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
> switch
>
>
>
> Hello Brendan,
>
>
>
> This helps some, but looks like we need more debug output.
>
>
>
> Could you build a debug version of Open MPI by adding --enable-debug
>
> to the config options and rerun the test with the breakout cable setup
>
> and keeping the --mca btl_base_verbose 100 command line option?
>
>
>
> Thanks
>
>
>
> Howard
>
>
>
>
>
> 2017-01-23 8:23 GMT-07:00 Brendan Myers <brendan.my...@soft-forge.com>:
>
> Hello Howard,
>
> Thank you for looking into this. Attached is the output you requested.
> Also, I am using Open MPI 2.0.1.
>
>
>
> Thank you,
>
> Brendan
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Friday, January 20, 2017 6:35 PM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
> switch
>
>
>
> Hi Brendan
>
>
>
> I doubt this kind of config has gotten any testing with OMPI.  Could you
> rerun with
>
>
>
> --mca btl_base_verbose 100
>
>
>
> added to the command line and post the output to the list?
>
>
>
> Howard
>
>
>
>
>
> Brendan Myers <brendan.my...@soft-forge.com> schrieb am Fr. 20. Jan. 2017
> um 15:04:
>
> Hello,
>
> I am attempting to get Open MPI to run over 2 nodes using a switch and a
> single breakout cable with this design:
>
> (100GbE)QSFP ßà 2x (50GbE)QSFP
>
>
>
> Hardware Layout:
>
> Breakout cable module A connects to switch (100GbE)
>
> Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)
>
> Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)
>
> Switch is Mellanox SN 2700 100GbE RoCE switch
>
>
>
> · I  am able to pass RDMA traffic between the nodes with perftest
> (ib_write_bw) when using the breakout cable as the IC from both nodes to
> the switch.
>
> · When attempting to run a job using the breakout cable as the IC
> Open MPI aborts with failure to initialize open fabrics device errors.
>
> · If I replace the breakout cable with 2 standard QSFP cables the
> Open MPI job will complete correctly.
>
>
>
>
>
> This is the command I use, it works unless I attempt a run with the
> breakout cable used as IC:
>
> *mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues
> P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm  -hostfile
> mpi-hosts-ce /usr/local/bin/IMB-MPI1*
>
>
>
> If anyone has any idea as to why using a breakout cable is causing my jobs
> to fail please let me 

Re: [OMPI users] OpenMPI not running any job on Mac OS X 10.12

2017-02-02 Thread Howard Pritchard
Hi Michel

Try adding --enable-static to the configure.
That fixed the problem for me.

Howard

Michel Lesoinne  schrieb am Mi. 1. Feb. 2017 um
19:07:

> I have compiled OpenMPI 2.0.2 on a new Macbook running OS X 10.12 and have
> been trying to run simple program.
> I configured openmpi with
> ../configure --disable-shared --prefix ~/.local
> make all install
>
> Then I have  a simple code only containing a call to MPI_Init.
> I compile it with
> mpirun -np 2 ./mpitest
>
> The output is:
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_patcher_overwrite: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_mmap: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_posix: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_sysv: File not found (ignored)
>
> --
>
> It looks like opal_init failed for some reason; your parallel process is
>
> likely to abort.  There are many reasons that a parallel process can
>
> fail during opal_init; some of which are due to configuration or
>
> environment problems.  This failure appears to be an internal failure;
>
> here's some additional information (which may only be relevant to an
>
> Open MPI developer):
>
>
>   opal_shmem_base_select failed
>
>   --> Returned value -1 instead of OPAL_SUCCESS
>
> --
>
> Without the --disable-shared in the configuration, then I get:
>
>
> [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
> parameter in file ../../orte/orted/pmix/pmix_server.c at line 264
>
> [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
> parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line
> 666
>
> --
>
> It looks like orte_init failed for some reason; your parallel process is
>
> likely to abort.  There are many reasons that a parallel process can
>
> fail during orte_init; some of which are due to configuration or
>
> environment problems.  This failure appears to be an internal failure;
>
> here's some additional information (which may only be relevant to an
>
> Open MPI developer):
>
>
>   pmix server init failed
>
>   --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
>
> --
>
>
>
>
> Has anyone seen this? What am I missing?
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI not running any job on Mac OS X 10.12

2017-02-02 Thread Howard Pritchard
Hi Michael,

I reproduced this problem on my Mac too:

pn1249323:~/ompi/examples (v2.0.x *)$ mpirun -np 2 ./ring_c

[pn1249323.lanl.gov:94283] mca_base_component_repository_open: unable to
open mca_patcher_overwrite: File not found (ignored)

[pn1249323.lanl.gov:94283] mca_base_component_repository_open: unable to
open mca_shmem_mmap: File not found (ignored)

[pn1249323.lanl.gov:94283] mca_base_component_repository_open: unable to
open mca_shmem_posix: File not found (ignored)

[pn1249323.lanl.gov:94283] mca_base_component_repository_open: unable to
open mca_shmem_sysv: File not found (ignored)

--

It looks like opal_init failed for some reason; your parallel process is

likely to abort.  There are many reasons that a parallel process can

fail during opal_init; some of which are due to configuration or

environment problems.  This failure appears to be an internal failure;

here's some additional information (which may only be relevant to an

Open MPI developer):


  opal_shmem_base_select failed

  --> Returned value -1 instead of OPAL_SUCCESS

Is there a reason why you are using the --disable-shared option?  Can you
use the --disable-dlopen instead?

I'll do some more investigating and open an issue.

Howard



2017-02-01 19:05 GMT-07:00 Michel Lesoinne :

> I have compiled OpenMPI 2.0.2 on a new Macbook running OS X 10.12 and have
> been trying to run simple program.
> I configured openmpi with
> ../configure --disable-shared --prefix ~/.local
> make all install
>
> Then I have  a simple code only containing a call to MPI_Init.
> I compile it with
> mpirun -np 2 ./mpitest
>
> The output is:
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_patcher_overwrite: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_mmap: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_posix: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_sysv: File not found (ignored)
>
> --
>
> It looks like opal_init failed for some reason; your parallel process is
>
> likely to abort.  There are many reasons that a parallel process can
>
> fail during opal_init; some of which are due to configuration or
>
> environment problems.  This failure appears to be an internal failure;
>
> here's some additional information (which may only be relevant to an
>
> Open MPI developer):
>
>
>   opal_shmem_base_select failed
>
>   --> Returned value -1 instead of OPAL_SUCCESS
>
> --
>
> Without the --disable-shared in the configuration, then I get:
>
>
> [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
> parameter in file ../../orte/orted/pmix/pmix_server.c at line 264
>
> [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
> parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at
> line 666
>
> --
>
> It looks like orte_init failed for some reason; your parallel process is
>
> likely to abort.  There are many reasons that a parallel process can
>
> fail during orte_init; some of which are due to configuration or
>
> environment problems.  This failure appears to be an internal failure;
>
> here's some additional information (which may only be relevant to an
>
> Open MPI developer):
>
>
>   pmix server init failed
>
>   --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
>
> --
>
>
>
>
> Has anyone seen this? What am I missing?
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] OpenMPI not running any job on Mac OS X 10.12

2017-02-02 Thread Howard Pritchard
Hi Michel

It's somewhat unusual to use the disable-shared  configure option.  That
may be causing this.  Could you try to build without using this option and
see if you still see the problem?


Thanks,

Howard

Michel Lesoinne  schrieb am Mi. 1. Feb. 2017 um
21:07:

> I have compiled OpenMPI 2.0.2 on a new Macbook running OS X 10.12 and have
> been trying to run simple program.
> I configured openmpi with
> ../configure --disable-shared --prefix ~/.local
> make all install
>
> Then I have  a simple code only containing a call to MPI_Init.
> I compile it with
> mpirun -np 2 ./mpitest
>
> The output is:
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_patcher_overwrite: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_mmap: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_posix: File not found (ignored)
>
> [Michels-MacBook-Pro.local:45101] mca_base_component_repository_open:
> unable to open mca_shmem_sysv: File not found (ignored)
>
> --
>
> It looks like opal_init failed for some reason; your parallel process is
>
> likely to abort.  There are many reasons that a parallel process can
>
> fail during opal_init; some of which are due to configuration or
>
> environment problems.  This failure appears to be an internal failure;
>
> here's some additional information (which may only be relevant to an
>
> Open MPI developer):
>
>
>   opal_shmem_base_select failed
>
>   --> Returned value -1 instead of OPAL_SUCCESS
>
> --
>
> Without the --disable-shared in the configuration, then I get:
>
>
> [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
> parameter in file ../../orte/orted/pmix/pmix_server.c at line 264
>
> [Michels-MacBook-Pro.local:68818] [[53415,0],0] ORTE_ERROR_LOG: Bad
> parameter in file ../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line
> 666
>
> --
>
> It looks like orte_init failed for some reason; your parallel process is
>
> likely to abort.  There are many reasons that a parallel process can
>
> fail during orte_init; some of which are due to configuration or
>
> environment problems.  This failure appears to be an internal failure;
>
> here's some additional information (which may only be relevant to an
>
> Open MPI developer):
>
>
>   pmix server init failed
>
>   --> Returned value Bad parameter (-5) instead of ORTE_SUCCESS
>
> --
>
>
>
>
> Has anyone seen this? What am I missing?
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Error using hpcc benchmark

2017-01-31 Thread Howard Pritchard
Hi Wodel

Randomaccess part of HPCC is probably causing this.

Perhaps set PSM env. variable -

Export PSM_MQ_REVCREQ_MAX=1000

or something like that.

Alternatively launch the job using

mpirun --mca plm ob1 --host 

to avoid use of psm.  Performance will probably suffer with this option
however.

Howard
wodel youchi  schrieb am Di. 31. Jan. 2017 um 08:27:

> Hi,
>
> I am n newbie in HPC world
>
> I am trying to execute the hpcc benchmark on our cluster, but every time I
> start the job, I get this error, then the job exits
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *compute017.22840Exhausted 1048576 MQ irecv request descriptors, which
> usually indicates a user program error or insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576)compute024.22840Exhausted 1048576 MQ irecv
> request descriptors, which usually indicates a user program error or
> insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576)compute019.22847Exhausted 1048576 MQ irecv
> request descriptors, which usually indicates a user program error or
> insufficient request descriptors
> (PSM_MQ_RECVREQS_MAX=1048576)---Primary
> job  terminated normally, but 1 process returneda non-zero exit code.. Per
> user-direction, the job has been
> aborted.-mpirun
> detected that one or more processes exited with non-zero status, thus
> causingthe job to be terminated. The first process to do so was:  Process
> name: [[19601,1],272]  Exit code:
> 255--*
>
> Platform : IBM PHPC
> OS : RHEL 6.5
> one management node
> 32 compute node : 16 cores, 32GB RAM, intel qlogic QLE7340 one port QRD
> infiniband 40Gb/s
>
> I compiled hpcc against : IBM MPI, Openmpi 2.0.1 (compiled with gcc 4.4.7)
> and Openmpi 1.8.1 (compiled with gcc 4.4.7)
>
> I get the errors, but each time on different compute nodes.
>
> This is the command I used to start the job
>
> *mpirun -np 512 --mca mtl psm --hostfile hosts32
> /shared/build/hpcc-1.5.0b-blas-ompi-181/hpcc hpccinf.txt*
>
> Any help will be appreciated, and if you need more details, let me know.
> Thanks in advance.
>
>
> Regards.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Open MPI over RoCE using breakout cable and switch

2017-01-24 Thread Howard Pritchard
Hello Brendan,

This helps some, but looks like we need more debug output.

Could you build a debug version of Open MPI by adding --enable-debug
to the config options and rerun the test with the breakout cable setup
and keeping the --mca btl_base_verbose 100 command line option?

Thanks

Howard


2017-01-23 8:23 GMT-07:00 Brendan Myers <brendan.my...@soft-forge.com>:

> Hello Howard,
>
> Thank you for looking into this. Attached is the output you requested.
> Also, I am using Open MPI 2.0.1.
>
>
>
> Thank you,
>
> Brendan
>
>
>
> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard
> Pritchard
> *Sent:* Friday, January 20, 2017 6:35 PM
> *To:* Open MPI Users <users@lists.open-mpi.org>
> *Subject:* Re: [OMPI users] Open MPI over RoCE using breakout cable and
> switch
>
>
>
> Hi Brendan
>
>
>
> I doubt this kind of config has gotten any testing with OMPI.  Could you
> rerun with
>
>
>
> --mca btl_base_verbose 100
>
>
>
> added to the command line and post the output to the list?
>
>
>
> Howard
>
>
>
>
>
> Brendan Myers <brendan.my...@soft-forge.com> schrieb am Fr. 20. Jan. 2017
> um 15:04:
>
> Hello,
>
> I am attempting to get Open MPI to run over 2 nodes using a switch and a
> single breakout cable with this design:
>
> (100GbE)QSFP ßà 2x (50GbE)QSFP
>
>
>
> Hardware Layout:
>
> Breakout cable module A connects to switch (100GbE)
>
> Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)
>
> Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)
>
> Switch is Mellanox SN 2700 100GbE RoCE switch
>
>
>
> · I  am able to pass RDMA traffic between the nodes with perftest
> (ib_write_bw) when using the breakout cable as the IC from both nodes to
> the switch.
>
> · When attempting to run a job using the breakout cable as the IC
> Open MPI aborts with failure to initialize open fabrics device errors.
>
> · If I replace the breakout cable with 2 standard QSFP cables the
> Open MPI job will complete correctly.
>
>
>
>
>
> This is the command I use, it works unless I attempt a run with the
> breakout cable used as IC:
>
> *mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues
> P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm  -hostfile
> mpi-hosts-ce /usr/local/bin/IMB-MPI1*
>
>
>
> If anyone has any idea as to why using a breakout cable is causing my jobs
> to fail please let me know.
>
>
>
> Thank you,
>
>
>
> Brendan T. W. Myers
>
> brendan.my...@soft-forge.com
>
> Software Forge Inc
>
>
>
> ___
>
> users mailing list
>
> users@lists.open-mpi.org
>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Open MPI over RoCE using breakout cable and switch

2017-01-20 Thread Howard Pritchard
Hi Brendan

I doubt this kind of config has gotten any testing with OMPI.  Could you
rerun with

--mca btl_base_verbose 100

added to the command line and post the output to the list?

Howard


Brendan Myers  schrieb am Fr. 20. Jan. 2017
um 15:04:

> Hello,
>
> I am attempting to get Open MPI to run over 2 nodes using a switch and a
> single breakout cable with this design:
>
> (100GbE)QSFP ßà 2x (50GbE)QSFP
>
>
>
> Hardware Layout:
>
> Breakout cable module A connects to switch (100GbE)
>
> Breakout cable module B1 connects to node 1 RoCE NIC (50GbE)
>
> Breakout cable module B2 connects to node 2 RoCE NIC (50GbE)
>
> Switch is Mellanox SN 2700 100GbE RoCE switch
>
>
>
> · I  am able to pass RDMA traffic between the nodes with perftest
> (ib_write_bw) when using the breakout cable as the IC from both nodes to
> the switch.
>
> · When attempting to run a job using the breakout cable as the IC
> Open MPI aborts with failure to initialize open fabrics device errors.
>
> · If I replace the breakout cable with 2 standard QSFP cables the
> Open MPI job will complete correctly.
>
>
>
>
>
> This is the command I use, it works unless I attempt a run with the
> breakout cable used as IC:
>
> *mpirun --mca btl openib,self,sm --mca btl_openib_receive_queues
> P,65536,120,64,32 --mca btl_openib_cpc_include rdmacm  -hostfile
> mpi-hosts-ce /usr/local/bin/IMB-MPI1*
>
>
>
> If anyone has any idea as to why using a breakout cable is causing my jobs
> to fail please let me know.
>
>
>
> Thank you,
>
>
>
> Brendan T. W. Myers
>
> brendan.my...@soft-forge.com
>
> Software Forge Inc
>
>
> ___
>
> users mailing list
>
> users@lists.open-mpi.org
>
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux

2017-01-09 Thread Howard Pritchard
HI Siegmar,

You have some config parameters I wasn't trying that may have some impact.
I'll give a try with these parameters.

This should be enough info for now,

Thanks,

Howard


2017-01-09 0:59 GMT-07:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi Howard,
>
> I use the following commands to build and install the package.
> ${SYSTEM_ENV} is "Linux" and ${MACHINE_ENV} is "x86_64" for my
> Linux machine.
>
> mkdir openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
> cd openmpi-2.0.2rc3-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>
> ../openmpi-2.0.2rc3/configure \
>   --prefix=/usr/local/openmpi-2.0.2_64_cc \
>   --libdir=/usr/local/openmpi-2.0.2_64_cc/lib64 \
>   --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
>   --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
>   JAVA_HOME=/usr/local/jdk1.8.0_66 \
>   LDFLAGS="-m64 -mt -Wl,-z -Wl,noexecstack" CC="cc" CXX="CC" FC="f95" \
>   CFLAGS="-m64 -mt" CXXFLAGS="-m64" FCFLAGS="-m64" \
>   CPP="cpp" CXXCPP="cpp" \
>   --enable-mpi-cxx \
>   --enable-mpi-cxx-bindings \
>   --enable-cxx-exceptions \
>   --enable-mpi-java \
>   --enable-heterogeneous \
>   --enable-mpi-thread-multiple \
>   --with-hwloc=internal \
>   --without-verbs \
>   --with-wrapper-cflags="-m64 -mt" \
>   --with-wrapper-cxxflags="-m64" \
>   --with-wrapper-fcflags="-m64" \
>   --with-wrapper-ldflags="-mt" \
>   --enable-debug \
>   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>
> make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
> rm -r /usr/local/openmpi-2.0.2_64_cc.old
> mv /usr/local/openmpi-2.0.2_64_cc /usr/local/openmpi-2.0.2_64_cc.old
> make install |& tee log.make-install.$SYSTEM_ENV.$MACHINE_ENV.64_cc
> make check |& tee log.make-check.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>
>
> I get a different error if I run the program with gdb.
>
> loki spawn 118 gdb /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec
> GNU gdb (GDB; SUSE Linux Enterprise 12) 7.11.1
> Copyright (C) 2016 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.h
> tml>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-suse-linux".
> Type "show configuration" for configuration details.
> For bug reporting instructions, please see:
> <http://bugs.opensuse.org/>.
> Find the GDB manual and other documentation resources online at:
> <http://www.gnu.org/software/gdb/documentation/>.
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec...done.
> (gdb) r -np 1 --host loki --slot-list 0:0-5,1:0-5 spawn_master
> Starting program: /usr/local/openmpi-2.0.2_64_cc/bin/mpiexec -np 1 --host
> loki --slot-list 0:0-5,1:0-5 spawn_master
> Missing separate debuginfos, use: zypper install
> glibc-debuginfo-2.24-2.3.x86_64
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> [New Thread 0x73b97700 (LWP 13582)]
> [New Thread 0x718a4700 (LWP 13583)]
> [New Thread 0x710a3700 (LWP 13584)]
> [New Thread 0x7fffebbba700 (LWP 13585)]
> Detaching after fork from child process 13586.
>
> Parent process 0 running on loki
>   I create 4 slave processes
>
> Detaching after fork from child process 13589.
> Detaching after fork from child process 13590.
> Detaching after fork from child process 13591.
> [loki:13586] OPAL ERROR: Timeout in file ../../../../openmpi-2.0.2rc3/o
> pal/mca/pmix/base/pmix_base_fns.c at line 193
> [loki:13586] *** An error occurred in MPI_Comm_spawn
> [loki:13586] *** reported by process [2873294849,0]
> [loki:13586] *** on communicator MPI_COMM_WORLD
> [loki:13586] *** MPI_ERR_UNKNOWN: unknown error
> [loki:13586] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
> now abort,
> [loki:13586] ***and potentially your MPI job)
> [Thread 0x7fffebbba700 (LWP 13585) exited]
> [Thread 0x710a3700 (LWP 13584) exited]
> [Thread 0x718a4700 (LWP 13583) exited]
> [Thread 0x73b97700 (LWP 13582) exited]
> [Inferior 1 (process 13567) exited with code 016]
> Missing separate debuginfos, use: zypper install
> libpciaccess0-debuginfo-0.13.2-5.1.x86_64 libudev1-debuginfo-210-116.3.3
> .x86_64
> (gdb) bt
> No stack.
> (gdb)
>
> Do you need anything else?
>
>
&g

Re: [OMPI users] still segmentation fault with openmpi-2.0.2rc3 on Linux

2017-01-08 Thread Howard Pritchard
HI Siegmar,

Could you post the configury options you use when building the 2.0.2rc3?
Maybe that will help in trying to reproduce the segfault you are observing.

Howard


2017-01-07 2:30 GMT-07:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> I have installed openmpi-2.0.2rc3 on my "SUSE Linux Enterprise
> Server 12 (x86_64)" with Sun C 5.14 and gcc-6.3.0. Unfortunately,
> I still get the same error that I reported for rc2.
>
> I would be grateful, if somebody can fix the problem before
> releasing the final version. Thank you very much for any help
> in advance.
>
>
> Kind regards
>
> Siegmar
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] segmentation fault with openmpi-2.0.2rc2 on Linux

2017-01-03 Thread Howard Pritchard
HI Siegmar,

Could you please rerun the spawn_slave program with 4 processes?
Your original traceback indicates a failure in the barrier in the slave
program.  I'm interested in seeing if when you run the slave program
standalone with 4 processes the barrier failure is observed.

Thanks,

Howard


2017-01-03 0:32 GMT-07:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi Howard,
>
> thank you very much that you try to solve my problem. I haven't
> changed the programs since 2013 so that you use the correct
> version. The program works as expected with the master trunk as
> you can see at the bottom of this email from my last mail. The
> slave program works when I launch it directly.
>
> loki spawn 122 mpicc --showme
> cc -I/usr/local/openmpi-2.0.2_64_cc/include -m64 -mt -mt -Wl,-rpath
> -Wl,/usr/local/openmpi-2.0.2_64_cc/lib64 -Wl,--enable-new-dtags
> -L/usr/local/openmpi-2.0.2_64_cc/lib64 -lmpi
> loki spawn 123 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
> Open MPI: 2.0.2rc2
>  C compiler absolute: /opt/solstudio12.5b/bin/cc
> loki spawn 124 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5 --mca
> btl_base_verbose 10 spawn_slave
> [loki:05572] mca: base: components_register: registering framework btl
> components
> [loki:05572] mca: base: components_register: found loaded component self
> [loki:05572] mca: base: components_register: component self register
> function successful
> [loki:05572] mca: base: components_register: found loaded component sm
> [loki:05572] mca: base: components_register: component sm register
> function successful
> [loki:05572] mca: base: components_register: found loaded component tcp
> [loki:05572] mca: base: components_register: component tcp register
> function successful
> [loki:05572] mca: base: components_register: found loaded component vader
> [loki:05572] mca: base: components_register: component vader register
> function successful
> [loki:05572] mca: base: components_open: opening btl components
> [loki:05572] mca: base: components_open: found loaded component self
> [loki:05572] mca: base: components_open: component self open function
> successful
> [loki:05572] mca: base: components_open: found loaded component sm
> [loki:05572] mca: base: components_open: component sm open function
> successful
> [loki:05572] mca: base: components_open: found loaded component tcp
> [loki:05572] mca: base: components_open: component tcp open function
> successful
> [loki:05572] mca: base: components_open: found loaded component vader
> [loki:05572] mca: base: components_open: component vader open function
> successful
> [loki:05572] select: initializing btl component self
> [loki:05572] select: init of component self returned success
> [loki:05572] select: initializing btl component sm
> [loki:05572] select: init of component sm returned failure
> [loki:05572] mca: base: close: component sm closed
> [loki:05572] mca: base: close: unloading component sm
> [loki:05572] select: initializing btl component tcp
> [loki:05572] select: init of component tcp returned success
> [loki:05572] select: initializing btl component vader
> [loki][[35331,1],0][../../../../../openmpi-2.0.2rc2/opal/mca
> /btl/vader/btl_vader_component.c:454:mca_btl_vader_component_init] No
> peers to communicate with. Disabling vader.
> [loki:05572] select: init of component vader returned failure
> [loki:05572] mca: base: close: component vader closed
> [loki:05572] mca: base: close: unloading component vader
> [loki:05572] mca: bml: Using self btl for send to [[35331,1],0] on node
> loki
> Slave process 0 of 1 running on loki
> spawn_slave 0: argv[0]: spawn_slave
> [loki:05572] mca: base: close: component self closed
> [loki:05572] mca: base: close: unloading component self
> [loki:05572] mca: base: close: component tcp closed
> [loki:05572] mca: base: close: unloading component tcp
> loki spawn 125
>
>
> Kind regards and thank you very much once more
>
> Siegmar
>
> Am 03.01.2017 um 00:17 schrieb Howard Pritchard:
>
>> HI Siegmar,
>>
>> I've attempted to reproduce this using gnu compilers and
>> the version of this test program(s) you posted earlier in 2016
>> but am unable to reproduce the problem.
>>
>> Could you double check that the slave program can be
>> successfully run when launched directly by mpirun/mpiexec?
>> It might also help to use --mca btl_base_verbose 10 when
>> running the slave program standalone.
>>
>> Thanks,
>>
>> Howard
>>
>>
>>
>> 2016-12-28 7:06 GMT-07:00 Siegmar Gross <siegmar.gr...@informatik.hs-f
>> ulda.de <mailto:siegmar.gr...@informatik.hs-fulda.de>>:
>>
>

Re: [OMPI users] segmentation fault with openmpi-2.0.2rc2 on Linux

2017-01-02 Thread Howard Pritchard
HI Siegmar,

I've attempted to reproduce this using gnu compilers and
the version of this test program(s) you posted earlier in 2016
but am unable to reproduce the problem.

Could you double check that the slave program can be
successfully run when launched directly by mpirun/mpiexec?
It might also help to use --mca btl_base_verbose 10 when
running the slave program standalone.

Thanks,

Howard



2016-12-28 7:06 GMT-07:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> I have installed openmpi-2.0.2rc2 on my "SUSE Linux Enterprise
> Server 12 (x86_64)" with Sun C 5.14 beta and gcc-6.2.0. Unfortunately,
> I get an error when I run one of my programs. Everything works as
> expected with openmpi-master-201612232109-67a08e8. The program
> gets a timeout with openmpi-v2.x-201612232156-5ce66b0.
>
> loki spawn 144 ompi_info | grep -e "Open MPI:" -e "C compiler absolute:"
> Open MPI: 2.0.2rc2
>  C compiler absolute: /opt/solstudio12.5b/bin/cc
>
>
> loki spawn 145 mpiexec -np 1 --host loki --slot-list 0:0-5,1:0-5
> spawn_master
>
> Parent process 0 running on loki
>   I create 4 slave processes
>
> --
> A system call failed during shared memory initialization that should
> not have.  It is likely that your MPI job will now either abort or
> experience performance degradation.
>
>   Local host:  loki
>   System call: open(2)
>   Error:   No such file or directory (errno 2)
> --
> [loki:17855] *** Process received signal ***
> [loki:17855] Signal: Segmentation fault (11)
> [loki:17855] Signal code: Address not mapped (1)
> [loki:17855] Failing at address: 0x8
> [loki:17855] [ 0] /lib64/libpthread.so.0(+0xf870)[0x7f053d0e9870]
> [loki:17855] [ 1] /usr/local/openmpi-2.0.2_64_cc
> /lib64/openmpi/mca_pml_ob1.so(+0x990ae)[0x7f05325060ae]
> [loki:17855] [ 2] /usr/local/openmpi-2.0.2_64_cc
> /lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_req_start+0x1
> 96)[0x7f053250cb16]
> [loki:17855] [ 3] /usr/local/openmpi-2.0.2_64_cc
> /lib64/openmpi/mca_pml_ob1.so(mca_pml_ob1_irecv+0x2f8)[0x7f05324bd3d8]
> [loki:17855] [ 4] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(ompi_coll_base_bcast_intra_generic+0x34c
> )[0x7f053e52300c]
> [loki:17855] [ 5] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(ompi_coll_base_bcast_intra_binomial+
> 0x1ed)[0x7f053e523eed]
> [loki:17855] [ 6] /usr/local/openmpi-2.0.2_64_cc
> /lib64/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_
> intra_dec_fixed+0x1a3)[0x7f0531ea7c03]
> [loki:17855] [ 7] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(ompi_dpm_connect_accept+0xab8)[0x7f053d484f38]
> [loki:17855] [ 8] [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in
> file ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line
> 186
> /usr/local/openmpi-2.0.2_64_cc/lib64/libmpi.so.20(ompi_dpm_
> dyn_init+0xcd)[0x7f053d48aeed]
> [loki:17855] [ 9] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(ompi_mpi_init+0xf93)[0x7f053d53d5f3]
> [loki:17855] [10] /usr/local/openmpi-2.0.2_64_cc
> /lib64/libmpi.so.20(PMPI_Init+0x8d)[0x7f053db209cd]
> [loki:17855] [11] spawn_slave[0x4009cf]
> [loki:17855] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f053cd53b25]
> [loki:17855] [13] spawn_slave[0x400892]
> [loki:17855] *** End of error message ***
> [loki:17845] [[55817,0],0] ORTE_ERROR_LOG: Not found in file
> ../../openmpi-2.0.2rc2/orte/orted/pmix/pmix_server_fence.c at line 186
> --
> At least one pair of MPI processes are unable to reach each other for
> MPI communications.  This means that no Open MPI device has indicated
> that it can be used to communicate between these processes.  This is
> an error; Open MPI requires that all MPI processes be able to reach
> each other.  This error can sometimes be the result of forgetting to
> specify the "self" BTL.
>
>   Process 1 ([[55817,2],0]) is on host: loki
>   Process 2 ([[55817,2],1]) is on host: unknown!
>   BTLs attempted: self sm tcp vader
>
> Your MPI job is now going to abort; sorry.
> --
> *** An error occurred in MPI_Init
> *** on a NULL communicator
> *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
> ***and potentially your MPI job)
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>
>   ompi_dpm_dyn_init() failed
>   --> Returned "Unreachable" (-12) instead of "Success" (0)
> 

Re: [OMPI users] Segmentation Fault (Core Dumped) on mpif90 -v

2016-12-23 Thread Howard Pritchard
Hi Paul,

Thanks very much Christmas present.

The Open MPI README has been updated
to include a note about issues with the Intel 16.0.3-4 compiler suites.

Enjoy the holidays,

Howard


2016-12-23 3:41 GMT-07:00 Paul Kapinos :

> Hi all,
>
> we discussed this issue with Intel compiler support and it looks like they
> now know what the issue is and how to protect after. It is a known issue
> resulting from a backwards incompatibility in an OS/glibc update, cf.
> https://sourceware.org/bugzilla/show_bug.cgi?id=20019
>
> Affected versions of the Intel compilers: 16.0.3, 16.0.4
> Not affected versions: 16.0.2, 17.0
>
> So, simply do not use affected versions (and hope on an bugfix update in
> 16x series if you cannot immediately upgrade to 17x, like we, despite this
> is the favourite option from Intel).
>
> Have a nice Christmas time!
>
> Paul Kapinos
>
> On 12/14/16 13:29, Paul Kapinos wrote:
>
>> Hello all,
>> we seem to run into the same issue: 'mpif90' sigsegvs immediately for
>> Open MPI
>> 1.10.4 compiled using Intel compilers 16.0.4.258 and 16.0.3.210, while it
>> works
>> fine when compiled with 16.0.2.181.
>>
>> It seems to be a compiler issue (more exactly: library issue on libs
>> delivered
>> with 16.0.4.258 and 16.0.3.210 versions). Changing the version of compiler
>> loaded back to 16.0.2.181 (=> change of dynamically loaded libs) let the
>> prevously-failing binary (compiled with newer compilers) to work
>> propperly.
>>
>> Compiling with -O0 does not help. As the issue is likely in the Intel
>> libs (as
>> said changing out these solves/raises the issue) we will do a failback to
>> 16.0.2.181 compiler version. We will try to open a case by Intel - let's
>> see...
>>
>> Have a nice day,
>>
>> Paul Kapinos
>>
>>
>>
>> On 05/06/16 14:10, Jeff Squyres (jsquyres) wrote:
>>
>>> Ok, good.
>>>
>>> I asked that question because typically when we see errors like this, it
>>> is
>>> usually either a busted compiler installation or inadvertently mixing the
>>> run-times of multiple different compilers in some kind of incompatible
>>> way.
>>> Specifically, the mpifort (aka mpif90) application is a fairly simple
>>> program
>>> -- there's no reason it should segv, especially with a stack trace that
>>> you
>>> sent that implies that it's dying early in startup, potentially even
>>> before it
>>> has hit any Open MPI code (i.e., it could even be pre-main).
>>>
>>> BTW, you might be able to get a more complete stack trace from the
>>> debugger
>>> that comes with the Intel compiler (idb?  I don't remember offhand).
>>>
>>> Since you are able to run simple programs compiled by this compiler, it
>>> sounds
>>> like the compiler is working fine.  Good!
>>>
>>> The next thing to check is to see if somehow the compiler and/or run-time
>>> environments are getting mixed up.  E.g., the apps were compiled for one
>>> compiler/run-time but are being used with another.  Also ensure that any
>>> compiler/linker flags that you are passing to Open MPI's configure
>>> script are
>>> native and correct for the platform for which you're compiling (e.g.,
>>> don't
>>> pass in flags that optimize for a different platform; that may result in
>>> generating machine code instructions that are invalid for your platform).
>>>
>>> Try recompiling/re-installing Open MPI from scratch, and if it still
>>> doesn't
>>> work, then send all the information listed here:
>>>
>>> https://www.open-mpi.org/community/help/
>>>
>>>
>>> On May 6, 2016, at 3:45 AM, Giacomo Rossi  wrote:

 Yes, I've tried three simple "Hello world" programs in fortan, C and
 C++ and
 the compile and run with intel 16.0.3. The problem is with the openmpi
 compiled from source.

 Giacomo Rossi Ph.D., Space Engineer

 Research Fellow at Dept. of Mechanical and Aerospace Engineering,
 "Sapienza"
 University of Rome
 p: (+39) 0692927207 | m: (+39) 3408816643 | e: giacom...@gmail.com

 Member of Fortran-FOSS-programmers


 2016-05-05 11:15 GMT+02:00 Giacomo Rossi :
  gdb /opt/openmpi/1.10.2/intel/16.0.3/bin/mpif90
 GNU gdb (GDB) 7.11
 Copyright (C) 2016 Free Software Foundation, Inc.
 License GPLv3+: GNU GPL version 3 or later <
 http://gnu.org/licenses/gpl.html>
 This is free software: you are free to change and redistribute it.
 There is NO WARRANTY, to the extent permitted by law.  Type "show
 copying"
 and "show warranty" for details.
 This GDB was configured as "x86_64-pc-linux-gnu".
 Type "show configuration" for configuration details.
 For bug reporting instructions, please see:
 .
 Find the GDB manual and other documentation resources online at:
 .
 For help, type "help".
 Type "apropos word" to search for commands related to "word"...
 Reading symbols from 

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Howard Pritchard
Hi Daniele,

I bet this psm2 got installed as part of Mpss 3.7.  I see something in the
readme for that about MPSS install with OFED support.
I think if you want to go the route of using the RHEL Open MPI RPMS, you
could use the mca-params.conf file approach
to disabling the use of psm2.

This file and a lot of other stuff about mca parameters is described here:

https://www.open-mpi.org/faq/?category=tuning

Alternatively, you could try and build/install Open MPI yourself from the
download page:

https://www.open-mpi.org/software/ompi/v1.10/

The simplest solution - but you need to be confident that nothing's using
the PSM2 software - would be just
use yum to deinstall the psm2 rpm.

Good luck,

Howard




2016-12-08 14:17 GMT-07:00 Daniele Tartarini <d.tartar...@sheffield.ac.uk>:

> Hi,
> many thanks for tour reply.
>
> I have a S2600IP Intel motherboard. it is a stand alone server and I
> cannot see any omnipath device and so not such modules.
> opainfo is not available on my system
>
> missing anything?
> cheers
> Daniele
>
> On 8 December 2016 at 17:55, Cabral, Matias A <matias.a.cab...@intel.com>
> wrote:
>
>> >Anyway, * /dev/hfi1_0* doesn't exist.
>>
>> Make sure you have the hfi1 module/driver loaded.
>>
>> In addition, please confirm the links are in active state on all the
>> nodes `opainfo`
>>
>>
>>
>> _MAC
>>
>>
>>
>> *From:* users [mailto:users-boun...@lists.open-mpi.org] *On Behalf Of *Howard
>> Pritchard
>> *Sent:* Thursday, December 08, 2016 9:23 AM
>> *To:* Open MPI Users <users@lists.open-mpi.org>
>> *Subject:* Re: [OMPI users] device failed to appear .. Connection timed
>> out
>>
>>
>>
>> hello Daniele,
>>
>>
>>
>> Could you post the output from ompi_info command?  I'm noticing on the
>> RPMS that came with the rhel7.2 distro on
>>
>> one of our systems that it was built to support psm2/hfi-1.
>>
>>
>>
>> Two things, could you try running applications with
>>
>>
>>
>> mpirun --mca pml ob1 (all the rest of your args)
>>
>>
>>
>> and see if that works?
>>
>>
>>
>> Second,  what sort of system are you using?  Is this a cluster?  If it
>> is, you may want to check whether
>>
>> you have a situation where its an omnipath interconnect and you have the
>> psm2/hfi1 packages installed
>>
>> but for some reason the omnipath HCAs themselves are not active.
>>
>>
>>
>> On one of our omnipath systems the following hfi1 related pms are
>> installed:
>>
>>
>>
>> *hfi*diags-0.8-13.x86_64
>>
>> *hfi*1-psm-devel-0.7-244.x86_64
>> lib*hfi*1verbs-0.5-16.el7.x86_64
>> *hfi*1-psm-0.7-244.x86_64
>> *hfi*1-firmware-0.9-36.noarch
>> *hfi*1-psm-compat-0.7-244.x86_64
>> lib*hfi*1verbs-devel-0.5-16.el7.x86_64
>> *hfi*1-0.11.3.10.0_327.el7.x86_64-245.x86_64
>> *hfi*1-firmware_debug-0.9-36.noarc
>> *hfi*1-diagtools-sw-0.8-13.x86_64
>>
>>
>>
>> Howard
>>
>>
>>
>> 2016-12-08 8:45 GMT-07:00 r...@open-mpi.org <r...@open-mpi.org>:
>>
>> Sounds like something didn’t quite get configured right, or maybe you
>> have a library installed that isn’t quite setup correctly, or...
>>
>>
>>
>> Regardless, we generally advise building from source to avoid such
>> problems. Is there some reason not to just do so?
>>
>>
>>
>> On Dec 8, 2016, at 6:16 AM, Daniele Tartarini <
>> d.tartar...@sheffield.ac.uk> wrote:
>>
>>
>>
>> Hi,
>>
>> I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum:
>>
>> *openmpi-devel.x86_64 1.10.3-3.el7  *
>>
>>
>>
>> any code I try to run (including the mpitests-*) I get the following
>> message with slight variants:
>>
>>
>>
>> * my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device
>> failed to appear after 15.0 seconds: Connection timed out*
>>
>>
>>
>> Is anyone able to help me in identifying the source of the problem?
>>
>> Anyway, * /dev/hfi1_0* doesn't exist.
>>
>>
>>
>> If I use an OpenMPI version compiled from source I have no issue (gcc
>> 4.8.5).
>>
>>
>>
>> many thanks in advance.
>>
>>
>>
>> cheers
>>
>> Daniele
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium

Re: [OMPI users] device failed to appear .. Connection timed out

2016-12-08 Thread Howard Pritchard
hello Daniele,

Could you post the output from ompi_info command?  I'm noticing on the RPMS
that came with the rhel7.2 distro on
one of our systems that it was built to support psm2/hfi-1.

Two things, could you try running applications with

mpirun --mca pml ob1 (all the rest of your args)

and see if that works?

Second,  what sort of system are you using?  Is this a cluster?  If it is,
you may want to check whether
you have a situation where its an omnipath interconnect and you have the
psm2/hfi1 packages installed
but for some reason the omnipath HCAs themselves are not active.

On one of our omnipath systems the following hfi1 related pms are installed:

*hfi*diags-0.8-13.x86_64

*hfi*1-psm-devel-0.7-244.x86_64
lib*hfi*1verbs-0.5-16.el7.x86_64
*hfi*1-psm-0.7-244.x86_64
*hfi*1-firmware-0.9-36.noarch
*hfi*1-psm-compat-0.7-244.x86_64
lib*hfi*1verbs-devel-0.5-16.el7.x86_64
*hfi*1-0.11.3.10.0_327.el7.x86_64-245.x86_64
*hfi*1-firmware_debug-0.9-36.noarc
*hfi*1-diagtools-sw-0.8-13.x86_64


Howard

2016-12-08 8:45 GMT-07:00 r...@open-mpi.org :

> Sounds like something didn’t quite get configured right, or maybe you have
> a library installed that isn’t quite setup correctly, or...
>
> Regardless, we generally advise building from source to avoid such
> problems. Is there some reason not to just do so?
>
> On Dec 8, 2016, at 6:16 AM, Daniele Tartarini 
> wrote:
>
> Hi,
>
> I've installed on a Red Hat 7.2 the OpenMPI distributed via Yum:
>
> *openmpi-devel.x86_64 1.10.3-3.el7  *
>
> any code I try to run (including the mpitests-*) I get the following
> message with slight variants:
>
> * my_machine.171619hfi_wait_for_device: The /dev/hfi1_0 device
> failed to appear after 15.0 seconds: Connection timed out*
>
> Is anyone able to help me in identifying the source of the problem?
> Anyway, * /dev/hfi1_0* doesn't exist.
>
> If I use an OpenMPI version compiled from source I have no issue (gcc
> 4.8.5).
>
> many thanks in advance.
>
> cheers
> Daniele
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Follow-up to Open MPI SC'16 BOF

2016-11-22 Thread Howard Pritchard
Hi Jeff,

I don't think it was the use of memkind itself, but a need to refactor the
way Open MPI is using info objects
that was the issue.  I don't recall the details.

Howard


2016-11-22 16:27 GMT-07:00 Jeff Hammond :

>
>>
>>1. MPI_ALLOC_MEM integration with memkind
>>
>> It would sense to prototype this as a standalone project that is
> integrated with any MPI library via PMPI.  It's probably a day or two of
> work to get that going.
>
> Jeff
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Follow-up to Open MPI SC'16 BOF

2016-11-22 Thread Howard Pritchard
Hello Folks,

This is a followup to the question posed at the SC’16 Open MPI BOF:  Would
the community prefer to have a v2.2.x limited feature but backwards
compatible release sometime in 2017, or would the community prefer a v3.x
(not backwards compatible but potentially more features) sometime in late
2017 to early 2018?

BOF attendees expressed an interest in having a list of features that might
make it in to v2.2.x and ones that the Open MPI developers think would be
too hard to back port from the development branch (master) to a v2.2.x
release stream.

Here are the requested lists:

Features that we anticipate we could port to a v2.2.x release

   1. Improved collective performance (a new “tuned” module)
   2. Enable Linux CMA shared memory support by default
   3. PMIx 3.0 (If new functionality were to be used in this release of
   Open MPI)

Features that we anticipate would be too difficult to port to a v2.2.x
release

   1. Revamped CUDA support
   2. MPI_ALLOC_MEM integration with memkind
   3. OpenMP affinity/placement integration
   4. THREAD_MULTIPLE improvements to MTLs (not so clear on the level of
   difficult for this one)

You can register your opinion on whether to go with a v2.2.x release next
year or to go from v2.1.x to v3.x in late 2017 or early 2018 at the link
below:

https://www.open-mpi.org/sc16/

Thanks very much,

Howard

-- 

Howard Pritchard

HPC-DES

Los Alamos National Laboratory
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] ScaLapack tester fails with 2.0.1, works with 1.10.4; Intel Omni-Path

2016-11-18 Thread Howard Pritchard
Hi Christof,

Thanks for trying out 2.0.1.  Sorry that you're hitting problems.
Could you try to run the tests using the 'ob1' PML in order to
bypass PSM2?

mpirun --mca pml ob1 (all the rest of the args)

and see if you still observe the failures?

Howard


2016-11-18 9:32 GMT-07:00 Christof Köhler <
christof.koeh...@bccms.uni-bremen.de>:

> Hello everybody,
>
> I am observing failures in the xdsyevr (and xssyevr) ScaLapack self tests
> when running on one or two nodes with OpenMPI 2.0.1. With 1.10.4 no
> failures are observed. Also, with mvapich2 2.2 no failures are observed.
> The other testers appear to be working with all MPIs mentioned (have to
> triple check again). I somehow overlooked the failures below at first.
>
> The system is an Intel OmniPath system (newest Intel driver release 10.2),
> i.e. we are using the PSM2
> mtl I believe.
>
> I built the OpenMPIs with gcc 6.2 and the following identical options:
> ./configure  FFLAGS="-O1" CFLAGS="-O1" FCFLAGS="-O1" CXXFLAGS="-O1"
> --with-psm2 --with-tm --with-hwloc=internal --enable-static
> --enable-orterun-prefix-by-default
>
> The ScaLapack build is also with gcc 6.2, openblas 0.2.19 and using "-O1
> -g" as FCFLAGS and CCFLAGS identical for all tests, only wrapper compiler
> changes.
>
> With OpenMPI 1.10.4 I see on a single node
>
>  mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
> ./xdsyevr
> 136 tests completed and passed residual checks.
> 0 tests completed without checking.
> 0 tests skipped for lack of memory.
> 0 tests completed and failed.
>
> With OpenMPI 1.10.4 I see on two nodes
>
> mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
> ./xdsyevr
>   136 tests completed and passed residual checks.
> 0 tests completed without checking.
> 0 tests skipped for lack of memory.
> 0 tests completed and failed.
>
> With OpenMPI 2.0.1 I see on a single node
>
> mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> oob_tcp_if_include eth0,team0 -host node009,node009,node009,node009
> ./xdsyevr
> 32 tests completed and passed residual checks.
> 0 tests completed without checking.
> 0 tests skipped for lack of memory.
>   104 tests completed and failed.
>
> With OpenMPI 2.0.1 I see on two nodes
>
> mpirun -n 4 -x PATH -x LD_LIBRARY_PATH -x OMP_NUM_THREADS -mca
> oob_tcp_if_include eth0,team0 -host node009,node010,node009,node010
> ./xdsyevr
>32 tests completed and passed residual checks.
> 0 tests completed without checking.
> 0 tests skipped for lack of memory.
>   104 tests completed and failed.
>
> A typical failure looks like this in the output
>
> IL, IU, VL or VU altered by PDSYEVR
>500   1   1   1   8   Y 0.26-1.00  0.19E-02   15. FAILED
>500   1   2   1   8   Y 0.29-1.00  0.79E-03   3.9 PASSED
>  EVR
> IL, IU, VL or VU altered by PDSYEVR
>500   1   1   2   8   Y 0.52-1.00  0.82E-03   2.5 FAILED
>500   1   2   2   8   Y 0.41-1.00  0.79E-03   2.3 PASSED
>  EVR
>500   2   2   2   8   Y 0.18-1.00  0.78E-03   3.0 PASSED
>  EVR
> IL, IU, VL or VU altered by PDSYEVR
>500   4   1   4   8   Y 0.09-1.00  0.95E-03   4.1 FAILED
>500   4   4   1   8   Y 0.11-1.00  0.91E-03   2.8 PASSED
>  EVR
>
>
> The variable OMP_NUM_THREADS=1 to stop the openblas from threading.
> We see similar problems with intel 2016 compilers, but I believe gcc is a
> good baseline.
>
> Any ideas ? For us this is a real problem in that we do not know if this
> indicates a network (transport) issue in the intel software stack (libpsm2,
> hfi1 kernel module) which might affect our production codes or if this is
> an OpenMPI issue. We have some other problems I might ask about later on
> this list, but nothing which yields such a nice reproducer and especially
> these other problems might well be application related.
>
> Best Regards
>
> Christof
>
> --
> Dr. rer. nat. Christof Köhler   email: c.koeh...@bccms.uni-bremen.de
> Universitaet Bremen/ BCCMS  phone:  +49-(0)421-218-62334
> Am Fallturm 1/ TAB/ Raum 3.12   fax: +49-(0)421-218-62770
> 28359 Bremen
>
> PGP: http://www.bccms.uni-bremen.de/cms/people/c_koehler/
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] How to verify RDMA traffic (RoCE) is being sent over a fabric when running OpenMPI

2016-11-08 Thread Howard Pritchard
HI Brenda,

I should clarify as my response may confuse folks.  We had configured the
connectx4 cards to use
ethernet/RoCE rather than IB transport for these measurements.

Howard


2016-11-08 16:08 GMT-07:00 Howard Pritchard <hpprit...@gmail.com>:

> Hi Brenda,
>
> What type of ethernet device (is this a Mellanox HCA?) and ethernet switch
> are you using?  The mpirun configure
> options look correct to me.  Is it possible that you have all the mpi
> processes on a single node?
> It should be pretty obvious from the SendRecv IMB test if you're using
> RoCE.  The large message
> bandwidth will be much better than if you are going through the tcp btl.
>
> If you're using Mellanox cards, you might want to do a sanity check using
> the MXM libraries.
> You'd want to set MXM_TLS env. variable to "self,shm,rc".   We got close
> to 90 Gb/sec bandwidth using Connect X-4
> + MXM MTL on a cluster earlier this year.
>
> Howard
>
>
>
> 2016-11-08 15:15 GMT-07:00 Brendan Myers <brendan.my...@soft-forge.com>:
>
>> Hello,
>>
>> I am trying to figure out how I can verify that the OpenMPI traffic is
>> actually being transmitted over my RoCE fabric connecting my cluster.  My
>> MPI job runs quickly and error free but I cannot seem to verify that
>> significant amounts of data is being transferred to the other endpoint in
>> my RoCE fabric.  I am able to see what I believe to be the oob data when I
>> remove the oob exclusion from my command when I analyze my RoCE interface
>> using the tools listed below.
>>
>> Software:
>>
>> · CentOS 7.2
>>
>> · Open MPI 2.0.1
>>
>> Command:
>>
>> · mpirun   --mca btl openib,self,sm --mca oob_tcp_if_exclude
>> eth3 --mca btl_openib_receive_queues P,65536,120,64,32 --mca
>> btl_openib_cpc_include rdmacm -np 4 -hostfile mpi-hosts-ce
>> /usr/local/bin/IMB-MPI1
>>
>> o   Eth3 is my RoCE interface
>>
>> o   The 2 nodes involved RoCE interfaces are defined in my mpi-hosts-ce
>> file
>>
>> Ways I have looked to verify data transference:
>>
>> · Through the port counters on my RoCE switch
>>
>> o   Sees data being sent when using ib_write_bw but not when using Open
>> MPI
>>
>> · Through ibdump
>>
>> o   Sees data being sent when using ib_write_bw but not when using Open
>> MPI
>>
>> · Through Wireshark
>>
>> o   Sees data being sent when using ib_write_bw but not when using Open
>> MPI
>>
>>
>>
>> I do not have much experience with Open MPI and apologize if I have left
>> out necessary information.  I will respond with any data requested.  I
>> appreciate the time spent to read and respond to this.
>>
>>
>>
>>
>>
>> Thank you,
>>
>>
>>
>> Brendan T. W. Myers
>>
>> brendan.my...@soft-forge.com
>>
>> Software Forge Inc
>>
>>
>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] How to verify RDMA traffic (RoCE) is being sent over a fabric when running OpenMPI

2016-11-08 Thread Howard Pritchard
Hi Brenda,

What type of ethernet device (is this a Mellanox HCA?) and ethernet switch
are you using?  The mpirun configure
options look correct to me.  Is it possible that you have all the mpi
processes on a single node?
It should be pretty obvious from the SendRecv IMB test if you're using
RoCE.  The large message
bandwidth will be much better than if you are going through the tcp btl.

If you're using Mellanox cards, you might want to do a sanity check using
the MXM libraries.
You'd want to set MXM_TLS env. variable to "self,shm,rc".   We got close to
90 Gb/sec bandwidth using Connect X-4
+ MXM MTL on a cluster earlier this year.

Howard



2016-11-08 15:15 GMT-07:00 Brendan Myers :

> Hello,
>
> I am trying to figure out how I can verify that the OpenMPI traffic is
> actually being transmitted over my RoCE fabric connecting my cluster.  My
> MPI job runs quickly and error free but I cannot seem to verify that
> significant amounts of data is being transferred to the other endpoint in
> my RoCE fabric.  I am able to see what I believe to be the oob data when I
> remove the oob exclusion from my command when I analyze my RoCE interface
> using the tools listed below.
>
> Software:
>
> · CentOS 7.2
>
> · Open MPI 2.0.1
>
> Command:
>
> · mpirun   --mca btl openib,self,sm --mca oob_tcp_if_exclude eth3
> --mca btl_openib_receive_queues P,65536,120,64,32 --mca
> btl_openib_cpc_include rdmacm -np 4 -hostfile mpi-hosts-ce
> /usr/local/bin/IMB-MPI1
>
> o   Eth3 is my RoCE interface
>
> o   The 2 nodes involved RoCE interfaces are defined in my mpi-hosts-ce
> file
>
> Ways I have looked to verify data transference:
>
> · Through the port counters on my RoCE switch
>
> o   Sees data being sent when using ib_write_bw but not when using Open
> MPI
>
> · Through ibdump
>
> o   Sees data being sent when using ib_write_bw but not when using Open
> MPI
>
> · Through Wireshark
>
> o   Sees data being sent when using ib_write_bw but not when using Open
> MPI
>
>
>
> I do not have much experience with Open MPI and apologize if I have left
> out necessary information.  I will respond with any data requested.  I
> appreciate the time spent to read and respond to this.
>
>
>
>
>
> Thank you,
>
>
>
> Brendan T. W. Myers
>
> brendan.my...@soft-forge.com
>
> Software Forge Inc
>
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] how to tell if pmi or pmi2 is being used?

2016-10-13 Thread Howard Pritchard
HI David,

If you are using srun, you can
export OMPI_MCA_pmix_base_verbose=10
and there will be output to show which SLURM pmi library you are using.

Howard


2016-10-13 12:55 GMT-06:00 David Shrader :

> That is really good to know. Thanks!
> David
>
>
> On 10/13/2016 12:27 PM, r...@open-mpi.org wrote:
>
>> If you are using mpirun, then neither PMI1 or PMI2 are involved at all.
>> ORTE has its own internal mechanism for handling wireup.
>>
>>
>> On Oct 13, 2016, at 10:43 AM, David Shrader  wrote:
>>>
>>> Hello All,
>>>
>>> I'm using Open MPI 1.10.3 with Slurm and would like to ask how do I find
>>> out if pmi1 or pmi2 was used for process launching? The Slurm installation
>>> is supposed to support both pmi1 and pmi2, but I would really like to know
>>> which one I fall in to. I tried using '-mca plm_base_verbose 100' on the
>>> mpirun line, but it didn't mention pmi specifically. Instead, all I could
>>> really find was that it was using the slurm component. Is there something
>>> else I can look at in the output that would have that detail?
>>>
>>> Thank you for your time,
>>> David
>>>
>>> --
>>> David Shrader
>>> HPC-ENV High Performance Computer Systems
>>> Los Alamos National Lab
>>> Email: dshrader  lanl.gov
>>>
>>> ___
>>> users mailing list
>>> users@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>>
>> ___
>> users mailing list
>> users@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>>
>
> --
> David Shrader
> HPC-ENV High Performance Computer Systems
> Los Alamos National Lab
> Email: dshrader  lanl.gov
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Regression: multiple memory regions in dynamic windows

2016-08-25 Thread Howard Pritchard
Hi Joseph,

Thanks for reporting this problem.

There's an issue now (#2012)
https://github.com/open-mpi/ompi/issues/2012

to track this.

Howard


2016-08-25 7:44 GMT-06:00 Christoph Niethammer :

> Hello,
>
> The Error is not 100% reproducible for me every time but seems to
> disappear entirely if one excludes
> -mca osc ^rdma
> or
> -mca btl ^openib
> component.
>
> The error is present in 2.0.0 and also 2.0.1rc1.
>
> Best
> Christoph Niethammer
>
>
>
> - Original Message -
> From: "Joseph Schuchart" 
> To: users@lists.open-mpi.org
> Sent: Thursday, August 25, 2016 2:07:17 PM
> Subject: [OMPI users] Regression: multiple memory regions in dynamic
> windows
>
> All,
>
> It seems there is a regression in the handling of dynamic windows
> between Open MPI 1.10.3 and 2.0.0. I am attaching a test case that works
> fine with Open MPI 1.8.3 and fail with version 2.0.0 with the following
> output:
>
> ===
> [0] MPI_Get 0 -> 3200 on first memory region
> [cl3fr1:7342] *** An error occurred in MPI_Get
> [cl3fr1:7342] *** reported by process [908197889,0]
> [cl3fr1:7342] *** on win rdma window 3
> [cl3fr1:7342] *** MPI_ERR_RMA_RANGE: invalid RMA address range
> [cl3fr1:7342] *** MPI_ERRORS_ARE_FATAL (processes in this win will now
> abort,
> [cl3fr1:7342] ***and potentially your MPI job)
> ===
>
> Expected output is:
> ===
> [0] MPI_Get 0 -> 100 on first memory region:
> [0] Done.
> [0] MPI_Get 0 -> 100 on second memory region:
> [0] Done.
> ===
>
> The code allocates a dynamic window and attaches two memory regions to
> it before accessing both memory regions using MPI_Get. With Open MPI
> 2.0.0, only access to the both memory regions fails. Access to the first
> memory region only succeeds if the second memory region is not attached.
> With Open MPI 1.10.3, all MPI operations succeed.
>
> Please let me know if you need any additional information or think that
> my code example is not standard compliant.
>
> Best regards
> Joseph
>
>
> --
> Dipl.-Inf. Joseph Schuchart
> High Performance Computing Center Stuttgart (HLRS)
> Nobelstr. 19
> D-70569 Stuttgart
>
> Tel.: +49(0)711-68565890
> Fax: +49(0)711-6856832
> E-Mail: schuch...@hlrs.de
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Java-OpenMPI returns with SIGSEGV

2016-07-08 Thread Howard Pritchard
Hi Gundram

Could you configure without the disable dlopen option and retry?

Howard

Am Freitag, 8. Juli 2016 schrieb Gilles Gouaillardet :

> the JVM sets its own signal handlers, and it is important openmpi dones
> not override them.
> this is what previously happened with PSM (infinipath) but this has been
> solved since.
> you might be linking with a third party library that hijacks signal
> handlers and cause the crash
> (which would explain why I cannot reproduce the issue)
>
> the master branch has a revamped memory patcher (compared to v2.x or
> v1.10), and that could have some bad interactions with the JVM, so you
> might also give v2.x a try
>
> Cheers,
>
> Gilles
>
> On Friday, July 8, 2016, Gundram Leifert  > wrote:
>
>> You made the best of it... thanks a lot!
>>
>> Whithout MPI it runs.
>> Just adding MPI.init() causes the crash!
>>
>> maybe I installed something wrong...
>>
>> install newest automake, autoconf, m4, libtoolize in right order and same
>> prefix
>> check out ompi,
>> autogen
>> configure with same prefix, pointing to the same jdk, I later use
>> make
>> make install
>>
>> I will test some different configurations of ./configure...
>>
>>
>> On 07/08/2016 01:40 PM, Gilles Gouaillardet wrote:
>>
>> I am running out of ideas ...
>>
>> what if you do not run within slurm ?
>> what if you do not use '-cp executor.jar'
>> or what if you configure without --disable-dlopen --disable-mca-dso ?
>>
>> if you
>> mpirun -np 1 ...
>> then MPI_Bcast and MPI_Barrier are basically no-op, so it is really weird
>> your program is still crashing. an other test is to comment out MPI_Bcast
>> and MPI_Barrier and try again with -np 1
>>
>> Cheers,
>>
>> Gilles
>>
>> On Friday, July 8, 2016, Gundram Leifert 
>> wrote:
>>
>>> In any cases the same error.
>>> this is my code:
>>>
>>> salloc -n 3
>>> export IPATH_NO_BACKTRACE
>>> ulimit -s 10240
>>> mpirun -np 3 java -cp executor.jar
>>> de.uros.citlab.executor.test.TestSendBigFiles2
>>>
>>>
>>> also for 1 or two cores, the process crashes.
>>>
>>>
>>> On 07/08/2016 12:32 PM, Gilles Gouaillardet wrote:
>>>
>>> you can try
>>> export IPATH_NO_BACKTRACE
>>> before invoking mpirun (that should not be needed though)
>>>
>>> an other test is to
>>> ulimit -s 10240
>>> before invoking mpirun.
>>>
>>> btw, do you use mpirun or srun ?
>>>
>>> can you reproduce the crash with 1 or 2 tasks ?
>>>
>>> Cheers,
>>>
>>> Gilles
>>>
>>> On Friday, July 8, 2016, Gundram Leifert 
>>> wrote:
>>>
 Hello,

 configure:
 ./configure --enable-mpi-java
 --with-jdk-dir=/home/gl069/bin/jdk1.7.0_25 --disable-dlopen
 --disable-mca-dso


 1 node with 3 cores. I use SLURM to allocate one node. I changed --mem,
 but it has no effect.
 salloc -n 3


 core file size  (blocks, -c) 0
 data seg size   (kbytes, -d) unlimited
 scheduling priority (-e) 0
 file size   (blocks, -f) unlimited
 pending signals (-i) 256564
 max locked memory   (kbytes, -l) unlimited
 max memory size (kbytes, -m) unlimited
 open files  (-n) 10
 pipe size(512 bytes, -p) 8
 POSIX message queues (bytes, -q) 819200
 real-time priority  (-r) 0
 stack size  (kbytes, -s) unlimited
 cpu time   (seconds, -t) unlimited
 max user processes  (-u) 4096
 virtual memory  (kbytes, -v) unlimited
 file locks  (-x) unlimited

 uname -a
 Linux titan01.service 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31
 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

 cat /etc/system-release
 CentOS Linux release 7.2.1511 (Core)

 what else do you need?

 Cheers, Gundram

 On 07/07/2016 10:05 AM, Gilles Gouaillardet wrote:

 Gundram,


 can you please provide more information on your environment :

 - configure command line

 - OS

 - memory available

 - ulimit -a

 - number of nodes

 - number of tasks used

 - interconnect used (if any)

 - batch manager (if any)


 Cheers,


 Gilles
 On 7/7/2016 4:17 PM, Gundram Leifert wrote:

 Hello Gilles,

 I tried you code and it crashes after 3-15 iterations (see (1)). It is
 always the same error (only the "94" varies).

 Meanwhile I think Java and MPI use the same memory because when I
 delete the hash-call, the program runs sometimes more than 9k iterations.
 When it crashes, there are different lines (see (2) and (3)). The
 crashes also occurs on rank 0.

 # (1)#
 # Problematic frame:
 # J 94 C2 

Re: [OMPI users] problem with exceptions in Java interface

2016-05-24 Thread Howard Pritchard
Hi Siegmar,

Sorry for the delay, I seem to have missed this one.

It looks like there's an error in the way the native methods are processing
java exceptions.  The code correctly builds up an exception message for
cases where MPI 'c' returns non-success but, not if the problem occured
in one of the JNI utilities.

Issue filed:
https://github.com/open-mpi/ompi/issues/1698


Thanks for reporting this.


Howard


2016-05-20 9:25 GMT-06:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> I tried MPI.ERRORS_RETURN in a small Java program with Open MPI
> 1.10.2 and master. I get the expected behaviour, if I use a
> wrong value for the root process in "bcast". Unfortunately I
> get an MPI or Java error message if I try to broadcast more data
> than available. Is this intended or is it a problem in the Java
> interface of Open MPI? I would be grateful if somebody can answer
> my question.
>
> loki java 194 mpijavac Exception_1_Main.java
> loki java 195 mpijavac Exception_2_Main.java
>
> loki java 196 mpiexec -np 1 java Exception_1_Main
> Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
> Call "bcast" with wrong "root" process.
> Caught an exception.
> MPI_ERR_ROOT: invalid root
>
>
> loki java 197 mpiexec -np 1 java Exception_2_Main
> Set error handler for MPI.COMM_WORLD to MPI.ERRORS_RETURN.
> Call "bcast" with index out-of bounds.
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException
> at mpi.Comm.bcast(Native Method)
> at mpi.Comm.bcast(Comm.java:1231)
> at Exception_2_Main.main(Exception_2_Main.java:44)
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> --
> mpiexec detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
>
>   Process name: [[38300,1],0]
>   Exit code:1
> --
> loki java 198
>
>
> Kind regards and thank you very much for any help in advance
>
> Siegmar
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29256.php
>


Re: [OMPI users] mpirun java

2016-05-23 Thread Howard Pritchard
HI Ralph,

Yep, If you could handle this that would be great.  I guess we'd like a fix
in 1.10.x and for 2.0.1
that would be great.

Howard


2016-05-23 14:59 GMT-06:00 Ralph Castain <r...@open-mpi.org>:

> Looks to me like there is a bug in the orterun parser that is trying to
> add java library paths - I can take a look at it
>
> On May 23, 2016, at 1:05 PM, Claudio Stamile <claudiostam...@gmail.com>
> wrote:
>
> Hi Howard.
>
> Thank you for your reply.
>
> I'm using version 1.10.2
>
> I executed the following command:
>
> mpirun -np 2 --mca odls_base_verbose 100 java -cp alot:of:jarfile
> -Djava.library.path=/Users/stamile/Applications/IBM/ILOG/CPLEX_Studio1263/cplex/bin/x86-64_osx
> clustering.TensorClusterinCplexMPI
>
>
> the output is:
>
> * Num procs: 2 FirstRank: 0 Recovery: DEFAULT Max Restarts: 0*
>
> *  Argv[0]: java*
>
> *  Argv[1]: -cp*
>
> *  Argv[2]:
> /Applications/Eclipse.app/Contents/MacOS:/Users/stamile/Documents/workspace_newJava/TensorFactorization/bin:/Users/stamile/Applications/IBM/ILOG/CPLEX_Studio1263/cplex/lib/cplex.jar:/Users/stamile/Downloads/commons-lang3-3.4/commons-lang3-3.4.jar:/Users/stamile/Downloads/Jama-1.0.3.jar:/Users/stamile/Downloads/hyperdrive-master/hyperdrive.jar:/usr/local/lib:/usr/local/lib/mpi.jar*
>
> *  Argv[3]:
> /Users/stamile/Applications/IBM/ILOG/CPLEX_Studio1263/cplex/bin/x86-64_osx*
>
> *  Argv[4]:
> -Djava.library.path=-Djava.library.path=/Users/stamile/Applications/IBM/ILOG/CPLEX_Studio1263/cplex/bin/x86-64_osx:/usr/local/lib*
>
> *  Argv[5]: clustering.TensorClusterinCplexMPI*
>
> *  Env[0]: OMPI_MCA_odls_base_verbose=100*
>
> *  Env[1]: OMPI_COMMAND=clustering.TensorClusterinCplexMPI*
>
> *  Env[2]:
> OMPI_MCA_orte_precondition_transports=e6a8891c458c267b-c079810b4abe7ebf*
>
> *  Env[3]: OMPI_MCA_orte_peer_modex_id=0*
>
> *  Env[4]: OMPI_MCA_orte_peer_init_barrier_id=1*
>
> *  Env[5]: OMPI_MCA_orte_peer_fini_barrier_id=2*
>
> *  Env[6]: TMPDIR=/var/folders/5t/6tqp003x4fn09fzgtx46tjdhgn/T/*
>
>
> Argv[4] looks strange. Indeed if I execute:
>
> mpirun -np 2 --mca odls_base_verbose 100 java -cp alot:of:jarfile
> clustering.TensorClusterinCplexMPI
> The same as before without
> ( 
> -Djava.library.path=/Users/stamile/Applications/IBM/ILOG/CPLEX_Studio1263/cplex/bin/x86-64_osx
>  )
> i obtain:
>
> *Argv[0]: java*
>
> *  Argv[1]: -Djava.library.path=/usr/local/lib*
>
> *  Argv[2]: -cp*
>
> *  Argv[3]:
> /Applications/Eclipse.app/Contents/MacOS:/Users/stamile/Documents/workspace_newJava/TensorFactorization/bin:/Users/stamile/Applications/IBM/ILOG/CPLEX_Studio1263/cplex/lib/cplex.jar:/Users/stamile/Downloads/commons-lang3-3.4/commons-lang3-3.4.jar:/Users/stamile/Downloads/Jama-1.0.3.jar:/Users/stamile/Downloads/hyperdrive-master/hyperdrive.jar:/usr/local/lib:/usr/local/lib/mpi.jar*
>
> *  Argv[4]: clustering.TensorClusterinCplexMPI*
>
> *  Env[0]: OMPI_MCA_odls_base_verbose=100*
>
> *  Env[1]: OMPI_COMMAND=clustering.TensorClusterinCplexMPI*
>
> *  Env[2]:
> OMPI_MCA_orte_precondition_transports=92248561306f2b2e-601ae65dc34a347c*
>
> *  Env[3]: OMPI_MCA_orte_peer_modex_id=0*
>
> *  Env[4]: OMPI_MCA_orte_peer_init_barrier_id=1*
>
> *  Env[5]: OMPI_MCA_orte_peer_fini_barrier_id=2*
>
> *  Env[6]: TMPDIR=/var/folders/5t/6tqp003x4fn09fzgtx46tjdhgn/T/*
>
> *  Env[7]: __CF_USER_TEXT_ENCODING=0x1F5:0x0:0x4*
>
>
> What do you think ?
>
> Best,
>
> Claudio
>
> 2016-05-23 19:38 GMT+02:00 Howard Pritchard <hpprit...@gmail.com>:
>
>> Hello Claudio,
>>
>> mpirun should be combining your java.library.path option with the one
>> needed to add
>> the Open MPI's java bindings as well.
>>
>> Which version of Open MPI are you using?
>>
>> Could you first try to compile the Ring.java code in ompi/examples and
>> run it with the
>> following additional mpirun parameter?
>>
>> mpirun -np 1 --mca odls_base_verbose 100 java Ring
>>
>> then try your application with the same "odls_base_verbose" mpirun option
>>
>> and post the output from the two runs to the mail list?
>>
>> I suspect there may be a bug with building the combined java.library.path
>> in the Open MPI code.
>>
>> Howard
>>
>>
>> 2016-05-23 9:47 GMT-06:00 Claudio Stamile <claudiostam...@gmail.com>:
>>
>>> Dear all,
>>>
>>> I'm using openmpi for Java.
>>> I've a problem when I try to use more option parameters in my java
>>> command. More in detail I run mpirun as follow:
>>>
>>> mpirun -n 5 java -cp path1:

Re: [OMPI users] mpirun java

2016-05-23 Thread Howard Pritchard
Hello Claudio,

mpirun should be combining your java.library.path option with the one
needed to add
the Open MPI's java bindings as well.

Which version of Open MPI are you using?

Could you first try to compile the Ring.java code in ompi/examples and run
it with the
following additional mpirun parameter?

mpirun -np 1 --mca odls_base_verbose 100 java Ring

then try your application with the same "odls_base_verbose" mpirun option

and post the output from the two runs to the mail list?

I suspect there may be a bug with building the combined java.library.path
in the Open MPI code.

Howard


2016-05-23 9:47 GMT-06:00 Claudio Stamile :

> Dear all,
>
> I'm using openmpi for Java.
> I've a problem when I try to use more option parameters in my java
> command. More in detail I run mpirun as follow:
>
> mpirun -n 5 java -cp path1:path2 -Djava.library.path=pathLibs
> classification.MyClass
>
> It seems that the option "-Djava.library.path" is ignored when i execute
> the command.
>
> Is it normal ?
>
> Do you know how to solve this problem ?
>
> Thank you.
>
> Best,
> Claudio
>
> --
> C.
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/05/29285.php
>


Re: [OMPI users] libfabric verb provider for iWARP RNIC

2016-04-04 Thread Howard Pritchard
Hi Durga,

I'd suggest reposting this to the libfabric-users mail list.
You can join that list at
http://lists.openfabrics.org/mailman/listinfo/libfabric-users

I'd suggest including the output of config.log.  If you installed
ofed in non-canonical location, you may need to give an explicit
path as an argument to the --enable-verbs configury option.

Note if you're trying to use libfabric with the Open MPI ofi
mtl, you will need to get literally the freshest version of
libfabric, either at github or the 1.3rc2 tarball at

http://www.openfabrics.org/downloads/ofi/

Good luck,

Howard


2016-04-02 13:41 GMT-06:00 dpchoudh . :

> Hello all
>
> My machine has 3 network cards:
>
> 1. Broadcom GbE (vanilla type, with some offload capability)
> 2. Chelsion S310 10Gb iWARP
> 3. Qlogic DDR 4X Infiniband.
>
> With this setup, I built libfabric like this:
>
> ./configure --enable-udp=auto --enable-gni=auto --enable-mxm=auto
> --enable-usnic=auto --enable-verbs=auto --enable-sockets=auto
> --enable-psm2=auto --enable-psm=auto && make && sudo make install
>
> However, in the built libfabric, I do not see a verb provider, which I'd
> expect for the iWARP card, at least.
>
> [durga@smallMPI libfabric]$ fi_info
> psm: psm
> version: 0.9
> type: FI_EP_RDM
> protocol: FI_PROTO_PSMX
> UDP: UDP-IP
> version: 1.0
> type: FI_EP_DGRAM
> protocol: FI_PROTO_UDP
> sockets: IP
> version: 1.0
> type: FI_EP_MSG
> protocol: FI_PROTO_SOCK_TCP
> sockets: IP
> version: 1.0
> type: FI_EP_DGRAM
> protocol: FI_PROTO_SOCK_TCP
> sockets: IP
> version: 1.0
> type: FI_EP_RDM
> protocol: FI_PROTO_SOCK_TCP
>
>
> Am I doing something wrong or misunderstanding how libfabric works?
>
> Thanks in advance
> Durga
>
> We learn from history that we never learn from history.
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/04/28870.php
>


Re: [OMPI users] Java MPI Code for NAS Benchmarks

2016-03-11 Thread Howard Pritchard
Hello Saliya,

Sorry i did not see this email earlier.  There are a bunch of java test
codes including performance tests like used in the paper at

https://github.com/open-mpi/ompi-java-test

Howard


2016-02-27 23:01 GMT-07:00 Saliya Ekanayake :

> Hi,
>
> I see this paper from Oscar refers to a Java implementation of NAS
> benchmarks. Is this work publicly available (the code?)
>
> I've done some work in benchmarking NAS with some optimizations provided
> in our work (
> https://www.researchgate.net/publication/291695433_SPIDAL_Java_High_Performance_Data_Analytics_with_Java_and_MPI_on_Large_Multicore_HPC_Clusters)
> and would like to test out the work in the above paper as well.
>
> Thank you,
> Saliya
>
> --
> Saliya Ekanayake
> Ph.D. Candidate | Research Assistant
> School of Informatics and Computing | Digital Science Center
> Indiana University, Bloomington
> Cell 812-391-4914
> http://saliya.org
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/02/28607.php
>


Re: [OMPI users] Issues Building Open MPI static with Intel Fortran 16

2016-01-22 Thread Howard Pritchard
HI Matt,

If you don't need oshmem, you could try again with --disable-oshmem added
to the config line

Howard


2016-01-22 12:15 GMT-07:00 Matt Thompson :

> All,
>
> I'm trying to duplicate an issue I had with ESMF long ago (not sure if I
> reported it here or at ESMF, but...). It had been a while, so I started
> from scratch. I first built Open MPI 1.10.2 with Intel Fortran 16.0.0.109
> and my system GCC (4.8.5 from RHEL7) with mostly defaults:
>
> # ./configure --disable-wrapper-rpath CC=gcc CXX=g++ FC=ifort \
> #CFLAGS='-fPIC -m64' CXXFLAGS='-fPIC -m64' FCFLAGS='-fPIC -m64' \
> #
> --prefix=/ford1/share/gmao_SIteam/MPI/openmpi-1.10.2-ifort-16.0.0.109-shared
> | & tee configure.intel16.0.0.109-shared.log
>
> This built and checked just fine. Huzzah! And, indeed, it died in ESMF
> during a link in an odd way (ESMF is looking at it).
>
> As a thought, I decided to see if building Open MPI statically might help
> or not. So, I tried to build Open MPI with:
>
> # ./configure --disable-shared --enable-static --disable-wrapper-rpath
> CC=gcc CXX=g++ FC=ifort \
> #CFLAGS='-fPIC -m64' CXXFLAGS='-fPIC -m64' FCFLAGS='-fPIC -m64' \
> #
> --prefix=/ford1/share/gmao_SIteam/MPI/openmpi-1.10.2-ifort-16.0.0.109-static
> | & tee configure.intel16.0.0.109-static.log
>
> I just added --disable-shared --enable-static being lazy. But, when I do
> this, I get this (when built with make V=1):
>
> Making all in tools/oshmem_info
> make[2]: Entering directory
> `/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem/tools/oshmem_info'
> /bin/sh ../../../libtool  --tag=CC   --mode=link gcc -std=gnu99  -O3
> -DNDEBUG -fPIC -m64 -finline-functions -fno-strict-aliasing -pthread   -o
> oshmem_info oshmem_info.o param.o ../../../ompi/libmpi.la ../../../oshmem/
> liboshmem.la ../../../orte/libopen-rte.la ../../../opal/libopen-pal.la
> -lrt -lm -lutil
> libtool: link: gcc -std=gnu99 -O3 -DNDEBUG -fPIC -m64 -finline-functions
> -fno-strict-aliasing -pthread -o oshmem_info oshmem_info.o param.o
>  ../../../ompi/.libs/libmpi.a ../../../oshmem/.libs/liboshmem.a
> /ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/ompi/.libs/libmpi.a
> -libverbs
> /ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/orte/.libs/libopen-rte.a
> ../../../orte/.libs/libopen-rte.a
> /ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/opal/.libs/libopen-pal.a
> ../../../opal/.libs/libopen-pal.a -lnuma -ldl -lrt -lm -lutil -pthread
> /usr/bin/ld: ../../../oshmem/.libs/liboshmem.a(memheap_base_static.o):
> undefined reference to symbol '_end'
> /usr/bin/ld: note: '_end' is defined in DSO /lib64/libnl-route-3.so.200 so
> try adding it to the linker command line
> /lib64/libnl-route-3.so.200: could not read symbols: Invalid operation
> collect2: error: ld returned 1 exit status
> make[2]: *** [oshmem_info] Error 1
> make[2]: Leaving directory
> `/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem/tools/oshmem_info'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory
> `/ford1/share/gmao_SIteam/MPI/src/openmpi-1.10.2/oshmem'
> make: *** [all-recursive] Error 1
>
> So, what did I do wrong? Or is there something I need to add to the
> configure line? I have built static versions of Open MPI in the past (say
> 1.8.7 era with Intel Fortran 15), but this is a new OS (RHEL 7 instead of
> 6) so I can see issues possible.
>
> Anyone seen this before? As I said, the "usual" build way is just fine.
> Perhaps I need an extra RPM that isn't installed? I do have libnl-devel
> installed.
>
> --
> Matt Thompson
>
> Man Among Men
> Fulcrum of History
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/01/28344.php
>


Re: [OMPI users] How to allocate more memory to java OpenMPI

2016-01-19 Thread Howard Pritchard
HI Ibrahim,

Are you using a 32bit or 64bit JVM?

I don't think this is an Open MPI issue, but likely something owing to your
app or your java setup.
You may want to checkout

http://javaeesupportpatterns.blogspot.com/2012/09/outofmemoryerror-unable-to-create-new.html

If you'd like to post the java code to the list, I can try it out on some
of the servers I use.

Howard


2016-01-19 8:03 GMT-07:00 Ibrahim Ikhlawi :

>
> Hallo,
>
> I'm working with  java OpenMPI on a server with 64GB memory. But when I
> run the java class I can only run it on until 15 processes (with this
> command: mpirun -np 15 java Multiplikation). Although there is 64GB memory,
> only about 3 GB will be used(with top command can I see that, the first two
> lines are below). When I run more than 15 processes I get this error:
>
> Error occurred during initialization of VM
> java.lang.OutOfMemoryError: unable to create new native thread
>
>
> But I want to run it on more than 15 processes and use more than 3 GB. In
> Addition, after searching in google I have tried to run it with this
> command:
>
> mpirun -np 20 java -Xmx2096M -Xms1048M Multiplikation
>
> but I still get the same error.
>
> My question: How can I allocate java more memory, so that I run my program
> with more than 15 processes and more than 3GB memory?
>
> thanks in advance
> Ibrahim
>
> PS:
> It may help, these are the first two lines from the top command:
>
> PID   PRI  VIRTRESSHR   S  CPU%  MEM%
> 23255   20   0 20.7G  103M 11916 S  2.0   0.2  0:52.14 java
> 23559   20   0 20.7G 33772 11916 S  1.0   0.1  0:50.82 java
>
>
>
>
>
>
>
>
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/01/28301.php
>


Re: [OMPI users] problem with execstack and openmpi-v1.10.1-140-g31ff573

2016-01-14 Thread Howard Pritchard
HI Sigmar,

Would you mind posting your MsgSendRecvMain to the mail list?  I'd like to
see if I can
reproduce it on my linux box.

Thanks,

Howard




2016-01-14 7:30 GMT-07:00 Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de>:

> Hi,
>
> I've successfully built openmpi-v1.10.1-140-g31ff573 on my machine
> (SUSE Linux Enterprise Server 12.0 x86_64) with gcc-5.2.0 and
> Sun C 5.13. Unfortunately I get warnings if I use my cc version
> running a Java program, although I added "-z noexecstack" to
> CFLAGS. I used the following commands to build the package.
>
>
> mkdir openmpi-v1.10.1-140-g31ff573-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
> cd openmpi-v1.10.1-140-g31ff573-${SYSTEM_ENV}.${MACHINE_ENV}.64_cc
>
> ../openmpi-v1.10.1-140-g31ff573/configure \
>   --prefix=/usr/local/openmpi-1.10.2_64_cc \
>   --libdir=/usr/local/openmpi-1.10.2_64_cc/lib64 \
>   --with-jdk-bindir=/usr/local/jdk1.8.0_66/bin \
>   --with-jdk-headers=/usr/local/jdk1.8.0_66/include \
>   JAVA_HOME=/usr/local/jdk1.8.0_66 \
>   LDFLAGS="-m64 -mt" \
>   CC="cc" CXX="CC" FC="f95" \
>   CFLAGS="-m64 -mt -z noexecstack" CXXFLAGS="-m64 -library=stlport4"
> FCFLAGS="-m64" \
>   CPP="cpp" CXXCPP="cpp" \
>   --enable-mpi-cxx \
>   --enable-cxx-exceptions \
>   --enable-mpi-java \
>   --enable-heterogeneous \
>   --enable-mpi-thread-multiple \
>   --with-hwloc=internal \
>   --without-verbs \
>   --with-wrapper-cflags="-m64 -mt" \
>   --with-wrapper-cxxflags="-m64 -library=stlport4" \
>   --with-wrapper-fcflags="-m64" \
>   --with-wrapper-ldflags="-mt" \
>   --enable-debug \
>   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>
> make |& tee log.make.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>
>
>
>
>
> loki java 115 ompi_info | egrep -e "Open MPI repo revision:" -e "C
> compiler absolute:"
>   Open MPI repo revision: v1.10.1-140-g31ff573
>  C compiler absolute: /opt/solstudio12.4/bin/cc
>
> loki java 116 mpiexec -np 4 --host loki --slot-list 0:0-5,1:0-5 java
> MsgSendRecvMain
> Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
> /usr/local/openmpi-1.10.2_64_cc/lib64/libmpi_java.so.1.2.0 which might have
> disabled stack guard. The VM will try to fix the stack guard now.
> It's highly recommended that you fix the library with 'execstack -c
> ', or link it with '-z noexecstack'.
> Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
> /usr/local/openmpi-1.10.2_64_cc/lib64/libmpi_java.so.1.2.0 which might have
> disabled stack guard. The VM will try to fix the stack guard now.
> It's highly recommended that you fix the library with 'execstack -c
> ', or link it with '-z noexecstack'.
> Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
> /usr/local/openmpi-1.10.2_64_cc/lib64/libmpi_java.so.1.2.0 which might have
> disabled stack guard. The VM will try to fix the stack guard now.
> It's highly recommended that you fix the library with 'execstack -c
> ', or link it with '-z noexecstack'.
> Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library
> /usr/local/openmpi-1.10.2_64_cc/lib64/libmpi_java.so.1.2.0 which might have
> disabled stack guard. The VM will try to fix the stack guard now.
> It's highly recommended that you fix the library with 'execstack -c
> ', or link it with '-z noexecstack'.
>
> Now 3 processes are sending greetings.
>
> Greetings from process 1:
>   message tag:3
>   message length: 4
>   message:loki
> ...
>
>
> Does anybody know how I can get rid of the messages or can somebody
> fix the problem directly in the distribution? Please let me know if
> you need anything else. Thank you very much for any help in advance.
>
>
> Best regards
>
> Siegmar
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/01/28275.php
>


Re: [OMPI users] RMA operations with java buffers

2016-01-13 Thread Howard Pritchard
Hi Marko,

You can probably find examples of what you'd like to do on github:

https://github.com/open-mpi/ompi-java-test

There are numerous MPI-2 RMA examples in the one-sided subdirectory.

If you've never used github before, jus click on the download as zip
button in the upper right hand corner and you will not need to think
about github again.

Hope this helps,

Howard


2016-01-13 9:04 GMT-07:00 Marko Blatzheim :

> Hello,
>
> I work with the java open mpi version and I want to send byte arrays with
> the mpi get function. The window provides a large buffer containing the
> array values and a single call of get should provide the process with a
> small part of that buffer but not necessarily starting at position 0 but at
> an arbitrary starting point. Is this possible with java or do I need to use
> put or even switch to the send/recv functions?
>
> Thanks for your help
>
> Marko
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2016/01/28268.php
>


Re: [OMPI users] help understand unhelpful ORTE error message

2015-11-19 Thread Howard Pritchard
Hi Jeff,

I finally got an allocation on cori - its one busy machine.

Anyway, using the ompi i'd built on edison with the above recommended
configure options
I was able to run using either srun or mpirun on cori provided that in the
later case I used

mpirun -np X -N Y --mca plm slurm ./my_favorite_app

I will make an adjustment to the alps plm launcher to disqualify itself if
the wlm_detect
facility on the cray reports that srun is the launcher.  That's a minor fix
and should make
it in to v2.x in a week or so.  It will be a runtime selection so you only
have to build ompi
once for use either on edison or cori.

Howard


2015-11-19 17:11 GMT-07:00 Howard Pritchard <hpprit...@gmail.com>:

> Hi Jeff H.
>
> Why don't you just try configuring with
>
> ./configure --prefix=my_favorite_install_dir
> --with-libfabric=install_dir_for_libfabric
> make -j 8 install
>
> and see what happens?
>
> Make sure before you configure that you have PrgEnv-gnu or PrgEnv-intel
> module loaded.
>
> Those were the configure/compiler options I used to do testing of ofi mtl
> on cori.
>
> Jeff S. - this thread has gotten intermingled with mpich setup as well,
> hence
> the suggestion for the mpich shm mechanism.
>
>
> Howard
>
>
>
> 2015-11-19 16:59 GMT-07:00 Jeff Hammond <jeff.scie...@gmail.com>:
>
>>
>>> How did you configure for Cori?  You need to be using the slurm plm
>>> component for that system.  I know this sounds like gibberish.
>>>
>>>
>> ../configure --with-libfabric=$HOME/OFI/install-ofi-gcc-gni-cori \
>>  --enable-mca-static=mtl-ofi \
>>  --enable-mca-no-build=btl-openib,btl-vader,btl-ugni,btl-tcp \
>>  --enable-static --disable-shared --disable-dlopen \
>>  --prefix=$HOME/MPI/install-ompi-ofi-gcc-gni-xpmem-cori \
>>  --with-cray-pmi --with-alps --with-cray-xpmem --with-slurm \
>>  --without-verbs --without-fca --without-mxm --without-ucx \
>>  --without-portals4 --without-psm --without-psm2 \
>>  --without-udreg --without-ugni --without-munge \
>>  --without-sge --without-loadleveler --without-tm --without-lsf \
>>  --without-pvfs2 --without-plfs \
>>  --without-cuda --disable-oshmem \
>>  --disable-mpi-fortran --disable-oshmem-fortran \
>>  LDFLAGS="-L/opt/cray/ugni/default/lib64 -lugni \
>>   -L/opt/cray/alps/default/lib64 -lalps -lalpslli -lalpsutil \   
>>-ldl -lrt"
>>
>>
>> This is copied from
>> https://github.com/jeffhammond/HPCInfo/blob/master/ofi/README.md#open-mpi,
>> which I note in case you want to see what changes I've made at any point in
>> the future.
>>
>>
>>> There should be a with-slurm configure option to pick up this component.
>>>
>>> Indeed there is.
>>
>>
>>> Doesn't mpich have the option to use sysv memory?  You may want to try
>>> that
>>>
>>>
>> MPICH?  Look, I may have earned my way onto Santa's naughty list more
>> than a few times, but at least I have the decency not to post MPICH
>> questions to the Open-MPI list ;-)
>>
>> If there is a way to tell Open-MPI to use shm_open without filesystem
>> backing (if that is even possible) at configure time, I'd love to do that.
>>
>>
>>> Oh for tuning params you can use env variables.  For example lets say
>>> rather than using the gni provider in ofi mtl you want to try sockets. Then
>>> do
>>>
>>> Export OMPI_MCA_mtl_ofi_provider_include=sockets
>>>
>>>
>> Thanks.  I'm glad that there is an option to set them this way.
>>
>>
>>> In the spirit OMPI - may the force be with you.
>>>
>>>
>> All I will say here is that Open-MPI has a Vader BTL :-)
>>
>>>
>>> > On Thu 19.11.2015 09:44:20 Jeff Hammond wrote:
>>> > > I have no idea what this is trying to tell me. Help?
>>> > >
>>> > > jhammond@nid00024:~/MPI/qoit/collectives> mpirun -n 2 ./driver.x 64
>>> > > [nid00024:00482] [[46168,0],0] ORTE_ERROR_LOG: Not found in file
>>> > > ../../../../../orte/mca/plm/alps/plm_alps_module.c at line 418
>>> > >
>>> > > I can run the same job with srun without incident:
>>> > >
>>> > > jhammond@nid00024:~/MPI/qoit/collectives> srun -n 2 ./driver.x 64
>>> > > MPI was initialized.
>>> > >
>>> > > This is on the NERSC Cori Cray XC40 system. 

Re: [OMPI users] mpijavac doesn't compile any thing

2015-11-19 Thread Howard Pritchard
Hi Ibrahim,

If you just try to compile with the javac do you at least see a "error:
package mpi..." does not exist?
Adding the "-verbose" option may also help with diagnosing the problem.

If the javac doesn't get that far then your problem is with the java
install.

Howard



2015-11-19 6:45 GMT-07:00 Ibrahim Ikhlawi :

>
> Hello,
>
> thank you for answering.
>
> the command mpijavac --verbose Hello.java gives me the same result as
> yours.
> JAVA_HOME ist right at me, but I don't have neither JAVA_BINDIR nor
> JAVA_ROOT.
> I think that the both variables don't cause the problem, because I was
> able to compile Hello.java before three days without any problem, but now I
> cann't.
>
> Ibrahim
>
>
> --
> Date: Wed, 18 Nov 2015 20:16:31 -0700
> From: hpprit...@gmail.com
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] mpijavac doesn't compile any thing
>
>
> Hello Ibrahim
>
> As a sanity check, could you try to compile the Hello.java in examples?
> mpijavac --verbose Hello.java
>
> you should see something like:
> /usr/bin/javac -cp
> /global/homes/h/hpp/ompi_install/lib/mpi.jar:/global/homes/h/hpp/ompi_install/lib/shmem.jar
> Hello.java
>
> You may also want to double check what your java env. variables, e.g.
> JAVA_HOME, JAVA_ROOT, and JAVA_BINDIR
> are set to.
> Howard
>
>
>
>
> --
>
> sent from my smart phonr so no good type.
>
> Howard
> On Nov 18, 2015 7:26 AM, "Ibrahim Ikhlawi" 
> wrote:
>
>
>
> Hello,
>
> I am trying to compile java classes with mpijavac, but it doesn't compile
> any class, for examle:
> Usually when I write the following line (mpijavac MyClass.java) in the
> console, it compiles and gives me the possible errors (e.g. missed
> semicolon) and the .class file will be created.
>
> But now when I compile any class with the same command (mpijavac
> AnyClass.java), it doesn't give me any error and the file AnyClass.class
> will be not created.
>
> What could be the problem?
>
> Thanks in advance
> Ibrahim
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/28047.php
>
>
> ___ users mailing list
> us...@open-mpi.org Subscription:
> http://www.open-mpi.org/mailman/listinfo.cgi/users Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/28049.php
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/28057.php
>


Re: [OMPI users] mpijavac doesn't compile any thing

2015-11-18 Thread Howard Pritchard
Hello Ibrahim

As a sanity check, could you try to compile the Hello.java in examples?

mpijavac --verbose Hello.java

you should see something like:

/usr/bin/javac -cp
/global/homes/h/hpp/ompi_install/lib/mpi.jar:/global/homes/h/hpp/ompi_install/lib/shmem.jar
Hello.java

You may also want to double check what your java env. variables, e.g.
JAVA_HOME, JAVA_ROOT, and JAVA_BINDIR

are set to.

Howard



--

sent from my smart phonr so no good type.

Howard
On Nov 18, 2015 7:26 AM, "Ibrahim Ikhlawi"  wrote:

>
>
> Hello,
>
> I am trying to compile java classes with mpijavac, but it doesn't compile
> any class, for examle:
> Usually when I write the following line (mpijavac MyClass.java) in the
> console, it compiles and gives me the possible errors (e.g. missed
> semicolon) and the .class file will be created.
>
> But now when I compile any class with the same command (mpijavac
> AnyClass.java), it doesn't give me any error and the file AnyClass.class
> will be not created.
>
> What could be the problem?
>
> Thanks in advance
> Ibrahim
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/11/28047.php
>


Re: [OMPI users] libfabric/usnic does not compile in 2.x

2015-09-30 Thread Howard Pritchard
Hi Marcin,


2015-09-30 9:19 GMT-06:00 marcin.krotkiewski <marcin.krotkiew...@gmail.com>:

> Thank you, and Jeff, for clarification.
>
> Before I bother you all more without the need, I should probably say I was
> hoping to use libfabric/OpenMPI on an InfiniBand cluster. Somehow now I
> feel I have confused this altogether, so maybe I should go one step back:
>
>  1. libfabric is hardware independent, and does support Infiniband, right?
>

The short answer is yes libfabric is hardware independent (and does work on
goods days on os-x as well as linux).
The longer answer is that there has been more/less work on implementing
providers (the plugins in to libfabric
to interface to different networks) for different networks.

There is a socket provider.  That gets a good amount of attention because
its a base reference provider.
psm/psm2 providers are available.  I have used the psm provider some on a
truescale cluster.  It doesn't
offer better performance than just using psm directly, but it does appear
to work.

There is an mxm provider but it was not implemented by mellanox, and I
can't get it to compile on my
connectx3 system using mxm 1.5.

There is a vanilla verbs provider but it doesn't support FI_EP_RDM endpoint
type, which is used by
the non-cisco component of Open MPI (ofi mtl) which is available.

When you build and install libfabric, there should be an fi_info binary
installed in $(LIBFABRIC_INSTALL_DIR)/bin
On my truescale cluster the output is:

psm: psm

version: 0.9

type: FI_EP_RDM

protocol: FI_PROTO_PSMX

verbs: IB-0x80fe

version: 1.0

type: FI_EP_MSG

protocol: FI_PROTO_RDMA_CM_IB_RC

sockets: IP

version: 1.0

type: FI_EP_MSG

protocol: FI_PROTO_SOCK_TCP

sockets: IP

version: 1.0

type: FI_EP_DGRAM

protocol: FI_PROTO_SOCK_TCP

sockets: IP

version: 1.0

type: FI_EP_RDM

protocol: FI_PROTO_SOCK_TCP

In order to use the mtl/ofi, at a minimum a provider needs to support
FI_EP_RDM type (see above).  Note that on the truescale
cluster the verbs provider is built, but it only supports FI_EP_MSG
endpoint types.  So mtl/ofi can't use that.



>  2. I read that OpenMPI provides interface to libfabric through btl/usnic
> and mtl/ofi.  can any of those use libfabric on Infiniband networks?
>

if you have intel truescale or its follow-on then the answer is yes,
although the default is for Open MPI to use mtl/psm on that network.



>
> Please forgive my ignorance, the amount of different options is rather
> overwhelming..
>
> Marcin
>
>
>
> On 09/30/2015 04:26 PM, Howard Pritchard wrote:
>
> Hello Marcin
>
> What configure options are you using besides with-libfabric?
>
> Could you post your config.log file tp the list?
>
> Looks like you only install fi_ext_usnic.h if you could build the usnic
> libfab provider.  When you configured libfabric what providers were listed
> at the end of configure run? Maybe attach config.log from the libfabric
> build ?
>
> If your cluster has cisco usnics you should probably be using
> libfabric/cisco openmpi.  If you are using intel omnipath you may want to
> try the ofi mtl.  Its not selected by default however.
>
> Howard
>
> --
>
> sent from my smart phonr so no good type.
>
> Howard
> On Sep 30, 2015 5:35 AM, "Marcin Krotkiewski" <
> marcin.krotkiew...@gmail.com> wrote:
>
>> Hi,
>>
>> I am trying to compile the 2.x branch with libfabric support, but get
>> this error during configure:
>>
>> configure:100708: checking rdma/fi_ext_usnic.h presence
>> configure:100708: gcc -E
>> -I/cluster/software/VERSIONS/openmpi.gnu.2.x/include
>> -I/usit/abel/u1/marcink/software/ompi-release-2.x/opal/mca/hwloc/hwloc1110/hwloc/include
>> conftest.c
>> conftest.c:688:31: fatal error: rdma/fi_ext_usnic.h: No such file or
>> directory
>> [...]
>> configure:100708: checking for rdma/fi_ext_usnic.h
>> configure:100708: result: no
>> configure:101253: checking if MCA component btl:usnic can compile
>> configure:101255: result: no
>>
>> Which is correct - the file is not there. I have downloaded fresh
>> libfabric-1.1.0.tar.bz2 and it does not have this file. Probably OpenMPI
>> needs some updates?
>>
>> I am also wondering what is the state of libfabric support in OpenMPI
>> nowadays. I have seen recent (March) presentation about it, so it seems to
>> be an actively developed feature. Is this correct? It seemed from the
>> presentation that there are benefits to this approach, but is it mature
>> enough in OpenMPI, or it will yet take some time?
>>
>> Thanks!
>>
>> Marcin
>> ___
>> users mailing list
>> us.

Re: [OMPI users] libfabric/usnic does not compile in 2.x

2015-09-30 Thread Howard Pritchard
Hello Marcin

What configure options are you using besides with-libfabric?

Could you post your config.log file tp the list?

Looks like you only install fi_ext_usnic.h if you could build the usnic
libfab provider.  When you configured libfabric what providers were listed
at the end of configure run? Maybe attach config.log from the libfabric
build ?

If your cluster has cisco usnics you should probably be using
libfabric/cisco openmpi.  If you are using intel omnipath you may want to
try the ofi mtl.  Its not selected by default however.

Howard

--

sent from my smart phonr so no good type.

Howard
On Sep 30, 2015 5:35 AM, "Marcin Krotkiewski" 
wrote:

> Hi,
>
> I am trying to compile the 2.x branch with libfabric support, but get this
> error during configure:
>
> configure:100708: checking rdma/fi_ext_usnic.h presence
> configure:100708: gcc -E
> -I/cluster/software/VERSIONS/openmpi.gnu.2.x/include
> -I/usit/abel/u1/marcink/software/ompi-release-2.x/opal/mca/hwloc/hwloc1110/hwloc/include
> conftest.c
> conftest.c:688:31: fatal error: rdma/fi_ext_usnic.h: No such file or
> directory
> [...]
> configure:100708: checking for rdma/fi_ext_usnic.h
> configure:100708: result: no
> configure:101253: checking if MCA component btl:usnic can compile
> configure:101255: result: no
>
> Which is correct - the file is not there. I have downloaded fresh
> libfabric-1.1.0.tar.bz2 and it does not have this file. Probably OpenMPI
> needs some updates?
>
> I am also wondering what is the state of libfabric support in OpenMPI
> nowadays. I have seen recent (March) presentation about it, so it seems to
> be an actively developed feature. Is this correct? It seemed from the
> presentation that there are benefits to this approach, but is it mature
> enough in OpenMPI, or it will yet take some time?
>
> Thanks!
>
> Marcin
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/09/27728.php
>


Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-15 Thread Howard Pritchard
Gilles,

On hopper there aren't any psm libraries - its an infiniband/infinipath
free system -
at least on the compute nodes.

For my own work, I never use things like the platform files, I just do
./configure --prefix=blahblah --enable-mpi-java (and whatever else I want
to test this tie)

Thanks for the ideas though,

Howard


2015-08-14 19:20 GMT-06:00 Gilles Gouaillardet <
gilles.gouaillar...@gmail.com>:

> Howard,
>
> I have no infinipath hardware, but the infinipath libraries are installed.
> I tried to run with --mca mtl_psm_priority 0 instead of --mca mtl ^psm
> but that did not work.
> without psm mtl, I was unable to reproduce the persistent communication
> issue,
> so I concluded there was only one issue here.
>
> do you configure with --disable-dlopen on hopper ?
> I wonder whether --mca mtl ^psm is effective if dlopen is disabled
>
> Cheers,
>
> Gilles
>
> On Saturday, August 15, 2015, Howard Pritchard <hpprit...@gmail.com>
> wrote:
>
>> Hi Jeff,
>>
>> I don't know why Gilles keeps picking on the persistent request problem
>> and mixing
>> it up with this user bug.  I do think for this user the psm probably is
>> the problem.
>>
>>
>> They don't have anything to do with each other.
>>
>> I can reproduce the persistent request problem on hopper consistently.
>> As I said
>> on the telecon last week it has something to do with memory corruption
>> with the
>> receive buffer that is associated with the persistent request.
>>
>> Howard
>>
>>
>> 2015-08-14 11:21 GMT-06:00 Jeff Squyres (jsquyres) <jsquy...@cisco.com>:
>>
>>> Hmm.  Oscar's not around to ask any more, but I'd be greatly surprised
>>> if he had InfiniPath on his systems where he ran into this segv issue...?
>>>
>>>
>>> > On Aug 14, 2015, at 1:08 PM, Howard Pritchard <hpprit...@gmail.com>
>>> wrote:
>>> >
>>> > Hi Gilles,
>>> >
>>> > Good catch!  Nate we hadn't been testing on a infinipath system.
>>> >
>>> > Howard
>>> >
>>> >
>>> > 2015-08-14 0:20 GMT-06:00 Gilles Gouaillardet <gil...@rist.or.jp>:
>>> > Nate,
>>> >
>>> > i could get rid of the problem by not using the psm mtl.
>>> > the infinipath library (used by the psm mtl) sets some signal handlers
>>> that conflict with the JVM
>>> > that can be seen by running
>>> > mpirun -np 1 java -Xcheck:jni MPITestBroke data/
>>> >
>>> > so instead of running
>>> > mpirun -np 1 java MPITestBroke data/
>>> > please run
>>> > mpirun --mca mtl ^psm -np 1 java MPITestBroke data/
>>> >
>>> > that solved the issue for me
>>> >
>>> > Cheers,
>>> >
>>> > Gilles
>>> >
>>> > On 8/13/2015 9:19 AM, Nate Chambers wrote:
>>> >> I appreciate you trying to help! I put the Java and its compiled
>>> .class file on Dropbox. The directory contains the .java and .class files,
>>> as well as a data/ directory:
>>> >>
>>> >>
>>> http://www.dropbox.com/sh/pds5c5wecfpb2wk/AAAcz17UTDQErmrUqp2SPjpqa?dl=0
>>> >>
>>> >> You can run it with and without MPI:
>>> >>
>>> >> >  java MPITestBroke data/
>>> >> >  mpirun -np 1 java MPITestBroke data/
>>> >>
>>> >> Attached is a text file of what I see when I run it with mpirun and
>>> your debug flag. Lots of debug lines.
>>> >>
>>> >>
>>> >> Nate
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Aug 12, 2015 at 11:09 AM, Howard Pritchard <
>>> hpprit...@gmail.com> wrote:
>>> >> Hi Nate,
>>> >>
>>> >> Sorry for the delay in getting back to you.
>>> >> We're somewhat stuck on how to help you, but here are two suggestions.
>>> >>
>>> >> Could you add the following to your launch command line
>>> >>
>>> >> --mca odls_base_verbose 100
>>> >>
>>> >> so we can see exactly what arguments are being feed to java when
>>> launching
>>> >> your app.
>>> >>
>>> >> Also, if you could put your MPITestBroke.class file somewhere (like
>>> google drive)
>>> >> where we could get it and try to run locally or at NERSC, that migh

Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-14 Thread Howard Pritchard
Hi Jeff,

I don't know why Gilles keeps picking on the persistent request problem and
mixing
it up with this user bug.  I do think for this user the psm probably is the
problem.


They don't have anything to do with each other.

I can reproduce the persistent request problem on hopper consistently.  As
I said
on the telecon last week it has something to do with memory corruption with
the
receive buffer that is associated with the persistent request.

Howard


2015-08-14 11:21 GMT-06:00 Jeff Squyres (jsquyres) <jsquy...@cisco.com>:

> Hmm.  Oscar's not around to ask any more, but I'd be greatly surprised if
> he had InfiniPath on his systems where he ran into this segv issue...?
>
>
> > On Aug 14, 2015, at 1:08 PM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
> >
> > Hi Gilles,
> >
> > Good catch!  Nate we hadn't been testing on a infinipath system.
> >
> > Howard
> >
> >
> > 2015-08-14 0:20 GMT-06:00 Gilles Gouaillardet <gil...@rist.or.jp>:
> > Nate,
> >
> > i could get rid of the problem by not using the psm mtl.
> > the infinipath library (used by the psm mtl) sets some signal handlers
> that conflict with the JVM
> > that can be seen by running
> > mpirun -np 1 java -Xcheck:jni MPITestBroke data/
> >
> > so instead of running
> > mpirun -np 1 java MPITestBroke data/
> > please run
> > mpirun --mca mtl ^psm -np 1 java MPITestBroke data/
> >
> > that solved the issue for me
> >
> > Cheers,
> >
> > Gilles
> >
> > On 8/13/2015 9:19 AM, Nate Chambers wrote:
> >> I appreciate you trying to help! I put the Java and its compiled .class
> file on Dropbox. The directory contains the .java and .class files, as well
> as a data/ directory:
> >>
> >>
> http://www.dropbox.com/sh/pds5c5wecfpb2wk/AAAcz17UTDQErmrUqp2SPjpqa?dl=0
> >>
> >> You can run it with and without MPI:
> >>
> >> >  java MPITestBroke data/
> >> >  mpirun -np 1 java MPITestBroke data/
> >>
> >> Attached is a text file of what I see when I run it with mpirun and
> your debug flag. Lots of debug lines.
> >>
> >>
> >> Nate
> >>
> >>
> >>
> >>
> >>
> >> On Wed, Aug 12, 2015 at 11:09 AM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
> >> Hi Nate,
> >>
> >> Sorry for the delay in getting back to you.
> >> We're somewhat stuck on how to help you, but here are two suggestions.
> >>
> >> Could you add the following to your launch command line
> >>
> >> --mca odls_base_verbose 100
> >>
> >> so we can see exactly what arguments are being feed to java when
> launching
> >> your app.
> >>
> >> Also, if you could put your MPITestBroke.class file somewhere (like
> google drive)
> >> where we could get it and try to run locally or at NERSC, that might
> help us
> >> narrow down the problem.Better yet, if you have the class or jar
> file for
> >> the entire app plus some data sets, we could try that out as well.
> >>
> >> All the config outputs, etc. you've sent so far indicate a correct
> installation
> >> of open mpi.
> >>
> >> Howard
> >>
> >>
> >> On Aug 6, 2015 1:54 PM, "Nate Chambers" <ncham...@usna.edu> wrote:
> >> Howard,
> >>
> >> I tried the nightly build openmpi-dev-2223-g731cfe3 and it still
> segfaults as before. I must admit I am new to MPI, so is it possible I'm
> just configuring or running incorrectly? Let me list my steps for you, and
> maybe something will jump out? Also attached is my config.log.
> >>
> >>
> >> CONFIGURE
> >> ./configure --prefix= --enable-mpi-java CC=gcc
> >>
> >> MAKE
> >> make all install
> >>
> >> RUN
> >> /mpirun -np 1 java MPITestBroke twitter/
> >>
> >>
> >> DEFAULT JAVA AND GCC
> >>
> >> $ java -version
> >> java version "1.7.0_21"
> >> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
> >> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
> >>
> >> $ gcc --v
> >> Using built-in specs.
> >> Target: x86_64-redhat-linux
> >> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
> --infodir=/usr/share/info --with-bugurl=
> http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared
> --enable-threads=posix --enable-checking=release --with-system-zlib
> -

Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-14 Thread Howard Pritchard
Hi Gilles,

Good catch!  Nate we hadn't been testing on a infinipath system.

Howard


2015-08-14 0:20 GMT-06:00 Gilles Gouaillardet <gil...@rist.or.jp>:

> Nate,
>
> i could get rid of the problem by not using the psm mtl.
> the infinipath library (used by the psm mtl) sets some signal handlers
> that conflict with the JVM
> that can be seen by running
> mpirun -np 1 java -Xcheck:jni MPITestBroke data/
>
> so instead of running
> mpirun -np 1 java MPITestBroke data/
> please run
> mpirun --mca mtl ^psm -np 1 java MPITestBroke data/
>
> that solved the issue for me
>
> Cheers,
>
> Gilles
>
> On 8/13/2015 9:19 AM, Nate Chambers wrote:
>
> *I appreciate you trying to help! I put the Java and its compiled .class
> file on Dropbox. The directory contains the .java and .class files, as well
> as a data/ directory:*
>
> http://www.dropbox.com/sh/pds5c5wecfpb2wk/AAAcz17UTDQErmrUqp2SPjpqa?dl=0
>
> *You can run it with and without MPI:*
>
> >  java MPITestBroke data/
> >  mpirun -np 1 java MPITestBroke data/
>
> *Attached is a text file of what I see when I run it with mpirun and your
> debug flag. Lots of debug lines.*
>
>
> Nate
>
>
>
>
>
> On Wed, Aug 12, 2015 at 11:09 AM, Howard Pritchard < <hpprit...@gmail.com>
> hpprit...@gmail.com> wrote:
>
>> Hi Nate,
>>
>> Sorry for the delay in getting back to you.
>>
>> We're somewhat stuck on how to help you, but here are two suggestions.
>>
>> Could you add the following to your launch command line
>>
>> --mca odls_base_verbose 100
>>
>> so we can see exactly what arguments are being feed to java when launching
>> your app.
>>
>> Also, if you could put your MPITestBroke.class file somewhere (like
>> google drive)
>> where we could get it and try to run locally or at NERSC, that might help
>> us
>> narrow down the problem.Better yet, if you have the class or jar file
>> for
>> the entire app plus some data sets, we could try that out as well.
>>
>> All the config outputs, etc. you've sent so far indicate a correct
>> installation
>> of open mpi.
>>
>> Howard
>>
>>
>> On Aug 6, 2015 1:54 PM, "Nate Chambers" <ncham...@usna.edu> wrote:
>>
>>> Howard,
>>>
>>> I tried the nightly build openmpi-dev-2223-g731cfe3 and it still
>>> segfaults as before. I must admit I am new to MPI, so is it possible I'm
>>> just configuring or running incorrectly? Let me list my steps for you, and
>>> maybe something will jump out? Also attached is my config.log.
>>>
>>>
>>> CONFIGURE
>>> ./configure --prefix= --enable-mpi-java CC=gcc
>>>
>>> MAKE
>>> make all install
>>>
>>> RUN
>>> /mpirun -np 1 java MPITestBroke twitter/
>>>
>>>
>>> DEFAULT JAVA AND GCC
>>>
>>> $ java -version
>>> java version "1.7.0_21"
>>> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
>>> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>>>
>>> $ gcc --v
>>> Using built-in specs.
>>> Target: x86_64-redhat-linux
>>> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
>>> --infodir=/usr/share/info --with-bugurl=
>>> <http://bugzilla.redhat.com/bugzilla>http://bugzilla.redhat.com/bugzilla
>>> --enable-bootstrap --enable-shared --enable-threads=posix
>>> --enable-checking=release --with-system-zlib --enable-__cxa_atexit
>>> --disable-libunwind-exceptions --enable-gnu-unique-object
>>> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada
>>> --enable-java-awt=gtk --disable-dssi
>>> --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre
>>> --enable-libgcj-multifile --enable-java-maintainer-mode
>>> --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib
>>> --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686
>>> --build=x86_64-redhat-linux
>>> Thread model: posix
>>> gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC)
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Aug 6, 2015 at 7:58 AM, Howard Pritchard < <hpprit...@gmail.com>
>>> hpprit...@gmail.com> wrote:
>>>
>>>> HI Nate,
>>>>
>>>> We're trying this out on a mac running mavericks and a cray xc system.
>>>>   the mac has java 8
>>>> while the cray xc has java 7.
>>>>
>>>> We could not get the code to run

Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-13 Thread Howard Pritchard
Hi Nate,

The odls output helps some.  You have a really big CLASSPATH.   Also there
might be a small chance that the shmem.jar is causing problems.
Could you try undefining your CLASSPATH just to run the test case?

If the little test case still doesn't work, could you reconfigure the mpi
build to not build oshmem?  --disable-oshmem?

We've never tested the oshmem jar.

Thanks,

Howard


2015-08-12 18:19 GMT-06:00 Nate Chambers <ncham...@usna.edu>:

> *I appreciate you trying to help! I put the Java and its compiled .class
> file on Dropbox. The directory contains the .java and .class files, as well
> as a data/ directory:*
>
> http://www.dropbox.com/sh/pds5c5wecfpb2wk/AAAcz17UTDQErmrUqp2SPjpqa?dl=0
>
> *You can run it with and without MPI:*
>
> >  java MPITestBroke data/
> >  mpirun -np 1 java MPITestBroke data/
>
> *Attached is a text file of what I see when I run it with mpirun and your
> debug flag. Lots of debug lines.*
>
>
> Nate
>
>
>
>
>
> On Wed, Aug 12, 2015 at 11:09 AM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
>
>> Hi Nate,
>>
>> Sorry for the delay in getting back to you.
>>
>> We're somewhat stuck on how to help you, but here are two suggestions.
>>
>> Could you add the following to your launch command line
>>
>> --mca odls_base_verbose 100
>>
>> so we can see exactly what arguments are being feed to java when launching
>> your app.
>>
>> Also, if you could put your MPITestBroke.class file somewhere (like
>> google drive)
>> where we could get it and try to run locally or at NERSC, that might help
>> us
>> narrow down the problem.Better yet, if you have the class or jar file
>> for
>> the entire app plus some data sets, we could try that out as well.
>>
>> All the config outputs, etc. you've sent so far indicate a correct
>> installation
>> of open mpi.
>>
>> Howard
>>
>>
>> On Aug 6, 2015 1:54 PM, "Nate Chambers" <ncham...@usna.edu> wrote:
>>
>>> Howard,
>>>
>>> I tried the nightly build openmpi-dev-2223-g731cfe3 and it still
>>> segfaults as before. I must admit I am new to MPI, so is it possible I'm
>>> just configuring or running incorrectly? Let me list my steps for you, and
>>> maybe something will jump out? Also attached is my config.log.
>>>
>>>
>>> CONFIGURE
>>> ./configure --prefix= --enable-mpi-java CC=gcc
>>>
>>> MAKE
>>> make all install
>>>
>>> RUN
>>> /mpirun -np 1 java MPITestBroke twitter/
>>>
>>>
>>> DEFAULT JAVA AND GCC
>>>
>>> $ java -version
>>> java version "1.7.0_21"
>>> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
>>> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>>>
>>> $ gcc --v
>>> Using built-in specs.
>>> Target: x86_64-redhat-linux
>>> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
>>> --infodir=/usr/share/info --with-bugurl=
>>> http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared
>>> --enable-threads=posix --enable-checking=release --with-system-zlib
>>> --enable-__cxa_atexit --disable-libunwind-exceptions
>>> --enable-gnu-unique-object
>>> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada
>>> --enable-java-awt=gtk --disable-dssi
>>> --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre
>>> --enable-libgcj-multifile --enable-java-maintainer-mode
>>> --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib
>>> --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686
>>> --build=x86_64-redhat-linux
>>> Thread model: posix
>>> gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC)
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Aug 6, 2015 at 7:58 AM, Howard Pritchard <hpprit...@gmail.com>
>>> wrote:
>>>
>>>> HI Nate,
>>>>
>>>> We're trying this out on a mac running mavericks and a cray xc system.
>>>>   the mac has java 8
>>>> while the cray xc has java 7.
>>>>
>>>> We could not get the code to run just using the java launch command,
>>>> although we noticed if you add
>>>>
>>>> catch(NoClassDefFoundError e) {
>>>>
>>>>   System.out.println("Not using MPI its out to lunch for now");
>>>>
>>>> }
>>>>
>>>> as one 

Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-12 Thread Howard Pritchard
Hi Nate,

Sorry for the delay in getting back to you.

We're somewhat stuck on how to help you, but here are two suggestions.

Could you add the following to your launch command line

--mca odls_base_verbose 100

so we can see exactly what arguments are being feed to java when launching
your app.

Also, if you could put your MPITestBroke.class file somewhere (like google
drive)
where we could get it and try to run locally or at NERSC, that might help
us
narrow down the problem.Better yet, if you have the class or jar file
for
the entire app plus some data sets, we could try that out as well.

All the config outputs, etc. you've sent so far indicate a correct
installation
of open mpi.

Howard


On Aug 6, 2015 1:54 PM, "Nate Chambers" <ncham...@usna.edu> wrote:

> Howard,
>
> I tried the nightly build openmpi-dev-2223-g731cfe3 and it still segfaults
> as before. I must admit I am new to MPI, so is it possible I'm just
> configuring or running incorrectly? Let me list my steps for you, and maybe
> something will jump out? Also attached is my config.log.
>
>
> CONFIGURE
> ./configure --prefix= --enable-mpi-java CC=gcc
>
> MAKE
> make all install
>
> RUN
> /mpirun -np 1 java MPITestBroke twitter/
>
>
> DEFAULT JAVA AND GCC
>
> $ java -version
> java version "1.7.0_21"
> Java(TM) SE Runtime Environment (build 1.7.0_21-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 23.21-b01, mixed mode)
>
> $ gcc --v
> Using built-in specs.
> Target: x86_64-redhat-linux
> Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
> --infodir=/usr/share/info --with-bugurl=
> http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared
> --enable-threads=posix --enable-checking=release --with-system-zlib
> --enable-__cxa_atexit --disable-libunwind-exceptions
> --enable-gnu-unique-object
> --enable-languages=c,c++,objc,obj-c++,java,fortran,ada
> --enable-java-awt=gtk --disable-dssi
> --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-1.5.0.0/jre
> --enable-libgcj-multifile --enable-java-maintainer-mode
> --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --disable-libjava-multilib
> --with-ppl --with-cloog --with-tune=generic --with-arch_32=i686
> --build=x86_64-redhat-linux
> Thread model: posix
> gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC)
>
>
>
>
>
> On Thu, Aug 6, 2015 at 7:58 AM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
>
>> HI Nate,
>>
>> We're trying this out on a mac running mavericks and a cray xc system.
>> the mac has java 8
>> while the cray xc has java 7.
>>
>> We could not get the code to run just using the java launch command,
>> although we noticed if you add
>>
>> catch(NoClassDefFoundError e) {
>>
>>   System.out.println("Not using MPI its out to lunch for now");
>>
>> }
>>
>> as one of the catches after the try for firing up MPI, you can get
>> further.
>>
>> Instead we tried on the two systems using
>>
>> mpirun -np 1 java MPITestBroke tweets repeat.txt
>>
>> and, you guessed it, we can't reproduce the error, at least using master.
>>
>> Would you mind trying to get a copy of nightly master build off of
>>
>> http://www.open-mpi.org/nightly/master/
>>
>> and install that version and give it a try.
>>
>> If that works, then I'd suggest using master (or v2.0) for now.
>>
>> Howard
>>
>>
>>
>>
>> 2015-08-05 14:41 GMT-06:00 Nate Chambers <ncham...@usna.edu>:
>>
>>> Howard,
>>>
>>> Thanks for looking at all this. Adding System.gc() did not cause it to
>>> segfault. The segfault still comes much later in the processing.
>>>
>>> I was able to reduce my code to a single test file without other
>>> dependencies. It is attached. This code simply opens a text file and reads
>>> its lines, one by one. Once finished, it closes and opens the same file and
>>> reads the lines again. On my system, it does this about 4 times until the
>>> segfault fires. Obviously this code makes no sense, but it's based on our
>>> actual code that reads millions of lines of data and does various
>>> processing to it.
>>>
>>> Attached is a tweets.tgz file that you can uncompress to have an input
>>> directory. The text file is just the same line over and over again. Run it
>>> as:
>>>
>>> *java MPITestBroke tweets/*
>>>
>>>
>>> Nate
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Aug 5, 2015 at 8:29 AM, Howard Pritchard <hpprit...@gmail.com>

Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-05 Thread Howard Pritchard
thanks Nate.  We will give the test a try.

--

sent from my smart phonr so no good type.

Howard
On Aug 5, 2015 2:42 PM, "Nate Chambers" <ncham...@usna.edu> wrote:

> Howard,
>
> Thanks for looking at all this. Adding System.gc() did not cause it to
> segfault. The segfault still comes much later in the processing.
>
> I was able to reduce my code to a single test file without other
> dependencies. It is attached. This code simply opens a text file and reads
> its lines, one by one. Once finished, it closes and opens the same file and
> reads the lines again. On my system, it does this about 4 times until the
> segfault fires. Obviously this code makes no sense, but it's based on our
> actual code that reads millions of lines of data and does various
> processing to it.
>
> Attached is a tweets.tgz file that you can uncompress to have an input
> directory. The text file is just the same line over and over again. Run it
> as:
>
> *java MPITestBroke tweets/*
>
>
> Nate
>
>
>
>
>
> On Wed, Aug 5, 2015 at 8:29 AM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
>
>> Hi Nate,
>>
>> Sorry for the delay in getting back.  Thanks for the sanity check.  You
>> may have a point about the args string to MPI.init -
>> there's nothing the Open MPI is needing from this but that is a
>> difference with your use case - your app has an argument.
>>
>> Would you mind adding a
>>
>> System.gc()
>>
>> call immediately after MPI.init call and see if the gc blows up with a
>> segfault?
>>
>> Also, may be interesting to add the -verbose:jni to your command line.
>>
>> We'll do some experiments here with the init string arg.
>>
>> Is your app open source where we could download it and try to reproduce
>> the problem locally?
>>
>> thanks,
>>
>> Howard
>>
>>
>> 2015-08-04 18:52 GMT-06:00 Nate Chambers <ncham...@usna.edu>:
>>
>>> Sanity checks pass. Both Hello and Ring.java run correctly with the
>>> expected program's output.
>>>
>>> Does MPI.init(args) expect anything from those command-line args?
>>>
>>>
>>> Nate
>>>
>>>
>>> On Tue, Aug 4, 2015 at 12:26 PM, Howard Pritchard <hpprit...@gmail.com>
>>> wrote:
>>>
>>>> Hello Nate,
>>>>
>>>> As a sanity check of your installation, could you try to compile the
>>>> examples/*.java codes using the mpijavac you've installed and see that
>>>> those run correctly?
>>>> I'd be just interested in the Hello.java and Ring.java?
>>>>
>>>> Howard
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 2015-08-04 14:34 GMT-06:00 Nate Chambers <ncham...@usna.edu>:
>>>>
>>>>> Sure, I reran the configure with CC=gcc and then make install. I think
>>>>> that's the proper way to do it. Attached is my config log. The behavior
>>>>> when running our code appears to be the same. The output is the same error
>>>>> I pasted in my email above. It occurs when calling MPI.init().
>>>>>
>>>>> I'm not great at debugging this sort of stuff, but happy to try things
>>>>> out if you need me to.
>>>>>
>>>>> Nate
>>>>>
>>>>>
>>>>> On Tue, Aug 4, 2015 at 5:09 AM, Howard Pritchard <hpprit...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Nate,
>>>>>>
>>>>>> As a first step to addressing this, could you please try using gcc
>>>>>> rather than the Intel compilers to build Open MPI?
>>>>>>
>>>>>> We've been doing a lot of work recently on the java bindings, etc.
>>>>>> but have never tried using any compilers other
>>>>>> than gcc when working with the java bindings.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Howard
>>>>>>
>>>>>>
>>>>>> 2015-08-03 17:36 GMT-06:00 Nate Chambers <ncham...@usna.edu>:
>>>>>>
>>>>>>> We've been struggling with this error for a while, so hoping someone
>>>>>>> more knowledgeable can help!
>>>>>>>
>>>>>>> Our java MPI code exits with a segfault during its normal operation, 
>>>>>>> *but

Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-05 Thread Howard Pritchard
Hi Nate,

Sorry for the delay in getting back.  Thanks for the sanity check.  You may
have a point about the args string to MPI.init -
there's nothing the Open MPI is needing from this but that is a difference
with your use case - your app has an argument.

Would you mind adding a

System.gc()

call immediately after MPI.init call and see if the gc blows up with a
segfault?

Also, may be interesting to add the -verbose:jni to your command line.

We'll do some experiments here with the init string arg.

Is your app open source where we could download it and try to reproduce the
problem locally?

thanks,

Howard


2015-08-04 18:52 GMT-06:00 Nate Chambers <ncham...@usna.edu>:

> Sanity checks pass. Both Hello and Ring.java run correctly with the
> expected program's output.
>
> Does MPI.init(args) expect anything from those command-line args?
>
>
> Nate
>
>
> On Tue, Aug 4, 2015 at 12:26 PM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
>
>> Hello Nate,
>>
>> As a sanity check of your installation, could you try to compile the
>> examples/*.java codes using the mpijavac you've installed and see that
>> those run correctly?
>> I'd be just interested in the Hello.java and Ring.java?
>>
>> Howard
>>
>>
>>
>>
>>
>>
>>
>> 2015-08-04 14:34 GMT-06:00 Nate Chambers <ncham...@usna.edu>:
>>
>>> Sure, I reran the configure with CC=gcc and then make install. I think
>>> that's the proper way to do it. Attached is my config log. The behavior
>>> when running our code appears to be the same. The output is the same error
>>> I pasted in my email above. It occurs when calling MPI.init().
>>>
>>> I'm not great at debugging this sort of stuff, but happy to try things
>>> out if you need me to.
>>>
>>> Nate
>>>
>>>
>>> On Tue, Aug 4, 2015 at 5:09 AM, Howard Pritchard <hpprit...@gmail.com>
>>> wrote:
>>>
>>>> Hello Nate,
>>>>
>>>> As a first step to addressing this, could you please try using gcc
>>>> rather than the Intel compilers to build Open MPI?
>>>>
>>>> We've been doing a lot of work recently on the java bindings, etc. but
>>>> have never tried using any compilers other
>>>> than gcc when working with the java bindings.
>>>>
>>>> Thanks,
>>>>
>>>> Howard
>>>>
>>>>
>>>> 2015-08-03 17:36 GMT-06:00 Nate Chambers <ncham...@usna.edu>:
>>>>
>>>>> We've been struggling with this error for a while, so hoping someone
>>>>> more knowledgeable can help!
>>>>>
>>>>> Our java MPI code exits with a segfault during its normal operation, *but
>>>>> the segfault occurs before our code ever uses MPI functionality like
>>>>> sending/receiving. *We've removed all message calls and any use of
>>>>> MPI.COMM_WORLD from the code. The segfault occurs if we call 
>>>>> MPI.init(args)
>>>>> in our code, and does not if we comment that line out. Further vexing us,
>>>>> the crash doesn't happen at the point of the MPI.init call, but later on 
>>>>> in
>>>>> the program. I don't have an easy-to-run example here because our non-MPI
>>>>> code is so large and complicated. We have run simpler test programs with
>>>>> MPI and the segfault does not occur.
>>>>>
>>>>> We have isolated the line where the segfault occurs. However, if we
>>>>> comment that out, the program will run longer, but then randomly (but
>>>>> deterministically) segfault later on in the code. Does anyone have tips on
>>>>> how to debug this? We have tried several flags with mpirun, but no good
>>>>> clues.
>>>>>
>>>>> We have also tried several MPI versions, including stable 1.8.7 and
>>>>> the most recent 1.8.8rc1
>>>>>
>>>>>
>>>>> ATTACHED
>>>>> - config.log from installation
>>>>> - output from `ompi_info -all`
>>>>>
>>>>>
>>>>> OUTPUT FROM RUNNING
>>>>>
>>>>> > mpirun -np 2 java -mx4g FeaturizeDay datadir/ days.txt
>>>>> ...
>>>>> some normal output from our code
>>>>> ...
>>>>>
>>>>> --
>>>>> mpirun noticed that process rank 0 with PID 29646 on node 

Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-04 Thread Howard Pritchard
Hello Nate,

As a sanity check of your installation, could you try to compile the
examples/*.java codes using the mpijavac you've installed and see that
those run correctly?
I'd be just interested in the Hello.java and Ring.java?

Howard







2015-08-04 14:34 GMT-06:00 Nate Chambers <ncham...@usna.edu>:

> Sure, I reran the configure with CC=gcc and then make install. I think
> that's the proper way to do it. Attached is my config log. The behavior
> when running our code appears to be the same. The output is the same error
> I pasted in my email above. It occurs when calling MPI.init().
>
> I'm not great at debugging this sort of stuff, but happy to try things out
> if you need me to.
>
> Nate
>
>
> On Tue, Aug 4, 2015 at 5:09 AM, Howard Pritchard <hpprit...@gmail.com>
> wrote:
>
>> Hello Nate,
>>
>> As a first step to addressing this, could you please try using gcc rather
>> than the Intel compilers to build Open MPI?
>>
>> We've been doing a lot of work recently on the java bindings, etc. but
>> have never tried using any compilers other
>> than gcc when working with the java bindings.
>>
>> Thanks,
>>
>> Howard
>>
>>
>> 2015-08-03 17:36 GMT-06:00 Nate Chambers <ncham...@usna.edu>:
>>
>>> We've been struggling with this error for a while, so hoping someone
>>> more knowledgeable can help!
>>>
>>> Our java MPI code exits with a segfault during its normal operation, *but
>>> the segfault occurs before our code ever uses MPI functionality like
>>> sending/receiving. *We've removed all message calls and any use of
>>> MPI.COMM_WORLD from the code. The segfault occurs if we call MPI.init(args)
>>> in our code, and does not if we comment that line out. Further vexing us,
>>> the crash doesn't happen at the point of the MPI.init call, but later on in
>>> the program. I don't have an easy-to-run example here because our non-MPI
>>> code is so large and complicated. We have run simpler test programs with
>>> MPI and the segfault does not occur.
>>>
>>> We have isolated the line where the segfault occurs. However, if we
>>> comment that out, the program will run longer, but then randomly (but
>>> deterministically) segfault later on in the code. Does anyone have tips on
>>> how to debug this? We have tried several flags with mpirun, but no good
>>> clues.
>>>
>>> We have also tried several MPI versions, including stable 1.8.7 and the
>>> most recent 1.8.8rc1
>>>
>>>
>>> ATTACHED
>>> - config.log from installation
>>> - output from `ompi_info -all`
>>>
>>>
>>> OUTPUT FROM RUNNING
>>>
>>> > mpirun -np 2 java -mx4g FeaturizeDay datadir/ days.txt
>>> ...
>>> some normal output from our code
>>> ...
>>>
>>> --
>>> mpirun noticed that process rank 0 with PID 29646 on node r9n69 exited
>>> on signal 11 (Segmentation fault).
>>>
>>> --
>>>
>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/users/2015/08/27386.php
>>>
>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/08/27389.php
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/08/27391.php
>


Re: [OMPI users] segfault on java binding from MPI.init()

2015-08-04 Thread Howard Pritchard
Hello Nate,

As a first step to addressing this, could you please try using gcc rather
than the Intel compilers to build Open MPI?

We've been doing a lot of work recently on the java bindings, etc. but have
never tried using any compilers other
than gcc when working with the java bindings.

Thanks,

Howard


2015-08-03 17:36 GMT-06:00 Nate Chambers :

> We've been struggling with this error for a while, so hoping someone more
> knowledgeable can help!
>
> Our java MPI code exits with a segfault during its normal operation, *but
> the segfault occurs before our code ever uses MPI functionality like
> sending/receiving. *We've removed all message calls and any use of
> MPI.COMM_WORLD from the code. The segfault occurs if we call MPI.init(args)
> in our code, and does not if we comment that line out. Further vexing us,
> the crash doesn't happen at the point of the MPI.init call, but later on in
> the program. I don't have an easy-to-run example here because our non-MPI
> code is so large and complicated. We have run simpler test programs with
> MPI and the segfault does not occur.
>
> We have isolated the line where the segfault occurs. However, if we
> comment that out, the program will run longer, but then randomly (but
> deterministically) segfault later on in the code. Does anyone have tips on
> how to debug this? We have tried several flags with mpirun, but no good
> clues.
>
> We have also tried several MPI versions, including stable 1.8.7 and the
> most recent 1.8.8rc1
>
>
> ATTACHED
> - config.log from installation
> - output from `ompi_info -all`
>
>
> OUTPUT FROM RUNNING
>
> > mpirun -np 2 java -mx4g FeaturizeDay datadir/ days.txt
> ...
> some normal output from our code
> ...
> --
> mpirun noticed that process rank 0 with PID 29646 on node r9n69 exited on
> signal 11 (Segmentation fault).
> --
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/08/27386.php
>


Re: [OMPI users] Running with native ugni on a Cray XC

2015-06-30 Thread Howard Pritchard
Hi Nick

No.  Have to use mpirun in this case.  You need. to ask for a larger batch
allocation than the initial mpirun requires.  You do need to ask for batch
alloc though.  Also note that mpirun doesnt currently work with nativized
slurm.  Its on my todo list to fix.

Howard

--

sent from my smart phonr so no good type.

Howard
On Jun 30, 2015 3:51 PM, "Nick Radcliffe" <nradc...@cray.com> wrote:

>  Howard,
>
> I have one more question. Is it possible to use MPI_Comm_spawn when
> launching an OpenMPI job with aprun? I'm getting this error when I try:
>
> nradclif@kay:/lus/scratch/nradclif> aprun -n 1 -N 1 ./manager
> [nid00036:21772] [[14952,0],0] ORTE_ERROR_LOG: Not available in file
> dpm_orte.c at line 1190
> [36:21772] *** An error occurred in MPI_Comm_spawn
> [36:21772] *** reported by process [979894272,0]
> [36:21772] *** on communicator MPI_COMM_SELF
> [36:21772] *** MPI_ERR_UNKNOWN: unknown error
> [36:21772] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will
> now abort,
> [36:21772] ***and potentially your MPI job)
> aborting job:
> N/A
>
>
> Nick Radcliffe
> Software Engineer
> Cray, Inc.
>  ------
> *From:* users [users-boun...@open-mpi.org] on behalf of Howard Pritchard [
> hpprit...@gmail.com]
> *Sent:* Thursday, June 25, 2015 11:00 PM
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] Running with native ugni on a Cray XC
>
>   Hi Nick,
>
>  I will endeavor to put together a wiki for the master/v2.x series
> specific to Cray systems
> (sans those customers who choose to neither 1) use Cray supported eslogin
> setup nor 2)  permit users to directly log in to and build apps on service
> nodes)  that explains best practices for
> using Open MPI on Cray XE/XK/XC systems.
>
>  A significant  amount of work went in to master, and now the v2.x release
> stream to rationalize support for Open MPI on Cray XE/XK/XC systems using
> either aprun
> or native slurm launch.
>
>  General advice for all on this mailing list, do not use the Open MPI
> 1.8.X release
> series with direct ugni access enabled on Cray XE/XK/XC .  Rather use
> master, or as soon as
> a release is available, from v2.x.   Note that if you are using CCM,  the
> performance
> of Open MPI 1.8.X over the Cray IAA (simulated ibverbs) is pretty good.  I
> suggest this
> as the preferred route for using the 1.8.X release stream on Cray XE/XK/XC.
>
>  Howard
>
>
> 2015-06-25 19:35 GMT-06:00 Nick Radcliffe <nradc...@cray.com>:
>
>>  Thanks Howard, using master worked for me.
>>
>> Nick Radcliffe
>> Software Engineer
>> Cray, Inc.
>>  --
>> *From:* users [users-boun...@open-mpi.org] on behalf of Howard Pritchard
>> [hpprit...@gmail.com]
>> *Sent:* Thursday, June 25, 2015 5:11 PM
>> *To:* Open MPI Users
>> *Subject:* Re: [OMPI users] Running with native ugni on a Cray XC
>>
>>   Hi Nick
>>
>> use master not 1.8.x. for cray xc.  also for config do not pay attention
>> to cray/lanl platform files.  just do config.  also if using nativized
>> slurm launch with srun not mpirun.
>>
>> howard
>>
>> --
>>
>> sent from my smart phonr so no good type.
>>
>> Howard
>> On Jun 25, 2015 2:56 PM, "Nick Radcliffe" <nradc...@cray.com> wrote:
>>
>>> Hi,
>>>
>>> I'm trying to build and run Open MPI 1.8.5 with native ugni on a Cray
>>> XC. The build works, but I'm getting this error when I run:
>>>
>>> nradclif@kay:/lus/scratch/nradclif> aprun -n 2 -N 1 ./osu_latency
>>> [nid00014:28784] [db_pmi.c:174:pmi_commit_packed] PMI_KVS_Put: Operation
>>> failed
>>> [nid00014:28784] [db_pmi.c:457:commit] PMI_KVS_Commit: Operation failed
>>> [nid00012:12788] [db_pmi.c:174:pmi_commit_packed] PMI_KVS_Put: Operation
>>> failed
>>> [nid00012:12788] [db_pmi.c:457:commit] PMI_KVS_Commit: Operation failed
>>> # OSU MPI Latency Test
>>> # SizeLatency (us)
>>> osu_latency: btl_ugni_endpoint.c:87: mca_btl_ugni_ep_connect_start:
>>> Assertion `0' failed.
>>> [nid00012:12788] *** Process received signal ***
>>> [nid00012:12788] Signal: Aborted (6)
>>> [nid00012:12788] Signal code:  (-6)
>>> [nid00012:12788] [ 0] /lib64/libpthread.so.0(+0xf850)[0x2b42b850]
>>> [nid00012:12788] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b66b885]
>>> [nid00012:12788] [ 2] /lib64/libc.so.6(abort+0x181)[0x2b66ce61]
>>> [nid00012:12788] [ 3]
>>> /lib64/libc.so.6(__assert_fail+0xf0)[0x2b664740]
>>> 

Re: [OMPI users] Running with native ugni on a Cray XC

2015-06-26 Thread Howard Pritchard
Hi Nick,

I will endeavor to put together a wiki for the master/v2.x series specific
to Cray systems
(sans those customers who choose to neither 1) use Cray supported eslogin
setup nor 2)  permit users to directly log in to and build apps on service
nodes)  that explains best practices for
using Open MPI on Cray XE/XK/XC systems.

A significant  amount of work went in to master, and now the v2.x release
stream to rationalize support for Open MPI on Cray XE/XK/XC systems using
either aprun
or native slurm launch.

General advice for all on this mailing list, do not use the Open MPI 1.8.X
release
series with direct ugni access enabled on Cray XE/XK/XC .  Rather use
master, or as soon as
a release is available, from v2.x.   Note that if you are using CCM,  the
performance
of Open MPI 1.8.X over the Cray IAA (simulated ibverbs) is pretty good.  I
suggest this
as the preferred route for using the 1.8.X release stream on Cray XE/XK/XC.

Howard


2015-06-25 19:35 GMT-06:00 Nick Radcliffe <nradc...@cray.com>:

>  Thanks Howard, using master worked for me.
>
> Nick Radcliffe
> Software Engineer
> Cray, Inc.
>  --
> *From:* users [users-boun...@open-mpi.org] on behalf of Howard Pritchard [
> hpprit...@gmail.com]
> *Sent:* Thursday, June 25, 2015 5:11 PM
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] Running with native ugni on a Cray XC
>
>   Hi Nick
>
> use master not 1.8.x. for cray xc.  also for config do not pay attention
> to cray/lanl platform files.  just do config.  also if using nativized
> slurm launch with srun not mpirun.
>
> howard
>
> --
>
> sent from my smart phonr so no good type.
>
> Howard
> On Jun 25, 2015 2:56 PM, "Nick Radcliffe" <nradc...@cray.com> wrote:
>
>> Hi,
>>
>> I'm trying to build and run Open MPI 1.8.5 with native ugni on a Cray XC.
>> The build works, but I'm getting this error when I run:
>>
>> nradclif@kay:/lus/scratch/nradclif> aprun -n 2 -N 1 ./osu_latency
>> [nid00014:28784] [db_pmi.c:174:pmi_commit_packed] PMI_KVS_Put: Operation
>> failed
>> [nid00014:28784] [db_pmi.c:457:commit] PMI_KVS_Commit: Operation failed
>> [nid00012:12788] [db_pmi.c:174:pmi_commit_packed] PMI_KVS_Put: Operation
>> failed
>> [nid00012:12788] [db_pmi.c:457:commit] PMI_KVS_Commit: Operation failed
>> # OSU MPI Latency Test
>> # SizeLatency (us)
>> osu_latency: btl_ugni_endpoint.c:87: mca_btl_ugni_ep_connect_start:
>> Assertion `0' failed.
>> [nid00012:12788] *** Process received signal ***
>> [nid00012:12788] Signal: Aborted (6)
>> [nid00012:12788] Signal code:  (-6)
>> [nid00012:12788] [ 0] /lib64/libpthread.so.0(+0xf850)[0x2b42b850]
>> [nid00012:12788] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b66b885]
>> [nid00012:12788] [ 2] /lib64/libc.so.6(abort+0x181)[0x2b66ce61]
>> [nid00012:12788] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x2b664740]
>> [nid00012:12788] [ 4]
>> /lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(mca_btl_ugni_ep_connect_progress+0x6c9)[0x2aff9869]
>> [nid00012:12788] [ 5]
>> /lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(+0x5ae32)[0x2af46e32]
>> [nid00012:12788] [ 6]
>> /lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(mca_btl_ugni_sendi+0x8bd)[0x2affaf7d]
>> [nid00012:12788] [ 7]
>> /lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(+0x1f0c17)[0x2b0dcc17]
>> [nid00012:12788] [ 8]
>> /lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(mca_pml_ob1_isend+0xa8)[0x2b0dd488]
>> [nid00012:12788] [ 9]
>> /lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(ompi_coll_tuned_barrier_intra_two_procs+0x11b)[0x2b07e84b]
>> [nid00012:12788] [10]
>> /lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(PMPI_Barrier+0xb6)[0x2af8a7c6]
>> [nid00012:12788] [11] ./osu_latency[0x401114]
>> [nid00012:12788] [12]
>> /lib64/libc.so.6(__libc_start_main+0xe6)[0x2b657c36]
>> [nid00012:12788] [13] ./osu_latency[0x400dd9]
>> [nid00012:12788] *** End of error message ***
>> osu_latency: btl_ugni_endpoint.c:87: mca_btl_ugni_ep_connect_start:
>> Assertion `0' failed.
>>
>>
>> Here's how I build:
>>
>> export FC=ftn (I'm not using Fortran, but the configure fails if
>> it can't find a Fortran compiler)
>> ./configure --prefix=/lus/scratch/nradclif/openmpi_install
>> --enable-mpi-fortran=none
>> --with-platform=contrib/platform/lanl/cray_xe6/debug-lustre
>> make install
>>
>> I didn't modify the debug-lustre file, but I did change cray-common to
>> remove the hard-coding, e.g., rather than using the gemini-specific path
>> "wit

Re: [OMPI users] Running with native ugni on a Cray XC

2015-06-25 Thread Howard Pritchard
Hi Nick

use master not 1.8.x. for cray xc.  also for config do not pay attention to
cray/lanl platform files.  just do config.  also if using nativized slurm
launch with srun not mpirun.

howard

--

sent from my smart phonr so no good type.

Howard
On Jun 25, 2015 2:56 PM, "Nick Radcliffe"  wrote:

> Hi,
>
> I'm trying to build and run Open MPI 1.8.5 with native ugni on a Cray XC.
> The build works, but I'm getting this error when I run:
>
> nradclif@kay:/lus/scratch/nradclif> aprun -n 2 -N 1 ./osu_latency
> [nid00014:28784] [db_pmi.c:174:pmi_commit_packed] PMI_KVS_Put: Operation
> failed
> [nid00014:28784] [db_pmi.c:457:commit] PMI_KVS_Commit: Operation failed
> [nid00012:12788] [db_pmi.c:174:pmi_commit_packed] PMI_KVS_Put: Operation
> failed
> [nid00012:12788] [db_pmi.c:457:commit] PMI_KVS_Commit: Operation failed
> # OSU MPI Latency Test
> # SizeLatency (us)
> osu_latency: btl_ugni_endpoint.c:87: mca_btl_ugni_ep_connect_start:
> Assertion `0' failed.
> [nid00012:12788] *** Process received signal ***
> [nid00012:12788] Signal: Aborted (6)
> [nid00012:12788] Signal code:  (-6)
> [nid00012:12788] [ 0] /lib64/libpthread.so.0(+0xf850)[0x2b42b850]
> [nid00012:12788] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x2b66b885]
> [nid00012:12788] [ 2] /lib64/libc.so.6(abort+0x181)[0x2b66ce61]
> [nid00012:12788] [ 3] /lib64/libc.so.6(__assert_fail+0xf0)[0x2b664740]
> [nid00012:12788] [ 4]
> /lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(mca_btl_ugni_ep_connect_progress+0x6c9)[0x2aff9869]
> [nid00012:12788] [ 5]
> /lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(+0x5ae32)[0x2af46e32]
> [nid00012:12788] [ 6]
> /lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(mca_btl_ugni_sendi+0x8bd)[0x2affaf7d]
> [nid00012:12788] [ 7]
> /lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(+0x1f0c17)[0x2b0dcc17]
> [nid00012:12788] [ 8]
> /lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(mca_pml_ob1_isend+0xa8)[0x2b0dd488]
> [nid00012:12788] [ 9]
> /lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(ompi_coll_tuned_barrier_intra_two_procs+0x11b)[0x2b07e84b]
> [nid00012:12788] [10]
> /lus/scratch/nradclif/openmpi_install/lib/libmpi.so.1(PMPI_Barrier+0xb6)[0x2af8a7c6]
> [nid00012:12788] [11] ./osu_latency[0x401114]
> [nid00012:12788] [12]
> /lib64/libc.so.6(__libc_start_main+0xe6)[0x2b657c36]
> [nid00012:12788] [13] ./osu_latency[0x400dd9]
> [nid00012:12788] *** End of error message ***
> osu_latency: btl_ugni_endpoint.c:87: mca_btl_ugni_ep_connect_start:
> Assertion `0' failed.
>
>
> Here's how I build:
>
> export FC=ftn (I'm not using Fortran, but the configure fails if
> it can't find a Fortran compiler)
> ./configure --prefix=/lus/scratch/nradclif/openmpi_install
> --enable-mpi-fortran=none
> --with-platform=contrib/platform/lanl/cray_xe6/debug-lustre
> make install
>
> I didn't modify the debug-lustre file, but I did change cray-common to
> remove the hard-coding, e.g., rather than using the gemini-specific path
> "with_pmi=/opt/cray/pmi/2.1.4-1..8596.8.9.gem", I used
> "with_pmi=/opt/cray/pmi/default".
>
> I've tried running different executables with different numbers of
> ranks/nodes, but they all seem to run into problems with PMI_KVS_Put.
>
> Any ideas what could be going wrong?
>
> Thanks for any help,
> Nick
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27197.php
>


Re: [OMPI users] hybrid programming and OpenMPI compilation

2015-06-25 Thread Howard Pritchard
Hello Fedele,

Would it be possible to build the open mpi package to use gfortran
and run the test again?

Do you observe this problem if you build a Open MP (<-> MP not MPI)
only version of the test case?

I can't reproduce this problem using gfortran.  I don't have access to an
Intel compiler at the moment.

Also, please send the output of ompi_info.

Thanks,

Howard


2015-06-25 10:37 GMT-06:00 Fedele Stabile :

> Hello to all,
> I'm trying hybrid OpenMP + MPI programming, when I run the simple code
> listed below I have an error:
> forrtl: severe (40): recursive I/O operation, unit -1, file unknown
> Image  PCRoutineLine
> Source
> aa 00403D8E  Unknown   Unknown
> Unknown
> aa 00403680  Unknown   Unknown
> Unknown
> libiomp5.so2B705F7C5BB3  Unknown   Unknown
> Unknown
> libiomp5.so2B705F79A617  Unknown   Unknown
> Unknown
> libiomp5.so2B705F799D3A  Unknown   Unknown
> Unknown
> libiomp5.so2B705F7C5EAD  Unknown   Unknown
> Unknown
> libpthread.so.02B705FA699D1  Unknown   Unknown
> Unknown
> libc.so.6  2B705FD688FD  Unknown   Unknown
> Unknown
> ---
> Primary job  terminated normally, but 1 process returned
> a non-zero exit code.. Per user-direction, the job has been aborted.
> ---
> --
> mpirun detected that one or more processes exited with non-zero status,
> thus causing
> the job to be terminated. The first process to do so was:
>
>   Process name: [[61634,1],0]
>   Exit code:40
>
> I have compiled OpenMPI using this configuration options:
> ./configure --prefix=/data/apps/mpi/openmpi-1.8.4-intel
> -enable-mpirun-prefix-by-default --enable-mpi-fortran
> --enable-mpi-thread-multiple
> --with-tm=/usr/local/torque-5.1.0-1_4048f77c/src --with-verbs
> --with-openib --with-cuda=/usr/local/cuda-6.5
>
> This is the listing of the simple code:
> program hello
> include "mpif.h"
>
> integer numprocs, rank, namelen, ierr
> character*(MPI_MAX_PROCESSOR_NAME) processor_name
> integer iam, np
> integer omp_get_num_threads, omp_get_thread_num
>
> call MPI_Init(ierr)
> call MPI_Comm_size(MPI_COMM_WORLD, numprocs, ierr)
> call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
> call MPI_Get_processor_name(processor_name, namelen, ierr)
> iam = 0
> np = 1
> !$omp parallel default(shared) private(iam, np)
>
> np = omp_get_num_threads()
> iam = omp_get_thread_num();
> write(*,*)"Hello from thread ", iam," out of ", np,
>  %  " from process ", rank," out of ", numprocs,
>  %  " on ", processor_name
>
> !$omp end parallel
> call MPI_Finalize(ierr)
> stop
> end
>
> Can you help me to solve the problem?
> Thank you,
> Fedele Stabile
>
>
> ___
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/06/27192.php
>


  1   2   >