Howard,
On Wed, Nov 14, 2018 at 5:26 AM Howard Pritchard <hpprit...@gmail.com> wrote:
>
> Hello Bert,
>
> What OS are you running on your notebook?

Ubuntu 18.04

>
> If you are running Linux, and you have root access to your system,  then
> you should be able to resolve the Open SHMEM support issue by installing
> the XPMEM device driver on your system, and rebuilding UCX so it picks
> up XPMEM support.
>
> The source code is on GitHub:
>
> https://github.com/hjelmn/xpmem
>
> Some instructions on how to build the xpmem device driver are at
>
> https://github.com/hjelmn/xpmem/wiki/Installing-XPMEM
>
> You will need to install the kernel source and symbols rpms on your
> system before building the xpmem device driver.

I will try that. I already tried KNEM, which also did not worked.
Though thats definitely leaving the country of convenience. For a
development machine where performance doesn't matter, its a huge step
back for Open MPI I think.

I wil report back if that works.

Thanks.

Best,
Bert

>
> Hope this helps,
>
> Howard
>
>
> Am Di., 13. Nov. 2018 um 15:00 Uhr schrieb Bert Wesarg via users 
> <users@lists.open-mpi.org>:
>>
>> Hi,
>>
>> On Mon, Nov 12, 2018 at 10:49 PM Pritchard Jr., Howard via announce
>> <annou...@lists.open-mpi.org> wrote:
>> >
>> > The Open MPI Team, representing a consortium of research, academic, and
>> > industry partners, is pleased to announce the release of Open MPI version
>> > 4.0.0.
>> >
>> > v4.0.0 is the start of a new release series for Open MPI.  Starting with
>> > this release, the OpenIB BTL supports only iWarp and RoCE by default.
>> > Starting with this release,  UCX is the preferred transport protocol
>> > for Infiniband interconnects. The embedded PMIx runtime has been updated
>> > to 3.0.2.  The embedded Romio has been updated to 3.2.1.  This
>> > release is ABI compatible with the 3.x release streams. There have been 
>> > numerous
>> > other bug fixes and performance improvements.
>> >
>> > Note that starting with Open MPI v4.0.0, prototypes for several
>> > MPI-1 symbols that were deleted in the MPI-3.0 specification
>> > (which was published in 2012) are no longer available by default in
>> > mpi.h. See the README for further details.
>> >
>> > Version 4.0.0 can be downloaded from the main Open MPI web site:
>> >
>> >   https://www.open-mpi.org/software/ompi/v4.0/
>> >
>> >
>> > 4.0.0 -- September, 2018
>> > ------------------------
>> >
>> > - OSHMEM updated to the OpenSHMEM 1.4 API.
>> > - Do not build OpenSHMEM layer when there are no SPMLs available.
>> >   Currently, this means the OpenSHMEM layer will only build if
>> >   a MXM or UCX library is found.
>>
>> so what is the most convenience way to get SHMEM working on a single
>> shared memory node (aka. notebook)? I just realized that I don't have
>> a SHMEM since Open MPI 3.0. But building with UCX does not help
>> either. I tried with UCX 1.4 but Open MPI SHMEM
>> still does not work:
>>
>> $ oshcc -o shmem_hello_world-4.0.0 openmpi-4.0.0/examples/hello_oshmem_c.c
>> $ oshrun -np 2 ./shmem_hello_world-4.0.0
>> [1542109710.217344] [tudtug:27715:0]         select.c:406  UCX  ERROR
>> no remote registered memory access transport to tudtug:27716:
>> self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
>> tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
>> mm/posix - Destination is unreachable, cma/cma - no put short
>> [1542109710.217344] [tudtug:27716:0]         select.c:406  UCX  ERROR
>> no remote registered memory access transport to tudtug:27715:
>> self/self - Destination is unreachable, tcp/enp0s31f6 - no put short,
>> tcp/wlp61s0 - no put short, mm/sysv - Destination is unreachable,
>> mm/posix - Destination is unreachable, cma/cma - no put short
>> [tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
>> Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
>> [tudtug:27715] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
>> Error: add procs FAILED rc=-2
>> [tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:266
>> Error: ucp_ep_create(proc=1/2) failed: Destination is unreachable
>> [tudtug:27716] ../../../../../oshmem/mca/spml/ucx/spml_ucx.c:305
>> Error: add procs FAILED rc=-2
>> --------------------------------------------------------------------------
>> It looks like SHMEM_INIT failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during SHMEM_INIT; some of which are due to configuration or environment
>> problems.  This failure appears to be an internal failure; here's some
>> additional information (which may only be relevant to an Open SHMEM
>> developer):
>>
>>   SPML add procs failed
>>   --> Returned "Out of resource" (-2) instead of "Success" (0)
>> --------------------------------------------------------------------------
>> [tudtug:27715] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
>> initialize - aborting
>> [tudtug:27716] Error: pshmem_init.c:80 - _shmem_init() SHMEM failed to
>> initialize - aborting
>> --------------------------------------------------------------------------
>> SHMEM_ABORT was invoked on rank 0 (pid 27715, host=tudtug) with errorcode -1.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> A SHMEM process is aborting at a time when it cannot guarantee that all
>> of its peer processes in the job will be killed properly.  You should
>> double check that everything has shut down cleanly.
>>
>> Local host: tudtug
>> PID:        27715
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> Primary job  terminated normally, but 1 process returned
>> a non-zero exit code. Per user-direction, the job has been aborted.
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> oshrun detected that one or more processes exited with non-zero
>> status, thus causing
>> the job to be terminated. The first process to do so was:
>>
>>   Process name: [[2212,1],1]
>>   Exit code:    255
>> --------------------------------------------------------------------------
>> [tudtug:27710] 1 more process has sent help message
>> help-shmem-runtime.txt / shmem_init:startup:internal-failure
>> [tudtug:27710] Set MCA parameter "orte_base_help_aggregate" to 0 to
>> see all help / error messages
>> [tudtug:27710] 1 more process has sent help message help-shmem-api.txt
>> / shmem-abort
>> [tudtug:27710] 1 more process has sent help message
>> help-shmem-runtime.txt / oshmem shmem abort:cannot guarantee all
>> killed
>>
>> MPI works as expected:
>>
>> $ mpicc -o mpi_hello_world-4.0.0 openmpi-4.0.0/examples/hello_c.c
>> $ mpirun -np 2 ./mpi_hello_world-4.0.0
>> Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI
>> wesarg@tudtug Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12,
>> 2018, 108)
>> Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI
>> wesarg@tudtug Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12,
>> 2018, 108)
>>
>> I'm attaching the output from 'ompi_info -a' and also from 'ucx_info
>> -b -d -c -s'.
>>
>> Thanks for the help.
>>
>> Best,
>> Bert
>>
>> _______________________________________________
>> users mailing list
>> users@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to