Re: [OMPI users] Debug OMPI errors

2019-08-04 Thread Passant A. Hafez via users
Hello Jeff,

In short, Yes. 

To further explain what I meant, I see many problems which will just end in 
termination of the MPI job, sharing the same error message (which is just 
saying that the process aborted) while the underlying reason are different, 
sometimes related to the code, some other times related to hardware, 
networking, configuration of Infiniband.

I want when I get such error to have details that guide me to which area I 
should investigate, without spitting very detailed logs like the output of 
strace for example, so it doesn't make the actual output of the MPI job harder 
to read.

I assume it could be either something enabled during compilation of OMPI 
itself, or something passed during runtime (will be better).


All the best,
--
Passant 


From: Jeff Squyres (jsquyres) 
Sent: Sunday, July 28, 2019 5:52 PM
To: Open MPI User's List
Cc: Passant A. Hafez
Subject: Re: [OMPI users] Debug OMPI errors

I'm not sure exactly what you are asking -- can you be more specific?

Are you asking if Open MPI can emit more detail when an error occurs and the 
job aborts?


> On Jul 28, 2019, at 4:12 AM, Passant A. Hafez via users 
>  wrote:
>
> Hello all,
>
> I was wondering if I can enable some reasonable level of debugging for OMPI 
> errors, especially in the cases that just report that a process is killed 
> (for example MPI_ABORT was invoked) and that's it.
>
>
>
> All the best,
>
> --
>
> Passant
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


--
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] Debug OMPI errors

2019-07-28 Thread Passant A. Hafez via users
Hello all,

I was wondering if I can enable some reasonable level of debugging for OMPI 
errors, especially in the cases that just report that a process is killed (for 
example MPI_ABORT was invoked) and that's it.



All the best,

--

Passant
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] undefined reference error related to ucx

2019-06-25 Thread Passant A. Hafez via users
Thanks Gilles!

The thing is I'm having this error
ud_iface.c:271  UCX Assertion `qp_init_attr.cap.max_inline_data >= 
UCT_UD_MIN_INLINE' failed
and core files.

I looked that up and it was suggested here 
https://github.com/openucx/ucx/issues/3336 that the UCX 1.6 might solve this 
issue, so I tried the pre-release version to just check if it will.




All the best,
--
Passant 


From: users  on behalf of Gilles Gouaillardet 
via users 
Sent: Tuesday, June 25, 2019 11:27 AM
To: Open MPI Users
Cc: Gilles Gouaillardet
Subject: Re: [OMPI users] undefined reference error related to ucx

Passant,

UCX 1.6.0 is not yet officially released, and it seems Open MPI
(4.0.1) does not support it yet, and some porting is needed.

Cheers,

Gilles

On Tue, Jun 25, 2019 at 5:13 PM Passant A. Hafez via users
 wrote:
>
> Hello,
>
>
> I'm trying to build ompi 4.0.1 with external ucx 1.6.0 but I'm getting
>
>
> ../../../opal/.libs/libopen-pal.so: undefined reference to 
> `uct_ep_create_connected'
> collect2: error: ld returned 1 exit status
>
> configure line for ompi
> ./configure --prefix=/opt/ompi401_ucx16 --with-slurm --with-hwloc=internal 
> --with-pmix=internal --enable-shared --enable-static --with-x 
> --with-ucx=/opt/ucx-1.6.0
>
> configure line for ucx
> ./configure --prefix=/opt/ucx-1.6.0
>
>
> What could be the reason?
>
>
>
>
>
>
> All the best,
> --
> Passant
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


[OMPI users] undefined reference error related to ucx

2019-06-25 Thread Passant A. Hafez via users
Hello,


I'm trying to build ompi 4.0.1 with external ucx 1.6.0 but I'm getting

../../../opal/.libs/libopen-pal.so: undefined reference to 
`uct_ep_create_connected'
collect2: error: ld returned 1 exit status

configure line for ompi
./configure --prefix=/opt/ompi401_ucx16 --with-slurm --with-hwloc=internal 
--with-pmix=internal --enable-shared --enable-static --with-x 
--with-ucx=/opt/ucx-1.6.0

configure line for ucx
./configure --prefix=/opt/ucx-1.6.0?


What could be the reason?
?





All the best,
--
Passant
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users