Hello

The hwloc/X11 stuff is caused by OpenMPI using a hwloc that was built
with the GL backend enabled (in your case, it's because package
libhwloc-plugins is installed). That backend is used for querying the
locality of X11 displays running on NVIDIA GPUs (using libxnvctrl). Does
running "lstopo" fail/hang too? (it will basically run hwloc without
OpenMPI).

One workaround should be to set HWLOC_COMPONENTS=-gl in your environment
so that this backend is ignored. Recent hwloc releases have a way to
avoid some plugins at runtime through the C interface, we should likely
blacklist all plugins that are already blacklisted at compile time when
OMPI builds its own hwloc.

Brice



Le 14/11/2020 à 12:33, Jorge Silva via users a écrit :
> Hello,
>
> In spite of the delay, I was not able to solve my problem. Thanks to
> Joseph and Prentice for their interesting suggestions.
>
> I uninstalled AppAmor (SElinux is not installed ) as suggested by
> Prentice but there were no changes, mpirun  sttill hangs.
>
> The result of gdb stack trace is the following:
>
>
> $ sudo gdb -batch -ex "thread apply all bt" -p $(ps -C mpirun -o pid=
> | head -n 1)
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library
> "/lib/x86_64-linux-gnu/libthread_db.so.1".
> 0x00007f9289544307 in __libc_connect (fd=9, addr=..., len=16) at
> ../sysdeps/unix/sysv/linux/connect.c:26
> 26      ../sysdeps/unix/sysv/linux/connect.c: Aucun fichier ou dossier
> de ce type.
>
> Thread 1 (Thread 0x7f92891f4e80 (LWP 4948)):
> #0  0x00007f9289544307 in __libc_connect (fd=9, addr=..., len=16) at
> ../sysdeps/unix/sysv/linux/connect.c:26
> #1  0x00007f9288fff59d in ?? () from /lib/x86_64-linux-gnu/libxcb.so.1
> #2  0x00007f9288fffc49 in xcb_connect_to_display_with_auth_info ()
> from /lib/x86_64-linux-gnu/libxcb.so.1
> #3  0x00007f928906cb7a in _XConnectXCB () from
> /lib/x86_64-linux-gnu/libX11.so.6
> #4  0x00007f928905d319 in XOpenDisplay () from
> /lib/x86_64-linux-gnu/libX11.so.6
> #5  0x00007f92897de4fb in ?? () from
> /usr/lib/x86_64-linux-gnu/hwloc/hwloc_gl.so
> #6  0x00007f92893b901e in ?? () from /lib/x86_64-linux-gnu/libhwloc.so.15
> #7  0x00007f92893c13a0 in hwloc_topology_load () from
> /lib/x86_64-linux-gnu/libhwloc.so.15
> #8  0x00007f92896df564 in opal_hwloc_base_get_topology () from
> /lib/x86_64-linux-gnu/libopen-pal.so.40
> #9  0x00007f92891da6be in ?? () from
> /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_hnp.so
> #10 0x00007f92897a22fc in orte_init () from
> /lib/x86_64-linux-gnu/libopen-rte.so.40
> #11 0x00007f92897a6c86 in orte_submit_init () from
> /lib/x86_64-linux-gnu/libopen-rte.so.40
> #12 0x000055819fd0b3a3 in ?? ()
> #13 0x00007f92894480b3 in __libc_start_main (main=0x55819fd0b1c0,
> argc=1, argv=0x7fff5334fe48, init=<optimized out>, fini=<optimized
> out>, rtld_fini=<optimized out>, stack_end=0x7fff5334fe38) at
> ../csu/libc-start.c:308
> #14 0x000055819fd0b1fe in ?? ()
> [Inferior 1 (process 4948) detached]
>
>
> So it seems to be a problem in the connection  via libxcb (socket ?)
> but this is out of my system computer skills.. Is there any
> authorization needed?
>
> As is libX11 at the origin of the call I tried to execute in a bare
> terminal (ctrl-alt-f2 and via ssh) but the message is the same. I
> tried to recompile/install the hole package and have the same result.
>
> Thank you for your help.
>
> Jorge
>
>  Le 22/10/2020 à 12:16, Joseph Schuchart via users a écrit :
>> Hi Jorge,
>>
>> Can you try to get a stack trace of mpirun using the following
>> command in a separate terminal?
>>
>> sudo gdb -batch -ex "thread apply all bt" -p $(ps -C mpirun -o pid= |
>> head -n 1)
>>
>> Maybe that will give some insight where mpirun is hanging.
>>
>> Cheers,
>> Joseph
>>
>> On 10/21/20 9:58 PM, Jorge SILVA via users wrote:
>>> Hello Jeff,
>>>
>>> The  program is not executed, seems waits for something to connect
>>> with (why twice ctrl-C ?)
>>>
>>> jorge@gcp26:~/MPIRUN$ mpirun -np 1 touch /tmp/foo
>>> ^C^C
>>>
>>> jorge@gcp26:~/MPIRUN$ ls -l /tmp/foo
>>> ls: impossible d'accéder à '/tmp/foo': Aucun fichier ou dossier de
>>> ce type
>>>
>>> no file  is created..
>>>
>>> In fact, my question was if are there differences in mpirun usage 
>>> between these versions..  The
>>>
>>> mpirun -help
>>>
>>> gives a different output as expected, but I  tried a lot of options
>>> without any success.
>>>
>>>
>>> Le 21/10/2020 à 21:16, Jeff Squyres (jsquyres) a écrit :
>>>> There's huge differences between Open MPI v2.1.1 and v4.0.3 (i.e.,
>>>> years of development effort); it would be very hard to categorize
>>>> them all; sorry!
>>>>
>>>> What happens if you
>>>>
>>>>     mpirun -np 1 touch /tmp/foo
>>>>
>>>> (Yes, you can run non-MPI apps through mpirun)
>>>>
>>>> Is /tmp/foo created?  (i.e., did the job run, and mpirun is somehow
>>>> not terminating)
>>>>
>>>>
>>>>
>>>>> On Oct 21, 2020, at 12:22 PM, Jorge SILVA via users
>>>>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>>>>>
>>>>> Hello Gus,
>>>>>
>>>>>  Thank you for your answer..  Unfortunately my problem is much
>>>>> more basic. I  didn't try to run the program in both computers ,
>>>>> but just to run something in one computer. I just installed the
>>>>> new OS an openmpi in two different computers, in the standard way,
>>>>> with the same result.
>>>>>
>>>>> For example:
>>>>>
>>>>> In kubuntu20.4.1 LTS with openmpi 4.0.3-0ubuntu
>>>>>
>>>>> jorge@gcp26:~/MPIRUN$ cat hello.f90
>>>>>  print*,"Hello World!"
>>>>> end
>>>>> jorge@gcp26:~/MPIRUN$ mpif90 hello.f90 -o hello
>>>>> jorge@gcp26:~/MPIRUN$ ./hello
>>>>>  Hello World!
>>>>> jorge@gcp26:~/MPIRUN$ mpirun -np 1 hello <---here  the program
>>>>> hangs with no output
>>>>> ^C^Cjorge@gcp26:~/MPIRUN$
>>>>>
>>>>> The mpirun task sleeps with no output, and only twice ctrl-C ends
>>>>> the execution  :
>>>>>
>>>>> jorge       5540  0.1  0.0 44768  8472 pts/8    S+   17:54 0:00
>>>>> mpirun -np 1 hello
>>>>>
>>>>> In kubuntu 18.04.5 LTS with openmpi 2.1.1, of course, the same
>>>>> program gives
>>>>>
>>>>> jorge@gcp30:~/MPIRUN$ cat hello.f90
>>>>>  print*, "Hello World!"
>>>>>  END
>>>>> jorge@gcp30:~/MPIRUN$ mpif90 hello.f90 -o hello
>>>>> jorge@gcp30:~/MPIRUN$ ./hello
>>>>>  Hello World!
>>>>> jorge@gcp30:~/MPIRUN$ mpirun -np 1 hello
>>>>>  Hello World
>>>>> jorge@gcp30:~/MPIRUN$
>>>>>
>>>>>
>>>>> Even just typing mpirun hangs without the usual error message.
>>>>>
>>>>> Are there any changes between the two versions of openmpi that I
>>>>> miss?  Some package lacking to mpirun ?
>>>>>
>>>>> Thank you again for your help
>>>>>
>>>>> Jorge
>>>>>
>>>>>
>>>>> Le 21/10/2020 à 00:20, Gus Correa a écrit :
>>>>>> Hi Jorge
>>>>>>
>>>>>> You may have an active firewall protecting either computer or both,
>>>>>> and preventing mpirun to start the connection.
>>>>>> Your /etc/hosts file may also not have the computer IP addresses.
>>>>>> You may also want to try the --hostfile option.
>>>>>> Likewise, the --verbose option may also help diagnose the problem.
>>>>>>
>>>>>> It would help if you send the mpirun command line, the hostfile
>>>>>> (if any),
>>>>>> error message if any, etc.
>>>>>>
>>>>>>
>>>>>> These FAQs may help diagnose and solve the problem:
>>>>>>
>>>>>> https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems
>>>>>>
>>>>>> https://www.open-mpi.org/faq/?category=running#mpirun-hostfile
>>>>>> https://www.open-mpi.org/faq/?category=running
>>>>>>
>>>>>> I hope this helps,
>>>>>> Gus Correa
>>>>>>
>>>>>> On Tue, Oct 20, 2020 at 4:47 PM Jorge SILVA via users
>>>>>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote:
>>>>>>
>>>>>>     Hello,
>>>>>>
>>>>>>     I installed kubuntu20.4.1 with openmpi 4.0.3-0ubuntu in two
>>>>>>     different
>>>>>>     computers in the standard way. Compiling with mpif90 works, but
>>>>>>     mpirun
>>>>>>     hangs with no output in both systems. Even mpirun command
>>>>>> without
>>>>>>     parameters hangs and only twice ctrl-C typing can end the
>>>>>> sleeping
>>>>>>     program. Only  the command
>>>>>>
>>>>>>          mpirun --help
>>>>>>
>>>>>>     gives the usual output.
>>>>>>
>>>>>>     Seems that is something related to the terminal output, but the
>>>>>>     command
>>>>>>     worked well for Kubuntu 18.04. Is there a way to debug or fix
>>>>>> this
>>>>>>     problem (without re-compiling from sources, etc)? Is it a known
>>>>>>     problem?
>>>>>>
>>>>>>     Thanks,
>>>>>>
>>>>>>       Jorge
>>>>>>
>>>>
>>>>
>>>> -- 
>>>> Jeff Squyres
>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>>>>

Reply via email to