Hello The hwloc/X11 stuff is caused by OpenMPI using a hwloc that was built with the GL backend enabled (in your case, it's because package libhwloc-plugins is installed). That backend is used for querying the locality of X11 displays running on NVIDIA GPUs (using libxnvctrl). Does running "lstopo" fail/hang too? (it will basically run hwloc without OpenMPI).
One workaround should be to set HWLOC_COMPONENTS=-gl in your environment so that this backend is ignored. Recent hwloc releases have a way to avoid some plugins at runtime through the C interface, we should likely blacklist all plugins that are already blacklisted at compile time when OMPI builds its own hwloc. Brice Le 14/11/2020 à 12:33, Jorge Silva via users a écrit : > Hello, > > In spite of the delay, I was not able to solve my problem. Thanks to > Joseph and Prentice for their interesting suggestions. > > I uninstalled AppAmor (SElinux is not installed ) as suggested by > Prentice but there were no changes, mpirun sttill hangs. > > The result of gdb stack trace is the following: > > > $ sudo gdb -batch -ex "thread apply all bt" -p $(ps -C mpirun -o pid= > | head -n 1) > [Thread debugging using libthread_db enabled] > Using host libthread_db library > "/lib/x86_64-linux-gnu/libthread_db.so.1". > 0x00007f9289544307 in __libc_connect (fd=9, addr=..., len=16) at > ../sysdeps/unix/sysv/linux/connect.c:26 > 26 ../sysdeps/unix/sysv/linux/connect.c: Aucun fichier ou dossier > de ce type. > > Thread 1 (Thread 0x7f92891f4e80 (LWP 4948)): > #0 0x00007f9289544307 in __libc_connect (fd=9, addr=..., len=16) at > ../sysdeps/unix/sysv/linux/connect.c:26 > #1 0x00007f9288fff59d in ?? () from /lib/x86_64-linux-gnu/libxcb.so.1 > #2 0x00007f9288fffc49 in xcb_connect_to_display_with_auth_info () > from /lib/x86_64-linux-gnu/libxcb.so.1 > #3 0x00007f928906cb7a in _XConnectXCB () from > /lib/x86_64-linux-gnu/libX11.so.6 > #4 0x00007f928905d319 in XOpenDisplay () from > /lib/x86_64-linux-gnu/libX11.so.6 > #5 0x00007f92897de4fb in ?? () from > /usr/lib/x86_64-linux-gnu/hwloc/hwloc_gl.so > #6 0x00007f92893b901e in ?? () from /lib/x86_64-linux-gnu/libhwloc.so.15 > #7 0x00007f92893c13a0 in hwloc_topology_load () from > /lib/x86_64-linux-gnu/libhwloc.so.15 > #8 0x00007f92896df564 in opal_hwloc_base_get_topology () from > /lib/x86_64-linux-gnu/libopen-pal.so.40 > #9 0x00007f92891da6be in ?? () from > /usr/lib/x86_64-linux-gnu/openmpi/lib/openmpi3/mca_ess_hnp.so > #10 0x00007f92897a22fc in orte_init () from > /lib/x86_64-linux-gnu/libopen-rte.so.40 > #11 0x00007f92897a6c86 in orte_submit_init () from > /lib/x86_64-linux-gnu/libopen-rte.so.40 > #12 0x000055819fd0b3a3 in ?? () > #13 0x00007f92894480b3 in __libc_start_main (main=0x55819fd0b1c0, > argc=1, argv=0x7fff5334fe48, init=<optimized out>, fini=<optimized > out>, rtld_fini=<optimized out>, stack_end=0x7fff5334fe38) at > ../csu/libc-start.c:308 > #14 0x000055819fd0b1fe in ?? () > [Inferior 1 (process 4948) detached] > > > So it seems to be a problem in the connection via libxcb (socket ?) > but this is out of my system computer skills.. Is there any > authorization needed? > > As is libX11 at the origin of the call I tried to execute in a bare > terminal (ctrl-alt-f2 and via ssh) but the message is the same. I > tried to recompile/install the hole package and have the same result. > > Thank you for your help. > > Jorge > > Le 22/10/2020 à 12:16, Joseph Schuchart via users a écrit : >> Hi Jorge, >> >> Can you try to get a stack trace of mpirun using the following >> command in a separate terminal? >> >> sudo gdb -batch -ex "thread apply all bt" -p $(ps -C mpirun -o pid= | >> head -n 1) >> >> Maybe that will give some insight where mpirun is hanging. >> >> Cheers, >> Joseph >> >> On 10/21/20 9:58 PM, Jorge SILVA via users wrote: >>> Hello Jeff, >>> >>> The program is not executed, seems waits for something to connect >>> with (why twice ctrl-C ?) >>> >>> jorge@gcp26:~/MPIRUN$ mpirun -np 1 touch /tmp/foo >>> ^C^C >>> >>> jorge@gcp26:~/MPIRUN$ ls -l /tmp/foo >>> ls: impossible d'accéder à '/tmp/foo': Aucun fichier ou dossier de >>> ce type >>> >>> no file is created.. >>> >>> In fact, my question was if are there differences in mpirun usage >>> between these versions.. The >>> >>> mpirun -help >>> >>> gives a different output as expected, but I tried a lot of options >>> without any success. >>> >>> >>> Le 21/10/2020 à 21:16, Jeff Squyres (jsquyres) a écrit : >>>> There's huge differences between Open MPI v2.1.1 and v4.0.3 (i.e., >>>> years of development effort); it would be very hard to categorize >>>> them all; sorry! >>>> >>>> What happens if you >>>> >>>> mpirun -np 1 touch /tmp/foo >>>> >>>> (Yes, you can run non-MPI apps through mpirun) >>>> >>>> Is /tmp/foo created? (i.e., did the job run, and mpirun is somehow >>>> not terminating) >>>> >>>> >>>> >>>>> On Oct 21, 2020, at 12:22 PM, Jorge SILVA via users >>>>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: >>>>> >>>>> Hello Gus, >>>>> >>>>> Thank you for your answer.. Unfortunately my problem is much >>>>> more basic. I didn't try to run the program in both computers , >>>>> but just to run something in one computer. I just installed the >>>>> new OS an openmpi in two different computers, in the standard way, >>>>> with the same result. >>>>> >>>>> For example: >>>>> >>>>> In kubuntu20.4.1 LTS with openmpi 4.0.3-0ubuntu >>>>> >>>>> jorge@gcp26:~/MPIRUN$ cat hello.f90 >>>>> print*,"Hello World!" >>>>> end >>>>> jorge@gcp26:~/MPIRUN$ mpif90 hello.f90 -o hello >>>>> jorge@gcp26:~/MPIRUN$ ./hello >>>>> Hello World! >>>>> jorge@gcp26:~/MPIRUN$ mpirun -np 1 hello <---here the program >>>>> hangs with no output >>>>> ^C^Cjorge@gcp26:~/MPIRUN$ >>>>> >>>>> The mpirun task sleeps with no output, and only twice ctrl-C ends >>>>> the execution : >>>>> >>>>> jorge 5540 0.1 0.0 44768 8472 pts/8 S+ 17:54 0:00 >>>>> mpirun -np 1 hello >>>>> >>>>> In kubuntu 18.04.5 LTS with openmpi 2.1.1, of course, the same >>>>> program gives >>>>> >>>>> jorge@gcp30:~/MPIRUN$ cat hello.f90 >>>>> print*, "Hello World!" >>>>> END >>>>> jorge@gcp30:~/MPIRUN$ mpif90 hello.f90 -o hello >>>>> jorge@gcp30:~/MPIRUN$ ./hello >>>>> Hello World! >>>>> jorge@gcp30:~/MPIRUN$ mpirun -np 1 hello >>>>> Hello World >>>>> jorge@gcp30:~/MPIRUN$ >>>>> >>>>> >>>>> Even just typing mpirun hangs without the usual error message. >>>>> >>>>> Are there any changes between the two versions of openmpi that I >>>>> miss? Some package lacking to mpirun ? >>>>> >>>>> Thank you again for your help >>>>> >>>>> Jorge >>>>> >>>>> >>>>> Le 21/10/2020 à 00:20, Gus Correa a écrit : >>>>>> Hi Jorge >>>>>> >>>>>> You may have an active firewall protecting either computer or both, >>>>>> and preventing mpirun to start the connection. >>>>>> Your /etc/hosts file may also not have the computer IP addresses. >>>>>> You may also want to try the --hostfile option. >>>>>> Likewise, the --verbose option may also help diagnose the problem. >>>>>> >>>>>> It would help if you send the mpirun command line, the hostfile >>>>>> (if any), >>>>>> error message if any, etc. >>>>>> >>>>>> >>>>>> These FAQs may help diagnose and solve the problem: >>>>>> >>>>>> https://www.open-mpi.org/faq/?category=running#diagnose-multi-host-problems >>>>>> >>>>>> https://www.open-mpi.org/faq/?category=running#mpirun-hostfile >>>>>> https://www.open-mpi.org/faq/?category=running >>>>>> >>>>>> I hope this helps, >>>>>> Gus Correa >>>>>> >>>>>> On Tue, Oct 20, 2020 at 4:47 PM Jorge SILVA via users >>>>>> <users@lists.open-mpi.org <mailto:users@lists.open-mpi.org>> wrote: >>>>>> >>>>>> Hello, >>>>>> >>>>>> I installed kubuntu20.4.1 with openmpi 4.0.3-0ubuntu in two >>>>>> different >>>>>> computers in the standard way. Compiling with mpif90 works, but >>>>>> mpirun >>>>>> hangs with no output in both systems. Even mpirun command >>>>>> without >>>>>> parameters hangs and only twice ctrl-C typing can end the >>>>>> sleeping >>>>>> program. Only the command >>>>>> >>>>>> mpirun --help >>>>>> >>>>>> gives the usual output. >>>>>> >>>>>> Seems that is something related to the terminal output, but the >>>>>> command >>>>>> worked well for Kubuntu 18.04. Is there a way to debug or fix >>>>>> this >>>>>> problem (without re-compiling from sources, etc)? Is it a known >>>>>> problem? >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Jorge >>>>>> >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com> >>>>