Hello all, we have had issues with a bug in openmpi wrt to vader on "slow" systems.
See: https://bitbucket.org/einsteintoolkit/tickets/issues/2287/add-openmpi-env-vars-to-notebook-to-avoid for the ET ticket explaining this (the slow system being the tutorial VM) and the OpenMPI ticket here: https://github.com/open-mpi/ompi/issues/6568 Yours, Roland > On Fri, Apr 9, 2021 at 12:43 PM Hee Il Kim <[email protected]> wrote: > > > > Thanks Erik. > > > > > > On Fri, Apr 9, 2021, 23:45 Erik Schnetter <[email protected]> wrote: > >> > >> Hee Il > >> > >> Yes, that has happened to me several times. Usually, the problem is > >> either MPI or I/O. > > > > > > Ever experienced under UCX? > > No, but I think UCX and MPI are about equivalent in this context here. > > -erik > > >> It might be that there is a file system problem, and one process is > >> trying to write to a file, but is blocked indefinitely. The other > >> processes then also stop making progress since they wait on > >> communication. > >> > >> It could also be that there is an MPI problem, either caused by a > >> problem in the code, or by an error in the system, that makes MPI > >> hang. > > > > > > I think I haven't seen the issue when I use 'sm' btl. At least vader was > > used for all the problematic runs. > > > >> > >> In both cases, restarting from a checkpoint might solve the problem. > >> If the problem is reproducible, then it would make sense to dig deeper > >> to find out what's wrong, and whether there is a work-around (e.g. > >> changing the grid structure a bit to avoid triggering the bug). > >> > >> -erik > > > > > > Yes. restarting could solve the issue. > > > > Hee Il > > > > > >> > >> > >> > >> On Fri, Apr 9, 2021 at 8:19 AM Hee Il Kim <[email protected]> wrote: > >> > > >> > Hi, > >> > > >> > Though it might not be an issue of ET. Have you ever seen ET runs > >> > stopped making every output (even the stdout), even though the processes > >> > are running? > >> > > >> > I have seen this issue on new and old NVMe storages with various > >> > versions of OpenMPI. It happened in more than a day of runs. > >> > > >> > Oh, not all the processes are running. One process is in Dl state, so > >> > the every output stopped. Do you have any hints on this issue? There's > >> > no specific limits set for the files. The other write/read tasks on the > >> > disks are ok. > >> > > >> > Thanks for your help in advance. > >> > > >> > Hee Il > >> > > >> > > >> > > >> > > >> > > >> > > >> > _______________________________________________ > >> > Users mailing list > >> > [email protected] > >> > https://urldefense.com/v3/__http://lists.einsteintoolkit.org/mailman/listinfo/users__;!!DZ3fjg!tgd_rDKIJitACUnLixHB2PND01Yf7MisM-hbW7PEzYIVhk3Rao1sdCz_tOPj3NlB$ > >> > > >> > >> > >> > >> -- > >> Erik Schnetter <[email protected]> > >> https://urldefense.com/v3/__http://www.perimeterinstitute.ca/personal/eschnetter/__;!!DZ3fjg!tgd_rDKIJitACUnLixHB2PND01Yf7MisM-hbW7PEzYIVhk3Rao1sdCz_tIt3aW2E$ > >> > > > -- My email is as private as my paper mail. I therefore support encrypting and signing email messages. Get my PGP key from http://pgp.mit.edu .
pgpU6axsSSM3i.pgp
Description: OpenPGP digital signature
_______________________________________________ Users mailing list [email protected] http://lists.einsteintoolkit.org/mailman/listinfo/users
