Thanks Erik.
On Fri, Apr 9, 2021, 23:45 Erik Schnetter <[email protected]> wrote: > Hee Il > > Yes, that has happened to me several times. Usually, the problem is > either MPI or I/O. > Ever experienced under UCX? > It might be that there is a file system problem, and one process is > trying to write to a file, but is blocked indefinitely. The other > processes then also stop making progress since they wait on > communication. > > It could also be that there is an MPI problem, either caused by a > problem in the code, or by an error in the system, that makes MPI > hang. > I think I haven't seen the issue when I use 'sm' btl. At least vader was used for all the problematic runs. > In both cases, restarting from a checkpoint might solve the problem. > If the problem is reproducible, then it would make sense to dig deeper > to find out what's wrong, and whether there is a work-around (e.g. > changing the grid structure a bit to avoid triggering the bug). > > -erik > Yes. restarting could solve the issue. Hee Il > > > On Fri, Apr 9, 2021 at 8:19 AM Hee Il Kim <[email protected]> wrote: > > > > Hi, > > > > Though it might not be an issue of ET. Have you ever seen ET runs > stopped making every output (even the stdout), even though the processes > are running? > > > > I have seen this issue on new and old NVMe storages with various > versions of OpenMPI. It happened in more than a day of runs. > > > > Oh, not all the processes are running. One process is in Dl state, so > the every output stopped. Do you have any hints on this issue? There's no > specific limits set for the files. The other write/read tasks on the disks > are ok. > > > > Thanks for your help in advance. > > > > Hee Il > > > > > > > > > > > > > > _______________________________________________ > > Users mailing list > > [email protected] > > http://lists.einsteintoolkit.org/mailman/listinfo/users > > > > -- > Erik Schnetter <[email protected]> > http://www.perimeterinstitute.ca/personal/eschnetter/ >
_______________________________________________ Users mailing list [email protected] http://lists.einsteintoolkit.org/mailman/listinfo/users
