Hee Il Yes, that has happened to me several times. Usually, the problem is either MPI or I/O.
It might be that there is a file system problem, and one process is trying to write to a file, but is blocked indefinitely. The other processes then also stop making progress since they wait on communication. It could also be that there is an MPI problem, either caused by a problem in the code, or by an error in the system, that makes MPI hang. In both cases, restarting from a checkpoint might solve the problem. If the problem is reproducible, then it would make sense to dig deeper to find out what's wrong, and whether there is a work-around (e.g. changing the grid structure a bit to avoid triggering the bug). -erik On Fri, Apr 9, 2021 at 8:19 AM Hee Il Kim <[email protected]> wrote: > > Hi, > > Though it might not be an issue of ET. Have you ever seen ET runs stopped > making every output (even the stdout), even though the processes are running? > > I have seen this issue on new and old NVMe storages with various versions of > OpenMPI. It happened in more than a day of runs. > > Oh, not all the processes are running. One process is in Dl state, so the > every output stopped. Do you have any hints on this issue? There's no > specific limits set for the files. The other write/read tasks on the disks > are ok. > > Thanks for your help in advance. > > Hee Il > > > > > > > _______________________________________________ > Users mailing list > [email protected] > http://lists.einsteintoolkit.org/mailman/listinfo/users -- Erik Schnetter <[email protected]> http://www.perimeterinstitute.ca/personal/eschnetter/ _______________________________________________ Users mailing list [email protected] http://lists.einsteintoolkit.org/mailman/listinfo/users
