On Fri, Apr 9, 2021 at 12:43 PM Hee Il Kim <[email protected]> wrote:
>
> Thanks Erik.
>
>
> On Fri, Apr 9, 2021, 23:45 Erik Schnetter <[email protected]> wrote:
>>
>> Hee Il
>>
>> Yes, that has happened to me several times. Usually, the problem is
>> either MPI or I/O.
>
>
> Ever experienced under UCX?

No, but I think UCX and MPI are about equivalent in this context here.

-erik

>> It might be that there is a file system problem, and one process is
>> trying to write to a file, but is blocked indefinitely. The other
>> processes then also stop making progress since they wait on
>> communication.
>>
>> It could also be that there is an MPI problem, either caused by a
>> problem in the code, or by an error in the system, that makes MPI
>> hang.
>
>
> I think I haven't seen the issue when I use 'sm' btl. At least vader was used 
> for all the problematic runs.
>
>>
>> In both cases, restarting from a checkpoint might solve the problem.
>> If the problem is reproducible, then it would make sense to dig deeper
>> to find out what's wrong, and whether there is a work-around (e.g.
>> changing the grid structure a bit to avoid triggering the bug).
>>
>> -erik
>
>
> Yes. restarting could solve the issue.
>
> Hee Il
>
>
>>
>>
>>
>> On Fri, Apr 9, 2021 at 8:19 AM Hee Il Kim <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > Though it might not be an issue of ET. Have you ever seen ET runs stopped 
>> > making every output (even the stdout), even though the processes are 
>> > running?
>> >
>> > I have seen this issue on new and old NVMe storages with various versions 
>> > of OpenMPI. It happened in more than a day of runs.
>> >
>> > Oh, not all the processes are running. One process is in Dl state, so the 
>> > every output stopped. Do you have any hints on this issue? There's no 
>> > specific limits set for the files. The other write/read tasks on the disks 
>> > are ok.
>> >
>> > Thanks for your help in advance.
>> >
>> > Hee Il
>> >
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > Users mailing list
>> > [email protected]
>> > http://lists.einsteintoolkit.org/mailman/listinfo/users
>>
>>
>>
>> --
>> Erik Schnetter <[email protected]>
>> http://www.perimeterinstitute.ca/personal/eschnetter/



-- 
Erik Schnetter <[email protected]>
http://www.perimeterinstitute.ca/personal/eschnetter/
_______________________________________________
Users mailing list
[email protected]
http://lists.einsteintoolkit.org/mailman/listinfo/users

Reply via email to