[
https://issues.apache.org/jira/browse/YARN-5436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15397739#comment-15397739
]
Zhiyuan Yang commented on YARN-5436:
------------------------------------
[~rohithsharma] Thanks for reviewing! You are right in the sense this patch is
mostly letting DrainDispatcher not reuse AsyncDispatcher's drained field, but
the fix for YARN-2991 is still there.
bq. does small tiny race is causing TEZ test failures?
Yes. In Tez UT tests, invocation of dispatcher.await() finished without
handling all events and assertion after dispatcher.await() failed. This race
condition only happens when queue is almost empty, which is exactly the case in
Tez UT tests.
bq. If so would it be good to fix in AsyncDispatcher rather adding full
duplicate code.
The root cause of race is we cannot guarantee we enqueue event and update
drained atomically. I didn't find a way to fix this without adding more
synchronization which is a very expensive fix for a minimum benefit. YARN-3878
discussed about this race and decided to ignore it for the same reason.
bq. How about adding additional check before adding into event queue to avoid a
race?
While this may avoid enqueuing last event, race can still happen without
invoking dispatcher.serviceStop(). Actually in Tez UT test, we never invoke
dispatcher.serviceStop().
> Race in AsyncDispatcher can cause random test failures in Tez(probably YARN
> also )
> ----------------------------------------------------------------------------------
>
> Key: YARN-5436
> URL: https://issues.apache.org/jira/browse/YARN-5436
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Zhiyuan Yang
> Assignee: Zhiyuan Yang
> Attachments: YARN-5436.1.patch, YARN-5436.2.patch, YARN-5436.3.patch,
> YARN-5436.4.patch
>
>
> In YARN-2264, a race in DrainDispatcher was fixed. Unfortunately, it also
> exists in AsyncDispatcher (this was found and ignored in YARN-3878 but never
> documented...). In YARN-2991, another DrainDispatcher bug was fixed by
> letting DrainDispatcher reuse some AsyncDispatcher method because
> AsyncDispatcher doesn't have such issue. However, this shadows YARN-2264, and
> now similar race reappears in Tez unit tests (probably also YARN unit tests
> also).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]