Qihong Wu created YARN-10851:
--------------------------------
Summary: Tez session close does not interrupt yarn's async thread
Key: YARN-10851
URL: https://issues.apache.org/jira/browse/YARN-10851
Project: Hadoop YARN
Issue Type: Bug
Affects Versions: 2.10.1, 2.8.5
Environment: On an HA cluster, where RM1 is not the active RM
Yarn of version 2.8.5 and is configured with Tez
Reporter: Qihong Wu
Attachments: hive.log
Hi, I want to ask for the expertise knowledge on the yarn behavior when
handling `InterruptedIOException`.
The issue occurs on a HA cluster, where RM1 is NOT the active RM. Therefore, if
the yarn request made to RM1 failed, the RM failover should happen. However, if
an interrupted exception is thrown when connecting to RM1, the thread should
try to [bail
out|https://dzone.com/articles/how-to-handle-the-interruptedexception] as soon
as possible to [respect interrupt
request|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ExecutorService.html#shutdownNow--],
rather than moving on to another RM.
But I found my application (hive) after throwing `InterruptedIOException` when
trying to connect with RM1 failed, continuing to RM2. I want to know how does
yarn handle InterruptedIOException, shouldn't the async thread gets interrupted
and shutdown when tez close() triggered interrupt request?
*The reproduction step is:*
1. In an HA cluster which uses yarn of version 2.8.5 and is configured with Tez
2. Make sure RM1 is not the active RM by checking `yarn rmadmin
-getAllServiceState`. It it is, manually [transition RM2 as active
RM|https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html#Admin_commands].
3. Apply failover-retry properties to yarn-site.xml
{quote}<property>
<name>yarn.client.failover-retries</name>
<value>4</value>
</property>
<property>
<name>yarn.client.failover-retries-on-socket-timeouts</name>
<value>4</value>
</property>
<property>
<name>yarn.client.failover-max-attempts</name>
<value>4</value>
</property>
{quote}
4. Run a simple application to yarn-client (for example, a simple hive DDL
command)
{quote}hive --hiveconf hive.root.logger=TRACE,console -e "create table tez_test
(id int, name string);"
{quote}
5. Find from application's log (for example, hive.log), you can find
`RetryInvocationHandler` has captured the `InterruptedIOException` when request
was talking over rm1, but the thread didn't bail out immediately, but continue
moving to rm2.
*More information:*
The interrupted exception is triggered via via
[TezSessionState#close|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/tez/TezSessionState.java#L689]
and
[Future#cancel|https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Future.html#cancel-boolean-].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]