Thomas Friedrich created YARN-5309:
--------------------------------------
Summary: SSLFactory truststore reloader thread leak in
TimelineClientImpl
Key: YARN-5309
URL: https://issues.apache.org/jira/browse/YARN-5309
Project: Hadoop YARN
Issue Type: Bug
Components: timelineserver, yarn
Affects Versions: 2.7.1
Reporter: Thomas Friedrich
We found a similar issue as HADOOP-11368 in TimelineClientImpl. The class
creates an instance of SSLFactory in newSslConnConfigurator and subsequently
creates the ReloadingX509TrustManager instance which in turn starts a trust
store reloader thread.
However, the SSLFactory is never destroyed and hence the trust store reloader
threads are not killed.
This problem was observed by a customer who had SSL enabled in Hadoop and
submitted many queries against the HiveServer2. After a few days, the HS2
instance crashed and from the Java dump we could see many (over 13000) threads
like this:
"Truststore reloader thread" #126 daemon prio=5 os_prio=0
tid=0x00007f680d2e3000 nid=0x98fd waiting on
condition [0x00007f67e482c000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.security.ssl.ReloadingX509TrustManager.run
(ReloadingX509TrustManager.java:225)
at java.lang.Thread.run(Thread.java:745)
HiveServer2 uses the JobClient to submit a job:
Thread [HiveServer2-Background-Pool: Thread-188] (Suspended (breakpoint at line
89 in
ReloadingX509TrustManager))
owns: Object (id=464)
owns: Object (id=465)
owns: Object (id=466)
owns: ServiceLoader<S> (id=210)
ReloadingX509TrustManager.<init>(String, String, String, long) line: 89
FileBasedKeyStoresFactory.init(SSLFactory$Mode) line: 209
SSLFactory.init() line: 131
TimelineClientImpl.newSslConnConfigurator(int, Configuration) line: 532
TimelineClientImpl.newConnConfigurator(Configuration) line: 507
TimelineClientImpl.serviceInit(Configuration) line: 269
TimelineClientImpl(AbstractService).init(Configuration) line: 163
YarnClientImpl.serviceInit(Configuration) line: 169
YarnClientImpl(AbstractService).init(Configuration) line: 163
ResourceMgrDelegate.serviceInit(Configuration) line: 102
ResourceMgrDelegate(AbstractService).init(Configuration) line: 163
ResourceMgrDelegate.<init>(YarnConfiguration) line: 96
YARNRunner.<init>(Configuration) line: 112
YarnClientProtocolProvider.create(Configuration) line: 34
Cluster.initialize(InetSocketAddress, Configuration) line: 95
Cluster.<init>(InetSocketAddress, Configuration) line: 82
Cluster.<init>(Configuration) line: 75
JobClient.init(JobConf) line: 475
JobClient.<init>(JobConf) line: 454
MapRedTask(ExecDriver).execute(DriverContext) line: 401
MapRedTask.execute(DriverContext) line: 137
MapRedTask(Task<T>).executeTask() line: 160
TaskRunner.runSequential() line: 88
Driver.launchTask(Task<Serializable>, String, boolean, String, int,
DriverContext) line: 1653
Driver.execute() line: 1412
For every job, a new instance of JobClient/YarnClientImpl/TimelineClientImpl is
created. But because the HS2 process stays up for days, the previous trust
store reloader threads are still hanging around in the HS2 process and
eventually use all the resources available.
It seems like a similar fix as HADOOP-11368 is needed in TimelineClientImpl but
it doesn't have a destroy method to begin with.
One option to avoid this problem is to disable the yarn timeline service
(yarn.timeline-service.enabled=false).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]