[
https://issues.apache.org/jira/browse/YARN-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16511867#comment-16511867
]
Eric Yang edited comment on YARN-8414 at 6/14/18 3:38 PM:
----------------------------------------------------------
The root cause for node manager to crash is contributed by leaking sockets
between TimelineV2Client and timeline collector.
Serviceam.log shows:
{code}
2018-06-13 23:50:13,509 [AMRM Callback Handler Thread] INFO
impl.TimelineV2ClientImpl - Updated timeline service address to
host5.example.com:46473
{code}
Base on netstat output, CLOSE_WAIT socket leaks coming from
TimelineV2ClientImpl.
{code}
tcp 1 0 172.26.32.105:39180 172.26.32.105:46473 CLOSE_WAIT
3310868/java
tcp 1 0 172.26.32.105:51496 172.26.32.105:46473 CLOSE_WAIT
3310868/java
tcp 1 0 172.26.32.105:45463 172.26.32.105:46473 CLOSE_WAIT
3310868/java
tcp 1 0 172.26.32.105:58193 172.26.32.105:46473 CLOSE_WAIT
3310868/java
tcp 1 0 172.26.32.105:41297 172.26.32.105:46473 CLOSE_WAIT
3310868/java
tcp 1 0 172.26.32.105:34530 172.26.32.105:46473 CLOSE_WAIT
3310868/java
tcp 1 0 172.26.32.105:55300 172.26.32.105:46473 CLOSE_WAIT
3310868/java
{code}
The excessive push of metrics to timeline collector without closing sockets
cause node manager to hang on to more open socket than it should and eventually
node manager ran out of file descriptor (too many file open) and crash.
was (Author: eyang):
The root cause for node manager to crash is contributed by leaking sockets
between TimelineV2Client and timeline collector.
Serviceam.log shows:
{code}
2018-06-13 23:50:13,509 [AMRM Callback Handler Thread] INFO
impl.TimelineV2ClientImpl - Updated timeline service address to
y005.l42scl.hortonworks.com:46473
{code}
Base on netstat output, CLOSE_WAIT socket leaks coming from
TimelineV2ClientImpl.
{code}
tcp 1 0 172.26.32.105:39180 172.26.32.105:46473 CLOSE_WAIT
3310868/java
tcp 1 0 172.26.32.105:51496 172.26.32.105:46473 CLOSE_WAIT
3310868/java
tcp 1 0 172.26.32.105:45463 172.26.32.105:46473 CLOSE_WAIT
3310868/java
tcp 1 0 172.26.32.105:58193 172.26.32.105:46473 CLOSE_WAIT
3310868/java
tcp 1 0 172.26.32.105:41297 172.26.32.105:46473 CLOSE_WAIT
3310868/java
tcp 1 0 172.26.32.105:34530 172.26.32.105:46473 CLOSE_WAIT
3310868/java
tcp 1 0 172.26.32.105:55300 172.26.32.105:46473 CLOSE_WAIT
3310868/java
{code}
The excessive push of metrics to timeline collector without closing sockets
cause node manager to hang on to more open socket than it should and eventually
node manager ran out of file descriptor (too many file open) and crash.
> Nodemanager crashes soon if ATSv2 HBase is either down or absent
> ----------------------------------------------------------------
>
> Key: YARN-8414
> URL: https://issues.apache.org/jira/browse/YARN-8414
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: yarn
> Affects Versions: 3.1.0
> Reporter: Eric Yang
> Priority: Critical
>
> Test cluster has 1000 apps running, and a user trigger capacity scheduler
> queue changes. This crashes all node managers. It looks like node manager
> encounter too many files open while aggregating logs for containers:
> {code}
> 2018-06-07 21:17:59,307 WARN server.AbstractConnector
> (AbstractConnector.java:handleAcceptFailure(544)) -
> java.io.IOException: Too many open files
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at
> org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:371)
> at
> org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Thread.java:745)
> 2018-06-07 21:17:59,758 WARN util.SysInfoLinux
> (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo;
> can't determine memory settings
> 2018-06-07 21:17:59,758 WARN util.SysInfoLinux
> (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo;
> can't determine memory settings
> 2018-06-07 21:18:00,842 WARN client.ConnectionUtils
> (ConnectionUtils.java:getStubKey(236)) - Can not resolve host12.example.com,
> please check your network
> java.net.UnknownHostException: host1.example.com: System error
> at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
> at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
> at
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
> at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
> at java.net.InetAddress.getAllByName(InetAddress.java:1192)
> at java.net.InetAddress.getAllByName(InetAddress.java:1126)
> at java.net.InetAddress.getByName(InetAddress.java:1076)
> at
> org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1189)
> at
> org.apache.hadoop.hbase.client.ReversedScannerCallable.prepare(ReversedScannerCallable.java:111)
> at
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:399)
> at
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
> at
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> Timeline service has thousands of exceptions:
> {code}
> 2018-06-07 21:18:34,182 ERROR client.AsyncProcess
> (AsyncProcess.java:submit(291)) - Failed to get region location
> java.io.InterruptedIOException
> at
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:265)
> at
> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437)
> at
> org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312)
> at
> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834)
> at
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732)
> at
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:281)
> at
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:236)
> at
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:307)
> at
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:212)
> at
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:170)
> at
> org.apache.hadoop.yarn.server.timelineservice.storage.common.TypedBufferedMutator.mutate(TypedBufferedMutator.java:54)
> at
> org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.store(ColumnRWHelper.java:153)
> at
> org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.store(ColumnRWHelper.java:107)
> at
> org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.store(HBaseTimelineWriterImpl.java:395)
> at
> org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.write(HBaseTimelineWriterImpl.java:198)
> at
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.writeTimelineEntities(TimelineCollector.java:164)
> at
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.putEntitiesAsync(TimelineCollector.java:196)
> at
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorWebService.putEntities(TimelineCollectorWebService.java:173)
> at sun.reflect.GeneratedMethodAccessor145.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
> at
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
> at
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
> at
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
> at
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
> at
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
> at
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
> at
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
> at
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
> at
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
> at
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
> at
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
> at
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
> at
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
> at
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:304)
> at
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
> at org.eclipse.jetty.server.Server.handle(Server.java:534)
> at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
> at
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
> at
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
> at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
> at
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
> at
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
> at java.lang.Thread.run(Thread.java:745)
> 2018-06-07 21:18:36,266 INFO retry.RetryInvocationHandler
> (RetryInvocationHandler.java:log(411)) - java.net.UnknownHostException:
> Invalid host name: local host is: (unknown); destination host is:
> "host1.example.com":8020; java.net.UnknownHostException; For more details
> see: http://wiki.apache.org/hadoop/UnknownHost, while invoking
> ClientNamenodeProtocolTranslatorPB.getServerDefaults over
> host1.example.com:8020 after 10 failover attempts. Trying to failover after
> sleeping for 9634ms.
> 2018-06-07 21:18:36,612 WARN storage.HBaseTimelineWriterImpl
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of:
> flowName=null appId=application_1528316765723_0030 userId=csingh
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-06-07 21:18:38,396 INFO client.RpcRetryingCallerImpl
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=6,
> retries=6, started=4213 ms ago, cancelled=false, msg=Call to
> host1.example.com/142.26.32.112:17020 failed on connection exception:
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
> Connection refused: host12.example.com/142.26.32.112:17020, details=row
> 'prod.timelineservice.entity,csingh!yarn-cluster!scale-1-182!^?���(�^@<!^?���)8��^?���!COMPONENT!^@^@^@^@^@^@^@^@!simple,99999999999999'
> on table 'hbase:meta' at region=hbase:meta,,1.1588230740,
> hostname=host12.example.com,17020,1528302866813, seqNum=-1
> 2018-06-07 21:18:38,662 ERROR util.ShutdownHookManager
> (ShutdownHookManager.java:run(82)) - ShutdownHookManger shutdown forcefully
> {code}
> Nodes were temporarily unable to resolve hostname to IP mapping.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]