[ https://issues.apache.org/jira/browse/YARN-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512779#comment-16512779 ]
Rohith Sharma K S commented on YARN-8414: ----------------------------------------- thanks [~eyang] for analysis.. Publishing container events are distributed per application. Each NM publishes container entities into corresponding timeline collector. Lets say NM has 5 applications container running than NM will create 5 timeline client which will be connected to timeline collector. Timeline collector per application will be running in the same machine where master container is running. All timelinev2client required collector address which is distributed in NM/AM heartbeat from RM. Collector address update is one time operation. In the above case, does collector address updated because of master container gone down or master container NM has gone down? Once we establish connection to collector it will be there forever. But I do see potential issue that once we update collector address we are not closing existing connection. How many running applications are there in cluster? How many NM are there in cluster? Can we also get complete netstat -tnapl | grep <NM-PID> result? For rough calculations, X is number of applications running, and Y is number of NM in cluster. Worst case, number of connections per NM is X * Y provided each application container is running in all the NM. > Nodemanager crashes soon if ATSv2 HBase is either down or absent > ---------------------------------------------------------------- > > Key: YARN-8414 > URL: https://issues.apache.org/jira/browse/YARN-8414 > Project: Hadoop YARN > Issue Type: Improvement > Components: yarn > Affects Versions: 3.1.0 > Reporter: Eric Yang > Priority: Critical > > Test cluster has 1000 apps running, and a user trigger capacity scheduler > queue changes. This crashes all node managers. It looks like node manager > encounter too many files open while aggregating logs for containers: > {code} > 2018-06-07 21:17:59,307 WARN server.AbstractConnector > (AbstractConnector.java:handleAcceptFailure(544)) - > java.io.IOException: Too many open files > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250) > at > org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:371) > at > org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:745) > 2018-06-07 21:17:59,758 WARN util.SysInfoLinux > (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; > can't determine memory settings > 2018-06-07 21:17:59,758 WARN util.SysInfoLinux > (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; > can't determine memory settings > 2018-06-07 21:18:00,842 WARN client.ConnectionUtils > (ConnectionUtils.java:getStubKey(236)) - Can not resolve host12.example.com, > please check your network > java.net.UnknownHostException: host1.example.com: System error > at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928) > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323) > at java.net.InetAddress.getAllByName0(InetAddress.java:1276) > at java.net.InetAddress.getAllByName(InetAddress.java:1192) > at java.net.InetAddress.getAllByName(InetAddress.java:1126) > at java.net.InetAddress.getByName(InetAddress.java:1076) > at > org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1189) > at > org.apache.hadoop.hbase.client.ReversedScannerCallable.prepare(ReversedScannerCallable.java:111) > at > org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:399) > at > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105) > at > org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Timeline service has thousands of exceptions: > {code} > 2018-06-07 21:18:34,182 ERROR client.AsyncProcess > (AsyncProcess.java:submit(291)) - Failed to get region location > java.io.InterruptedIOException > at > org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:265) > at > org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437) > at > org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312) > at > org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834) > at > org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732) > at > org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:281) > at > org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:236) > at > org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:307) > at > org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:212) > at > org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:170) > at > org.apache.hadoop.yarn.server.timelineservice.storage.common.TypedBufferedMutator.mutate(TypedBufferedMutator.java:54) > at > org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.store(ColumnRWHelper.java:153) > at > org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.store(ColumnRWHelper.java:107) > at > org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.store(HBaseTimelineWriterImpl.java:395) > at > org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.write(HBaseTimelineWriterImpl.java:198) > at > org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.writeTimelineEntities(TimelineCollector.java:164) > at > org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.putEntitiesAsync(TimelineCollector.java:196) > at > org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorWebService.putEntities(TimelineCollectorWebService.java:173) > at sun.reflect.GeneratedMethodAccessor145.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60) > at > com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205) > at > com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75) > at > com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108) > at > com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147) > at > com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542) > at > com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419) > at > com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409) > at > com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558) > at > com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644) > at > org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:304) > at > org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134) > at org.eclipse.jetty.server.Server.handle(Server.java:534) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108) > at > org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) > at > org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) > at java.lang.Thread.run(Thread.java:745) > 2018-06-07 21:18:36,266 INFO retry.RetryInvocationHandler > (RetryInvocationHandler.java:log(411)) - java.net.UnknownHostException: > Invalid host name: local host is: (unknown); destination host is: > "host1.example.com":8020; java.net.UnknownHostException; For more details > see: http://wiki.apache.org/hadoop/UnknownHost, while invoking > ClientNamenodeProtocolTranslatorPB.getServerDefaults over > host1.example.com:8020 after 10 failover attempts. Trying to failover after > sleeping for 9634ms. > 2018-06-07 21:18:36,612 WARN storage.HBaseTimelineWriterImpl > (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: > flowName=null appId=application_1528316765723_0030 userId=csingh > clusterId=yarn-cluster . Not proceeding with writing to hbase > 2018-06-07 21:18:38,396 INFO client.RpcRetryingCallerImpl > (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=6, > retries=6, started=4213 ms ago, cancelled=false, msg=Call to > host1.example.com/142.26.32.112:17020 failed on connection exception: > org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: > Connection refused: host12.example.com/142.26.32.112:17020, details=row > 'prod.timelineservice.entity,csingh!yarn-cluster!scale-1-182!^?���(�^@<!^?���)8��^?���!COMPONENT!^@^@^@^@^@^@^@^@!simple,99999999999999' > on table 'hbase:meta' at region=hbase:meta,,1.1588230740, > hostname=host12.example.com,17020,1528302866813, seqNum=-1 > 2018-06-07 21:18:38,662 ERROR util.ShutdownHookManager > (ShutdownHookManager.java:run(82)) - ShutdownHookManger shutdown forcefully > {code} > Nodes were temporarily unable to resolve hostname to IP mapping. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org