[ 
https://issues.apache.org/jira/browse/YARN-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509826#comment-16509826
 ] 

Ted Yu commented on YARN-8414:
------------------------------

In {{ClientScanner}} ctor :
{code}
    this.retries = conf.getInt(HConstants.HBASE_CLIENT_RETRIES_NUMBER,
      HConstants.DEFAULT_HBASE_CLIENT_RETRIES_NUMBER);
{code}
Config is "hbase.client.retries.number" with default of 15. You can tune this 
parameter so that client side fails earlier in this scenario.



> Nodemanager crashes soon if ATSv2 HBase is either down or absent
> ----------------------------------------------------------------
>
>                 Key: YARN-8414
>                 URL: https://issues.apache.org/jira/browse/YARN-8414
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: yarn
>    Affects Versions: 3.1.0
>            Reporter: Eric Yang
>            Priority: Critical
>
> Test cluster has 1000 apps running, and a user trigger capacity scheduler 
> queue changes.  This crashes all node managers.  It looks like node manager 
> encounter too many files open while aggregating logs for containers:
> {code}
> 2018-06-07 21:17:59,307 WARN  server.AbstractConnector 
> (AbstractConnector.java:handleAcceptFailure(544)) -
> java.io.IOException: Too many open files
>         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>         at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
>         at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
>         at 
> org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:371)
>         at 
> org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601)
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-06-07 21:17:59,758 WARN  util.SysInfoLinux 
> (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; 
> can't determine memory settings
> 2018-06-07 21:17:59,758 WARN  util.SysInfoLinux 
> (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; 
> can't determine memory settings
> 2018-06-07 21:18:00,842 WARN  client.ConnectionUtils 
> (ConnectionUtils.java:getStubKey(236)) - Can not resolve host12.example.com, 
> please check your network
> java.net.UnknownHostException: host1.example.com: System error
>         at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
>         at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
>         at 
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
>         at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1192)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1126)
>         at java.net.InetAddress.getByName(InetAddress.java:1076)
>         at 
> org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233)
>         at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1189)
>         at 
> org.apache.hadoop.hbase.client.ReversedScannerCallable.prepare(ReversedScannerCallable.java:111)
>         at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:399)
>         at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
>         at 
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> Timeline service has thousands of exceptions:
> {code}
> 2018-06-07 21:18:34,182 ERROR client.AsyncProcess 
> (AsyncProcess.java:submit(291)) - Failed to get region location
> java.io.InterruptedIOException
>         at 
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:265)
>         at 
> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437)
>         at 
> org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312)
>         at 
> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597)
>         at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834)
>         at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732)
>         at 
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:281)
>         at 
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:236)
>         at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:307)
>         at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:212)
>         at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:170)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.storage.common.TypedBufferedMutator.mutate(TypedBufferedMutator.java:54)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.store(ColumnRWHelper.java:153)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.store(ColumnRWHelper.java:107)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.store(HBaseTimelineWriterImpl.java:395)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.write(HBaseTimelineWriterImpl.java:198)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.writeTimelineEntities(TimelineCollector.java:164)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.putEntitiesAsync(TimelineCollector.java:196)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorWebService.putEntities(TimelineCollectorWebService.java:173)
>         at sun.reflect.GeneratedMethodAccessor145.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>         at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>         at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>         at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>         at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>         at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>         at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>         at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>         at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>         at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
>         at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
>         at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
>         at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>         at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>         at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>         at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>         at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>         at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
>         at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:304)
>         at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
>         at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>         at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>         at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>         at 
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>         at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>         at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>         at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>         at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>         at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>         at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>         at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>         at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>         at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>         at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>         at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>         at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>         at org.eclipse.jetty.server.Server.handle(Server.java:534)
>         at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
>         at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>         at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>         at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>         at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>         at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>         at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>         at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-06-07 21:18:36,266 INFO  retry.RetryInvocationHandler 
> (RetryInvocationHandler.java:log(411)) - java.net.UnknownHostException: 
> Invalid host name: local host is: (unknown); destination host is: 
> "host1.example.com":8020; java.net.UnknownHostException; For more details 
> see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking 
> ClientNamenodeProtocolTranslatorPB.getServerDefaults over 
> host1.example.com:8020 after 10 failover attempts. Trying to failover after 
> sleeping for 9634ms.
> 2018-06-07 21:18:36,612 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1528316765723_0030 userId=csingh 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-06-07 21:18:38,396 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=6, 
> retries=6, started=4213 ms ago, cancelled=false, msg=Call to 
> host1.example.com/142.26.32.112:17020 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: host12.example.com/142.26.32.112:17020, details=row 
> 'prod.timelineservice.entity,csingh!yarn-cluster!scale-1-182!^?���(�^@<!^?���)8��^?���!COMPONENT!^@^@^@^@^@^@^@^@!simple,99999999999999'
>  on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=host12.example.com,17020,1528302866813, seqNum=-1
> 2018-06-07 21:18:38,662 ERROR util.ShutdownHookManager 
> (ShutdownHookManager.java:run(82)) - ShutdownHookManger shutdown forcefully
> {code}
> Nodes were temporarily unable to resolve hostname to IP mapping.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to