[ 
https://issues.apache.org/jira/browse/YARN-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509816#comment-16509816
 ] 

Eric Yang commented on YARN-8414:
---------------------------------

Timeline Service v2 implements BufferedMutator API from HBase for buffering 
data to be sent to HBase as batches.  It might be possible to add separate 
logic in YARN code to throttle connection retries, but it will make code more 
complex in YARN to handle connection retries.  This seems like a frequently 
asked question that HBase client API can handle connection retries to throttle 
attempts to HBase or ZooKeeper.  I am not sure if there is existing API or 
configuration that can help to back off the retries from HBase point of view.  
[~te...@apache.org] any suggestion that could help us here?

> Nodemanager crashes soon if ATSv2 HBase is either down or absent
> ----------------------------------------------------------------
>
>                 Key: YARN-8414
>                 URL: https://issues.apache.org/jira/browse/YARN-8414
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: yarn
>    Affects Versions: 3.1.0
>            Reporter: Eric Yang
>            Priority: Critical
>
> Test cluster has 1000 apps running, and a user trigger capacity scheduler 
> queue changes.  This crashes all node managers.  It looks like node manager 
> encounter too many files open while aggregating logs for containers:
> {code}
> 2018-06-07 21:17:59,307 WARN  server.AbstractConnector 
> (AbstractConnector.java:handleAcceptFailure(544)) -
> java.io.IOException: Too many open files
>         at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
>         at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
>         at 
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
>         at 
> org.eclipse.jetty.server.ServerConnector.accept(ServerConnector.java:371)
>         at 
> org.eclipse.jetty.server.AbstractConnector$Acceptor.run(AbstractConnector.java:601)
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-06-07 21:17:59,758 WARN  util.SysInfoLinux 
> (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; 
> can't determine memory settings
> 2018-06-07 21:17:59,758 WARN  util.SysInfoLinux 
> (SysInfoLinux.java:readProcMemInfoFile(238)) - Couldn't read /proc/meminfo; 
> can't determine memory settings
> 2018-06-07 21:18:00,842 WARN  client.ConnectionUtils 
> (ConnectionUtils.java:getStubKey(236)) - Can not resolve host12.example.com, 
> please check your network
> java.net.UnknownHostException: host1.example.com: System error
>         at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
>         at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
>         at 
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
>         at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1192)
>         at java.net.InetAddress.getAllByName(InetAddress.java:1126)
>         at java.net.InetAddress.getByName(InetAddress.java:1076)
>         at 
> org.apache.hadoop.hbase.client.ConnectionUtils.getStubKey(ConnectionUtils.java:233)
>         at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.getClient(ConnectionImplementation.java:1189)
>         at 
> org.apache.hadoop.hbase.client.ReversedScannerCallable.prepare(ReversedScannerCallable.java:111)
>         at 
> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.prepare(ScannerCallableWithReplicas.java:399)
>         at 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
>         at 
> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:80)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> Timeline service has thousands of exceptions:
> {code}
> 2018-06-07 21:18:34,182 ERROR client.AsyncProcess 
> (AsyncProcess.java:submit(291)) - Failed to get region location
> java.io.InterruptedIOException
>         at 
> org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:265)
>         at 
> org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:437)
>         at 
> org.apache.hadoop.hbase.client.ClientScanner.nextWithSyncCache(ClientScanner.java:312)
>         at 
> org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:597)
>         at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:834)
>         at 
> org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:732)
>         at 
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:281)
>         at 
> org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:236)
>         at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:307)
>         at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:212)
>         at 
> org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:170)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.storage.common.TypedBufferedMutator.mutate(TypedBufferedMutator.java:54)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.store(ColumnRWHelper.java:153)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.storage.common.ColumnRWHelper.store(ColumnRWHelper.java:107)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.store(HBaseTimelineWriterImpl.java:395)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.storage.HBaseTimelineWriterImpl.write(HBaseTimelineWriterImpl.java:198)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.writeTimelineEntities(TimelineCollector.java:164)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollector.putEntitiesAsync(TimelineCollector.java:196)
>         at 
> org.apache.hadoop.yarn.server.timelineservice.collector.TimelineCollectorWebService.putEntities(TimelineCollectorWebService.java:173)
>         at sun.reflect.GeneratedMethodAccessor145.invoke(Unknown Source)
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at 
> com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
>         at 
> com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$ResponseOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:205)
>         at 
> com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
>         at 
> com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
>         at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>         at 
> com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
>         at 
> com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
>         at 
> com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
>         at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
>         at 
> com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
>         at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
>         at 
> com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
>         at 
> com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
>         at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
>         at 
> com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
>         at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>         at 
> org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848)
>         at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1772)
>         at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:644)
>         at 
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:304)
>         at 
> org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:592)
>         at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>         at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1604)
>         at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>         at 
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>         at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1759)
>         at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
>         at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>         at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>         at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>         at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>         at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
>         at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>         at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>         at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>         at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>         at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>         at org.eclipse.jetty.server.Server.handle(Server.java:534)
>         at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
>         at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>         at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>         at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>         at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>         at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>         at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>         at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>         at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-06-07 21:18:36,266 INFO  retry.RetryInvocationHandler 
> (RetryInvocationHandler.java:log(411)) - java.net.UnknownHostException: 
> Invalid host name: local host is: (unknown); destination host is: 
> "host1.example.com":8020; java.net.UnknownHostException; For more details 
> see:  http://wiki.apache.org/hadoop/UnknownHost, while invoking 
> ClientNamenodeProtocolTranslatorPB.getServerDefaults over 
> host1.example.com:8020 after 10 failover attempts. Trying to failover after 
> sleeping for 9634ms.
> 2018-06-07 21:18:36,612 WARN  storage.HBaseTimelineWriterImpl 
> (HBaseTimelineWriterImpl.java:write(170)) - Found null for one of: 
> flowName=null appId=application_1528316765723_0030 userId=csingh 
> clusterId=yarn-cluster . Not proceeding with writing to hbase
> 2018-06-07 21:18:38,396 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=6, 
> retries=6, started=4213 ms ago, cancelled=false, msg=Call to 
> host1.example.com/142.26.32.112:17020 failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: host12.example.com/142.26.32.112:17020, details=row 
> 'prod.timelineservice.entity,csingh!yarn-cluster!scale-1-182!^?���(�^@<!^?���)8��^?���!COMPONENT!^@^@^@^@^@^@^@^@!simple,99999999999999'
>  on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=host12.example.com,17020,1528302866813, seqNum=-1
> 2018-06-07 21:18:38,662 ERROR util.ShutdownHookManager 
> (ShutdownHookManager.java:run(82)) - ShutdownHookManger shutdown forcefully
> {code}
> Nodes were temporarily unable to resolve hostname to IP mapping.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to