[ 
https://issues.apache.org/jira/browse/YARN-8302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528173#comment-16528173
 ] 

Billie Rinaldi commented on YARN-8302:
--------------------------------------

Attaching a patch that runs a monitor thread to check whether HBase queries 
succeed. It adds a new parameter 
yarn.timeline-service.reader.storage-monitor.interval-ms that governs how often 
the check is performed (and the same value is used to configure HBase client 
timeouts). If HBase is detected to be down, further calls to the 
HBaseTimelineReaderImpl will fail immediately, until such time as the monitor 
thread determines that HBase is back up. Having the reader wait for HBase to 
become healthy again will require less admin intervention than shutting down 
the TimelineReader would.

> ATS v2 should handle HBase connection issue properly
> ----------------------------------------------------
>
>                 Key: YARN-8302
>                 URL: https://issues.apache.org/jira/browse/YARN-8302
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: ATSv2
>    Affects Versions: 3.1.0
>            Reporter: Yesha Vora
>            Assignee: Billie Rinaldi
>            Priority: Major
>         Attachments: YARN-8302.1.patch
>
>
> ATS v2 call times out with below error when it can't connect to HBase 
> instance.
> {code}
> bash-4.2$ curl -i -k -s -1  -H 'Content-Type: application/json'  -H 'Accept: 
> application/json' --max-time 5   --negotiate -u : 
> 'https://xxx:8199/ws/v2/timeline/apps/application_1526357251888_0022/entities/YARN_CONTAINER?fields=ALL&_=1526425686092'
> curl: (28) Operation timed out after 5002 milliseconds with 0 bytes received
> {code}
> {code:title=ATS log}
> 2018-05-15 23:10:03,623 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=7, 
> retries=7, started=8165 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:13,651 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=8, 
> retries=8, started=18192 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:23,730 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=9, 
> retries=9, started=28272 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1
> 2018-05-15 23:10:33,788 INFO  client.RpcRetryingCallerImpl 
> (RpcRetryingCallerImpl.java:callWithRetries(134)) - Call exception, tries=10, 
> retries=10, started=38330 ms ago, cancelled=false, msg=Call to xxx/xxx:17020 
> failed on connection exception: 
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException:
>  Connection refused: xxx/xxx:17020, details=row 
> 'prod.timelineservice.app_flow,
> ,99999999999999' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, 
> hostname=xxx,17020,1526348294182, seqNum=-1{code}
> There are two issues here.
> 1) Check why ATS can't connect to HBase
> 2) In case of connection error,  ATS call should not get timeout. It should 
> fail with proper error.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to