I'm running into an issue that occurs fairly often (but not consistently
reproducible) where yarn reports a negative value for memory allocation eg
(-2048) and a 0 vcore allocation despite the AM actually running. For
example the AM reports a runtime of 1hrs, 29mins, 40sec while the dag only
880 seconds.
After some investigating I've noticed that the AM has repeated issues
contacting the timeline server after the dag is complete (error trace
below). This seems to be delaying the shutdown sequence. It seems to retry
every minute before either giving up or succeeding but I'm not sure which.
What's the best way to debug why this would be happening and potentially
shortening the timeout retry period as I'm more concerned with job
completion than logging it to the timeline server. This doesn't seem to be
happening consistently to all tez jobs only some.
I'm using hive 1.1.0 and tez 0.7.1 on cdh5.4.10 (hadoop 2.6).
2016-08-03 19:18:22,881 [INFO] [ContainerLauncher #112]
|impl.ContainerManagementProtocolProxy|: Opening proxy : nodexxxx:45454
2016-08-03 19:18:23,292 [WARN] [HistoryEventHandlingThread]
|security.UserGroupInformation|: PriviledgedActionException as:xxxxx
(auth:SIMPLE) cause:java.net.SocketTimeoutException: Read timed out
2016-08-03 19:18:23,292 [ERROR] [HistoryEventHandlingThread]
|impl.TimelineClientImpl|: Failed to get the response from the timeline
server.
com.sun.jersey.api.client.ClientHandlerException:
java.net.SocketTimeoutException: Read timed out
at
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:226)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:162)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:237)
at com.sun.jersey.api.client.Client.handle(Client.java:648)
at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
at
com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:472)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:321)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:301)
at
org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.handleEvents(ATSHistoryLoggingService.java:349)
at
org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.access$700(ATSHistoryLoggingService.java:53)
at
org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService$1.run(ATSHistoryLoggingService.java:190)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:689)
at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1324)
at
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
at
org.apache.hadoop.security.authentication.client.AuthenticatedURL.extractToken(AuthenticatedURL.java:253)
at
org.apache.hadoop.security.authentication.client.PseudoAuthenticator.authenticate(PseudoAuthenticator.java:77)
at
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.authenticate(DelegationTokenAuthenticator.java:127)
at
org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:216)
at
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.openConnection(DelegationTokenAuthenticatedURL.java:322)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineURLConnectionFactory$1.run(TimelineClientImpl.java:501)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineURLConnectionFactory$1.run(TimelineClientImpl.java:498)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707)
at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineURLConnectionFactory.getHttpURLConnection(TimelineClientImpl.java:498)
at
com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:159)
at
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
... 14 more
and finally
2016-08-03 20:32:51,041 [INFO] [AMShutdownThread]
|ats.ATSHistoryLoggingService|: Event queue empty, stopping ATS Service
2016-08-03 20:32:51,131 [INFO] [AMShutdownThread]
|launcher.ContainerLauncherImpl|: Stopping
container_e12_1470097176422_30703_01_002211
Thanks,
Slava
--
Slava Markeyev | Engineering | Upsight