Thanks for the info Hitesh. Unfortunately it seems that RollingLevelDB is
only in trunk. I may have to backport it to 2.6.2 (version I use). I did
notice that the leveldb does grow to tens of gb which may be an indication
of pruning not happening often enough (or at all?). I also need to fix the
logging as the logs for the timeline server don't seem to be very active
beyond it starting up.

For the job I posted before here is the associated eventQueueBacklog log
line.

2016-08-03 19:23:27,932 [INFO] [AMShutdownThread]
|ats.ATSHistoryLoggingService|: Stopping ATSService,
eventQueueBacklog=17553

I'll look into lowering tez.yarn.ats.event.flush.timeout.millis while
trying to look into the timelineserver.

Thanks for your help,
Slava

On Wed, Aug 3, 2016 at 2:45 PM, Hitesh Shah <[email protected]> wrote:

> Hello Slava,
>
> Can you check for a log line along the lines of "Stopping ATSService,
> eventQueueBacklog=“ to see how backed up is the event queue to YARN
> timeline?
>
> I have noticed this in quite a few installs with YARN Timeline where YARN
> Timeline is using the simple Level DB impl and not the RollingLevelDB
> storage class. The YARN timeline ends up hitting some bottlenecks around
> the time when the data purging happens ( takes a global lock on level db ).
> The Rolling level db storage impl solved this problem by using separate
> level dos for different time intervals and just throwing out the level db
> instead of trying to do a full scan+purge.
>
> Another workaround though not a great one is to set
> “tez.yarn.ats.event.flush.timeout.millis” to a value say 60000 i.e. 1 min.
> This implies that the Tez AM will try for at max 1 min to flush the queue
> to YARN timeline before giving up and shutting down the Tez AM.
>
> A longer term option is the YARN Timeline version 1.5 work currently
> slated to be released in hadoop 2.8.0 which uses HDFS for writes instead of
> the current web service based approach. This has a far better perf
> throughput for writes albeit with a delay on the read path as the Timeline
> server scans HDFS for new updates. The tez changes for this are already
> available in the source code under the hadoop28 profile though the
> documentation for this is still pending.
>
> thanks
> — Hitesh
>
>
>
>
>
> > On Aug 3, 2016, at 2:02 PM, Slava Markeyev <[email protected]>
> wrote:
> >
> > I'm running into an issue that occurs fairly often (but not consistently
> reproducible) where yarn reports a negative value for memory allocation eg
> (-2048) and a 0 vcore allocation despite the AM actually running. For
> example the AM reports a runtime of 1hrs, 29mins, 40sec while the dag only
> 880 seconds.
> >
> > After some investigating I've noticed that the AM has repeated issues
> contacting the timeline server after the dag is complete (error trace
> below). This seems to be delaying the shutdown sequence. It seems to retry
> every minute before either giving up or succeeding but I'm not sure which.
> What's the best way to debug why this would be happening and potentially
> shortening the timeout retry period as I'm more concerned with job
> completion than logging it to the timeline server. This doesn't seem to be
> happening consistently to all tez jobs only some.
> >
> > I'm using hive 1.1.0 and tez 0.7.1 on cdh5.4.10 (hadoop 2.6).
> >
> > 2016-08-03 19:18:22,881 [INFO] [ContainerLauncher #112]
> |impl.ContainerManagementProtocolProxy|: Opening proxy : nodexxxx:45454
> > 2016-08-03 19:18:23,292 [WARN] [HistoryEventHandlingThread]
> |security.UserGroupInformation|: PriviledgedActionException as:xxxxx
> (auth:SIMPLE) cause:java.net.SocketTimeoutException: Read timed out
> > 2016-08-03 19:18:23,292 [ERROR] [HistoryEventHandlingThread]
> |impl.TimelineClientImpl|: Failed to get the response from the timeline
> server.
> > com.sun.jersey.api.client.ClientHandlerException:
> java.net.SocketTimeoutException: Read timed out
> >     at
> com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
> >     at
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:226)
> >     at
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:162)
> >     at
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:237)
> >     at com.sun.jersey.api.client.Client.handle(Client.java:648)
> >     at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
> >     at
> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
> >     at
> com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
> >     at
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:472)
> >     at
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:321)
> >     at
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:301)
> >     at
> org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.handleEvents(ATSHistoryLoggingService.java:349)
> >     at
> org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.access$700(ATSHistoryLoggingService.java:53)
> >     at
> org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService$1.run(ATSHistoryLoggingService.java:190)
> >     at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.net.SocketTimeoutException: Read timed out
> >     at java.net.SocketInputStream.socketRead0(Native Method)
> >     at java.net.SocketInputStream.read(SocketInputStream.java:152)
> >     at java.net.SocketInputStream.read(SocketInputStream.java:122)
> >     at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
> >     at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
> >     at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> >     at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:689)
> >     at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:633)
> >     at
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1324)
> >     at
> java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:468)
> >     at
> org.apache.hadoop.security.authentication.client.AuthenticatedURL.extractToken(AuthenticatedURL.java:253)
> >     at
> org.apache.hadoop.security.authentication.client.PseudoAuthenticator.authenticate(PseudoAuthenticator.java:77)
> >     at
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticator.authenticate(DelegationTokenAuthenticator.java:127)
> >     at
> org.apache.hadoop.security.authentication.client.AuthenticatedURL.openConnection(AuthenticatedURL.java:216)
> >     at
> org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticatedURL.openConnection(DelegationTokenAuthenticatedURL.java:322)
> >     at
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineURLConnectionFactory$1.run(TimelineClientImpl.java:501)
> >     at
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineURLConnectionFactory$1.run(TimelineClientImpl.java:498)
> >     at java.security.AccessController.doPrivileged(Native Method)
> >     at javax.security.auth.Subject.doAs(Subject.java:415)
> >     at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1707)
> >     at
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineURLConnectionFactory.getHttpURLConnection(TimelineClientImpl.java:498)
> >     at
> com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:159)
> >     at
> com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
> >     ... 14 more
> >
> > and finally
> >
> > 2016-08-03 20:32:51,041 [INFO] [AMShutdownThread]
> |ats.ATSHistoryLoggingService|: Event queue empty, stopping ATS Service
> > 2016-08-03 20:32:51,131 [INFO] [AMShutdownThread]
> |launcher.ContainerLauncherImpl|: Stopping
> container_e12_1470097176422_30703_01_002211
> >
> >
> > Thanks,
> > Slava
> >
> > --
> > Slava Markeyev | Engineering | Upsight
>
>


-- 

Slava Markeyev | Engineering | Upsight
<http://www.linkedin.com/in/slavamarkeyev>
<http://www.linkedin.com/in/slavamarkeyev>

Reply via email to