aha. i sense we're getting closer. here are my settings for yarn.timeline-service.*
yarn.timeline-service.address=${yarn.timeline-service.hostname}:10200 yarn.timeline-service.client.max-retries=30 yarn.timeline-service.client.retry-interval-ms=1000 yarn.timeline-service.enabled=true yarn.timeline-service.handler-thread-count=10 yarn.timeline-service.hostname=XXXXX.sv2.trulia.com yarn.timeline-service.http-authentication.simple.anonymous.allowed=true yarn.timeline-service.http-authentication.type=simple yarn.timeline-service.http-cross-origin.enabled=true yarn.timeline-service.keytab=/etc/krb5.keytab yarn.timeline-service.leveldb-timeline-store.path=${hadoop.tmp.dir}/yarn/timeline yarn.timeline-service.leveldb-timeline-store.read-cache-size=104857600 yarn.timeline-service.leveldb-timeline-store.start-time-read-cache-size=10000 yarn.timeline-service.leveldb-timeline-store.start-time-write-cache-size=10000 yarn.timeline-service.leveldb-timeline-store.ttl-interval-ms=300dd *yarn.timeline-service.store-class=org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore* yarn.timeline-service.ttl-enable=true yarn.timeline-service.ttl-ms=604800000 yarn.timeline-service.webapp.address=XXXXX.sv2.trulia.com:8188 yarn.timeline-service.webapp.https.address=${yarn.timeline-service.hostname}:8190 Gopal, you propose that setting that to "RollingLevelDbTimelineStore" might fix the issue? Cheers, Stephen. On Tue, Dec 13, 2016 at 9:50 PM, Gopal Vijayaraghavan <gop...@apache.org> wrote: > > > well we are seeing these sessions sitting around for over an hour > > This could be one of the causes for this issue - a stuck ATS. Tez won't > kill a session till all the ATS info has been submitted out of the process. > > RollingLevelDbTimelineStore & EntityGroupFSTimelineStore was written to > fix this issue, but AFAIK those are not the default in the Apache Hadoop > installs (but Ambari does set them up). > > Check your yarn.timeline-service.store-class in yarn-site.xml, if it says > LeveldbTimelineStore, you might see this behavior exactly 30 days after the > cluster goes operational. > > Cheers, > Gopal > > > >