> well we are seeing these sessions sitting around for over an hour This could be one of the causes for this issue - a stuck ATS. Tez won't kill a session till all the ATS info has been submitted out of the process.
RollingLevelDbTimelineStore & EntityGroupFSTimelineStore was written to fix this issue, but AFAIK those are not the default in the Apache Hadoop installs (but Ambari does set them up). Check your yarn.timeline-service.store-class in yarn-site.xml, if it says LeveldbTimelineStore, you might see this behavior exactly 30 days after the cluster goes operational. Cheers, Gopal