yeah. looks related to the timeline-service alright - i think. here's the jstack output.
2017-03-14 11:27:46 Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.25-b02 mixed mode): "Attach Listener" #536 daemon prio=9 os_prio=0 tid=0x00007f46c00da000 nid=0x2be6 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE Locked ownable synchronizers: - None "AMShutdownThread" #517 daemon prio=5 os_prio=0 tid=0x00007f46b8059000 nid=0x6cfa runnable [0x00007f46a73af000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:150) at java.net.SocketInputStream.read(SocketInputStream.java:121) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) - locked <0x00000000fe71a6e0> (a java.io.BufferedInputStream) at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:703) at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647) at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534) - locked <0x00000000fe70bd68> (a sun.net.www.protocol.http.HttpURLConnection) at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1439) - locked <0x00000000fe70bd68> (a sun.net.www.protocol.http.HttpURLConnection) at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480) at com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:240) at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:226) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:162) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:237) at com.sun.jersey.api.client.Client.handle(Client.java:648) at com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) at com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) at com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:472) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:321) at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:301) at org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.handleEvents(ATSHistoryLoggingService.java:357) at org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.serviceStop(ATSHistoryLoggingService.java:233) - locked <0x00000000cd56b968> (a java.lang.Object) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked <0x00000000cd4042a0> (a java.lang.Object) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157) at org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131) at org.apache.tez.dag.history.HistoryEventHandler.serviceStop(HistoryEventHandler.java:85) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked <0x00000000cd0cf878> (a java.lang.Object) at org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80) at org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:65) at org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1938) at org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2121) - locked <0x00000000ccf30038> (a org.apache.tez.dag.app.DAGAppMaster) at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221) - locked <0x00000000ccf301d0> (a java.lang.Object) at org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShutdownRunnable.run(DAGAppMaster.java:952) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - None "ContainerLauncher #31" #210 daemon prio=5 os_prio=0 tid=0x00007f46c42c0000 nid=0x1601 waiting on condition [0x00007f46a4f90000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000000cd633740> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) now the funny thing is i have one query that runs a "select count(*) from <table>" and the Tez session exits just fine. On the other hand we have some complex Tez queries and its those that stick around. kinda strange why that is. i guess i bounce the time-line server and see what happens next. if you see other insights from that stack trace please let me know. Cheers! Stephen. On Tue, Mar 14, 2017 at 7:06 AM, Stephen Sprague <sprag...@gmail.com> wrote: > Thanks Gopal. lemme see what i can do with your insights and report back > with my findings. > > Cheers, > Stephen. > > On Tue, Mar 14, 2017 at 6:19 AM, Gopal Vijayaraghavan <gop...@apache.org> > wrote: > >> > Looking at the doc i thought this config setting would influence those >> Tez jobs from hanging around (tez.session.am.dag.submit.timeout.secs) >> but testing proved otherwise. It didn't seem to have any affect. >> > So i ask. How to force off those Tez jobs organically? Or is there >> perhaps something else i'm missing? >> >> A jstack of the Tez AM would be useful. >> >> My guess is that this is related to ATS. >> >> tez.yarn.ats.event.flush.timeout.millis=-1L; >> >> Is the default and if ATS is down for whatever reason, Tez queries will >> wait infinite time to flush all events to ATS. >> >> You can probably set that to 600000L and see if the AMs disappear after >> 10 minutes. >> >> Before TEZ-1701, this was set to 3 seconds which broke the UI when the >> ATS instance was temporarily unavaible. >> >> Cheers, >> Gopal >> >> >> >