FWIW. setting: tez.yarn.ats.event.flush.timeout.millis=60000;
seems to have worked in our case. thanks again Gopal. On Tue, Mar 14, 2017 at 11:42 AM, Stephen Sprague <sprag...@gmail.com> wrote: > yeah. looks related to the timeline-service alright - i think. > > here's the jstack output. > > > 2017-03-14 11:27:46 > Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.25-b02 mixed mode): > > "Attach Listener" #536 daemon prio=9 os_prio=0 tid=0x00007f46c00da000 > nid=0x2be6 waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > Locked ownable synchronizers: > - None > > "AMShutdownThread" #517 daemon prio=5 os_prio=0 tid=0x00007f46b8059000 > nid=0x6cfa runnable [0x00007f46a73af000] > java.lang.Thread.State: RUNNABLE > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:150) > at java.net.SocketInputStream.read(SocketInputStream.java:121) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) > at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) > at java.io.BufferedInputStream.read(BufferedInputStream.java:345) > - locked <0x00000000fe71a6e0> (a java.io.BufferedInputStream) > at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient. > java:703) > at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647) > at sun.net.www.protocol.http.HttpURLConnection.getInputStream0( > HttpURLConnection.java:1534) > - locked <0x00000000fe70bd68> (a sun.net.www.protocol.http. > HttpURLConnection) > at sun.net.www.protocol.http.HttpURLConnection.getInputStream( > HttpURLConnection.java:1439) > - locked <0x00000000fe70bd68> (a sun.net.www.protocol.http. > HttpURLConnection) > at java.net.HttpURLConnection.getResponseCode( > HttpURLConnection.java:480) > at com.sun.jersey.client.urlconnection. > URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:240) > at com.sun.jersey.client.urlconnection.URLConnectionClientHandler. > handle(URLConnectionClientHandler.java:147) > at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$ > TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:226) > at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$ > TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:162) > at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$ > TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:237) > at com.sun.jersey.api.client.Client.handle(Client.java:648) > at com.sun.jersey.api.client.WebResource.handle( > WebResource.java:670) > at com.sun.jersey.api.client.WebResource.access$200( > WebResource.java:74) > at com.sun.jersey.api.client.WebResource$Builder.post( > WebResource.java:563) > at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl. > doPostingObject(TimelineClientImpl.java:472) > at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl. > doPosting(TimelineClientImpl.java:321) > at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl. > putEntities(TimelineClientImpl.java:301) > at org.apache.tez.dag.history.logging.ats. > ATSHistoryLoggingService.handleEvents(ATSHistoryLoggingService.java:357) > at org.apache.tez.dag.history.logging.ats. > ATSHistoryLoggingService.serviceStop(ATSHistoryLoggingService.java:233) > - locked <0x00000000cd56b968> (a java.lang.Object) > at org.apache.hadoop.service.AbstractService.stop( > AbstractService.java:221) > - locked <0x00000000cd4042a0> (a java.lang.Object) > at org.apache.hadoop.service.ServiceOperations.stop( > ServiceOperations.java:52) > at org.apache.hadoop.service.ServiceOperations.stopQuietly( > ServiceOperations.java:80) > at org.apache.hadoop.service.CompositeService.stop( > CompositeService.java:157) > at org.apache.hadoop.service.CompositeService.serviceStop( > CompositeService.java:131) > at org.apache.tez.dag.history.HistoryEventHandler.serviceStop( > HistoryEventHandler.java:85) > at org.apache.hadoop.service.AbstractService.stop( > AbstractService.java:221) > - locked <0x00000000cd0cf878> (a java.lang.Object) > at org.apache.hadoop.service.ServiceOperations.stop( > ServiceOperations.java:52) > at org.apache.hadoop.service.ServiceOperations.stopQuietly( > ServiceOperations.java:80) > at org.apache.hadoop.service.ServiceOperations.stopQuietly( > ServiceOperations.java:65) > at org.apache.tez.dag.app.DAGAppMaster.stopServices( > DAGAppMaster.java:1938) > at org.apache.tez.dag.app.DAGAppMaster.serviceStop( > DAGAppMaster.java:2121) > - locked <0x00000000ccf30038> (a org.apache.tez.dag.app. > DAGAppMaster) > at org.apache.hadoop.service.AbstractService.stop( > AbstractService.java:221) > - locked <0x00000000ccf301d0> (a java.lang.Object) > at org.apache.tez.dag.app.DAGAppMaster$ > DAGAppMasterShutdownHandler$AMShutdownRunnable.run(DAGAppMaster.java:952) > at java.lang.Thread.run(Thread.java:745) > > Locked ownable synchronizers: > - None > > "ContainerLauncher #31" #210 daemon prio=5 os_prio=0 > tid=0x00007f46c42c0000 nid=0x1601 waiting on condition [0x00007f46a4f90000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000000cd633740> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport. > java:175) > at java.util.concurrent.locks.AbstractQueuedSynchronizer$ > ConditionObject.await(AbstractQueuedSynchronizer.java:2039) > at java.util.concurrent.LinkedBlockingQueue.take( > LinkedBlockingQueue.java:442) > at java.util.concurrent.ThreadPoolExecutor.getTask( > ThreadPoolExecutor.java:1067) > at java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1127) > at java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > > > > now the funny thing is i have one query that runs a "select count(*) from > <table>" and the Tez session exits just fine. On the other hand we have > some complex Tez queries and its those that stick around. kinda strange > why that is. > > i guess i bounce the time-line server and see what happens next. > > if you see other insights from that stack trace please let me know. > > Cheers! > Stephen. > > On Tue, Mar 14, 2017 at 7:06 AM, Stephen Sprague <sprag...@gmail.com> > wrote: > >> Thanks Gopal. lemme see what i can do with your insights and report >> back with my findings. >> >> Cheers, >> Stephen. >> >> On Tue, Mar 14, 2017 at 6:19 AM, Gopal Vijayaraghavan <gop...@apache.org> >> wrote: >> >>> > Looking at the doc i thought this config setting would influence those >>> Tez jobs from hanging around (tez.session.am.dag.submit.timeout.secs) >>> but testing proved otherwise. It didn't seem to have any affect. >>> > So i ask. How to force off those Tez jobs organically? Or is there >>> perhaps something else i'm missing? >>> >>> A jstack of the Tez AM would be useful. >>> >>> My guess is that this is related to ATS. >>> >>> tez.yarn.ats.event.flush.timeout.millis=-1L; >>> >>> Is the default and if ATS is down for whatever reason, Tez queries will >>> wait infinite time to flush all events to ATS. >>> >>> You can probably set that to 600000L and see if the AMs disappear after >>> 10 minutes. >>> >>> Before TEZ-1701, this was set to 3 seconds which broke the UI when the >>> ATS instance was temporarily unavaible. >>> >>> Cheers, >>> Gopal >>> >>> >>> >> >