yeah. looks related to the timeline-service alright - i think.

here's the jstack output.


2017-03-14 11:27:46
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.25-b02 mixed mode):

"Attach Listener" #536 daemon prio=9 os_prio=0 tid=0x00007f46c00da000
nid=0x2be6 waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

   Locked ownable synchronizers:
        - None

"AMShutdownThread" #517 daemon prio=5 os_prio=0 tid=0x00007f46b8059000
nid=0x6cfa runnable [0x00007f46a73af000]
   java.lang.Thread.State: RUNNABLE
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:150)
        at java.net.SocketInputStream.read(SocketInputStream.java:121)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
        - locked <0x00000000fe71a6e0> (a java.io.BufferedInputStream)
        at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:703)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
        at
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1534)
        - locked <0x00000000fe70bd68> (a
sun.net.www.protocol.http.HttpURLConnection)
        at
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1439)
        - locked <0x00000000fe70bd68> (a
sun.net.www.protocol.http.HttpURLConnection)
        at
java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
        at
com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:240)
        at
com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:147)
        at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:226)
        at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:162)
        at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:237)
        at com.sun.jersey.api.client.Client.handle(Client.java:648)
        at
com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
        at
com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
        at
com.sun.jersey.api.client.WebResource$Builder.post(WebResource.java:563)
        at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(TimelineClientImpl.java:472)
        at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(TimelineClientImpl.java:321)
        at
org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(TimelineClientImpl.java:301)
        at
org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.handleEvents(ATSHistoryLoggingService.java:357)
        at
org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.serviceStop(ATSHistoryLoggingService.java:233)
        - locked <0x00000000cd56b968> (a java.lang.Object)
        at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x00000000cd4042a0> (a java.lang.Object)
        at
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
        at
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
        at
org.apache.hadoop.service.CompositeService.stop(CompositeService.java:157)
        at
org.apache.hadoop.service.CompositeService.serviceStop(CompositeService.java:131)
        at
org.apache.tez.dag.history.HistoryEventHandler.serviceStop(HistoryEventHandler.java:85)
        at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x00000000cd0cf878> (a java.lang.Object)
        at
org.apache.hadoop.service.ServiceOperations.stop(ServiceOperations.java:52)
        at
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:80)
        at
org.apache.hadoop.service.ServiceOperations.stopQuietly(ServiceOperations.java:65)
        at
org.apache.tez.dag.app.DAGAppMaster.stopServices(DAGAppMaster.java:1938)
        at
org.apache.tez.dag.app.DAGAppMaster.serviceStop(DAGAppMaster.java:2121)
        - locked <0x00000000ccf30038> (a
org.apache.tez.dag.app.DAGAppMaster)
        at
org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
        - locked <0x00000000ccf301d0> (a java.lang.Object)
        at
org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShutdownRunnable.run(DAGAppMaster.java:952)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - None

"ContainerLauncher #31" #210 daemon prio=5 os_prio=0 tid=0x00007f46c42c0000
nid=0x1601 waiting on condition [0x00007f46a4f90000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000cd633740> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at
java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)



now the funny thing is i have one query that runs a "select count(*) from
<table>" and the Tez session exits just fine.  On the other hand we have
some complex Tez queries and its those that stick around.  kinda strange
why that is.

i guess i bounce the time-line server and see what happens next.

if you see other insights from that stack trace please let me know.

Cheers!
Stephen.

On Tue, Mar 14, 2017 at 7:06 AM, Stephen Sprague <sprag...@gmail.com> wrote:

> Thanks Gopal.   lemme see what i can do with your insights and report back
> with my findings.
>
> Cheers,
> Stephen.
>
> On Tue, Mar 14, 2017 at 6:19 AM, Gopal Vijayaraghavan <gop...@apache.org>
> wrote:
>
>> > Looking at the doc i thought this config setting would influence those
>> Tez jobs from hanging around (tez.session.am.dag.submit.timeout.secs)
>> but testing proved otherwise. It didn't seem to have any affect.
>> > So i ask. How to force off those Tez jobs organically? Or is there
>> perhaps something else i'm missing?
>>
>> A jstack of the Tez AM would be useful.
>>
>> My guess is that this is related to ATS.
>>
>> tez.yarn.ats.event.flush.timeout.millis=-1L;
>>
>> Is the default and if ATS is down for whatever reason, Tez queries will
>> wait infinite time to flush all events to ATS.
>>
>> You can probably set that to 600000L and see if the AMs disappear after
>> 10 minutes.
>>
>> Before TEZ-1701, this was set to 3 seconds which broke the UI when the
>> ATS instance was temporarily unavaible.
>>
>> Cheers,
>> Gopal
>>
>>
>>
>

Reply via email to