FWIW.  setting:

tez.yarn.ats.event.flush.timeout.millis=60000;

seems to have worked in our case.

thanks again Gopal.

On Tue, Mar 14, 2017 at 11:42 AM, Stephen Sprague <sprag...@gmail.com>
wrote:

> yeah. looks related to the timeline-service alright - i think.
>
> here's the jstack output.
>
>
> 2017-03-14 11:27:46
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.25-b02 mixed mode):
>
> "Attach Listener" #536 daemon prio=9 os_prio=0 tid=0x00007f46c00da000
> nid=0x2be6 waiting on condition [0x0000000000000000]
>    java.lang.Thread.State: RUNNABLE
>
>    Locked ownable synchronizers:
>         - None
>
> "AMShutdownThread" #517 daemon prio=5 os_prio=0 tid=0x00007f46b8059000
> nid=0x6cfa runnable [0x00007f46a73af000]
>    java.lang.Thread.State: RUNNABLE
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:150)
>         at java.net.SocketInputStream.read(SocketInputStream.java:121)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>         at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
>         - locked <0x00000000fe71a6e0> (a java.io.BufferedInputStream)
>         at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.
> java:703)
>         at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:647)
>         at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(
> HttpURLConnection.java:1534)
>         - locked <0x00000000fe70bd68> (a sun.net.www.protocol.http.
> HttpURLConnection)
>         at sun.net.www.protocol.http.HttpURLConnection.getInputStream(
> HttpURLConnection.java:1439)
>         - locked <0x00000000fe70bd68> (a sun.net.www.protocol.http.
> HttpURLConnection)
>         at java.net.HttpURLConnection.getResponseCode(
> HttpURLConnection.java:480)
>         at com.sun.jersey.client.urlconnection.
> URLConnectionClientHandler._invoke(URLConnectionClientHandler.java:240)
>         at com.sun.jersey.client.urlconnection.URLConnectionClientHandler.
> handle(URLConnectionClientHandler.java:147)
>         at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$
> TimelineJerseyRetryFilter$1.run(TimelineClientImpl.java:226)
>         at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$
> TimelineClientConnectionRetry.retryOn(TimelineClientImpl.java:162)
>         at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$
> TimelineJerseyRetryFilter.handle(TimelineClientImpl.java:237)
>         at com.sun.jersey.api.client.Client.handle(Client.java:648)
>         at com.sun.jersey.api.client.WebResource.handle(
> WebResource.java:670)
>         at com.sun.jersey.api.client.WebResource.access$200(
> WebResource.java:74)
>         at com.sun.jersey.api.client.WebResource$Builder.post(
> WebResource.java:563)
>         at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.
> doPostingObject(TimelineClientImpl.java:472)
>         at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.
> doPosting(TimelineClientImpl.java:321)
>         at org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.
> putEntities(TimelineClientImpl.java:301)
>         at org.apache.tez.dag.history.logging.ats.
> ATSHistoryLoggingService.handleEvents(ATSHistoryLoggingService.java:357)
>         at org.apache.tez.dag.history.logging.ats.
> ATSHistoryLoggingService.serviceStop(ATSHistoryLoggingService.java:233)
>         - locked <0x00000000cd56b968> (a java.lang.Object)
>         at org.apache.hadoop.service.AbstractService.stop(
> AbstractService.java:221)
>         - locked <0x00000000cd4042a0> (a java.lang.Object)
>         at org.apache.hadoop.service.ServiceOperations.stop(
> ServiceOperations.java:52)
>         at org.apache.hadoop.service.ServiceOperations.stopQuietly(
> ServiceOperations.java:80)
>         at org.apache.hadoop.service.CompositeService.stop(
> CompositeService.java:157)
>         at org.apache.hadoop.service.CompositeService.serviceStop(
> CompositeService.java:131)
>         at org.apache.tez.dag.history.HistoryEventHandler.serviceStop(
> HistoryEventHandler.java:85)
>         at org.apache.hadoop.service.AbstractService.stop(
> AbstractService.java:221)
>         - locked <0x00000000cd0cf878> (a java.lang.Object)
>         at org.apache.hadoop.service.ServiceOperations.stop(
> ServiceOperations.java:52)
>         at org.apache.hadoop.service.ServiceOperations.stopQuietly(
> ServiceOperations.java:80)
>         at org.apache.hadoop.service.ServiceOperations.stopQuietly(
> ServiceOperations.java:65)
>         at org.apache.tez.dag.app.DAGAppMaster.stopServices(
> DAGAppMaster.java:1938)
>         at org.apache.tez.dag.app.DAGAppMaster.serviceStop(
> DAGAppMaster.java:2121)
>         - locked <0x00000000ccf30038> (a org.apache.tez.dag.app.
> DAGAppMaster)
>         at org.apache.hadoop.service.AbstractService.stop(
> AbstractService.java:221)
>         - locked <0x00000000ccf301d0> (a java.lang.Object)
>         at org.apache.tez.dag.app.DAGAppMaster$
> DAGAppMasterShutdownHandler$AMShutdownRunnable.run(DAGAppMaster.java:952)
>         at java.lang.Thread.run(Thread.java:745)
>
>    Locked ownable synchronizers:
>         - None
>
> "ContainerLauncher #31" #210 daemon prio=5 os_prio=0
> tid=0x00007f46c42c0000 nid=0x1601 waiting on condition [0x00007f46a4f90000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x00000000cd633740> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.
> java:175)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$
> ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>         at java.util.concurrent.LinkedBlockingQueue.take(
> LinkedBlockingQueue.java:442)
>         at java.util.concurrent.ThreadPoolExecutor.getTask(
> ThreadPoolExecutor.java:1067)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1127)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
>
>
>
> now the funny thing is i have one query that runs a "select count(*) from
> <table>" and the Tez session exits just fine.  On the other hand we have
> some complex Tez queries and its those that stick around.  kinda strange
> why that is.
>
> i guess i bounce the time-line server and see what happens next.
>
> if you see other insights from that stack trace please let me know.
>
> Cheers!
> Stephen.
>
> On Tue, Mar 14, 2017 at 7:06 AM, Stephen Sprague <sprag...@gmail.com>
> wrote:
>
>> Thanks Gopal.   lemme see what i can do with your insights and report
>> back with my findings.
>>
>> Cheers,
>> Stephen.
>>
>> On Tue, Mar 14, 2017 at 6:19 AM, Gopal Vijayaraghavan <gop...@apache.org>
>> wrote:
>>
>>> > Looking at the doc i thought this config setting would influence those
>>> Tez jobs from hanging around (tez.session.am.dag.submit.timeout.secs)
>>> but testing proved otherwise. It didn't seem to have any affect.
>>> > So i ask. How to force off those Tez jobs organically? Or is there
>>> perhaps something else i'm missing?
>>>
>>> A jstack of the Tez AM would be useful.
>>>
>>> My guess is that this is related to ATS.
>>>
>>> tez.yarn.ats.event.flush.timeout.millis=-1L;
>>>
>>> Is the default and if ATS is down for whatever reason, Tez queries will
>>> wait infinite time to flush all events to ATS.
>>>
>>> You can probably set that to 600000L and see if the AMs disappear after
>>> 10 minutes.
>>>
>>> Before TEZ-1701, this was set to 3 seconds which broke the UI when the
>>> ATS instance was temporarily unavaible.
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>

Reply via email to