Hey folks,

encountering a strange problem. With a certain Tez job we end up with this 
situation:

- AppMaster starts, launches a container (seen in logs)
- NodeManager spawns the container (seen in logs)
- nothing happens anymore, the AppMaster just sits an waits there forever
- the YARN UI shows 0 running container
- resource-manager and node-manager do not have any trace from the container 
except the it traversed to running state
- 2 container JVMS are running:

[root@cmaster ~]# ps -ef | grep TezChild
root      6144  5547  0 18:14 pts/2    00:00:00 grep TezChild
yarn     26964 28254  0 16:11 ?        00:00:00 /bin/bash -c 
/usr/local/sun-jdk/bin/java  -Xmx819m -server -Djava.net.preferIPv4Stack=true 
-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA 
-XX:+UseParallelGC -Dhadoop.metrics.log.level=WARN  -Dapple.awt.UIElement=true 
-Djava.awt.headless=true  -Dlog4j.configuration=tez-container-log4j.properties 
-Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412235368760_0122_01_000002
 -Dtez.root.logger=INFO,CLA 
-Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_1412235368760_0122/container_1412235368760_0122_01_000002/tmp
 org.apache.tez.runtime.task.TezChild 192.168.179.151 60298 
container_1412235368760_0122_01_000002 application_1412235368760_0122 1 
1>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412235368760_0122_01_000002/stdout
 
2>/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412235368760_0122_01_000002/stderr
 
yarn     27046 26964  1 16:11 ?        00:01:49 /usr/local/sun-jdk/bin/java 
-Xmx819m -server -Djava.net.preferIPv4Stack=true -XX:+PrintGCDetails 
-verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC 
-Dhadoop.metrics.log.level=WARN -Dapple.awt.UIElement=true 
-Djava.awt.headless=true -Dlog4j.configuration=tez-container-log4j.properties 
-Dyarn.app.container.log.dir=/cdh_data/hadoop-yarn/logs/application_1412235368760_0122/container_1412235368760_0122_01_000002
 -Dtez.root.logger=INFO,CLA 
-Djava.io.tmpdir=/cdh_data/local-dir/usercache/qa/appcache/application_1412235368760_0122/container_1412235368760_0122_01_000002/tmp
 org.apache.tez.runtime.task.TezChild 192.168.179.151 60298 
container_1412235368760_0122_01_000002 application_1412235368760_0122 1
[root@cmaster ~]# jps
28254 NodeManager
13597 -- process information unavailable
5757 -- process information unavailable
28723 start.jar
17192 HMaster
27261 JournalNode
27046 TezChild
6154 Jps
17043 QuorumPeerMain
27418 NameNode
28057 JobHistoryServer
27716 ResourceManager
27094 DataNode

Looks like there is one TezChild which is a child process from the NodeManager. 
We cannot get any thread dump from this one. 
And the 2nd TezChild is a child process from the first TezChild.

Find the thread dump of the 2nd TezChild at the bottom of this mail.
Any idea whats going on here ?

best
Johannes

-------------------------------------
Attaching to process ID 27046, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 24.55-b03
Deadlock Detection:

No deadlocks found.

Thread 4399: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may 
be imprecise)
 - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) 
@bci=20, line=226 (Compiled frame)
 - 
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(java.util.concurrent.SynchronousQueue$TransferStack$SNode,
 boolean, long) @bci=174, line=460 (Compiled frame)
 - 
java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.lang.Object, 
boolean, long) @bci=102, line=359 (Compiled frame)
 - java.util.concurrent.SynchronousQueue.poll(long, 
java.util.concurrent.TimeUnit) @bci=11, line=942 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=141, line=1068 
(Compiled frame)
 - 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=26, line=1130 (Compiled frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27089: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
line=186 (Interpreted frame)
 - 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() 
@bci=42, line=2043 (Interpreted frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 
(Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 
(Interpreted frame)
 - 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=26, line=1130 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27088: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
line=186 (Interpreted frame)
 - 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() 
@bci=42, line=2043 (Interpreted frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 
(Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 
(Interpreted frame)
 - 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=26, line=1130 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27087: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
line=186 (Interpreted frame)
 - 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() 
@bci=42, line=2043 (Interpreted frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 
(Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 
(Interpreted frame)
 - 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=26, line=1130 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27086: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
line=186 (Interpreted frame)
 - 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() 
@bci=42, line=2043 (Interpreted frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 
(Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 
(Interpreted frame)
 - 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=26, line=1130 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27084: (state = BLOCKED)
 - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
 - org.apache.hadoop.hdfs.PeerCache.run() @bci=41, line=245 (Interpreted frame)
 - 
org.apache.hadoop.hdfs.PeerCache.access$000(org.apache.hadoop.hdfs.PeerCache) 
@bci=1, line=41 (Interpreted frame)
 - org.apache.hadoop.hdfs.PeerCache$1.run() @bci=4, line=119 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27082: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
line=186 (Interpreted frame)
 - 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() 
@bci=42, line=2043 (Interpreted frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 
(Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor.getTask() @bci=156, line=1068 
(Interpreted frame)
 - 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=26, line=1130 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27080: (state = BLOCKED)
 - java.lang.Thread.sleep(long) @bci=0 (Interpreted frame)
 - org.apache.hadoop.hdfs.LeaseRenewer.run(int) @bci=429, line=438 (Interpreted 
frame)
 - 
org.apache.hadoop.hdfs.LeaseRenewer.access$700(org.apache.hadoop.hdfs.LeaseRenewer,
 int) @bci=2, line=71 (Interpreted frame)
 - org.apache.hadoop.hdfs.LeaseRenewer$1.run() @bci=69, line=298 (Interpreted 
frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27079: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
imprecise)
 - org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run() @bci=265, line=502 
(Interpreted frame)


Thread 27077: (state = IN_NATIVE)
 - org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(int, 
org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=0 (Interpreted frame)
 - org.apache.hadoop.net.unix.DomainSocketWatcher.access$800(int, 
org.apache.hadoop.net.unix.DomainSocketWatcher$FdSet) @bci=2, line=52 
(Interpreted frame)
 - org.apache.hadoop.net.unix.DomainSocketWatcher$1.run() @bci=551, line=457 
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27076: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
line=186 (Interpreted frame)
 - 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() 
@bci=42, line=2043 (Interpreted frame)
 - java.util.concurrent.LinkedBlockingQueue.take() @bci=29, line=442 
(Interpreted frame)
 - org.apache.tez.runtime.LogicalIOProcessorRuntimeTask$1.run() @bci=26, 
line=668 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27071: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may 
be imprecise)
 - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) 
@bci=20, line=226 (Compiled frame)
 - 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(long,
 java.util.concurrent.TimeUnit) @bci=106, line=2176 (Compiled frame)
 - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call() @bci=61, 
line=183 (Compiled frame)
 - org.apache.tez.runtime.task.TaskReporter$HeartbeatCallable.call() @bci=1, 
line=118 (Interpreted frame)
 - java.util.concurrent.FutureTask.run() @bci=42, line=262 (Interpreted frame)
 - 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=95, line=1145 (Interpreted frame)
 - java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 
(Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)


Thread 27065: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
imprecise)
 - org.apache.hadoop.ipc.Client$Connection.waitForWork() @bci=59, line=902 
(Compiled frame)
 - org.apache.hadoop.ipc.Client$Connection.run() @bci=55, line=947 (Compiled 
frame)


Thread 27064: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
line=186 (Interpreted frame)
 - 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await() 
@bci=42, line=2043 (Interpreted frame)
 - some user code


Thread 27063: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
imprecise)
 - java.util.TimerThread.mainLoop() @bci=201, line=552 (Interpreted frame)
 - java.util.TimerThread.run() @bci=1, line=505 (Interpreted frame)


Thread 27055: (state = BLOCKED)


Thread 27054: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Interpreted frame)
 - java.lang.ref.ReferenceQueue.remove(long) @bci=44, line=135 (Compiled frame)
 - java.lang.ref.ReferenceQueue.remove() @bci=2, line=151 (Compiled frame)
 - java.lang.ref.Finalizer$FinalizerThread.run() @bci=16, line=189 (Compiled 
frame)


Thread 27053: (state = BLOCKED)
 - java.lang.Object.wait(long) @bci=0 (Compiled frame; information may be 
imprecise)
 - java.lang.Object.wait() @bci=2, line=503 (Interpreted frame)
 - java.lang.ref.Reference$ReferenceHandler.run() @bci=46, line=133 
(Interpreted frame)


Thread 27047: (state = BLOCKED)
 - sun.misc.Unsafe.park(boolean, long) @bci=0 (Interpreted frame)
 - java.util.concurrent.locks.LockSupport.park(java.lang.Object) @bci=14, 
line=186 (Interpreted frame)
 - java.util.concurrent.FutureTask.awaitDone(boolean, long) @bci=165, line=425 
(Interpreted frame)
 - java.util.concurrent.FutureTask.get() @bci=13, line=187 (Interpreted frame)
 - org.apache.tez.runtime.task.TezTaskRunner.run() @bci=37, line=96 
(Interpreted frame)
 - org.apache.tez.runtime.task.TezChild.run() @bci=372, line=218 (Interpreted 
frame)
 - org.apache.tez.runtime.task.TezChild.main(java.lang.String[]) @bci=116, 
line=433 (Interpreted frame)


Reply via email to