I have recently been testing converting an existing Pig M/R application to run 
on Tez.  I’ve had to work around a few issues, but the performance improvement 
is significant (~ 25 minutes on M/R, 5 minutes on Tez).

Currently the problem I’m running into is that occasionally when processing a 
DAG the application hangs.  When this happens, I find the following in the 
syslog for that dag:

016-03-21 16:39:01,643 [INFO] [DelayedContainerManager] 
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay 
expired or is new. Releasing container, 
containerId=container_e11_1437886552023_169758_01_000822, 
containerExpiryTime=1458603541415, idleTimeout=5000, taskRequestsCount=0, 
heldContainers=112, delayedContainers=27, isNew=false
2016-03-21 16:39:01,825 [INFO] [DelayedContainerManager] 
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay 
expired or is new. Releasing container, 
containerId=container_e11_1437886552023_169758_01_000824, 
containerExpiryTime=1458603541692, idleTimeout=5000, taskRequestsCount=0, 
heldContainers=111, delayedContainers=26, isNew=false
2016-03-21 16:39:01,990 [INFO] [Socket Reader #1 for port 53324] |ipc.Server|: 
Socket Reader #1 for port 53324: readAndProcess from client 10.102.173.86 threw 
exception [java.io.IOException: Connection reset by peer]
java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:197)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at org.apache.hadoop.ipc.Server.channelRead(Server.java:2593)
        at org.apache.hadoop.ipc.Server.access$2800(Server.java:135)
        at 
org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1471)
        at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:762)
        at 
org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:636)
        at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:607)
2016-03-21 16:39:02,032 [INFO] [DelayedContainerManager] 
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay 
expired or is new. Releasing container, 
containerId=container_e11_1437886552023_169758_01_000811, 
containerExpiryTime=1458603541828, idleTimeout=5000, taskRequestsCount=0, 
heldContainers=110, delayedContainers=25, isNew=false
2016-03-21 16:39:02,266 [INFO] [DelayedContainerManager] 
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay 
expired or is new. Releasing container, 
containerId=container_e11_1437886552023_169758_01_000963, 
containerExpiryTime=1458603542166, idleTimeout=5000, taskRequestsCount=0, 
heldContainers=109, delayedContainers=24, isNew=false
2016-03-21 16:39:02,305 [INFO] [DelayedContainerManager] 
|rm.YarnTaskSchedulerService|: No taskRequests. Container's idle timeout delay 
expired or is new. Releasing container, 
containerId=container_e11_1437886552023_169758_01_000881, 
containerExpiryTime=1458603542119, idleTimeout=5000, taskRequestsCount=0, 
heldContainers=108, delayedContainers=23, isNew=false


It will continue logging some number more ‘Releasing container’ messages, and 
then soon stop all logging, and stop submitting tasks. I also do not see any 
errors or exceptions in the container logs for the host identified in the 
IOException.  Is there some other place I should look on that host to find an 
indication of what’s going wrong?

Any thoughts on what’s going on here?  Is this a state from which an 
application should be able to recover?  We do not see the application hang when 
running on M/R.

Any insights most appreciated,
Kurt


Reply via email to