Task hang problem

2015-12-29 Thread Darren Govoni


Hi,

  I've had this nagging problem where a task will hang and the
entire job hangs. Using pyspark. Spark 1.5.1



The job output looks like this, and hangs after the last task:



..

15/12/29 17:00:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in
memory on 10.65.143.174:34385 (size: 5.8 KB, free: 2.1 GB)

15/12/29 17:00:39 INFO TaskSetManager: Finished task 15.0 in stage
0.0 (TID 15) in 11668 ms on 10.65.143.174 (29/32)

15/12/29 17:00:39 INFO TaskSetManager: Finished task 23.0 in stage
0.0 (TID 23) in 11684 ms on 10.65.143.174 (30/32)

15/12/29 17:00:39 INFO TaskSetManager: Finished task 7.0 in stage
0.0 (TID 7) in 11717 ms on 10.65.143.174 (31/32)

{nothing here for a while, ~6mins}





Here is the executor status, from UI.





  

  31
  31
  0
  RUNNING
  PROCESS_LOCAL
  2 / 10.65.143.174
  2015/12/29 17:00:28
  6.8 min
  0 ms
  0 ms
  60 ms
  0 ms
  0 ms
  0.0 B

  



Here is executor 2 from 10.65.143.174. Never see task 31 get to the
executor.any ideas?



.

15/12/29 17:00:38 INFO TorrentBroadcast: Started reading broadcast
variable 0

15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(5979) called
with curMem=0, maxMem=2223023063

15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0_piece0 stored
as bytes in memory (estimated size 5.8 KB, free 2.1 GB)

15/12/29 17:00:38 INFO TorrentBroadcast: Reading broadcast variable
0 took 208 ms

15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(8544) called
with curMem=5979, maxMem=2223023063

15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0 stored as
values in memory (estimated size 8.3 KB, free 2.1 GB)

15/12/29 17:00:39 INFO PythonRunner: Times: total = 913, boot = 747,
init = 166, finish = 0

15/12/29 17:00:39 INFO Executor: Finished task 15.0 in stage 0.0
(TID 15). 967 bytes result sent to driver

15/12/29 17:00:39 INFO PythonRunner: Times: total = 955, boot = 735,
init = 220, finish = 0

15/12/29 17:00:39 INFO Executor: Finished task 23.0 in stage 0.0
(TID 23). 967 bytes result sent to driver

15/12/29 17:00:39 INFO PythonRunner: Times: total = 970, boot = 812,
init = 158, finish = 0

15/12/29 17:00:39 INFO Executor: Finished task 7.0 in stage 0.0 (TID
7). 967 bytes result sent to driver

root@ip-10-65-143-174 2]$ 


Sent from my Verizon Wireless 4G LTE smartphone

Re: Task hang problem

2015-12-29 Thread Ted Yu
Can you log onto 10.65.143.174 , find task 31 and take a stack trace ?

Thanks

On Tue, Dec 29, 2015 at 9:19 AM, Darren Govoni  wrote:

> Hi,
>   I've had this nagging problem where a task will hang and the entire job
> hangs. Using pyspark. Spark 1.5.1
>
> The job output looks like this, and hangs after the last task:
>
> ..
> 15/12/29 17:00:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in
> memory on 10.65.143.174:34385 (size: 5.8 KB, free: 2.1 GB)
> 15/12/29 17:00:39 INFO TaskSetManager: Finished task 15.0 in stage 0.0
> (TID 15) in 11668 ms on 10.65.143.174 (29/32)
> 15/12/29 17:00:39 INFO TaskSetManager: Finished task 23.0 in stage 0.0
> (TID 23) in 11684 ms on 10.65.143.174 (30/32)
> 15/12/29 17:00:39 INFO TaskSetManager: Finished task 7.0 in stage 0.0
> (TID 7) in 11717 ms on 10.65.143.174 (31/32)
> {nothing here for a while, ~6mins}
>
>
> Here is the executor status, from UI.
>
> 31 31 0 RUNNING PROCESS_LOCAL 2 / 10.65.143.174 2015/12/29 17:00:28 6.8
> min 0 ms 0 ms 60 ms 0 ms 0 ms 0.0 B
> Here is executor 2 from 10.65.143.174. Never see task 31 get to the
> executor.any ideas?
>
> .
> 15/12/29 17:00:38 INFO TorrentBroadcast: Started reading broadcast
> variable 0
> 15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(5979) called with
> curMem=0, maxMem=2223023063
> 15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0_piece0 stored as
> bytes in memory (estimated size 5.8 KB, free 2.1 GB)
> 15/12/29 17:00:38 INFO TorrentBroadcast: Reading broadcast variable 0
> took 208 ms
> 15/12/29 17:00:38 INFO MemoryStore: ensureFreeSpace(8544) called with
> curMem=5979, maxMem=2223023063
> 15/12/29 17:00:38 INFO MemoryStore: Block broadcast_0 stored as values in
> memory (estimated size 8.3 KB, free 2.1 GB)
> 15/12/29 17:00:39 INFO PythonRunner: Times: total = 913, boot = 747, init
> = 166, finish = 0
> 15/12/29 17:00:39 INFO Executor: Finished task 15.0 in stage 0.0 (TID
> 15). 967 bytes result sent to driver
> 15/12/29 17:00:39 INFO PythonRunner: Times: total = 955, boot = 735, init
> = 220, finish = 0
> 15/12/29 17:00:39 INFO Executor: Finished task 23.0 in stage 0.0 (TID
> 23). 967 bytes result sent to driver
> 15/12/29 17:00:39 INFO PythonRunner: Times: total = 970, boot = 812, init
> = 158, finish = 0
> 15/12/29 17:00:39 INFO Executor: Finished task 7.0 in stage 0.0 (TID 7).
> 967 bytes result sent to driver
> root@ip-10-65-143-174 2]$
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>


Re: Task hang problem

2015-12-29 Thread Darren Govoni

  

  
  
here's executor trace.

  

  
  
Thread 58: Executor task launch
worker-3 (RUNNABLE)

  
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.read(SocketInputStream.java:152)
java.net.SocketInputStream.read(SocketInputStream.java:122)
java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
java.io.BufferedInputStream.read(BufferedInputStream.java:254)
java.io.DataInputStream.readInt(DataInputStream.java:387)
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:139)
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
org.apache.spark.scheduler.Task.run(Task.scala:88)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
  

  
  
Thread 41: BLOCK_MANAGER cleanup
timer (WAITING)

  
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
java.util.TimerThread.mainLoop(Timer.java:526)
java.util.TimerThread.run(Timer.java:505)
  

  
  
Thread 42: BROADCAST_VARS cleanup
timer (WAITING)

  
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:503)
java.util.TimerThread.mainLoop(Timer.java:526)
java.util.TimerThread.run(Timer.java:505)
  

  
  
Thread 54: driver-heartbeater
(TIMED_WAITING)

  
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
  

  
  
Thread 3: Finalizer (WAITING)

  
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:209)
  

  
  
Thread 25:
ForkJoinPool-3-worker-15 (WAITING)

  
sun.misc.Unsafe.park(Native Method)
scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
  

  
  
Thread 35: Hashed wheel timer #2
(TIMED_WAITING)

  
java.lang.Thread.sleep(Native Method)
org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:483)
org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:392)
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
java.lang.Thread.run(Thread.java:745)
  

  
  
Thread 68: Idle Worker Monitor
for /usr/bin/python2.7 (TIMED_WAITING)

  
java.lang.Thread.sleep(Native Method)
org.apache.spark.api.python.PythonWorkerFactory$MonitorThread.run(PythonWorkerFactory.scala:229)
  

  
  
Thread 1: main (WAITING)

  
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:236)
akka.actor.ActorSystemImpl$TerminationCallbacks.ready(ActorSystem.scala:819)