Hi Leo, Did you run Spark on Yarn or mesos?
On Wed, Dec 25, 2013 at 6:58 PM, [email protected] <[email protected]>wrote: > hi all : > I run an example three times , it just read data from hdfs then do map and > reduce then write to hdfs . > the first time and second time it works well , read almost 7G data and > finished in 15 minutes , but there have a problem when I run it the third > time . > one machine in my cluster lack of hard disk . > The job begin at 17:11:15 , but it has been unable to end . I wait it for > 1 hour then kill it . there are the logs : > > LogA (from the sick machine): > ..... > 13/12/25 17:13:56 INFO Executor: Its epoch is 0 > 13/12/25 17:13:56 ERROR Executor: Exception in task ID 26 > java.lang.NullPointerException > at > org.apache.spark.storage.DiskStore$DiskBlockObjectWriter.revertPartialWrites(DiskStore.scala:99) > at > org.apache.spark.scheduler.ShuffleMapTask$$anonfun$run$2.apply(ShuffleMapTask.scala:175) > at > org.apache.spark.scheduler.ShuffleMapTask$$anonfun$run$2.apply(ShuffleMapTask.scala:175) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34) > at scala.collection.mutable.ArrayOps.foreach(ArrayOps.scala:38) > at > org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:175) > at > org.apache.spark.scheduler.ShuffleMapTask.run(ShuffleMapTask.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:158) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > END > > LogB (a healthy machine): > ...... > 13/12/25 17:12:35 INFO Executor: Serialized size of result for 54 is 817 > 13/12/25 17:12:35 INFO Executor: Finished task ID 54 > [GC 4007686K->7782K(15073280K), 0.0287070 secs] > END > > LogC(the master and a worker): > ....... > [GC 4907439K->110393K(15457280K), 0.0533800 secs] > 13/12/25 17:13:23 INFO Executor: Serialized size of result for 203 is 817 > 13/12/25 17:13:23 INFO Executor: Finished task ID 203 > 13/12/25 17:13:24 INFO Executor: Serialized size of result for 202 is 817 > 13/12/25 17:13:24 INFO Executor: Finished task ID 202 > END > > I don't know why the job doesn't shut down ? the log message doesn't been > writen when the job runs 2 minuts . > why one machine assigned tasks so many more than others ? > How could I get the job and task' status when I run a big job ? it looks > like a black box ... > > > ------------------------------ > [email protected] >
