Re: Need Help Diagnosing/operating/tuning
you should check why executor is killed. as soon as it's killed you can get all kind of strange exceptions... either give your executors more memory(4G is rather small for spark ) or try to decrease your input or maybe split it into more partitions in input format 23G in lzo might expand to x? in memory - it depends on your format in general each executor has 4G of memory, when only part of it is used for caching/shuffling(see spark configuration of diff fraction params then you should divide this memory to number of cores in each executor then you can understand approx what is your partition size...you can make this arithmetic opposite way from size of partition to memory needed by each executor no point to make 300 retries...there is no magic in spark...if it fails after 3 retry it will fail... ui metrics can give you hints regarding partition size etc On 23 November 2015 at 03:30, Jeremy Daviswrote: > It seems like the problem is related to —executor-cores. Is there possibly > some sort of race condition when using multiple cores per executor? > > > On Nov 22, 2015, at 12:38 PM, Jeremy Davis wrote: > > > Hello, > I’m at a loss trying to diagnose why my spark job is failing. (works fine > on small data) > It is failing during the repartition, or on the subsequent steps.. which > then seem to fail and fall back to repartitioning.. > I’ve tried adjusting every parameter I can find, but have had no success. > Input is only 23GB of LZO )probably 8x compression), and I’ve verified all > files are valid (not corrupted). > I’ve tried more and less of : memory, partitions, executors, cores... > I’ve set maxFailures up to 300. > Setting 4GB heap usually makes it through repartitioning, but fails on > subsequent steps (Sometimes being killed from running past memory limits). > Larger Heaps usually don’t even make it through the first repartition due > to all kinds of weird errors that look like read errors... > > I’m at a loss on how to debug this thing. > Is there a tutorial somewhere? > > —— > > > Spark 1.4.1 > Java 7 > Cluster has 3TB of memory, and 400 cores. > > > Here are a collection of exceptions > > > java.io.FileNotFoundException: > /var/storage/sdd3/nm-local/usercache/jeremy/appcache/application_1447722466442_1649/blockmgr-9ed5583f-cac1-4701-9f70-810c215b954f/13/shuffle_0_5_0.data > (No such file or directory) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:221) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:128) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:215) > at > org.apache.spark.util.collection.ChainedBuffer.read(ChainedBuffer.scala:56) > at > org.apache.spark.util.collection.PartitionedSerializedPairBuffer$$anon$2.writeNext(PartitionedSerializedPairBuffer.scala:137) > at > org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:757) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:70) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:744) > > > > > java.lang.InternalError: lzo1x_decompress_safe returned: -6 > at > com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native > Method) > at > com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:315) > at > com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:122) > at > com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:247) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) > at > org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209) > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47) > at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248) > at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > at >
Re: Need Help Diagnosing/operating/tuning
It seems like the problem is related to —executor-cores. Is there possibly some sort of race condition when using multiple cores per executor? On Nov 22, 2015, at 12:38 PM, Jeremy Davis> wrote: Hello, I’m at a loss trying to diagnose why my spark job is failing. (works fine on small data) It is failing during the repartition, or on the subsequent steps.. which then seem to fail and fall back to repartitioning.. I’ve tried adjusting every parameter I can find, but have had no success. Input is only 23GB of LZO )probably 8x compression), and I’ve verified all files are valid (not corrupted). I’ve tried more and less of : memory, partitions, executors, cores... I’ve set maxFailures up to 300. Setting 4GB heap usually makes it through repartitioning, but fails on subsequent steps (Sometimes being killed from running past memory limits). Larger Heaps usually don’t even make it through the first repartition due to all kinds of weird errors that look like read errors... I’m at a loss on how to debug this thing. Is there a tutorial somewhere? —— Spark 1.4.1 Java 7 Cluster has 3TB of memory, and 400 cores. Here are a collection of exceptions java.io.FileNotFoundException: /var/storage/sdd3/nm-local/usercache/jeremy/appcache/application_1447722466442_1649/blockmgr-9ed5583f-cac1-4701-9f70-810c215b954f/13/shuffle_0_5_0.data (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:128) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:215) at org.apache.spark.util.collection.ChainedBuffer.read(ChainedBuffer.scala:56) at org.apache.spark.util.collection.PartitionedSerializedPairBuffer$$anon$2.writeNext(PartitionedSerializedPairBuffer.scala:137) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:757) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) java.lang.InternalError: lzo1x_decompress_safe returned: -6 at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method) at com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:315) at com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:122) at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:247) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:216) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) java.io.IOException: Filesystem closed at
Need Help Diagnosing/operating/tuning
Hello, I’m at a loss trying to diagnose why my spark job is failing. (works fine on small data) It is failing during the repartition, or on the subsequent steps.. which then seem to fail and fall back to repartitioning.. I’ve tried adjusting every parameter I can find, but have had no success. Input is only 23GB of LZO )probably 8x compression), and I’ve verified all files are valid (not corrupted). I’ve tried more and less of : memory, partitions, executors, cores... I’ve set maxFailures up to 300. Setting 4GB heap usually makes it through repartitioning, but fails on subsequent steps (Sometimes being killed from running past memory limits). Larger Heaps usually don’t even make it through the first repartition due to all kinds of weird errors that look like read errors... I’m at a loss on how to debug this thing. Is there a tutorial somewhere? —— Spark 1.4.1 Java 7 Cluster has 3TB of memory, and 400 cores. Here are a collection of exceptions java.io.FileNotFoundException: /var/storage/sdd3/nm-local/usercache/jeremy/appcache/application_1447722466442_1649/blockmgr-9ed5583f-cac1-4701-9f70-810c215b954f/13/shuffle_0_5_0.data (No such file or directory) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:128) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:215) at org.apache.spark.util.collection.ChainedBuffer.read(ChainedBuffer.scala:56) at org.apache.spark.util.collection.PartitionedSerializedPairBuffer$$anon$2.writeNext(PartitionedSerializedPairBuffer.scala:137) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:757) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) java.lang.InternalError: lzo1x_decompress_safe returned: -6 at com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method) at com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:315) at com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:122) at com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:247) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at java.io.InputStream.read(InputStream.java:101) at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:216) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837) at