Re: Need Help Diagnosing/operating/tuning

2015-11-23 Thread Igor Berman
you should check why executor is killed. as soon as it's killed you can get
all kind of strange exceptions...
either give your executors more memory(4G is rather small for spark )
or try to decrease your input or maybe split it into more partitions in
input format
23G in lzo might expand to x? in memory - it depends on your format

in general each executor has 4G of memory, when only part of it is used for
caching/shuffling(see spark configuration of diff fraction params
then you should divide this memory to number of cores in each executor
then you can understand approx what is your partition size...you can make
this arithmetic opposite way from size of partition to memory needed by
each executor

no point to make 300 retries...there is no magic in spark...if it fails
after 3 retry it will fail...

ui metrics can give you hints regarding partition size etc

On 23 November 2015 at 03:30, Jeremy Davis  wrote:

> It seems like the problem is related to —executor-cores. Is there possibly
> some sort of race condition when using multiple cores per executor?
>
>
> On Nov 22, 2015, at 12:38 PM, Jeremy Davis  wrote:
>
>
> Hello,
> I’m at a loss trying to diagnose why my spark job is failing. (works fine
> on small data)
> It is failing during the repartition, or on the subsequent steps.. which
> then seem to fail and fall back to repartitioning..
> I’ve tried adjusting every parameter I can find, but have had no success.
> Input is only 23GB of LZO )probably 8x compression), and I’ve verified all
> files are valid (not corrupted).
> I’ve tried more and less of : memory, partitions, executors, cores...
> I’ve set maxFailures up to 300.
> Setting 4GB heap usually makes it through repartitioning, but fails on
> subsequent steps (Sometimes being killed from running past memory limits).
> Larger Heaps usually don’t even make it through the first repartition due
> to all kinds of weird errors that look like read errors...
>
> I’m at a loss on how to debug this thing.
> Is there a tutorial somewhere?
>
> ——
>
>
> Spark 1.4.1
> Java 7
> Cluster has 3TB of memory, and 400 cores.
>
>
> Here are a collection of exceptions
>
>
> java.io.FileNotFoundException: 
> /var/storage/sdd3/nm-local/usercache/jeremy/appcache/application_1447722466442_1649/blockmgr-9ed5583f-cac1-4701-9f70-810c215b954f/13/shuffle_0_5_0.data
>  (No such file or directory)
>   at java.io.FileOutputStream.open(Native Method)
>   at java.io.FileOutputStream.(FileOutputStream.java:221)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:128)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:215)
>   at 
> org.apache.spark.util.collection.ChainedBuffer.read(ChainedBuffer.scala:56)
>   at 
> org.apache.spark.util.collection.PartitionedSerializedPairBuffer$$anon$2.writeNext(PartitionedSerializedPairBuffer.scala:137)
>   at 
> org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:757)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:70)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
>
>
>
>
> java.lang.InternalError: lzo1x_decompress_safe returned: -6
>   at 
> com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native 
> Method)
>   at 
> com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:315)
>   at 
> com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:122)
>   at 
> com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:247)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>   at java.io.InputStream.read(InputStream.java:101)
>   at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
>   at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
>   at 
> 

Re: Need Help Diagnosing/operating/tuning

2015-11-22 Thread Jeremy Davis
It seems like the problem is related to —executor-cores. Is there possibly some 
sort of race condition when using multiple cores per executor?


On Nov 22, 2015, at 12:38 PM, Jeremy Davis 
> wrote:


Hello,
I’m at a loss trying to diagnose why my spark job is failing. (works fine on 
small data)
It is failing during the repartition, or on the subsequent steps.. which then 
seem to fail and fall back to repartitioning..
I’ve tried adjusting every parameter I can find, but have had no success.
Input is only 23GB of LZO )probably 8x compression), and I’ve verified all 
files are valid (not corrupted).
I’ve tried more and less of : memory, partitions, executors, cores...
I’ve set maxFailures up to 300.
Setting 4GB heap usually makes it through repartitioning, but fails on 
subsequent steps (Sometimes being killed from running past memory limits). 
Larger Heaps usually don’t even make it through the first repartition due to 
all kinds of weird errors that look like read errors...

I’m at a loss on how to debug this thing.
Is there a tutorial somewhere?

——


Spark 1.4.1
Java 7
Cluster has 3TB of memory, and 400 cores.


Here are a collection of exceptions



java.io.FileNotFoundException: 
/var/storage/sdd3/nm-local/usercache/jeremy/appcache/application_1447722466442_1649/blockmgr-9ed5583f-cac1-4701-9f70-810c215b954f/13/shuffle_0_5_0.data
 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:128)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:215)
at 
org.apache.spark.util.collection.ChainedBuffer.read(ChainedBuffer.scala:56)
at 
org.apache.spark.util.collection.PartitionedSerializedPairBuffer$$anon$2.writeNext(PartitionedSerializedPairBuffer.scala:137)
at 
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:757)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)




java.lang.InternalError: lzo1x_decompress_safe returned: -6
at 
com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method)
at 
com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:315)
at 
com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:122)
at 
com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:247)
at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:216)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)



java.io.IOException: Filesystem closed
at 

Need Help Diagnosing/operating/tuning

2015-11-22 Thread Jeremy Davis

Hello,
I’m at a loss trying to diagnose why my spark job is failing. (works fine on 
small data)
It is failing during the repartition, or on the subsequent steps.. which then 
seem to fail and fall back to repartitioning..
I’ve tried adjusting every parameter I can find, but have had no success.
Input is only 23GB of LZO )probably 8x compression), and I’ve verified all 
files are valid (not corrupted).
I’ve tried more and less of : memory, partitions, executors, cores...
I’ve set maxFailures up to 300.
Setting 4GB heap usually makes it through repartitioning, but fails on 
subsequent steps (Sometimes being killed from running past memory limits). 
Larger Heaps usually don’t even make it through the first repartition due to 
all kinds of weird errors that look like read errors...

I’m at a loss on how to debug this thing.
Is there a tutorial somewhere?

——


Spark 1.4.1
Java 7
Cluster has 3TB of memory, and 400 cores.


Here are a collection of exceptions



java.io.FileNotFoundException: 
/var/storage/sdd3/nm-local/usercache/jeremy/appcache/application_1447722466442_1649/blockmgr-9ed5583f-cac1-4701-9f70-810c215b954f/13/shuffle_0_5_0.data
 (No such file or directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:128)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:215)
at 
org.apache.spark.util.collection.ChainedBuffer.read(ChainedBuffer.scala:56)
at 
org.apache.spark.util.collection.PartitionedSerializedPairBuffer$$anon$2.writeNext(PartitionedSerializedPairBuffer.scala:137)
at 
org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:757)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)




java.lang.InternalError: lzo1x_decompress_safe returned: -6
at 
com.hadoop.compression.lzo.LzoDecompressor.decompressBytesDirect(Native Method)
at 
com.hadoop.compression.lzo.LzoDecompressor.decompress(LzoDecompressor.java:315)
at 
com.hadoop.compression.lzo.LzopDecompressor.decompress(LzopDecompressor.java:122)
at 
com.hadoop.compression.lzo.LzopInputStream.decompress(LzopInputStream.java:247)
at 
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
at java.io.InputStream.read(InputStream.java:101)
at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:216)
at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:62)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)



java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:707)
at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:776)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:837)
at