Iterator$$anon$11.hasNext(Iterator.scala:327)
> > >> >
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> > >> > scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
> > >> >
> scala.collection.AbstractIterator.isEmpty(Itera
gt;> >
>> > >> >
>> > >>
>> > org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:139)
>>
>> > >> >
>> > >> >
>> > >>
>> > org.apache.spark.InterruptibleIterator.hasNext(
>> Interr
terator.scala:327)
> > >> >
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> > >> > scala.collection.Iterator$class.isEmpty(Iterator.scala:256)
> > >> >
> scala.collection.AbstractIterator.isEmpty(Iterator.scala:1157)
> >
I actually submitted a patch to do this yesterday:
https://github.com/apache/spark/pull/2493
Can you tell us more about your configuration. In particular how much
memory/cores do the executors have and what does the schema of your data
look like?
On Tue, Sep 23, 2014 at 7:39 AM, Cody Koeninger
This may be related: https://github.com/Parquet/parquet-mr/issues/211
Perhaps if we change our configuration settings for Parquet it would get
better, but the performance characteristics of Snappy are pretty bad here
under some circumstances.
On Tue, Sep 23, 2014 at 10:13 AM, Cody Koeninger
After commit 8856c3d8 switched from gzip to snappy as default parquet
compression codec, I'm seeing the following when trying to read parquet
files saved using the new default (same schema and roughly same size as
files that were previously working):
java.lang.OutOfMemoryError: Direct buffer