Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-12-02 Thread Cheng Lian
You may try to set Hadoop conf "parquet.enable.summary-metadata" to false to disable writing Parquet summary files (_metadata and _common_metadata). By default Parquet writes the summary files by collecting footers of all part-files in the dataset while committing the job. Spark also follows

Re: df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-12-02 Thread Adrien Mogenet
Very interested in that topic too, thanks Cheng for the direction! We'll give it a try as well. On 3 December 2015 at 01:40, Cheng Lian wrote: > You may try to set Hadoop conf "parquet.enable.summary-metadata" to false > to disable writing Parquet summary files

df.partitionBy().parquet() java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-28 Thread Don Drake
I have a 2TB dataset that I have in a DataFrame that I am attempting to partition by 2 fields and my YARN job seems to write the partitioned dataset successfully. I can see the output in HDFS once all Spark tasks are done. After the spark tasks are done, the job appears to be running for over an