You may try to set Hadoop conf "parquet.enable.summary-metadata" to
false to disable writing Parquet summary files (_metadata and
_common_metadata).
By default Parquet writes the summary files by collecting footers of all
part-files in the dataset while committing the job. Spark also follows
Very interested in that topic too, thanks Cheng for the direction!
We'll give it a try as well.
On 3 December 2015 at 01:40, Cheng Lian wrote:
> You may try to set Hadoop conf "parquet.enable.summary-metadata" to false
> to disable writing Parquet summary files
I have a 2TB dataset that I have in a DataFrame that I am attempting to
partition by 2 fields and my YARN job seems to write the partitioned
dataset successfully. I can see the output in HDFS once all Spark tasks
are done.
After the spark tasks are done, the job appears to be running for over an