Hey Gavin,
Could you please provide a snippet of your code to show how did you
disabled "parquet.enable.summary-metadata" and wrote the files?
Especially, you mentioned you saw "3000 jobs" failed. Were you writing
each Parquet file with an individual job? (Usually people use
write.partitionBy(...).parquet(...) to write multiple Parquet files.)
Cheng
On 1/10/16 10:12 PM, Gavin Yue wrote:
Hey,
I am trying to convert a bunch of json files into parquet, which would
output over 7000 parquet files. But tthere are too many files, so I
want to repartition based on id to 3000.
But I got the error of GC problem like this one:
https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives
So I set parquet.enable.summary-metadata to false. But when I
write.parquet, I could still see the 3000 jobs run after the writing
parquet and they failed due to GC.
Basically repartition never succeeded for me. Is there any other
settings which could be optimized?
Thanks,
Gavin
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org