Hey Gavin,

Could you please provide a snippet of your code to show how did you disabled "parquet.enable.summary-metadata" and wrote the files? Especially, you mentioned you saw "3000 jobs" failed. Were you writing each Parquet file with an individual job? (Usually people use write.partitionBy(...).parquet(...) to write multiple Parquet files.)

Cheng

On 1/10/16 10:12 PM, Gavin Yue wrote:
Hey,

I am trying to convert a bunch of json files into parquet, which would output over 7000 parquet files. But tthere are too many files, so I want to repartition based on id to 3000.

But I got the error of GC problem like this one: https://mail-archives.apache.org/mod_mbox/spark-user/201512.mbox/%3CCAB4bC7_LR2rpHceQw3vyJ=l6xq9+9sjl3wgiispzyfh2xmt...@mail.gmail.com%3E#archives

So I set parquet.enable.summary-metadata to false. But when I write.parquet, I could still see the 3000 jobs run after the writing parquet and they failed due to GC.

Basically repartition never succeeded for me. Is there any other settings which could be optimized?

Thanks,
Gavin


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to