Parquet partitioning for unique identifier

Kohki Nishio Wed, 02 Sep 2015 17:12:07 -0700

Hello experts,

I have a huge json file (> 40G) and trying to use Parquet as a file format.
Each entry has a unique identifier but other than that, it doesn't have
'well balanced value' column to partition it. Right now it just throws OOM
and couldn't figure out what to do with it.


It would be ideal if I could provide a partitioner based on the unique
identifier value like computing its hash value or something.  One of the
option would be to produce a hash value and add it as a separate column,
but it doesn't sound right to me. Is there any other ways I can try ?

Regards,
-- 
Kohki Nishio

Parquet partitioning for unique identifier

Reply via email to