Best way to imitate Hive Partition

David Ortiz Mon, 21 Nov 2016 07:01:04 -0800

Hello,

     I am working on a Crunch pipeline where the output is going to be read
by subsequent Hive jobs.  I want to partition it by the timezone contained
in the data records.  What is the best way to support this in Crunch?


     From the googling I did, it looked like one approach would be to write
the data out into a PTable keyed by the timezone, then use the
AvroPathPerKeyTarget.  However, from what I can tell this only works if I
am writing to an Avro output.  Is there similar functionality available for
parquet output?

     Alternatively, is there a better way to do this?  I imagine I could
filter the collection for each timezone, but that doesn't seem like it
would be an efficient way to bucket the data.

Thanks,
     Dave

Best way to imitate Hive Partition

Reply via email to