That looks like exactly what I'm after. Even better, it looks like it was
backported to 0.11.0 in CDH 5.7.
Thanks!
Dave
On Mon, Nov 21, 2016 at 11:34 AM Josh Wills <[email protected]> wrote:
There is an AvroParquetPathPerKeyTarget, IIRC- I'm on my phone at the
moment, so I can't check the docs. Still the best option at the moment.
On Mon, Nov 21, 2016 at 6:59 AM David Ortiz <[email protected]> wrote:
Hello,
I am working on a Crunch pipeline where the output is going to be read
by subsequent Hive jobs. I want to partition it by the timezone contained
in the data records. What is the best way to support this in Crunch?
From the googling I did, it looked like one approach would be to write
the data out into a PTable keyed by the timezone, then use the
AvroPathPerKeyTarget. However, from what I can tell this only works if I
am writing to an Avro output. Is there similar functionality available for
parquet output?
Alternatively, is there a better way to do this? I imagine I could
filter the collection for each timezone, but that doesn't seem like it
would be an efficient way to bucket the data.
Thanks,
Dave