There is an AvroParquetPathPerKeyTarget, IIRC- I'm on my phone at the moment, so I can't check the docs. Still the best option at the moment. On Mon, Nov 21, 2016 at 6:59 AM David Ortiz <[email protected]> wrote:
> Hello, > > I am working on a Crunch pipeline where the output is going to be > read by subsequent Hive jobs. I want to partition it by the timezone > contained in the data records. What is the best way to support this in > Crunch? > > From the googling I did, it looked like one approach would be to > write the data out into a PTable keyed by the timezone, then use the > AvroPathPerKeyTarget. However, from what I can tell this only works if I > am writing to an Avro output. Is there similar functionality available for > parquet output? > > Alternatively, is there a better way to do this? I imagine I could > filter the collection for each timezone, but that doesn't seem like it > would be an efficient way to bucket the data. > > Thanks, > Dave >
