Hi Elliot, On Tue, Apr 22, 2014 at 1:11 PM, Elliot West <[email protected]> wrote: > Hello, > > I'm evaluating Apache Crunch as a possible replacement for some our data > processing frameworks that run on Hadoop. I can find crunch constructs that > map to most types of operation that we perform in our processes. However, we > frequently bin data by a date field for the purpose of generating > partitioned Hive tables - a fairly common operation I believe. I can't find > a similar binning operation in the crunch user manual and was wondering > if/how this would be achieve with Apache Crunch?
There is currently some support for something like this in Crunch, provided that you're using Avro for your output files. The AvroPathPerKeyTarget[1] takes a PTable<String,T>, where T is a type that can be serialized by Avro, and writes the Avro values in a subdirectory whose name is given by the String value for that record in the PTable. As pointed out in the javadoc for AvroPathPerKeyTarget, it's a good idea to ensure that all values for the same key are together (i.e. that the elements in the PTable are sorted by key) before using the AvroPathPerKeyTarget. - Gabriel 1. http://crunch.apache.org/apidocs/0.9.0/org/apache/crunch/io/avro/AvroPathPerKeyTarget.html
