Re: Binning operation for the generation of Hive partitioned data

Gabriel Reid Tue, 22 Apr 2014 04:32:31 -0700

Hi Elliot,

On Tue, Apr 22, 2014 at 1:11 PM, Elliot West <[email protected]> wrote:
> Hello,
>
> I'm evaluating Apache Crunch as a possible replacement for some our data
> processing frameworks that run on Hadoop. I can find crunch constructs that
> map to most types of operation that we perform in our processes. However, we
> frequently bin data by a date field for the purpose of generating
> partitioned Hive tables - a fairly common operation I believe. I can't find
> a similar binning operation in the crunch user manual and was wondering
> if/how this would be achieve with Apache Crunch?


There is currently some support for something like this in Crunch,
provided that you're using Avro for your output files.

The AvroPathPerKeyTarget[1] takes a PTable<String,T>, where T is a
type that can be serialized by Avro, and writes the Avro values in a
subdirectory whose name is given by the String value for that record
in the PTable. As pointed out in the javadoc for AvroPathPerKeyTarget,
it's a good idea to ensure that all values for the same key are
together (i.e. that the elements in the PTable are sorted by key)
before using the AvroPathPerKeyTarget.

- Gabriel

1. 
http://crunch.apache.org/apidocs/0.9.0/org/apache/crunch/io/avro/AvroPathPerKeyTarget.html

Re: Binning operation for the generation of Hive partitioned data

Reply via email to