Re: crunch : correct way to think about tuple abstractions for aggregations?

Jay Vyas Sun, 05 Jan 2014 09:48:16 -0800

Ha yes but dont we all love static typing when it comes for free :)?


> On Jan 5, 2014, at 11:26 AM, Josh Wills <[email protected]> wrote:
> 
> 
>> On Sat, Jan 4, 2014 at 7:43 PM, Jay Vyas <[email protected]> wrote:
>> BTW Thanks josh ! That worked! 
>> 
>> Here is an example of how easy it is to do aggregations in crunch :) ~~~~~~
>> 
>> https://github.com/jayunit100/bigpetstore/commit/03a59fc88680d8926aba4c8d00760436c8cafb69
>> 
>> PS Are you sure PIG/HIVE is really better for this kind of stuff?  I really 
>> like the IDE friendly, statically validated, strongly typed, functional API  
>> ALOT more than the russian roulette that I always seem to play with my 
>> pig/hive code :)
> 
> That may be a function of your comfort level with IDE-supported static strong 
> typing. ;-)
>  
>> 
>> 
>> 
>> 
>>> On Sat, Jan 4, 2014 at 7:49 PM, Jay Vyas <[email protected]> wrote:
>>> Thanks josh ..That was very helpful!! ..I like the avro mapper intermediate 
>>> solution I'll try it out.
>>> 
>>> ...Also : would be interested in contributing a new "section" of the 
>>> bigpetstore workflow , a module which really showed where crunch's 
>>> differentiating factors were valuable?
>>> 
>>> The idea is that bigpetstore should show the differences between different 
>>> ecosystem components so that people can pick for themselves which tool is 
>>> best for which job, and so I think it would be cool to have a phase in the 
>>> bigpetstore workflow which used some nested, strongly typed data and 
>>> processed it with crunch versus pig, to demonstrate (in code) the comments 
>>> you've made.
>>> 
>>> Right now I only have pig and hive but want to add in cascading and 
>>> (obviously) crunch as well.
>>> 
>>>> On Jan 4, 2014, at 4:57 PM, Josh Wills <[email protected]> wrote:
>>>> 
>>>> Hey Jay,
>>>> 
>>>> Crunch isn't big into tuples; it's mostly used to process some sort of 
>>>> structured, complex record data like Avro, protocol buffers, or Thrift. I 
>>>> certainly don't speak for everyone in the community, but I think that 
>>>> using one of these rich, evolvable formats is the best way to work with 
>>>> data on Hadoop. For the problem you gave, where the data is in CSV text, 
>>>> there are a couple of options.
>>>> 
>>>> One option would be to use the TupleN type to represent a record and the 
>>>> Extractor API in crunch-contrib to parse the lines of strings into typed 
>>>> tokens, so you would do something like this to your PCollection<String>:
>>>> 
>>>> PCollection<String> rawData = ...;
>>>> TokenizerFactory tokenize = TokenizerFactory.builder().delim(",").build();
>>>> PCollection<TupleN> tuples = Parse.parse("bigpetshop", // a name to use 
>>>> for the counters used in parsing
>>>>     rawData,
>>>>     xtupleN(tokenize,
>>>>       xstring(),   // big pet store
>>>>       xstring(),   // store code
>>>>       xint(),        // line item
>>>>       xstring(),  // first name
>>>>       xstring(),  // last name
>>>>       xstring(),  // timestamp
>>>>       xdouble(),  // price
>>>>       xstring()));   // item description
>>>> 
>>>> You could also create a POJO to represent a LineItem (which is what I 
>>>> assume this is) and then use Avro reflection-based serialization to 
>>>> serialize it with Crunch:
>>>> 
>>>> public static class LineItem {
>>>>   String appName;
>>>>   String storeCode;
>>>>   int lineId;
>>>>   String firstName;
>>>>   String lastName;
>>>>   String timestamp;
>>>>   double price;
>>>>   String description;
>>>> 
>>>>   public LineItem() {
>>>>      // Avro reflection needs a zero-arg constructor
>>>>   }
>>>> 
>>>>   // other constructors, parsers, etc.
>>>> }
>>>> 
>>>> and then you would have something like this:
>>>> 
>>>> PCollection<LineItem> lineItems = rawData.parallelDo(new MapFn<String, 
>>>> LineItem>() {
>>>>   @Override
>>>>   public LineItem map(String input) {
>>>>     // parse line to LineItem object
>>>>   }
>>>> }, Avros.reflects(LineItem.class));
>>>> 
>>>> I'm not quite sure what you're doing in the grouping clause you have here:
>>>> 
>>>> groupBy(0).count();
>>>> 
>>>> ...I assume you want to count the distinct values of the first field in 
>>>> your tuple, which you would do like this for line items:
>>>> 
>>>> PTable<String, Long> counts = lineItems.parallelDo(new MapFn<LineItem, 
>>>> String>() {
>>>>   public String map(LineItem lineItem) { return lineItem.appName; }
>>>> }, Avros.strings()).count();
>>>> 
>>>> and similarly for TupleN, although you would call get(0) on TupleN and 
>>>> have to cast the returned Object to a String b/c TupleN methods don't have 
>>>> type information.
>>>> 
>>>> I hope that helps. In general, I don't really recommend Crunch for this 
>>>> sort of data processing; Hive, Pig, and Cascading are fine alternatives. 
>>>> But I think Crunch is superior to any of them if you were trying to, say, 
>>>> create an Order record that aggregated the result of multiple LineItems:
>>>> 
>>>> Order {
>>>>   List<LineItem> lineItems;
>>>>   // global order attributes
>>>> }
>>>> 
>>>> or a customer type that aggregated multiple Orders for a single customer:
>>>> 
>>>> Customer {
>>>>   List<Order> orders;
>>>>   // other customer fields
>>>> }
>>>> 
>>>> ...especially if this was the sort of processing task you had to do 
>>>> regularly because lots of other downstream processing tasks required these 
>>>> standard aggregations to exist so that they could do their own 
>>>> calculations. I would also recommend Crunch if you were building 
>>>> BigPetStore on top of HBase using custom schemas that you needed to 
>>>> periodically MapReduce over in order to calculate statistics, cleanup 
>>>> stale data, or fix any consistency issues.
>>>> 
>>>> Best,
>>>> Josh
>>>> 
>>>> 
>>>> 
>>>>> On Sat, Jan 4, 2014 at 12:34 PM, Jay Vyas <[email protected]> wrote:
>>>>> Hi crunch ! 
>>>>> 
>>>>> I want to process a list in crunch:
>>>>> 
>>>>> Something like this: 
>>>>> 
>>>>>         PCollection<String> lines = MemPipeline.collectionOf(
>>>>>                 "BigPetStore,storeCode_AK,1 lindsay,franco,Sat Jan 10 
>>>>> 00:11:10 EST 1970,10.5,dog-food",
>>>>>                 "BigPetStore,storeCode_AZ,1 tom,giles,Sun Dec 28 23:08:45 
>>>>> EST 1969,10.5,dog-food",
>>>>>                 "BigPetStore,storeCode_CA,1  brandon,ewing,Mon Dec 08 
>>>>> 20:23:57 EST 1969,16.5,organic-dog-food",
>>>>>                 "BigPetStore,storeCode_CA,2  angie,coleman,Thu Dec 11 
>>>>> 07:00:31 EST 1969,10.5,dog-food",
>>>>>                 "BigPetStore,storeCode_CA,3  angie,coleman,Tue Jan 20 
>>>>> 06:24:23 EST 1970,7.5,cat-food",
>>>>>                 "BigPetStore,storeCode_CO,1  sharon,trevino,Mon Jan 12 
>>>>> 07:52:10 EST 1970,30.1,antelope snacks",
>>>>>                 "BigPetStore,storeCode_CT,1 kevin,fitzpatrick,Wed Dec 10 
>>>>> 05:24:13 EST 1969,10.5,dog-food",
>>>>>                 "BigPetStore,storeCode_NY,1 dale,holden,Mon Jan 12 
>>>>> 23:02:13 EST 1970,19.75,fish-food",
>>>>>                 "BigPetStore,storeCode_NY,2 dale,holden,Tue Dec 30 
>>>>> 12:29:52 EST 1969,10.5,dog-food",
>>>>>                 "BigPetStore,storeCode_OK,1  donnie,tucker,Sun Jan 18 
>>>>> 04:50:26 EST 1970,7.5,cat-food");
>>>>>         
>>>>>         PCollection coll = lines.parallelDo(
>>>>>               "split lines into words", 
>>>>>               new DoFn<String, String>() {
>>>>>                   @Override
>>>>>                   public void process(String line, Emitter emitter) {
>>>>>                     //not sure this regex will work but you get the 
>>>>> idea.. split by tabs and commas  
>>>>>                     emitter.emit(Arrays.asList(line.split("\t,")));
>>>>>                   }
>>>>>               }, 
>>>>>               Writables.lists()
>>>>>         ).groupBy(0).count();
>>>>>         
>>>>>         }
>>>>> 
>>>>> What is the correct abstraction in crunch to convert raw text into 
>>>>> tuples, 
>>>>> and access them by an index - which you then use to group and count on? 
>>>>> 
>>>>> thanks !
>>>>> 
>>>>> ** FYI ** this is for the bigpetstore project, id like to show crunch 
>>>>> examples in it if i can get them working,  as the API is a nice example 
>>>>> of a lowerlevel mapreduce paradigm which is more java freindly. 
>>>>> 
>>>>> See https://issues.apache.org/jira/browse/BIGTOP-1089 and 
>>>>> https://github.com/jayunit100/bigpetstore for details..
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> Director of Data Science
>>>> Cloudera
>>>> Twitter: @josh_wills
>> 
>> 
>> 
>> -- 
>> Jay Vyas
>> http://jayunit100.blogspot.com
> 
> 
> 
> -- 
> Director of Data Science
> Cloudera
> Twitter: @josh_wills

Re: crunch : correct way to think about tuple abstractions for aggregations?

Reply via email to