crunch : correct way to think about tuple abstractions for aggregations?

Jay Vyas Sat, 04 Jan 2014 12:36:00 -0800

Hi crunch !

I want to process a list in crunch:


Something like this:

        PCollection<String> lines = MemPipeline.collectionOf(
                "BigPetStore,storeCode_AK,1  lindsay,franco,Sat Jan 10
00:11:10 EST 1970,10.5,dog-food",
                "BigPetStore,storeCode_AZ,1  tom,giles,Sun Dec 28 23:08:45
EST 1969,10.5,dog-food",
                "BigPetStore,storeCode_CA,1  brandon,ewing,Mon Dec 08
20:23:57 EST 1969,16.5,organic-dog-food",
                "BigPetStore,storeCode_CA,2  angie,coleman,Thu Dec 11
07:00:31 EST 1969,10.5,dog-food",
                "BigPetStore,storeCode_CA,3  angie,coleman,Tue Jan 20
06:24:23 EST 1970,7.5,cat-food",
                "BigPetStore,storeCode_CO,1  sharon,trevino,Mon Jan 12
07:52:10 EST 1970,30.1,antelope snacks",
                "BigPetStore,storeCode_CT,1  kevin,fitzpatrick,Wed Dec 10
05:24:13 EST 1969,10.5,dog-food",
                "BigPetStore,storeCode_NY,1  dale,holden,Mon Jan 12
23:02:13 EST 1970,19.75,fish-food",
                "BigPetStore,storeCode_NY,2  dale,holden,Tue Dec 30
12:29:52 EST 1969,10.5,dog-food",
                "BigPetStore,storeCode_OK,1  donnie,tucker,Sun Jan 18
04:50:26 EST 1970,7.5,cat-food");

        PCollection coll = lines.parallelDo(
              "split lines into words",
              new DoFn<String, String>() {
                  @Override
                  public void process(String line, Emitter emitter) {
                    //not sure this regex will work but you get the idea..
split by tabs and commas
                    emitter.emit(Arrays.asList(line.split("\t,")));
                  }
              },
              Writables.lists()
        ).groupBy(0).count();

        }

What is the correct abstraction in crunch to convert raw text into tuples,
and access them by an index - which you then use to group and count on?

thanks !

** FYI ** this is for the bigpetstore project, id like to show crunch
examples in it if i can get them working,  as the API is a nice example of
a lowerlevel mapreduce paradigm which is more java freindly.

See https://issues.apache.org/jira/browse/BIGTOP-1089 and
https://github.com/jayunit100/bigpetstore for details..

crunch : correct way to think about tuple abstractions for aggregations?

Reply via email to