Hi crunch !
I want to process a list in crunch:
Something like this:
PCollection<String> lines = MemPipeline.collectionOf(
"BigPetStore,storeCode_AK,1 lindsay,franco,Sat Jan 10
00:11:10 EST 1970,10.5,dog-food",
"BigPetStore,storeCode_AZ,1 tom,giles,Sun Dec 28 23:08:45
EST 1969,10.5,dog-food",
"BigPetStore,storeCode_CA,1 brandon,ewing,Mon Dec 08
20:23:57 EST 1969,16.5,organic-dog-food",
"BigPetStore,storeCode_CA,2 angie,coleman,Thu Dec 11
07:00:31 EST 1969,10.5,dog-food",
"BigPetStore,storeCode_CA,3 angie,coleman,Tue Jan 20
06:24:23 EST 1970,7.5,cat-food",
"BigPetStore,storeCode_CO,1 sharon,trevino,Mon Jan 12
07:52:10 EST 1970,30.1,antelope snacks",
"BigPetStore,storeCode_CT,1 kevin,fitzpatrick,Wed Dec 10
05:24:13 EST 1969,10.5,dog-food",
"BigPetStore,storeCode_NY,1 dale,holden,Mon Jan 12
23:02:13 EST 1970,19.75,fish-food",
"BigPetStore,storeCode_NY,2 dale,holden,Tue Dec 30
12:29:52 EST 1969,10.5,dog-food",
"BigPetStore,storeCode_OK,1 donnie,tucker,Sun Jan 18
04:50:26 EST 1970,7.5,cat-food");
PCollection coll = lines.parallelDo(
"split lines into words",
new DoFn<String, String>() {
@Override
public void process(String line, Emitter emitter) {
//not sure this regex will work but you get the idea..
split by tabs and commas
emitter.emit(Arrays.asList(line.split("\t,")));
}
},
Writables.lists()
).groupBy(0).count();
}
What is the correct abstraction in crunch to convert raw text into tuples,
and access them by an index - which you then use to group and count on?
thanks !
** FYI ** this is for the bigpetstore project, id like to show crunch
examples in it if i can get them working, as the API is a nice example of
a lowerlevel mapreduce paradigm which is more java freindly.
See https://issues.apache.org/jira/browse/BIGTOP-1089 and
https://github.com/jayunit100/bigpetstore for details..