grunt> data = LOAD 'data.txt' using PigJsonLoader as (json: map[]);
grunt> data = FOREACH data GENERATE json#'tags';
grunt> describe data;
data: {bytearray}
grunt> data = FOREACH data GENERATE
FLATTEN($0);
grunt> describe data;
data: {bytearray}
grunt> dump data;
([["amenity","parking"],["name","Vuurtoren"]])It's not removing the outside brackets, and then ToBag doesn't work correctly. Instead I wrote my own "SplitIntoBag" and create a tuple out of each key/value pair, add to a bag and return the bag. Not sure if this is the most efficient way, but it works so I'll roll with it. -Kim On Thu, Oct 14, 2010 at 1:33 PM, Dmitriy Ryaboy <[email protected]> wrote: > I see.. I think you can flatten each row in data to un-nest, so you will > get > (["amenity","parking"],["name","Vuurtoren"]); then for each resulting row > call ToBag(*), getting ({["amenity","parking"],["name","Vuurtoren"]}); then > flatten *that*, getting a row per pair. Now you can group and count. > > Haven't tried it, let me know how it goes. > > -D > > On Thu, Oct 14, 2010 at 12:09 PM, Kim Vogt <[email protected]> wrote: > > > Hey Dmitriy, > > > > I tried to keep my example simple but maybe that doesn't work so here it > > goes. I'm trying to do group/counts on the "tag" key/values in json data > > that looks like this: > > > > { > > "type":"Feature", > > "id":61561312, > > "geometry":{ > > "type":"Polygon", > > "coordinates":[ > > [ > > [ > > "53.18119", > > "4.85247" > > ], > > [ > > "53.180908", > > "4.8518934" > > ], > > [ > > "53.1807441", > > "4.8520919" > > ], > > [ > > "53.181027", > > "4.8526444" > > ], > > [ > > "53.18119", > > "4.85247" > > ] > > ] > > ] > > }, > > "properties":{ > > "uid":26959, > > "timestamp":"2010-06-09T12:25:02Z", > > "changeset":4944796, > > "user":"ttwimlex", > > "version":1 > > }, > > "tags":[ > > [ > > "amenity", > > "parking" > > ], > > [ > > "name", > > "Vuurtoren" > > ] > > ] > > } > > > > So in the end, I want stats on how many unique key/value tag pairs I see. > > For just this one record, my output would be something like: > > > > (["amenity", "parking"], 1) > > (["name", "Vuurtoren"], 1) > > > > I load the data using my jsonLoader and grab just the tags, like this: > > > > data = LOAD 'data.txt' using PigJsonLoader as (json: map[]); > > data = FOREACH data GENERATE json#'tags'; > > dump data; > > > > ([["amenity","parking"],["name","Vuurtoren"]]) > > > > and then I get stuck, because there can be any number of these tags in > each > > json record. I thought if I could split them out into multiple bags, I > > could group and count. Or maybe I'm missing something obvious :-) > > > > -Kim > > > > On Thu, Oct 14, 2010 at 11:33 AM, Dmitriy Ryaboy <[email protected]> > > wrote: > > > > > Kim, > > > You can't just flatten it? Not sure I am following the example right. > > > > > > -D > > > > > > On Thu, Oct 14, 2010 at 10:30 AM, Kim Vogt <[email protected]> wrote: > > > > > > > Hi, > > > > > > > > If I have bags that have a dynamic number of fields that look > something > > > > like > > > > this: > > > > > > > > ("park", "building", "office") > > > > ("store", "school") > > > > ("building", "school", "restaurant", "hotel) > > > > > > > > Is it possible to transform this into one tuple per bag so my data > > looks > > > > like this and then I can do group bys and counts? Maybe I can do > this > > in > > > > an > > > > eval udf? > > > > > > > > ("park") > > > > ("building") > > > > ("office") > > > > ("store") > > > > ... > > > > > > > > > > > > -Kim > > > > > > > > > >
