I see.. I think you can flatten each row in data to un-nest, so you will get
(["amenity","parking"],["name","Vuurtoren"]); then for each resulting row
call ToBag(*), getting ({["amenity","parking"],["name","Vuurtoren"]}); then
flatten *that*, getting a row per pair. Now you can group and count.
Haven't tried it, let me know how it goes.
-D
On Thu, Oct 14, 2010 at 12:09 PM, Kim Vogt <[email protected]> wrote:
> Hey Dmitriy,
>
> I tried to keep my example simple but maybe that doesn't work so here it
> goes. I'm trying to do group/counts on the "tag" key/values in json data
> that looks like this:
>
> {
> "type":"Feature",
> "id":61561312,
> "geometry":{
> "type":"Polygon",
> "coordinates":[
> [
> [
> "53.18119",
> "4.85247"
> ],
> [
> "53.180908",
> "4.8518934"
> ],
> [
> "53.1807441",
> "4.8520919"
> ],
> [
> "53.181027",
> "4.8526444"
> ],
> [
> "53.18119",
> "4.85247"
> ]
> ]
> ]
> },
> "properties":{
> "uid":26959,
> "timestamp":"2010-06-09T12:25:02Z",
> "changeset":4944796,
> "user":"ttwimlex",
> "version":1
> },
> "tags":[
> [
> "amenity",
> "parking"
> ],
> [
> "name",
> "Vuurtoren"
> ]
> ]
> }
>
> So in the end, I want stats on how many unique key/value tag pairs I see.
> For just this one record, my output would be something like:
>
> (["amenity", "parking"], 1)
> (["name", "Vuurtoren"], 1)
>
> I load the data using my jsonLoader and grab just the tags, like this:
>
> data = LOAD 'data.txt' using PigJsonLoader as (json: map[]);
> data = FOREACH data GENERATE json#'tags';
> dump data;
>
> ([["amenity","parking"],["name","Vuurtoren"]])
>
> and then I get stuck, because there can be any number of these tags in each
> json record. I thought if I could split them out into multiple bags, I
> could group and count. Or maybe I'm missing something obvious :-)
>
> -Kim
>
> On Thu, Oct 14, 2010 at 11:33 AM, Dmitriy Ryaboy <[email protected]>
> wrote:
>
> > Kim,
> > You can't just flatten it? Not sure I am following the example right.
> >
> > -D
> >
> > On Thu, Oct 14, 2010 at 10:30 AM, Kim Vogt <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > If I have bags that have a dynamic number of fields that look something
> > > like
> > > this:
> > >
> > > ("park", "building", "office")
> > > ("store", "school")
> > > ("building", "school", "restaurant", "hotel)
> > >
> > > Is it possible to transform this into one tuple per bag so my data
> looks
> > > like this and then I can do group bys and counts? Maybe I can do this
> in
> > > an
> > > eval udf?
> > >
> > > ("park")
> > > ("building")
> > > ("office")
> > > ("store")
> > > ...
> > >
> > >
> > > -Kim
> > >
> >
>