Hey Dmitriy,

I tried to keep my example simple but maybe that doesn't work so here it
goes.  I'm trying to do group/counts on the "tag" key/values in json data
that looks like this:

{
   "type":"Feature",
   "id":61561312,
   "geometry":{
      "type":"Polygon",
      "coordinates":[
         [
            [
               "53.18119",
               "4.85247"
            ],
            [
               "53.180908",
               "4.8518934"
            ],
            [
               "53.1807441",
               "4.8520919"
            ],
            [
               "53.181027",
               "4.8526444"
            ],
            [
               "53.18119",
               "4.85247"
            ]
         ]
      ]
   },
   "properties":{
      "uid":26959,
      "timestamp":"2010-06-09T12:25:02Z",
      "changeset":4944796,
      "user":"ttwimlex",
      "version":1
   },
   "tags":[
      [
         "amenity",
         "parking"
      ],
      [
         "name",
         "Vuurtoren"
      ]
   ]
}

So in the end, I want stats on how many unique key/value tag pairs I see.
For just this one record, my output would be something like:

(["amenity", "parking"], 1)
(["name", "Vuurtoren"], 1)

I load the data using my jsonLoader and grab just the tags, like this:

data = LOAD 'data.txt' using PigJsonLoader as (json: map[]);
data = FOREACH data GENERATE json#'tags';
dump data;

([["amenity","parking"],["name","Vuurtoren"]])

and then I get stuck, because there can be any number of these tags in each
json record.  I thought if I could split them out into multiple bags, I
could group and count.  Or maybe I'm missing something obvious :-)

-Kim

On Thu, Oct 14, 2010 at 11:33 AM, Dmitriy Ryaboy <[email protected]> wrote:

> Kim,
> You can't just flatten it? Not sure I am following the example right.
>
> -D
>
> On Thu, Oct 14, 2010 at 10:30 AM, Kim Vogt <[email protected]> wrote:
>
> > Hi,
> >
> > If I have bags that have a dynamic number of fields that look something
> > like
> > this:
> >
> > ("park", "building", "office")
> > ("store", "school")
> > ("building", "school", "restaurant", "hotel)
> >
> > Is it possible to transform this into one tuple per bag so my data looks
> > like this and then I can do group bys and counts?  Maybe I can do this in
> > an
> > eval udf?
> >
> > ("park")
> > ("building")
> > ("office")
> > ("store")
> > ...
> >
> >
> > -Kim
> >
>

Reply via email to