Hey Dmitriy,
I tried to keep my example simple but maybe that doesn't work so here it
goes. I'm trying to do group/counts on the "tag" key/values in json data
that looks like this:
{
"type":"Feature",
"id":61561312,
"geometry":{
"type":"Polygon",
"coordinates":[
[
[
"53.18119",
"4.85247"
],
[
"53.180908",
"4.8518934"
],
[
"53.1807441",
"4.8520919"
],
[
"53.181027",
"4.8526444"
],
[
"53.18119",
"4.85247"
]
]
]
},
"properties":{
"uid":26959,
"timestamp":"2010-06-09T12:25:02Z",
"changeset":4944796,
"user":"ttwimlex",
"version":1
},
"tags":[
[
"amenity",
"parking"
],
[
"name",
"Vuurtoren"
]
]
}
So in the end, I want stats on how many unique key/value tag pairs I see.
For just this one record, my output would be something like:
(["amenity", "parking"], 1)
(["name", "Vuurtoren"], 1)
I load the data using my jsonLoader and grab just the tags, like this:
data = LOAD 'data.txt' using PigJsonLoader as (json: map[]);
data = FOREACH data GENERATE json#'tags';
dump data;
([["amenity","parking"],["name","Vuurtoren"]])
and then I get stuck, because there can be any number of these tags in each
json record. I thought if I could split them out into multiple bags, I
could group and count. Or maybe I'm missing something obvious :-)
-Kim
On Thu, Oct 14, 2010 at 11:33 AM, Dmitriy Ryaboy <[email protected]> wrote:
> Kim,
> You can't just flatten it? Not sure I am following the example right.
>
> -D
>
> On Thu, Oct 14, 2010 at 10:30 AM, Kim Vogt <[email protected]> wrote:
>
> > Hi,
> >
> > If I have bags that have a dynamic number of fields that look something
> > like
> > this:
> >
> > ("park", "building", "office")
> > ("store", "school")
> > ("building", "school", "restaurant", "hotel)
> >
> > Is it possible to transform this into one tuple per bag so my data looks
> > like this and then I can do group bys and counts? Maybe I can do this in
> > an
> > eval udf?
> >
> > ("park")
> > ("building")
> > ("office")
> > ("store")
> > ...
> >
> >
> > -Kim
> >
>