Re: extract dynamic number of tuples from bags

Kim Vogt Fri, 15 Oct 2010 11:05:49 -0700

grunt> data = LOAD 'data.txt' using PigJsonLoader as (json: map[]);
grunt> data = FOREACH data GENERATE json#'tags';
grunt> describe data;
data: {bytearray}
grunt> data = FOREACH data GENERATE
FLATTEN($0);
grunt> describe data;
data: {bytearray}
grunt> dump data;
([["amenity","parking"],["name","Vuurtoren"]])


It's not removing the outside brackets, and then ToBag doesn't work
correctly.

Instead I wrote my own "SplitIntoBag" and create a tuple out of each
key/value pair, add to a bag and return the bag.  Not sure if this is the
most efficient way, but it works so I'll roll with it.

-Kim

On Thu, Oct 14, 2010 at 1:33 PM, Dmitriy Ryaboy <[email protected]> wrote:

> I see.. I think you can flatten each row in data to un-nest, so you will
> get
> (["amenity","parking"],["name","Vuurtoren"]); then for each resulting row
> call ToBag(*), getting ({["amenity","parking"],["name","Vuurtoren"]}); then
> flatten *that*, getting a row per pair. Now you can  group and count.
>
> Haven't tried it, let me know how it goes.
>
> -D
>
> On Thu, Oct 14, 2010 at 12:09 PM, Kim Vogt <[email protected]> wrote:
>
> > Hey Dmitriy,
> >
> > I tried to keep my example simple but maybe that doesn't work so here it
> > goes.  I'm trying to do group/counts on the "tag" key/values in json data
> > that looks like this:
> >
> > {
> >   "type":"Feature",
> >   "id":61561312,
> >   "geometry":{
> >      "type":"Polygon",
> >      "coordinates":[
> >         [
> >            [
> >               "53.18119",
> >               "4.85247"
> >            ],
> >            [
> >               "53.180908",
> >               "4.8518934"
> >            ],
> >            [
> >               "53.1807441",
> >               "4.8520919"
> >            ],
> >            [
> >               "53.181027",
> >               "4.8526444"
> >            ],
> >            [
> >               "53.18119",
> >               "4.85247"
> >            ]
> >         ]
> >      ]
> >   },
> >   "properties":{
> >      "uid":26959,
> >      "timestamp":"2010-06-09T12:25:02Z",
> >      "changeset":4944796,
> >      "user":"ttwimlex",
> >      "version":1
> >   },
> >   "tags":[
> >      [
> >         "amenity",
> >         "parking"
> >      ],
> >      [
> >         "name",
> >         "Vuurtoren"
> >      ]
> >   ]
> > }
> >
> > So in the end, I want stats on how many unique key/value tag pairs I see.
> > For just this one record, my output would be something like:
> >
> > (["amenity", "parking"], 1)
> > (["name", "Vuurtoren"], 1)
> >
> > I load the data using my jsonLoader and grab just the tags, like this:
> >
> > data = LOAD 'data.txt' using PigJsonLoader as (json: map[]);
> > data = FOREACH data GENERATE json#'tags';
> > dump data;
> >
> > ([["amenity","parking"],["name","Vuurtoren"]])
> >
> > and then I get stuck, because there can be any number of these tags in
> each
> > json record.  I thought if I could split them out into multiple bags, I
> > could group and count.  Or maybe I'm missing something obvious :-)
> >
> > -Kim
> >
> > On Thu, Oct 14, 2010 at 11:33 AM, Dmitriy Ryaboy <[email protected]>
> > wrote:
> >
> > > Kim,
> > > You can't just flatten it? Not sure I am following the example right.
> > >
> > > -D
> > >
> > > On Thu, Oct 14, 2010 at 10:30 AM, Kim Vogt <[email protected]> wrote:
> > >
> > > > Hi,
> > > >
> > > > If I have bags that have a dynamic number of fields that look
> something
> > > > like
> > > > this:
> > > >
> > > > ("park", "building", "office")
> > > > ("store", "school")
> > > > ("building", "school", "restaurant", "hotel)
> > > >
> > > > Is it possible to transform this into one tuple per bag so my data
> > looks
> > > > like this and then I can do group bys and counts?  Maybe I can do
> this
> > in
> > > > an
> > > > eval udf?
> > > >
> > > > ("park")
> > > > ("building")
> > > > ("office")
> > > > ("store")
> > > > ...
> > > >
> > > >
> > > > -Kim
> > > >
> > >
> >
>

Re: extract dynamic number of tuples from bags

Reply via email to