Re: JSONToTuple for pig UDF

Dmitriy Ryaboy Tue, 19 Apr 2011 11:56:15 -0700

YES :)

On Tue, Apr 19, 2011 at 11:49 AM, John Hui <[email protected]> wrote:


> I have a JSON library and pig script working.  Should I just contribute it
> instead of reinventing the wheel?
>
> John
>
> On Tue, Apr 19, 2011 at 2:44 PM, Daniel Eklund <[email protected]> wrote:
>
> > Bill,  thanks...
> >
> >  so that is a confirmation... people have rolled their own, and it's not
> in
> > piggybank.
> > I would absolutely be willing to work with you to get a contribution
> going,
> > but (as
> > a warning) I am extremely new to Pig.
> >
> > I was looking at this:
> > http://wiki.apache.org/pig/UDFManual
> > to get my mind wrapped around the framework.  And I also discovered this
> >
> >
> https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
> > ( I am assuming this was the UDF you mentioned that inspired you)...
> >
> > A quick question about the UDF's registered at the top of a pig script:
> >
> > does
> > REGISTER myJar.jar
> > distribute the jar across HDFS (like a Hadoop job jar) so that the
> > distribution of the code to the cluster nodes is transparent?
> > In other words, do we NOT have to distribute myJar.jar to each node on
> the
> > cluster.
> >
> > thanks more,
> > daniel
> >
> >
> >
> > On Tue, Apr 19, 2011 at 1:57 PM, Bill Graham <[email protected]>
> wrote:
> >
> > > We're doing the same thing using a JsonToMap UDF followed by a
> > > MapToBag UDF. The former was similarly inspired by the elephant bird
> > > JSONLoader. I'd be glad to collaborate on a contribution if you'd
> > > like.
> > >
> > > Here's what our scripts look like:
> > >
> > > define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
> > > define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
> > > define concat org.apache.pig.builtin.StringConcat();
> > >
> > > raw = LOAD 'hbase://user_info'
> > >      USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
> 'events:*')
> > >      AS (events_map:map[]);
> > >
> > > -- Convert our maps to bags so we can flatten them out
> > > B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;
> > >
> > > C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
> > > event_v:chararray);
> > >
> > > -- Convert the JSON events into maps
> > > D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[];
> > >
> > > -- Example showing how to filter on a given field
> > > E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
> > > event_map#'levt.asid' IS NOT NULL);
> > >
> > > -- Example showing how to pull data out of a map
> > > F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
> > >                                             event_map#'levt.astid' AS
> > > astid;
> > >
> > >
> > > thanks,
> > > Bill
> > >
> > > On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <[email protected]>
> > > wrote:
> > > > I noticed that there is a Pig JSON Loader (which might or might not
> be
> > in
> > > > piggbank).
> > > > Could anyone confirm the existence or absence of a JSONToTuple UDF?
> >  (not
> > > a
> > > > loader)
> > > >
> > > > I am inspired by the UDF mentioned on Slide 23 here:
> > > > http://www.slideshare.net/danharvey/hbase-at-mendeley
> > > >
> > > >  doc = FOREACH rawdocs GENERATE
> DocumentProtobufBytesToTuple(protodoc)
> > as
> > > > DOC;
> > > >
> > > > My desire is to store a raw JSON doc in a cell in HBase and run pig
> > > queries
> > > > against the tuples generated by the UDF.
> > > > I used the HBase Loader already to get the cell-data, and now I need
> a
> > > > JSON-deserializer.
> > > >
> > > > I would be willing to roll my own, (and contribute), but I figure I'd
> > see
> > > if
> > > > there was anything out there first.
> > > >
> > > > thanks,
> > > > daniel
> > > >
> > >
> >
>

Re: JSONToTuple for pig UDF

Reply via email to