YES :) On Tue, Apr 19, 2011 at 11:49 AM, John Hui <[email protected]> wrote:
> I have a JSON library and pig script working. Should I just contribute it > instead of reinventing the wheel? > > John > > On Tue, Apr 19, 2011 at 2:44 PM, Daniel Eklund <[email protected]> wrote: > > > Bill, thanks... > > > > so that is a confirmation... people have rolled their own, and it's not > in > > piggybank. > > I would absolutely be willing to work with you to get a contribution > going, > > but (as > > a warning) I am extremely new to Pig. > > > > I was looking at this: > > http://wiki.apache.org/pig/UDFManual > > to get my mind wrapped around the framework. And I also discovered this > > > > > https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java > > ( I am assuming this was the UDF you mentioned that inspired you)... > > > > A quick question about the UDF's registered at the top of a pig script: > > > > does > > REGISTER myJar.jar > > distribute the jar across HDFS (like a Hadoop job jar) so that the > > distribution of the code to the cluster nodes is transparent? > > In other words, do we NOT have to distribute myJar.jar to each node on > the > > cluster. > > > > thanks more, > > daniel > > > > > > > > On Tue, Apr 19, 2011 at 1:57 PM, Bill Graham <[email protected]> > wrote: > > > > > We're doing the same thing using a JsonToMap UDF followed by a > > > MapToBag UDF. The former was similarly inspired by the elephant bird > > > JSONLoader. I'd be glad to collaborate on a contribution if you'd > > > like. > > > > > > Here's what our scripts look like: > > > > > > define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag(); > > > define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap(); > > > define concat org.apache.pig.builtin.StringConcat(); > > > > > > raw = LOAD 'hbase://user_info' > > > USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( > 'events:*') > > > AS (events_map:map[]); > > > > > > -- Convert our maps to bags so we can flatten them out > > > B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag; > > > > > > C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray, > > > event_v:chararray); > > > > > > -- Convert the JSON events into maps > > > D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[]; > > > > > > -- Example showing how to filter on a given field > > > E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND > > > event_map#'levt.asid' IS NOT NULL); > > > > > > -- Example showing how to pull data out of a map > > > F = FOREACH E GENERATE event_map#'levt.asid' AS asid, > > > event_map#'levt.astid' AS > > > astid; > > > > > > > > > thanks, > > > Bill > > > > > > On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <[email protected]> > > > wrote: > > > > I noticed that there is a Pig JSON Loader (which might or might not > be > > in > > > > piggbank). > > > > Could anyone confirm the existence or absence of a JSONToTuple UDF? > > (not > > > a > > > > loader) > > > > > > > > I am inspired by the UDF mentioned on Slide 23 here: > > > > http://www.slideshare.net/danharvey/hbase-at-mendeley > > > > > > > > doc = FOREACH rawdocs GENERATE > DocumentProtobufBytesToTuple(protodoc) > > as > > > > DOC; > > > > > > > > My desire is to store a raw JSON doc in a cell in HBase and run pig > > > queries > > > > against the tuples generated by the UDF. > > > > I used the HBase Loader already to get the cell-data, and now I need > a > > > > JSON-deserializer. > > > > > > > > I would be willing to roll my own, (and contribute), but I figure I'd > > see > > > if > > > > there was anything out there first. > > > > > > > > thanks, > > > > daniel > > > > > > > > > >
