Re: JSONToTuple for pig UDF

Xavier Stevens Tue, 19 Apr 2011 11:58:19 -0700

For what it's worth I have one as well. This one uses Jackson to parse
everything.


https://github.com/xstevens/akela/blob/master/src/java/com/mozilla/pig/eval/json/JsonMap.java


On 4/19/11 11:55 AM, Dmitriy Ryaboy wrote:
> YES :)
>
> On Tue, Apr 19, 2011 at 11:49 AM, John Hui <[email protected]> wrote:
>
>> I have a JSON library and pig script working.  Should I just contribute it
>> instead of reinventing the wheel?
>>
>> John
>>
>> On Tue, Apr 19, 2011 at 2:44 PM, Daniel Eklund <[email protected]> wrote:
>>
>>> Bill,  thanks...
>>>
>>>  so that is a confirmation... people have rolled their own, and it's not
>> in
>>> piggybank.
>>> I would absolutely be willing to work with you to get a contribution
>> going,
>>> but (as
>>> a warning) I am extremely new to Pig.
>>>
>>> I was looking at this:
>>> http://wiki.apache.org/pig/UDFManual
>>> to get my mind wrapped around the framework.  And I also discovered this
>>>
>>>
>> https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java
>>> ( I am assuming this was the UDF you mentioned that inspired you)...
>>>
>>> A quick question about the UDF's registered at the top of a pig script:
>>>
>>> does
>>> REGISTER myJar.jar
>>> distribute the jar across HDFS (like a Hadoop job jar) so that the
>>> distribution of the code to the cluster nodes is transparent?
>>> In other words, do we NOT have to distribute myJar.jar to each node on
>> the
>>> cluster.
>>>
>>> thanks more,
>>> daniel
>>>
>>>
>>>
>>> On Tue, Apr 19, 2011 at 1:57 PM, Bill Graham <[email protected]>
>> wrote:
>>>> We're doing the same thing using a JsonToMap UDF followed by a
>>>> MapToBag UDF. The former was similarly inspired by the elephant bird
>>>> JSONLoader. I'd be glad to collaborate on a contribution if you'd
>>>> like.
>>>>
>>>> Here's what our scripts look like:
>>>>
>>>> define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
>>>> define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
>>>> define concat org.apache.pig.builtin.StringConcat();
>>>>
>>>> raw = LOAD 'hbase://user_info'
>>>>      USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(
>> 'events:*')
>>>>      AS (events_map:map[]);
>>>>
>>>> -- Convert our maps to bags so we can flatten them out
>>>> B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;
>>>>
>>>> C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
>>>> event_v:chararray);
>>>>
>>>> -- Convert the JSON events into maps
>>>> D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[];
>>>>
>>>> -- Example showing how to filter on a given field
>>>> E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
>>>> event_map#'levt.asid' IS NOT NULL);
>>>>
>>>> -- Example showing how to pull data out of a map
>>>> F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
>>>>                                             event_map#'levt.astid' AS
>>>> astid;
>>>>
>>>>
>>>> thanks,
>>>> Bill
>>>>
>>>> On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <[email protected]>
>>>> wrote:
>>>>> I noticed that there is a Pig JSON Loader (which might or might not
>> be
>>> in
>>>>> piggbank).
>>>>> Could anyone confirm the existence or absence of a JSONToTuple UDF?
>>>  (not
>>>> a
>>>>> loader)
>>>>>
>>>>> I am inspired by the UDF mentioned on Slide 23 here:
>>>>> http://www.slideshare.net/danharvey/hbase-at-mendeley
>>>>>
>>>>>  doc = FOREACH rawdocs GENERATE
>> DocumentProtobufBytesToTuple(protodoc)
>>> as
>>>>> DOC;
>>>>>
>>>>> My desire is to store a raw JSON doc in a cell in HBase and run pig
>>>> queries
>>>>> against the tuples generated by the UDF.
>>>>> I used the HBase Loader already to get the cell-data, and now I need
>> a
>>>>> JSON-deserializer.
>>>>>
>>>>> I would be willing to roll my own, (and contribute), but I figure I'd
>>> see
>>>> if
>>>>> there was anything out there first.
>>>>>
>>>>> thanks,
>>>>> daniel
>>>>>

Re: JSONToTuple for pig UDF

Reply via email to