We're doing the same thing using a JsonToMap UDF followed by a
MapToBag UDF. The former was similarly inspired by the elephant bird
JSONLoader. I'd be glad to collaborate on a contribution if you'd
like.
Here's what our scripts look like:
define mapToBag cnwk.hadoop.mapreduce.pig.udf.MapToBag();
define jsonToMap cnwk.hadoop.mapreduce.pig.udf.JsonToMap();
define concat org.apache.pig.builtin.StringConcat();
raw = LOAD 'hbase://user_info'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage( 'events:*')
AS (events_map:map[]);
-- Convert our maps to bags so we can flatten them out
B = FOREACH raw GENERATE mapToBag(events_map) AS event_bag;
C = FOREACH B GENERATE FLATTEN(event_bag) AS (event_k:chararray,
event_v:chararray);
-- Convert the JSON events into maps
D = FOREACH C GENERATE social_k, jsonToMap(event_v) AS event_map:map[];
-- Example showing how to filter on a given field
E = FILTER D BY (event_map#'levt.astid' IS NOT NULL AND
event_map#'levt.asid' IS NOT NULL);
-- Example showing how to pull data out of a map
F = FOREACH E GENERATE event_map#'levt.asid' AS asid,
event_map#'levt.astid' AS astid;
thanks,
Bill
On Tue, Apr 19, 2011 at 10:08 AM, Daniel Eklund <[email protected]> wrote:
> I noticed that there is a Pig JSON Loader (which might or might not be in
> piggbank).
> Could anyone confirm the existence or absence of a JSONToTuple UDF? (not a
> loader)
>
> I am inspired by the UDF mentioned on Slide 23 here:
> http://www.slideshare.net/danharvey/hbase-at-mendeley
>
> doc = FOREACH rawdocs GENERATE DocumentProtobufBytesToTuple(protodoc) as
> DOC;
>
> My desire is to store a raw JSON doc in a cell in HBase and run pig queries
> against the tuples generated by the UDF.
> I used the HBase Loader already to get the cell-data, and now I need a
> JSON-deserializer.
>
> I would be willing to roll my own, (and contribute), but I figure I'd see if
> there was anything out there first.
>
> thanks,
> daniel
>