This is what all worked: 1. Download elephant-bird-pig.jar and put in HDFS 2. REGISTER 'elephant-bird-pig.jar'; on grunt shell 3. Use com.twitter.elephantbird.pig.piggybank.JsonStringToMap(attributes#'md') AS metadata
Works brilliantly. HTH Ss On Tue, Jun 25, 2013 at 6:35 PM, Abhinav Neelam <[email protected]>wrote: > Use REGEX_EXTRACT_ALL > Something like this should work (untested, please verify) > > rel2 = foreach rel1 generate > > FLATTEN(REGEX_EXTRACT_ALL(attributes#'md','\\{"cld":"(\\w+)","sld":"(\\w+)"\\}')) > AS (cld: chararray, sld: chararray); > > Tighten up the regex appropriately. > > > On 24 June 2013 14:55, Suresh Saggar <[email protected]> wrote: > > > *Thanks a lot* for your reply but the problem still exists. To clarify > > further the exact sequence of pig statements are shown below: > > > > REGISTER 'hdfs://hadoop-prod-master.vpc:8020/user/hdfs/libs/prod.jar'; > > <<<<< *Our custom jar containing the Loader() code.* > > records_log = LOAD > > 'hdfs://hadoop-prod-master.vpc:8020/data/{prod}/{2013-06-20-11}/*' USING > > com.example.Loader() AS (date:chararray, type:chararray, attributes:[]); > > http = FILTER records_log BY type == 'm' AND attributes#'st' == 'http'; > > X = FOREACH http GENERATE attributes#'md' AS metadata; > > Y = FOREACH X GENERATE FLATTEN(metadata); > > > > grunt> describe Y > > Y: {metadata: bytearray} > > grunt> describe X > > X: {metadata: bytearray} > > > > Once I dump either X or Y, both result in the same. Further I tried > FLATTEN > > directly on records_log too, but no help i.e. > > Z = FOREACH records_log GENERATE FLATTEN(attributes); > > > > Similarly JsonStorage() can't be used directly as my raw data (one stored > > in HDFS) is not json, but a custom format as shown below: > > 2013-06-20-11|m|{'st':'http','md':{'cId':'a','sId':'b'}} > > > > Here our Loader() takes above raw data as input and returns the output in > > the format: (date:chararray, type:chararray, attributes:[]). Now since > > attributes#'md' is a JSON here, I'm having problems getting the 'cId' & > > 'sId' values. Hope this clarifies the context. I assume that FLATTEN > > operator couldn't 'un-nests' the attributes#'md' as that is represented > as > > {'cId':'a','sId':'b'} but not as ['cId'#'a','sId'#'b'] (map in pig) or > > {('cId'#'a'),('sId'#'b')} (bag in pig). > > > > TIA > > Ss > > > > On Fri, Jun 21, 2013 at 6:12 PM, Pradeep Gollakota <[email protected] > > >wrote: > > > > > Suresh, > > > > > > Look into using JsonStorage(). This seems to be what you're looking > for. > > > http://pig.apache.org/docs/r0.10.0/func.html#jsonloadstore > > > > > > > > > On Fri, Jun 21, 2013 at 8:35 AM, Shahab Yunus <[email protected] > > > >wrote: > > > > > > > Have you tried flattening the bag first? > > > > > > > > > > > > On Fri, Jun 21, 2013 at 5:43 AM, Suresh Saggar <[email protected]> wrote: > > > > > > > > > Facing a similar challenge. Here X contains one column named > > 'metadata' > > > > of > > > > > type bytearray. But the actual content is a JSON i.e. the value of > > > > metadata > > > > > field is a JSON (keys as sId & cId) as shown below: > > > > > > > > > > grunt> describe X > > > > > X: {metadata: bytearray} > > > > > > > > > > grunt> dump X > > > > > ({"sId":"003_w","cId":"k"}) > > > > > ({"sId":"001_rf","cId":"r"}) > > > > > ({"sId":"001_rf","cId":"r"}) > > > > > ({"sId":"004_rf","cId":"r"}) > > > > > > > > > > Any idea how can I generate cId & sId as separate chararray > columns? > > > TIA > > > > > > > > > > Ss > > > > > > > > > > On Tue, Jun 18, 2013 at 5:52 AM, Pradeep Gollakota < > > > [email protected] > > > > > >wrote: > > > > > > > > > > > What's the error you are seeing? What does you bag of maps look > > like? > > > > > What > > > > > > exactly is a userId? Is it a field or is it a key in the map? > > > > > > > > > > > > > > > > > > On Mon, Jun 17, 2013 at 8:18 PM, Jerry Lam <[email protected] > > > > > > wrote: > > > > > > > > > > > > > Hi Pig users, > > > > > > > > > > > > > > anyone has experience in dereferencing a bag of maps? For > > instance > > > > (in > > > > > > the > > > > > > > example below), doc in the B contains maps of userId and time. > I > > > want > > > > > to > > > > > > > keep only userId in C. Pig throws an exception on C. Any help > is > > > > > > > appreciated. > > > > > > > > > > > > > > A = LOAD 'data' AS doc:bytearray; > > > > > > > > > > > > > > B = FOREACH A GENERATE (bag{})doc; > > > > > > > > > > > > > > -- C = FOREACH B GENERATE doc.userId; // this doesn't work. > > > > > > > > > > > > > > Best Regards, > > > > > > > > > > > > > > Jerry > > > > > > > > > > > > > > > > > > > > > > > > > > > >
