Re: Aggregating data nested into JSON documents

Tecno Brain Thu, 20 Jun 2013 11:55:47 -0700

OK, I'll go back to my original question ( although this time I know what
tools I'm using).


I am using Pig + ElephantBird.

I have JSON documents with the following structure:
{
     g : "some-group-identifier",
     sg: "some-subgroup-identifier",
     j      : "some-job-identifier",
     page     : 23,
     ... // other fields omitted
     important-data : [
         {
           f1  : "abc",
           f2  : "a",
           f3  : "/"
           ...
         },
         ...
         {
           f1 : "xyz",
           f2  : "q",
           f3  : "/",
           ...
         },
     ]
    ... // other fields omitted
}

I want Pig to GENERATE a tuple for each element on the "important-data"
array attribute. For the example above, I would like to generate:

( "some-group-identifier" , "some-subgroup-identifier", 23, "abc", "a", "/"
)
( "some-group-identifier" , "some-subgroup-identifier", 23, "xyz", "q", "/"
)

This is what I have tried:

doc = LOAD '/example.json' USING
     com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as
(json:map[]);
flat = FOREACH doc  GENERATE  (chararray)json#'gr' as g, (long)json#'sg' as
sg,  FLATTEN( json#'important-data') ;
DUMP flat;

but that produces:

( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#abc, f2#a,
f3#/ ] )
( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#xyz, f2#q,
f3#/ ] )

Close, but not exactly what I want.

Do I require to use ProtoBuf ?

-Jorge


On Wed, Jun 19, 2013 at 3:44 PM, Tecno Brain
<[email protected]>wrote:

> Ok, I found that elephant-bird JsonLoader cannot handle JSON documents
> that are pretty-printed. (expanding over multiple-lines) The entire json
> document has to be on a single line.
>
> After I reformated some of the source files, now I am getting the expected
> output.
>
>
>
>
> On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain <[email protected]
> > wrote:
>
>> I also tried:
>>
>> doc = LOAD '/json-pcr/pcr-000001.json' USING
>>  com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
>>  flat = FOREACH doc  GENERATE  (chararray)json#'a' AS first,
>> (long)json#'b' AS second ;
>> DUMP flat;
>>
>> but I got no output either.
>>
>>      Input(s):
>>      Successfully read 0 records (35863 bytes) from:
>> "/json-pcr/pcr-000001.json"
>>
>>      Output(s):
>>      Successfully stored 0 records in:
>> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>>
>>
>>
>> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <
>> [email protected]> wrote:
>>
>>> I got Pig and Hive working ona single-node and I am able to run some
>>> script/queries over regular text files (access log files); with a record
>>> per line.
>>>
>>> Now, I want to process some JSON files.
>>>
>>> As mentioned before, it seems  that ElephantBird would be a would be a
>>> good solution to read JSON files.
>>>
>>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>>> document. The documents are NOT in a single line, but rather contain
>>> pretty-printed JSON expanding over multiple lines.
>>>
>>> I'm trying something simple, extracting two (primitive) attributes at
>>> the top of the document:
>>> {
>>>    a : "some value",
>>>    ...
>>>    b : 133,
>>>    ...
>>> }
>>>
>>> So, lets start with a LOAD of a single file (single JSON document):
>>>
>>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>>  com.twitter.elephantbird.pig.load.JsonLoader();
>>> flat  = FOREACH doc GENERATE (chararray)$0#'a' AS  first, (long)$0#'b'
>>> AS second ;
>>> DUMP flat;
>>>
>>> Apparently the job runs without problem, but I get no output. The output
>>> I get includes this message:
>>>
>>>    Input(s):
>>>    Successfully read 0 records (35863 bytes) from:
>>> "/json-pcr/pcr-000001.json"
>>>
>>> I was expecting to get
>>>
>>> ( "some value", 133 )
>>>
>>> Any idea on what I am doing wrong?
>>>
>>>
>>>
>>>
>>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <
>>> [email protected]> wrote:
>>>
>>>> I think you have a misconception of HBase.
>>>>
>>>> You don't need to actually have mutable data for it to be effective.
>>>> The key is that you need to have access to specific records and work a
>>>> very small subset of the data and not the complete data set.
>>>>
>>>>
>>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <[email protected]>
>>>> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>>>> much a snapshot, it does not require updates. Most of my aggregations will
>>>> also need to be computed once and won't change over time with the exception
>>>> of some aggregation that is based on the last N days of data.  Should I
>>>> still consider HBase ? I think that probably it will be good for the
>>>> aggregated data.
>>>>
>>>> I have no idea what are sequence files, but I will take a look.  My raw
>>>> data is stored in the cloud, not in my Hadoop cluster.
>>>>
>>>> I'll keep looking at Pig with ElephantBird.
>>>> Thanks,
>>>>
>>>> -Jorge
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi..
>>>>>
>>>>> Have you thought about HBase?
>>>>>
>>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>>> these files and putting the JSON records in to a sequence file.
>>>>> Or set of sequence files.... (Then look at HBase to help index
>>>>> them...) 200KB is small.
>>>>>
>>>>> That would be the same for either pig/hive.
>>>>>
>>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>>> nice. And yes you get each record as a row, however you can always flatten
>>>>> them as needed.
>>>>>
>>>>> Hive?
>>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>>> Edward Capriolo could give you a better answer.
>>>>> Going from memory, I don't know that there is a good SerDe that would
>>>>> write JSON, just read it. (Hive)
>>>>>
>>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be
>>>>> dated and biased.
>>>>>
>>>>> I think you're on the right track or at least train of thought.
>>>>>
>>>>> HTH
>>>>>
>>>>> -Mike
>>>>>
>>>>>
>>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Hello,
>>>>>    I'm new to Hadoop.
>>>>>    I have a large quantity of JSON documents with a structure similar
>>>>> to what is shown below.
>>>>>
>>>>>    {
>>>>>      g : "some-group-identifier",
>>>>>      sg: "some-subgroup-identifier",
>>>>>      j      : "some-job-identifier",
>>>>>      page     : 23,
>>>>>      ... // other fields omitted
>>>>>      important-data : [
>>>>>          {
>>>>>            f1  : "abc",
>>>>>            f2  : "a",
>>>>>            f3  : "/"
>>>>>            ...
>>>>>          },
>>>>>          ...
>>>>>          {
>>>>>            f1 : "xyz",
>>>>>            f2  : "q",
>>>>>            f3  : "/",
>>>>>            ...
>>>>>          },
>>>>>      ],
>>>>>     ... // other fields omitted
>>>>>      other-important-data : [
>>>>>         {
>>>>>            x1  : "ford",
>>>>>            x2  : "green",
>>>>>            x3  : 35
>>>>>            map : {
>>>>>                "free-field" : "value",
>>>>>                "other-free-field" : value2"
>>>>>               }
>>>>>          },
>>>>>          ...
>>>>>          {
>>>>>            x1 : "vw",
>>>>>            x2  : "red",
>>>>>            x3  : 54,
>>>>>            ...
>>>>>          },
>>>>>      ]
>>>>>    },
>>>>> }
>>>>>
>>>>>
>>>>> Each file contains a single JSON document (gzip compressed, and
>>>>> roughly about 200KB uncompressed of pretty-printed json text per document)
>>>>>
>>>>> I am interested in analyzing only the  "important-data" array and the
>>>>> "other-important-data" array.
>>>>> My source data would ideally be easier to analyze if it looked like a
>>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>>> be a complex column, all others would be primitives.
>>>>>
>>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>>
>>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>>
>>>>> So, for each JSON document, I would like to "create" several rows, but I
>>>>> would like to avoid the intermediate step of persisting -and duplicating-
>>>>> the "flattened" data.
>>>>>
>>>>> In order to avoid persisting the data flattened, I thought I had to
>>>>> write my own map-reduce in Java code, but discovered that others have had
>>>>> the same problem of using JSON as the source and there are somewhat
>>>>> "standard" solutions.
>>>>>
>>>>> By reading about the SerDe approach for Hive I get the impression
>>>>> that each JSON document is transformed into a single "row" of the table
>>>>> with some columns being an array, a map of other nested structures.
>>>>> a) Is there a way to break each JSON document into several "rows" for
>>>>> a Hive external table?
>>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>>> them considered the de-facto standard?
>>>>>
>>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>>> has pointers to more user documentation on this project? Or is browsing
>>>>> through the examples in GitHub my only source?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Aggregating data nested into JSON documents

Reply via email to