OK, I'll go back to my original question ( although this time I know what
tools I'm using).
I am using Pig + ElephantBird.
I have JSON documents with the following structure:
{
g : "some-group-identifier",
sg: "some-subgroup-identifier",
j : "some-job-identifier",
page : 23,
... // other fields omitted
important-data : [
{
f1 : "abc",
f2 : "a",
f3 : "/"
...
},
...
{
f1 : "xyz",
f2 : "q",
f3 : "/",
...
},
]
... // other fields omitted
}
I want Pig to GENERATE a tuple for each element on the "important-data"
array attribute. For the example above, I would like to generate:
( "some-group-identifier" , "some-subgroup-identifier", 23, "abc", "a", "/"
)
( "some-group-identifier" , "some-subgroup-identifier", 23, "xyz", "q", "/"
)
This is what I have tried:
doc = LOAD '/example.json' USING
com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as
(json:map[]);
flat = FOREACH doc GENERATE (chararray)json#'gr' as g, (long)json#'sg' as
sg, FLATTEN( json#'important-data') ;
DUMP flat;
but that produces:
( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#abc, f2#a,
f3#/ ] )
( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#xyz, f2#q,
f3#/ ] )
Close, but not exactly what I want.
Do I require to use ProtoBuf ?
-Jorge
On Wed, Jun 19, 2013 at 3:44 PM, Tecno Brain
<[email protected]>wrote:
> Ok, I found that elephant-bird JsonLoader cannot handle JSON documents
> that are pretty-printed. (expanding over multiple-lines) The entire json
> document has to be on a single line.
>
> After I reformated some of the source files, now I am getting the expected
> output.
>
>
>
>
> On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain <[email protected]
> > wrote:
>
>> I also tried:
>>
>> doc = LOAD '/json-pcr/pcr-000001.json' USING
>> com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
>> flat = FOREACH doc GENERATE (chararray)json#'a' AS first,
>> (long)json#'b' AS second ;
>> DUMP flat;
>>
>> but I got no output either.
>>
>> Input(s):
>> Successfully read 0 records (35863 bytes) from:
>> "/json-pcr/pcr-000001.json"
>>
>> Output(s):
>> Successfully stored 0 records in:
>> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>>
>>
>>
>> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <
>> [email protected]> wrote:
>>
>>> I got Pig and Hive working ona single-node and I am able to run some
>>> script/queries over regular text files (access log files); with a record
>>> per line.
>>>
>>> Now, I want to process some JSON files.
>>>
>>> As mentioned before, it seems that ElephantBird would be a would be a
>>> good solution to read JSON files.
>>>
>>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>>> document. The documents are NOT in a single line, but rather contain
>>> pretty-printed JSON expanding over multiple lines.
>>>
>>> I'm trying something simple, extracting two (primitive) attributes at
>>> the top of the document:
>>> {
>>> a : "some value",
>>> ...
>>> b : 133,
>>> ...
>>> }
>>>
>>> So, lets start with a LOAD of a single file (single JSON document):
>>>
>>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>> com.twitter.elephantbird.pig.load.JsonLoader();
>>> flat = FOREACH doc GENERATE (chararray)$0#'a' AS first, (long)$0#'b'
>>> AS second ;
>>> DUMP flat;
>>>
>>> Apparently the job runs without problem, but I get no output. The output
>>> I get includes this message:
>>>
>>> Input(s):
>>> Successfully read 0 records (35863 bytes) from:
>>> "/json-pcr/pcr-000001.json"
>>>
>>> I was expecting to get
>>>
>>> ( "some value", 133 )
>>>
>>> Any idea on what I am doing wrong?
>>>
>>>
>>>
>>>
>>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <
>>> [email protected]> wrote:
>>>
>>>> I think you have a misconception of HBase.
>>>>
>>>> You don't need to actually have mutable data for it to be effective.
>>>> The key is that you need to have access to specific records and work a
>>>> very small subset of the data and not the complete data set.
>>>>
>>>>
>>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <[email protected]>
>>>> wrote:
>>>>
>>>> Hi Mike,
>>>>
>>>> Yes, I also have thought about HBase or Cassandra but my data is pretty
>>>> much a snapshot, it does not require updates. Most of my aggregations will
>>>> also need to be computed once and won't change over time with the exception
>>>> of some aggregation that is based on the last N days of data. Should I
>>>> still consider HBase ? I think that probably it will be good for the
>>>> aggregated data.
>>>>
>>>> I have no idea what are sequence files, but I will take a look. My raw
>>>> data is stored in the cloud, not in my Hadoop cluster.
>>>>
>>>> I'll keep looking at Pig with ElephantBird.
>>>> Thanks,
>>>>
>>>> -Jorge
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi..
>>>>>
>>>>> Have you thought about HBase?
>>>>>
>>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>>> these files and putting the JSON records in to a sequence file.
>>>>> Or set of sequence files.... (Then look at HBase to help index
>>>>> them...) 200KB is small.
>>>>>
>>>>> That would be the same for either pig/hive.
>>>>>
>>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>>> nice. And yes you get each record as a row, however you can always flatten
>>>>> them as needed.
>>>>>
>>>>> Hive?
>>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>>> Edward Capriolo could give you a better answer.
>>>>> Going from memory, I don't know that there is a good SerDe that would
>>>>> write JSON, just read it. (Hive)
>>>>>
>>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be
>>>>> dated and biased.
>>>>>
>>>>> I think you're on the right track or at least train of thought.
>>>>>
>>>>> HTH
>>>>>
>>>>> -Mike
>>>>>
>>>>>
>>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Hello,
>>>>> I'm new to Hadoop.
>>>>> I have a large quantity of JSON documents with a structure similar
>>>>> to what is shown below.
>>>>>
>>>>> {
>>>>> g : "some-group-identifier",
>>>>> sg: "some-subgroup-identifier",
>>>>> j : "some-job-identifier",
>>>>> page : 23,
>>>>> ... // other fields omitted
>>>>> important-data : [
>>>>> {
>>>>> f1 : "abc",
>>>>> f2 : "a",
>>>>> f3 : "/"
>>>>> ...
>>>>> },
>>>>> ...
>>>>> {
>>>>> f1 : "xyz",
>>>>> f2 : "q",
>>>>> f3 : "/",
>>>>> ...
>>>>> },
>>>>> ],
>>>>> ... // other fields omitted
>>>>> other-important-data : [
>>>>> {
>>>>> x1 : "ford",
>>>>> x2 : "green",
>>>>> x3 : 35
>>>>> map : {
>>>>> "free-field" : "value",
>>>>> "other-free-field" : value2"
>>>>> }
>>>>> },
>>>>> ...
>>>>> {
>>>>> x1 : "vw",
>>>>> x2 : "red",
>>>>> x3 : 54,
>>>>> ...
>>>>> },
>>>>> ]
>>>>> },
>>>>> }
>>>>>
>>>>>
>>>>> Each file contains a single JSON document (gzip compressed, and
>>>>> roughly about 200KB uncompressed of pretty-printed json text per document)
>>>>>
>>>>> I am interested in analyzing only the "important-data" array and the
>>>>> "other-important-data" array.
>>>>> My source data would ideally be easier to analyze if it looked like a
>>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>>> be a complex column, all others would be primitives.
>>>>>
>>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>>
>>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>>
>>>>> So, for each JSON document, I would like to "create" several rows, but I
>>>>> would like to avoid the intermediate step of persisting -and duplicating-
>>>>> the "flattened" data.
>>>>>
>>>>> In order to avoid persisting the data flattened, I thought I had to
>>>>> write my own map-reduce in Java code, but discovered that others have had
>>>>> the same problem of using JSON as the source and there are somewhat
>>>>> "standard" solutions.
>>>>>
>>>>> By reading about the SerDe approach for Hive I get the impression
>>>>> that each JSON document is transformed into a single "row" of the table
>>>>> with some columns being an array, a map of other nested structures.
>>>>> a) Is there a way to break each JSON document into several "rows" for
>>>>> a Hive external table?
>>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>>> them considered the de-facto standard?
>>>>>
>>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>>> has pointers to more user documentation on this project? Or is browsing
>>>>> through the examples in GitHub my only source?
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>