Never mind, I got the solution!
uberflat = FOREACH flat GENERATE g, sg,
FLATTEN(important-data#'f1') as f1,
FLATTEN(important-data#'f2') as f2;
-Jorge
On Thu, Jun 20, 2013 at 11:54 AM, Tecno Brain
<[email protected]>wrote:
> OK, I'll go back to my original question ( although this time I know what
> tools I'm using).
>
> I am using Pig + ElephantBird.
>
> I have JSON documents with the following structure:
> {
> g : "some-group-identifier",
> sg: "some-subgroup-identifier",
> j : "some-job-identifier",
> page : 23,
> ... // other fields omitted
> important-data : [
> {
> f1 : "abc",
> f2 : "a",
> f3 : "/"
> ...
> },
> ...
> {
> f1 : "xyz",
> f2 : "q",
> f3 : "/",
> ...
> },
> ]
> ... // other fields omitted
> }
>
> I want Pig to GENERATE a tuple for each element on the "important-data"
> array attribute. For the example above, I would like to generate:
>
> ( "some-group-identifier" , "some-subgroup-identifier", 23, "abc", "a",
> "/" )
> ( "some-group-identifier" , "some-subgroup-identifier", 23, "xyz", "q",
> "/" )
>
> This is what I have tried:
>
> doc = LOAD '/example.json' USING
> com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') as
> (json:map[]);
> flat = FOREACH doc GENERATE (chararray)json#'gr' as g, (long)json#'sg'
> as sg, FLATTEN( json#'important-data') ;
> DUMP flat;
>
> but that produces:
>
> ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#abc,
> f2#a, f3#/ ] )
> ( "some-group-identifier" , "some-subgroup-identifier", 23, [ f1#xyz,
> f2#q, f3#/ ] )
>
> Close, but not exactly what I want.
>
> Do I require to use ProtoBuf ?
>
> -Jorge
>
>
> On Wed, Jun 19, 2013 at 3:44 PM, Tecno Brain <[email protected]
> > wrote:
>
>> Ok, I found that elephant-bird JsonLoader cannot handle JSON documents
>> that are pretty-printed. (expanding over multiple-lines) The entire json
>> document has to be on a single line.
>>
>> After I reformated some of the source files, now I am getting the
>> expected output.
>>
>>
>>
>>
>> On Wed, Jun 19, 2013 at 2:47 PM, Tecno Brain <
>> [email protected]> wrote:
>>
>>> I also tried:
>>>
>>> doc = LOAD '/json-pcr/pcr-000001.json' USING
>>> com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map[]);
>>> flat = FOREACH doc GENERATE (chararray)json#'a' AS first,
>>> (long)json#'b' AS second ;
>>> DUMP flat;
>>>
>>> but I got no output either.
>>>
>>> Input(s):
>>> Successfully read 0 records (35863 bytes) from:
>>> "/json-pcr/pcr-000001.json"
>>>
>>> Output(s):
>>> Successfully stored 0 records in:
>>> "hdfs://localhost:9000/tmp/temp-1239058872/tmp-1260892210"
>>>
>>>
>>>
>>> On Wed, Jun 19, 2013 at 2:36 PM, Tecno Brain <
>>> [email protected]> wrote:
>>>
>>>> I got Pig and Hive working ona single-node and I am able to run some
>>>> script/queries over regular text files (access log files); with a record
>>>> per line.
>>>>
>>>> Now, I want to process some JSON files.
>>>>
>>>> As mentioned before, it seems that ElephantBird would be a would be a
>>>> good solution to read JSON files.
>>>>
>>>> I uploaded 5 files to HDFS. Each file only contain a single JSON
>>>> document. The documents are NOT in a single line, but rather contain
>>>> pretty-printed JSON expanding over multiple lines.
>>>>
>>>> I'm trying something simple, extracting two (primitive) attributes at
>>>> the top of the document:
>>>> {
>>>> a : "some value",
>>>> ...
>>>> b : 133,
>>>> ...
>>>> }
>>>>
>>>> So, lets start with a LOAD of a single file (single JSON document):
>>>>
>>>> REGISTER 'bunch of JAR files from elephant-bird and its dependencies';
>>>> doc = LOAD '/json-pcr/pcr-000001.json' using
>>>> com.twitter.elephantbird.pig.load.JsonLoader();
>>>> flat = FOREACH doc GENERATE (chararray)$0#'a' AS first, (long)$0#'b'
>>>> AS second ;
>>>> DUMP flat;
>>>>
>>>> Apparently the job runs without problem, but I get no output. The
>>>> output I get includes this message:
>>>>
>>>> Input(s):
>>>> Successfully read 0 records (35863 bytes) from:
>>>> "/json-pcr/pcr-000001.json"
>>>>
>>>> I was expecting to get
>>>>
>>>> ( "some value", 133 )
>>>>
>>>> Any idea on what I am doing wrong?
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jun 13, 2013 at 3:05 PM, Michael Segel <
>>>> [email protected]> wrote:
>>>>
>>>>> I think you have a misconception of HBase.
>>>>>
>>>>> You don't need to actually have mutable data for it to be effective.
>>>>> The key is that you need to have access to specific records and work a
>>>>> very small subset of the data and not the complete data set.
>>>>>
>>>>>
>>>>> On Jun 13, 2013, at 11:59 AM, Tecno Brain <
>>>>> [email protected]> wrote:
>>>>>
>>>>> Hi Mike,
>>>>>
>>>>> Yes, I also have thought about HBase or Cassandra but my data is
>>>>> pretty much a snapshot, it does not require updates. Most of my
>>>>> aggregations will also need to be computed once and won't change over time
>>>>> with the exception of some aggregation that is based on the last N days of
>>>>> data. Should I still consider HBase ? I think that probably it will be
>>>>> good for the aggregated data.
>>>>>
>>>>> I have no idea what are sequence files, but I will take a look. My
>>>>> raw data is stored in the cloud, not in my Hadoop cluster.
>>>>>
>>>>> I'll keep looking at Pig with ElephantBird.
>>>>> Thanks,
>>>>>
>>>>> -Jorge
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Hi..
>>>>>>
>>>>>> Have you thought about HBase?
>>>>>>
>>>>>> I would suggest that if you're using Hive or Pig, to look at taking
>>>>>> these files and putting the JSON records in to a sequence file.
>>>>>> Or set of sequence files.... (Then look at HBase to help index
>>>>>> them...) 200KB is small.
>>>>>>
>>>>>> That would be the same for either pig/hive.
>>>>>>
>>>>>> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty
>>>>>> nice. And yes you get each record as a row, however you can always
>>>>>> flatten
>>>>>> them as needed.
>>>>>>
>>>>>> Hive?
>>>>>> I haven't worked with the latest SerDe, but maybe Dean Wampler or
>>>>>> Edward Capriolo could give you a better answer.
>>>>>> Going from memory, I don't know that there is a good SerDe that would
>>>>>> write JSON, just read it. (Hive)
>>>>>>
>>>>>> IMHO Pig/ElephantBird is the best so far, but then again I may be
>>>>>> dated and biased.
>>>>>>
>>>>>> I think you're on the right track or at least train of thought.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> -Mike
>>>>>>
>>>>>>
>>>>>> On Jun 12, 2013, at 7:57 PM, Tecno Brain <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>> Hello,
>>>>>> I'm new to Hadoop.
>>>>>> I have a large quantity of JSON documents with a structure similar
>>>>>> to what is shown below.
>>>>>>
>>>>>> {
>>>>>> g : "some-group-identifier",
>>>>>> sg: "some-subgroup-identifier",
>>>>>> j : "some-job-identifier",
>>>>>> page : 23,
>>>>>> ... // other fields omitted
>>>>>> important-data : [
>>>>>> {
>>>>>> f1 : "abc",
>>>>>> f2 : "a",
>>>>>> f3 : "/"
>>>>>> ...
>>>>>> },
>>>>>> ...
>>>>>> {
>>>>>> f1 : "xyz",
>>>>>> f2 : "q",
>>>>>> f3 : "/",
>>>>>> ...
>>>>>> },
>>>>>> ],
>>>>>> ... // other fields omitted
>>>>>> other-important-data : [
>>>>>> {
>>>>>> x1 : "ford",
>>>>>> x2 : "green",
>>>>>> x3 : 35
>>>>>> map : {
>>>>>> "free-field" : "value",
>>>>>> "other-free-field" : value2"
>>>>>> }
>>>>>> },
>>>>>> ...
>>>>>> {
>>>>>> x1 : "vw",
>>>>>> x2 : "red",
>>>>>> x3 : 54,
>>>>>> ...
>>>>>> },
>>>>>> ]
>>>>>> },
>>>>>> }
>>>>>>
>>>>>>
>>>>>> Each file contains a single JSON document (gzip compressed, and
>>>>>> roughly about 200KB uncompressed of pretty-printed json text per
>>>>>> document)
>>>>>>
>>>>>> I am interested in analyzing only the "important-data" array and the
>>>>>> "other-important-data" array.
>>>>>> My source data would ideally be easier to analyze if it looked like a
>>>>>> couple of tables with a fixed set of columns. Only the column "map" would
>>>>>> be a complex column, all others would be primitives.
>>>>>>
>>>>>> ( g, sg, j, page, f1, f2, f3 )
>>>>>>
>>>>>> ( g, sg, j, page, x1, x2, x3, map )
>>>>>>
>>>>>> So, for each JSON document, I would like to "create" several rows,
>>>>>> but I would like to avoid the intermediate step of persisting -and
>>>>>> duplicating- the "flattened" data.
>>>>>>
>>>>>> In order to avoid persisting the data flattened, I thought I had to
>>>>>> write my own map-reduce in Java code, but discovered that others have had
>>>>>> the same problem of using JSON as the source and there are somewhat
>>>>>> "standard" solutions.
>>>>>>
>>>>>> By reading about the SerDe approach for Hive I get the impression
>>>>>> that each JSON document is transformed into a single "row" of the table
>>>>>> with some columns being an array, a map of other nested structures.
>>>>>> a) Is there a way to break each JSON document into several "rows" for
>>>>>> a Hive external table?
>>>>>> b) It seems there are too many JSON SerDe libraries! Is there any of
>>>>>> them considered the de-facto standard?
>>>>>>
>>>>>> The Pig approach seems also promising using Elephant Bird Do anybody
>>>>>> has pointers to more user documentation on this project? Or is browsing
>>>>>> through the examples in GitHub my only source?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>