Re: Aggregating data nested into JSON documents

Michael Segel Thu, 13 Jun 2013 15:06:44 -0700

I think you have a misconception of HBase. 

You don't need to actually have mutable data for it to be effective. 
The key is that you need to have access to specific records and work a very 
small subset of the data and not the complete data set.



On Jun 13, 2013, at 11:59 AM, Tecno Brain <[email protected]> wrote:

> Hi Mike,
> 
> Yes, I also have thought about HBase or Cassandra but my data is pretty much 
> a snapshot, it does not require updates. Most of my aggregations will also 
> need to be computed once and won't change over time with the exception of 
> some aggregation that is based on the last N days of data.  Should I still 
> consider HBase ? I think that probably it will be good for the aggregated 
> data. 
> 
> I have no idea what are sequence files, but I will take a look.  My raw data 
> is stored in the cloud, not in my Hadoop cluster. 
> 
> I'll keep looking at Pig with ElephantBird. 
> Thanks,
> 
> -Jorge 
> 
> 
> 
> 
> 
> On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <[email protected]> 
> wrote:
> Hi..
> 
> Have you thought about HBase? 
> 
> I would suggest that if you're using Hive or Pig, to look at taking these 
> files and putting the JSON records in to a sequence file. 
> Or set of sequence files.... (Then look at HBase to help index them...) 200KB 
> is small. 
> 
> That would be the same for either pig/hive.
> 
> In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. And 
> yes you get each record as a row, however you can always flatten them as 
> needed. 
> 
> Hive? 
> I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward 
> Capriolo could give you a better answer. 
> Going from memory, I don't know that there is a good SerDe that would write 
> JSON, just read it. (Hive)
> 
> IMHO Pig/ElephantBird is the best so far, but then again I may be dated and 
> biased. 
> 
> I think you're on the right track or at least train of thought. 
> 
> HTH
> 
> -Mike
> 
> 
> On Jun 12, 2013, at 7:57 PM, Tecno Brain <[email protected]> wrote:
> 
>> Hello, 
>>    I'm new to Hadoop. 
>>    I have a large quantity of JSON documents with a structure similar to 
>> what is shown below.  
>> 
>>    {
>>      g : "some-group-identifier",
>>      sg: "some-subgroup-identifier",
>>      j      : "some-job-identifier",
>>      page     : 23,
>>      ... // other fields omitted
>>      important-data : [
>>          {
>>            f1  : "abc",
>>            f2  : "a",
>>            f3  : "/"
>>            ...
>>          },
>>          ...
>>          {
>>            f1 : "xyz",
>>            f2  : "q",
>>            f3  : "/",
>>            ... 
>>          },
>>      ],
>>     ... // other fields omitted 
>>      other-important-data : [
>>         {
>>            x1  : "ford",
>>            x2  : "green",
>>            x3  : 35
>>            map : {
>>                "free-field" : "value",
>>                "other-free-field" : value2"
>>               }
>>          },
>>          ...
>>          {
>>            x1 : "vw",
>>            x2  : "red",
>>            x3  : 54,
>>            ... 
>>          },    
>>      ]
>>    },
>> }
>>  
>> 
>> Each file contains a single JSON document (gzip compressed, and roughly 
>> about 200KB uncompressed of pretty-printed json text per document)
>> 
>> I am interested in analyzing only the  "important-data" array and the 
>> "other-important-data" array.
>> My source data would ideally be easier to analyze if it looked like a couple 
>> of tables with a fixed set of columns. Only the column "map" would be a 
>> complex column, all others would be primitives.
>> 
>> ( g, sg, j, page, f1, f2, f3 )
>>  
>> ( g, sg, j, page, x1, x2, x3, map )
>> 
>> So, for each JSON document, I would like to "create" several rows, but I 
>> would like to avoid the intermediate step of persisting -and duplicating- 
>> the "flattened" data.
>> 
>> In order to avoid persisting the data flattened, I thought I had to write my 
>> own map-reduce in Java code, but discovered that others have had the same 
>> problem of using JSON as the source and there are somewhat "standard" 
>> solutions. 
>> 
>> By reading about the SerDe approach for Hive I get the impression that each 
>> JSON document is transformed into a single "row" of the table with some 
>> columns being an array, a map of other nested structures. 
>> a) Is there a way to break each JSON document into several "rows" for a Hive 
>> external table?
>> b) It seems there are too many JSON SerDe libraries! Is there any of them 
>> considered the de-facto standard? 
>> 
>> The Pig approach seems also promising using Elephant Bird Do anybody has 
>> pointers to more user documentation on this project? Or is browsing through 
>> the examples in GitHub my only source?
>> 
>> Thanks
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
>

Re: Aggregating data nested into JSON documents

Reply via email to