Hi..

Have you thought about HBase? 

I would suggest that if you're using Hive or Pig, to look at taking these files 
and putting the JSON records in to a sequence file. 
Or set of sequence files.... (Then look at HBase to help index them...) 200KB 
is small. 

That would be the same for either pig/hive.

In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. And 
yes you get each record as a row, however you can always flatten them as 
needed. 

Hive? 
I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward 
Capriolo could give you a better answer. 
Going from memory, I don't know that there is a good SerDe that would write 
JSON, just read it. (Hive)

IMHO Pig/ElephantBird is the best so far, but then again I may be dated and 
biased. 

I think you're on the right track or at least train of thought. 

HTH

-Mike


On Jun 12, 2013, at 7:57 PM, Tecno Brain <[email protected]> wrote:

> Hello, 
>    I'm new to Hadoop. 
>    I have a large quantity of JSON documents with a structure similar to what 
> is shown below.  
> 
>    {
>      g : "some-group-identifier",
>      sg: "some-subgroup-identifier",
>      j      : "some-job-identifier",
>      page     : 23,
>      ... // other fields omitted
>      important-data : [
>          {
>            f1  : "abc",
>            f2  : "a",
>            f3  : "/"
>            ...
>          },
>          ...
>          {
>            f1 : "xyz",
>            f2  : "q",
>            f3  : "/",
>            ... 
>          },
>      ],
>     ... // other fields omitted 
>      other-important-data : [
>         {
>            x1  : "ford",
>            x2  : "green",
>            x3  : 35
>            map : {
>                "free-field" : "value",
>                "other-free-field" : value2"
>               }
>          },
>          ...
>          {
>            x1 : "vw",
>            x2  : "red",
>            x3  : 54,
>            ... 
>          },    
>      ]
>    },
> }
>  
> 
> Each file contains a single JSON document (gzip compressed, and roughly about 
> 200KB uncompressed of pretty-printed json text per document)
> 
> I am interested in analyzing only the  "important-data" array and the 
> "other-important-data" array.
> My source data would ideally be easier to analyze if it looked like a couple 
> of tables with a fixed set of columns. Only the column "map" would be a 
> complex column, all others would be primitives.
> 
> ( g, sg, j, page, f1, f2, f3 )
>  
> ( g, sg, j, page, x1, x2, x3, map )
> 
> So, for each JSON document, I would like to "create" several rows, but I 
> would like to avoid the intermediate step of persisting -and duplicating- the 
> "flattened" data.
> 
> In order to avoid persisting the data flattened, I thought I had to write my 
> own map-reduce in Java code, but discovered that others have had the same 
> problem of using JSON as the source and there are somewhat "standard" 
> solutions. 
> 
> By reading about the SerDe approach for Hive I get the impression that each 
> JSON document is transformed into a single "row" of the table with some 
> columns being an array, a map of other nested structures. 
> a) Is there a way to break each JSON document into several "rows" for a Hive 
> external table?
> b) It seems there are too many JSON SerDe libraries! Is there any of them 
> considered the de-facto standard? 
> 
> The Pig approach seems also promising using Elephant Bird Do anybody has 
> pointers to more user documentation on this project? Or is browsing through 
> the examples in GitHub my only source?
> 
> Thanks
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

Reply via email to