Hi.. Have you thought about HBase?
I would suggest that if you're using Hive or Pig, to look at taking these files and putting the JSON records in to a sequence file. Or set of sequence files.... (Then look at HBase to help index them...) 200KB is small. That would be the same for either pig/hive. In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. And yes you get each record as a row, however you can always flatten them as needed. Hive? I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward Capriolo could give you a better answer. Going from memory, I don't know that there is a good SerDe that would write JSON, just read it. (Hive) IMHO Pig/ElephantBird is the best so far, but then again I may be dated and biased. I think you're on the right track or at least train of thought. HTH -Mike On Jun 12, 2013, at 7:57 PM, Tecno Brain <[email protected]> wrote: > Hello, > I'm new to Hadoop. > I have a large quantity of JSON documents with a structure similar to what > is shown below. > > { > g : "some-group-identifier", > sg: "some-subgroup-identifier", > j : "some-job-identifier", > page : 23, > ... // other fields omitted > important-data : [ > { > f1 : "abc", > f2 : "a", > f3 : "/" > ... > }, > ... > { > f1 : "xyz", > f2 : "q", > f3 : "/", > ... > }, > ], > ... // other fields omitted > other-important-data : [ > { > x1 : "ford", > x2 : "green", > x3 : 35 > map : { > "free-field" : "value", > "other-free-field" : value2" > } > }, > ... > { > x1 : "vw", > x2 : "red", > x3 : 54, > ... > }, > ] > }, > } > > > Each file contains a single JSON document (gzip compressed, and roughly about > 200KB uncompressed of pretty-printed json text per document) > > I am interested in analyzing only the "important-data" array and the > "other-important-data" array. > My source data would ideally be easier to analyze if it looked like a couple > of tables with a fixed set of columns. Only the column "map" would be a > complex column, all others would be primitives. > > ( g, sg, j, page, f1, f2, f3 ) > > ( g, sg, j, page, x1, x2, x3, map ) > > So, for each JSON document, I would like to "create" several rows, but I > would like to avoid the intermediate step of persisting -and duplicating- the > "flattened" data. > > In order to avoid persisting the data flattened, I thought I had to write my > own map-reduce in Java code, but discovered that others have had the same > problem of using JSON as the source and there are somewhat "standard" > solutions. > > By reading about the SerDe approach for Hive I get the impression that each > JSON document is transformed into a single "row" of the table with some > columns being an array, a map of other nested structures. > a) Is there a way to break each JSON document into several "rows" for a Hive > external table? > b) It seems there are too many JSON SerDe libraries! Is there any of them > considered the de-facto standard? > > The Pig approach seems also promising using Elephant Bird Do anybody has > pointers to more user documentation on this project? Or is browsing through > the examples in GitHub my only source? > > Thanks > > > > > > > > > >
