I think you have a misconception of HBase. You don't need to actually have mutable data for it to be effective. The key is that you need to have access to specific records and work a very small subset of the data and not the complete data set.
On Jun 13, 2013, at 11:59 AM, Tecno Brain <[email protected]> wrote: > Hi Mike, > > Yes, I also have thought about HBase or Cassandra but my data is pretty much > a snapshot, it does not require updates. Most of my aggregations will also > need to be computed once and won't change over time with the exception of > some aggregation that is based on the last N days of data. Should I still > consider HBase ? I think that probably it will be good for the aggregated > data. > > I have no idea what are sequence files, but I will take a look. My raw data > is stored in the cloud, not in my Hadoop cluster. > > I'll keep looking at Pig with ElephantBird. > Thanks, > > -Jorge > > > > > > On Wed, Jun 12, 2013 at 7:26 PM, Michael Segel <[email protected]> > wrote: > Hi.. > > Have you thought about HBase? > > I would suggest that if you're using Hive or Pig, to look at taking these > files and putting the JSON records in to a sequence file. > Or set of sequence files.... (Then look at HBase to help index them...) 200KB > is small. > > That would be the same for either pig/hive. > > In terms of SerDes, I've worked w Pig and ElephantBird, its pretty nice. And > yes you get each record as a row, however you can always flatten them as > needed. > > Hive? > I haven't worked with the latest SerDe, but maybe Dean Wampler or Edward > Capriolo could give you a better answer. > Going from memory, I don't know that there is a good SerDe that would write > JSON, just read it. (Hive) > > IMHO Pig/ElephantBird is the best so far, but then again I may be dated and > biased. > > I think you're on the right track or at least train of thought. > > HTH > > -Mike > > > On Jun 12, 2013, at 7:57 PM, Tecno Brain <[email protected]> wrote: > >> Hello, >> I'm new to Hadoop. >> I have a large quantity of JSON documents with a structure similar to >> what is shown below. >> >> { >> g : "some-group-identifier", >> sg: "some-subgroup-identifier", >> j : "some-job-identifier", >> page : 23, >> ... // other fields omitted >> important-data : [ >> { >> f1 : "abc", >> f2 : "a", >> f3 : "/" >> ... >> }, >> ... >> { >> f1 : "xyz", >> f2 : "q", >> f3 : "/", >> ... >> }, >> ], >> ... // other fields omitted >> other-important-data : [ >> { >> x1 : "ford", >> x2 : "green", >> x3 : 35 >> map : { >> "free-field" : "value", >> "other-free-field" : value2" >> } >> }, >> ... >> { >> x1 : "vw", >> x2 : "red", >> x3 : 54, >> ... >> }, >> ] >> }, >> } >> >> >> Each file contains a single JSON document (gzip compressed, and roughly >> about 200KB uncompressed of pretty-printed json text per document) >> >> I am interested in analyzing only the "important-data" array and the >> "other-important-data" array. >> My source data would ideally be easier to analyze if it looked like a couple >> of tables with a fixed set of columns. Only the column "map" would be a >> complex column, all others would be primitives. >> >> ( g, sg, j, page, f1, f2, f3 ) >> >> ( g, sg, j, page, x1, x2, x3, map ) >> >> So, for each JSON document, I would like to "create" several rows, but I >> would like to avoid the intermediate step of persisting -and duplicating- >> the "flattened" data. >> >> In order to avoid persisting the data flattened, I thought I had to write my >> own map-reduce in Java code, but discovered that others have had the same >> problem of using JSON as the source and there are somewhat "standard" >> solutions. >> >> By reading about the SerDe approach for Hive I get the impression that each >> JSON document is transformed into a single "row" of the table with some >> columns being an array, a map of other nested structures. >> a) Is there a way to break each JSON document into several "rows" for a Hive >> external table? >> b) It seems there are too many JSON SerDe libraries! Is there any of them >> considered the de-facto standard? >> >> The Pig approach seems also promising using Elephant Bird Do anybody has >> pointers to more user documentation on this project? Or is browsing through >> the examples in GitHub my only source? >> >> Thanks >> >> >> >> >> >> >> >> >> >> > >
