Thanks Doug. 30 million is the size to start with, growth rate is about 1 million per week
You mention HBase being used to generate summaies into an RDBMS, i am not quite sure i understood this approach very well. How would you generate the summaries from raw HBase data & update into a RDBMS, would we need to accomplish this using a mapreduce job maybe? Could you please point me to an example use case scenario that has taken this approach? Thanks Vivek On Thu, Oct 27, 2011 at 1:27 AM, Doug Meil <[email protected]>wrote: > > re: "30 million records." > > We're obviously pro-HBase on this dist-list but one of the challenges of > HBase (and Hadoop in general) is that the architecture can tend to be > overkill on smaller datasets. That doesn't mean you shouldn't try HBase, > but expectations should be tempered. > > > Especially with your requirements #5 and #6, RDBMS are actually pretty > good at that for smaller volumes, which is why HBase tends to be used to > generate summaries into RDBMSs for further slicing and dicing. > > If you had an arrival rate of 30 million a day or something, then it would > be a different story. > > > On 10/26/11 3:31 PM, "viva v" <[email protected]> wrote: > > >Hi, > > > >I am working on a use case that has the following characteristics. > >1) Data volume is in the order 30 million records > >2) Data schema is known & is fixed (for the application we are building) > >3) Data is NOT multi format. A single key will have integer data for > >different aspects of that key > >4) Data will be incrementally updated (some column values will be updated > >at > >different points of time) > >5) There is a need to support adhoc (queries are not known ahead of time) > >querying of data (without writing map reduce jobs) > >6) Queries are likely to have a lot of joins & aggregations > > > >Could you please help me with suggestions on whether i should use > >1) Hive > >2) HBase > >3) Hive over HBase > >4) Pig over HBase > > > >Thanks > >Vivek > >
