It would look something like this... http://hbase.apache.org/book.html#mapreduce.example.summary
... Except your output would be to an RDBMS, instead of HBase. On 10/27/11 2:51 PM, "viva v" <[email protected]> wrote: >Thanks Doug. > >30 million is the size to start with, growth rate is about 1 million per >week > >You mention HBase being used to generate summaies into an RDBMS, i am not >quite sure i understood this approach very well. >How would you generate the summaries from raw HBase data & update into a >RDBMS, would we need to accomplish this using a mapreduce job maybe? > >Could you please point me to an example use case scenario that has taken >this approach? > >Thanks >Vivek > >On Thu, Oct 27, 2011 at 1:27 AM, Doug Meil ><[email protected]>wrote: > >> >> re: "30 million records." >> >> We're obviously pro-HBase on this dist-list but one of the challenges of >> HBase (and Hadoop in general) is that the architecture can tend to be >> overkill on smaller datasets. That doesn't mean you shouldn't try >>HBase, >> but expectations should be tempered. >> >> >> Especially with your requirements #5 and #6, RDBMS are actually pretty >> good at that for smaller volumes, which is why HBase tends to be used to >> generate summaries into RDBMSs for further slicing and dicing. >> >> If you had an arrival rate of 30 million a day or something, then it >>would >> be a different story. >> >> >> On 10/26/11 3:31 PM, "viva v" <[email protected]> wrote: >> >> >Hi, >> > >> >I am working on a use case that has the following characteristics. >> >1) Data volume is in the order 30 million records >> >2) Data schema is known & is fixed (for the application we are >>building) >> >3) Data is NOT multi format. A single key will have integer data for >> >different aspects of that key >> >4) Data will be incrementally updated (some column values will be >>updated >> >at >> >different points of time) >> >5) There is a need to support adhoc (queries are not known ahead of >>time) >> >querying of data (without writing map reduce jobs) >> >6) Queries are likely to have a lot of joins & aggregations >> > >> >Could you please help me with suggestions on whether i should use >> >1) Hive >> >2) HBase >> >3) Hive over HBase >> >4) Pig over HBase >> > >> >Thanks >> >Vivek >> >>
