Hi Shushant, Hive and HBase are 2 different things. You can not really use one vs another one.
Hive is a query engine against HDFS data. Data can be stored with different format like flat text, sequence files, Paquet file, or even HBase table. HBase is both a query engine (Get and scans) and a storage engine on top of HDFS which allow you to store data for random read and random write. Then you can also add tools like Phoenix and Impala in the picture which will allow you to query the data from HDFS or HBase too. A good way to know if HBase is a good fit or not is to ask yourself how you are going to write into HBase or to read from HBase. HBase is good for Random Reads and Random Writes. If you only do bulk loads and aggregations (Full table scan), HBase is not a good fit. If you do random access (Client information, events details, etc.) HBase is a good fit. It's a bit over simplified, but that should give you some starting points. 2014-04-30 4:34 GMT-04:00 Shushant Arora <[email protected]>: > I have a requirement of processing huge weblogs on daily basis. > > 1. data will come incremental to datastore on daily basis and I need > cumulative and daily > distinct user count from logs and after that aggregated data will be loaded > in RDBMS like mydql. > > 2.data will be loaded in hdfs datawarehouse on daily basis and same will be > fetched from Hdfs warehouse after some filtering in RDMS like mysql and > will be processed there. > > Which datawarehouse is suitable for approach 1 and 2 and why?. > > Thanks > Shushant >
