Thanks Jean ! Few more questions what are good practices for key column design in HBase? Say my web logs contains timestamp and request id which uniquely identify each row
1.Shall I make YYYY-MM-DD-HH-MM-SS_REQ_ID as row key ? In scenario where this data will be fetched from HBase on daily base and will be loaded in MYSql DB. Daily my ETLruns and it will fetch record with keycol>=lastdate and keycol<=today ? Will this key design over load one region server ? Or it will be equally divided among region servers. On Wed, Apr 30, 2014 at 5:55 PM, Jean-Marc Spaggiari < [email protected]> wrote: > With HBase you have some overhead. The Region Server will do a lot for you. > Manage lal the columns families, the columns, the delete marker, the > compactions, etc. If you read a file directly from HDFS it will be faster > for sure because you will not have all those validations and all this extra > memory usage. > > HBase is absolutely perfect and is excellent to what it's build for. But if > you are doing only full table scans, it's not it's primary usecase. It can > still do it if you want, but if you do only that, it's not yet the most > efficient option. > > If your usecase is a mix of full scans and random read/random writes, then > yes, go with it! > > Last, some full table scan can be good fits with HBase if you use some of > it's specific features like TTL on certain columns families when using more > than 1, etc. > > HTH > > > 2014-04-30 8:13 GMT-04:00 Shushant Arora <[email protected]>: > > > Hi Jean > > > > Thanks for explanation . > > > > I still have one doubt > > Why HBase is not good for bulk loads and aggregations > > (Full table scan) ? Hive will also read each row for aggregation as well > as > > HBase . > > Can you explain more ? > > > > > > On Wed, Apr 30, 2014 at 5:15 PM, Jean-Marc Spaggiari < > > [email protected]> wrote: > > > > > Hi Shushant, > > > > > > Hive and HBase are 2 different things. You can not really use one vs > > > another one. > > > > > > Hive is a query engine against HDFS data. Data can be stored with > > different > > > format like flat text, sequence files, Paquet file, or even HBase > table. > > > HBase is both a query engine (Get and scans) and a storage engine on > top > > of > > > HDFS which allow you to store data for random read and random write. > > > > > > Then you can also add tools like Phoenix and Impala in the picture > which > > > will allow you to query the data from HDFS or HBase too. > > > > > > A good way to know if HBase is a good fit or not is to ask yourself how > > you > > > are going to write into HBase or to read from HBase. HBase is good for > > > Random Reads and Random Writes. If you only do bulk loads and > > aggregations > > > (Full table scan), HBase is not a good fit. If you do random access > > (Client > > > information, events details, etc.) HBase is a good fit. > > > > > > It's a bit over simplified, but that should give you some starting > > points. > > > > > > > > > 2014-04-30 4:34 GMT-04:00 Shushant Arora <[email protected]>: > > > > > > > I have a requirement of processing huge weblogs on daily basis. > > > > > > > > 1. data will come incremental to datastore on daily basis and I need > > > > cumulative and daily > > > > distinct user count from logs and after that aggregated data will be > > > loaded > > > > in RDBMS like mydql. > > > > > > > > 2.data will be loaded in hdfs datawarehouse on daily basis and same > > will > > > be > > > > fetched from Hdfs warehouse after some filtering in RDMS like mysql > and > > > > will be processed there. > > > > > > > > Which datawarehouse is suitable for approach 1 and 2 and why?. > > > > > > > > Thanks > > > > Shushant > > > > > > > > > >
