With HBase you have some overhead. The Region Server will do a lot for you.
Manage lal the columns families, the columns, the delete marker, the
compactions, etc. If you read a file directly from HDFS it will be faster
for sure because you will not have all those validations and all this extra
memory usage.

HBase is absolutely perfect and is excellent to what it's build for. But if
you are doing only full table scans, it's not it's primary usecase. It can
still do it if you want, but if you do only that, it's not yet the most
efficient option.

If your usecase is a mix of full scans and random read/random writes, then
yes, go with it!

Last, some full table scan can be good fits with HBase if you use some of
it's specific features like TTL on certain columns families when using more
than 1, etc.

HTH


2014-04-30 8:13 GMT-04:00 Shushant Arora <[email protected]>:

> Hi Jean
>
> Thanks for explanation .
>
> I still  have one doubt
> Why HBase is not good for bulk loads and aggregations
> (Full table scan) ? Hive will also read each row for aggregation as well as
> HBase .
> Can you explain more ?
>
>
> On Wed, Apr 30, 2014 at 5:15 PM, Jean-Marc Spaggiari <
> [email protected]> wrote:
>
> > Hi Shushant,
> >
> > Hive and HBase are 2 different things. You can not really use one vs
> > another one.
> >
> > Hive is a query engine against HDFS data. Data can be stored with
> different
> > format like flat text, sequence files, Paquet file, or even HBase table.
> > HBase is both a query engine (Get and scans) and a storage engine on top
> of
> > HDFS which allow you to store data for random read and random write.
> >
> > Then you can also add tools like Phoenix and Impala in the picture which
> > will allow you to query the data from HDFS or HBase too.
> >
> > A good way to know if HBase is a good fit or not is to ask yourself how
> you
> > are going to write into HBase or to read from HBase. HBase is good for
> > Random Reads and Random Writes. If you only do bulk loads and
> aggregations
> > (Full table scan), HBase is not a good fit. If you do random access
> (Client
> > information, events details, etc.) HBase is a good fit.
> >
> > It's a bit over simplified, but that should give you some starting
> points.
> >
> >
> > 2014-04-30 4:34 GMT-04:00 Shushant Arora <[email protected]>:
> >
> > > I have a requirement of processing huge weblogs on daily basis.
> > >
> > > 1. data will come incremental to datastore on daily basis and I  need
> > > cumulative and daily
> > > distinct user count from logs and after that aggregated data will be
> > loaded
> > > in RDBMS like mydql.
> > >
> > > 2.data will be loaded in hdfs datawarehouse on daily basis and same
> will
> > be
> > > fetched from Hdfs warehouse after some filtering in RDMS like mysql and
> > > will be processed there.
> > >
> > > Which datawarehouse is suitable for approach 1 and 2 and why?.
> > >
> > > Thanks
> > > Shushant
> > >
> >
>

Reply via email to