Thanks Jean !

Few more questions
what are good practices for key column design in HBase?
Say my web logs contains timestamp and request id which uniquely identify
each row

1.Shall I make YYYY-MM-DD-HH-MM-SS_REQ_ID as row key ? In scenario where
this data will be fetched from HBase on daily base and will be loaded in
MYSql DB.
Daily my ETLruns and it will fetch record with keycol>=lastdate and
keycol<=today ? Will this key design over load one region server ? Or it
will be equally divided among region servers.






On Wed, Apr 30, 2014 at 5:55 PM, Jean-Marc Spaggiari <
[email protected]> wrote:

> With HBase you have some overhead. The Region Server will do a lot for you.
> Manage lal the columns families, the columns, the delete marker, the
> compactions, etc. If you read a file directly from HDFS it will be faster
> for sure because you will not have all those validations and all this extra
> memory usage.
>
> HBase is absolutely perfect and is excellent to what it's build for. But if
> you are doing only full table scans, it's not it's primary usecase. It can
> still do it if you want, but if you do only that, it's not yet the most
> efficient option.
>
> If your usecase is a mix of full scans and random read/random writes, then
> yes, go with it!
>
> Last, some full table scan can be good fits with HBase if you use some of
> it's specific features like TTL on certain columns families when using more
> than 1, etc.
>
> HTH
>
>
> 2014-04-30 8:13 GMT-04:00 Shushant Arora <[email protected]>:
>
> > Hi Jean
> >
> > Thanks for explanation .
> >
> > I still  have one doubt
> > Why HBase is not good for bulk loads and aggregations
> > (Full table scan) ? Hive will also read each row for aggregation as well
> as
> > HBase .
> > Can you explain more ?
> >
> >
> > On Wed, Apr 30, 2014 at 5:15 PM, Jean-Marc Spaggiari <
> > [email protected]> wrote:
> >
> > > Hi Shushant,
> > >
> > > Hive and HBase are 2 different things. You can not really use one vs
> > > another one.
> > >
> > > Hive is a query engine against HDFS data. Data can be stored with
> > different
> > > format like flat text, sequence files, Paquet file, or even HBase
> table.
> > > HBase is both a query engine (Get and scans) and a storage engine on
> top
> > of
> > > HDFS which allow you to store data for random read and random write.
> > >
> > > Then you can also add tools like Phoenix and Impala in the picture
> which
> > > will allow you to query the data from HDFS or HBase too.
> > >
> > > A good way to know if HBase is a good fit or not is to ask yourself how
> > you
> > > are going to write into HBase or to read from HBase. HBase is good for
> > > Random Reads and Random Writes. If you only do bulk loads and
> > aggregations
> > > (Full table scan), HBase is not a good fit. If you do random access
> > (Client
> > > information, events details, etc.) HBase is a good fit.
> > >
> > > It's a bit over simplified, but that should give you some starting
> > points.
> > >
> > >
> > > 2014-04-30 4:34 GMT-04:00 Shushant Arora <[email protected]>:
> > >
> > > > I have a requirement of processing huge weblogs on daily basis.
> > > >
> > > > 1. data will come incremental to datastore on daily basis and I  need
> > > > cumulative and daily
> > > > distinct user count from logs and after that aggregated data will be
> > > loaded
> > > > in RDBMS like mydql.
> > > >
> > > > 2.data will be loaded in hdfs datawarehouse on daily basis and same
> > will
> > > be
> > > > fetched from Hdfs warehouse after some filtering in RDMS like mysql
> and
> > > > will be processed there.
> > > >
> > > > Which datawarehouse is suitable for approach 1 and 2 and why?.
> > > >
> > > > Thanks
> > > > Shushant
> > > >
> > >
> >
>

Reply via email to