Re: Questions related to HBase general use

Krishna Kalyan Wed, 13 May 2015 21:51:13 -0700

I know that BigInsights comes with BigSQL which interacts with HBase as
well, have you considered that option.
We have a similar use case using BigInsights 2.1.2.



On Thu, May 14, 2015 at 4:56 AM, Nick Dimiduk <ndimi...@gmail.com> wrote:

> + Swarnim, who's expert on HBase/Hive integration.
>
> Yes, snapshots may be interesting for you. I believe Hive can access HBase
> timestamps, exposed as a "virtual" column. It's assumed across there whole
> row however, not per cell.
>
> On Sun, May 10, 2015 at 9:14 PM, Jerry He <jerry...@gmail.com> wrote:
>
> > Hi, Yong
> >
> > You have a good understanding of the benefit of HBase already.
> > Generally speaking, HBase is suitable for real time read/write to your
> big
> > data set.
> > Regarding the HBase performance evaluation tool, the 'read' test use
> HBase
> > 'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the
> server.
> > The 'scan' test scans the table and transfers the rows to the client in
> > batches (e.g. 100 rows at a time), which will take shorter time for the
> > whole test to complete for the same number of rows.
> > The hive/hbase integration, as you said, needs more consideration.
> > 1) The performance.  Hive access HBase via HBase client API, which
> involves
> > going to the HBase server for all the data access. This will slow things
> > down.
> >     There are a couple of things you can explore. e.g. Hive/HBase
> snapshot
> > integration. This would provide direct access to HBase hfiles.
> > 2) In your email, you are interested in HBase's capability of storing
> > multiple versions of data.  You need to consider if Hive supports this
> > HBase feature. i.e provide you access to multi versions. As I can
> remember,
> > it is not fully.
> >
> > Jerry
> >
> >
> > On Thu, May 7, 2015 at 6:18 PM, java8964 <java8...@hotmail.com> wrote:
> >
> > > Hi,
> > > I am kind of new to HBase. Currently our production run IBM BigInsight
> > V3,
> > > comes with Hadoop 2.2 and HBase 0.96.0.
> > > We are mostly using HDFS and Hive/Pig for our BigData project, it works
> > > very good for our big datasets. Right now, we have a one dataset needs
> to
> > > be loaded from Mysql, about 100G, and will have about Gs change daily.
> > This
> > > is a very important slow change dimension data, we like to sync between
> > > Mysql and BigData platform.
> > > I am thinking of using HBase to store it, instead of refreshing the
> whole
> > > dataset in HDFS, due to:
> > > 1) HBase makes the merge the change very easy.2) HBase could store all
> > the
> > > changes in the history, as a function out of box. We will replicate all
> > the
> > > changes from the binlog level from Mysql, and we could keep all changes
> > in
> > > HBase (or long history), then it can give us some insight that cannot
> be
> > > done easily in HDFS.3) HBase could give us the benefit to access the
> data
> > > by key fast, for some cases.4) HBase is available out of box.
> > > What I am not sure is the Hive/HBase integration. Hive is the top tool
> in
> > > our environment. If one dataset stored in Hbase (even only about 100G
> as
> > > now), the join between it with the other Big datasets in HDFS worries
> > me. I
> > > read quite some information about Hive/HBase integration, and feel that
> > it
> > > is not really mature, as not too many usage cases I can find online,
> > > especially on performance. There are quite some JIRAs related to make
> > Hive
> > > utilize the HBase for performance in MR job are still pending.
> > > I want to know other people experience to use HBase in this way. I
> > > understand HBase is not designed as a storage system for Data Warehouse
> > > component or analytics engine. But the benefits to use HBase in this
> case
> > > still attractive me. If my use cases of HBase is mostly read or full
> scan
> > > the data, how bad it is compared to HDFS in the same cluster? 3x? 5x?
> > > To help me understand the read throughput of HBase, I use the HBase
> > > performance evaluation tool, but the output is quite confusing. I have
> 2
> > > clusters, one is with 5 nodes with 3 slaves all running on VM (Each
> with
> > > 24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real
> > > cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48
> > mapper
> > > slots + 24 reducer slots).Below is the result I run the "sequentialRead
> > 3"
> > > on the better cluster:
> > > 15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50
> > > INFO mapred.JobClient:   File System Counters15/05/07 17:26:50 INFO
> > > mapred.JobClient:     FILE: BYTES_READ=54615/05/07 17:26:50 INFO
> > > mapred.JobClient:     FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO
> > > mapred.JobClient:     HDFS: BYTES_READ=270015/05/07 17:26:50 INFO
> > > mapred.JobClient:     HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO
> > > mapred.JobClient:   org.apache.hadoop.mapreduce.JobCounter15/05/07
> > 17:26:50
> > > INFO mapred.JobClient:     TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO
> > > mapred.JobClient:     TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO
> > > mapred.JobClient:     SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO
> > > mapred.JobClient:     SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50 INFO
> > > mapred.JobClient:     FALLOW_SLOTS_MILLIS_MAPS=015/05/07 17:26:50 INFO
> > > mapred.JobClient:     FALLOW_SLOTS_MILLIS_REDUCES=015/05/07 17:26:50
> INFO
> > > mapred.JobClient:   org.apache.hadoop.mapreduce.TaskCounter15/05/07
> > > 17:26:50 INFO mapred.JobClient:     MAP_INPUT_RECORDS=3015/05/07
> 17:26:50
> > > INFO mapred.JobClient:     MAP_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO
> > > mapred.JobClient:     MAP_OUTPUT_BYTES=48015/05/07 17:26:50 INFO
> > > mapred.JobClient:     MAP_OUTPUT_MATERIALIZED_BYTES=72015/05/07
> 17:26:50
> > > INFO mapred.JobClient:     SPLIT_RAW_BYTES=270015/05/07 17:26:50 INFO
> > > mapred.JobClient:     COMBINE_INPUT_RECORDS=015/05/07 17:26:50 INFO
> > > mapred.JobClient:     COMBINE_OUTPUT_RECORDS=015/05/07 17:26:50 INFO
> > > mapred.JobClient:     REDUCE_INPUT_GROUPS=3015/05/07 17:26:50 INFO
> > > mapred.JobClient:     REDUCE_SHUFFLE_BYTES=72015/05/07 17:26:50 INFO
> > > mapred.JobClient:     REDUCE_INPUT_RECORDS=3015/05/07 17:26:50 INFO
> > > mapred.JobClient:     REDUCE_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO
> > > mapred.JobClient:     SPILLED_RECORDS=6015/05/07 17:26:50 INFO
> > > mapred.JobClient:     CPU_MILLISECONDS=163145015/05/07 17:26:50 INFO
> > > mapred.JobClient:     PHYSICAL_MEMORY_BYTES=1403188838415/05/07
> 17:26:50
> > > INFO mapred.JobClient:     VIRTUAL_MEMORY_BYTES=6413996032015/05/07
> > > 17:26:50 INFO mapred.JobClient:
> > >  COMMITTED_HEAP_BYTES=3382286745615/05/07 17:26:50 INFO
> mapred.JobClient:
> > >  HBase Performance Evaluation15/05/07 17:26:50 INFO mapred.JobClient:
> > >  Elapsed time in milliseconds=248921715/05/07 17:26:50 INFO
> > > mapred.JobClient:     Row count=314571015/05/07 17:26:50 INFO
> > > mapred.JobClient:   File Input Format Counters15/05/07 17:26:50 INFO
> > > mapred.JobClient:     Bytes Read=015/05/07 17:26:50 INFO
> > mapred.JobClient:
> > >
> org.apache.hadoop.mapreduce.lib.output.FileOutputFormat$Counter15/05/07
> > > 17:26:50 INFO mapred.JobClient:     BYTES_WRITTEN=405
> > > First, what is the through put I should get from the above result? Does
> > it
> > > mean 2489 seconds to sequential read 3.1G data (I assume every record
> is
> > > 1k)? So about 1.2M/s, which is very low compared to HDFS.  Here is  the
> > > output for scan operation on the same cluster:
> > > 15/05/07 17:32:46 INFO mapred.JobClient:   HBase Performance
> > > Evaluation15/05/07 17:32:46 INFO mapred.JobClient:     Elapsed time in
> > > milliseconds=38302115/05/07 17:32:46 INFO mapred.JobClient:     Row
> > > count=3145710
> > > Does it mean scanning 3.1G data with 383 seconds can be done on this
> > > cluster? What is the difference between scan and sequential read?
> > > Of course, all this tests are just done with default setting coming out
> > of
> > > box of HBase on BigInsight. I am trying to learn how to tune it. What I
> > am
> > > interested to know that for a N number of nodes of cluster, what is the
> > > reasonable read throughput I can expected?
> > > Thanks for your time.
> > > Yong
> >
>

Re: Questions related to HBase general use

Reply via email to