I know that BigInsights comes with BigSQL which interacts with HBase as well, have you considered that option. We have a similar use case using BigInsights 2.1.2.
On Thu, May 14, 2015 at 4:56 AM, Nick Dimiduk <ndimi...@gmail.com> wrote: > + Swarnim, who's expert on HBase/Hive integration. > > Yes, snapshots may be interesting for you. I believe Hive can access HBase > timestamps, exposed as a "virtual" column. It's assumed across there whole > row however, not per cell. > > On Sun, May 10, 2015 at 9:14 PM, Jerry He <jerry...@gmail.com> wrote: > > > Hi, Yong > > > > You have a good understanding of the benefit of HBase already. > > Generally speaking, HBase is suitable for real time read/write to your > big > > data set. > > Regarding the HBase performance evaluation tool, the 'read' test use > HBase > > 'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the > server. > > The 'scan' test scans the table and transfers the rows to the client in > > batches (e.g. 100 rows at a time), which will take shorter time for the > > whole test to complete for the same number of rows. > > The hive/hbase integration, as you said, needs more consideration. > > 1) The performance. Hive access HBase via HBase client API, which > involves > > going to the HBase server for all the data access. This will slow things > > down. > > There are a couple of things you can explore. e.g. Hive/HBase > snapshot > > integration. This would provide direct access to HBase hfiles. > > 2) In your email, you are interested in HBase's capability of storing > > multiple versions of data. You need to consider if Hive supports this > > HBase feature. i.e provide you access to multi versions. As I can > remember, > > it is not fully. > > > > Jerry > > > > > > On Thu, May 7, 2015 at 6:18 PM, java8964 <java8...@hotmail.com> wrote: > > > > > Hi, > > > I am kind of new to HBase. Currently our production run IBM BigInsight > > V3, > > > comes with Hadoop 2.2 and HBase 0.96.0. > > > We are mostly using HDFS and Hive/Pig for our BigData project, it works > > > very good for our big datasets. Right now, we have a one dataset needs > to > > > be loaded from Mysql, about 100G, and will have about Gs change daily. > > This > > > is a very important slow change dimension data, we like to sync between > > > Mysql and BigData platform. > > > I am thinking of using HBase to store it, instead of refreshing the > whole > > > dataset in HDFS, due to: > > > 1) HBase makes the merge the change very easy.2) HBase could store all > > the > > > changes in the history, as a function out of box. We will replicate all > > the > > > changes from the binlog level from Mysql, and we could keep all changes > > in > > > HBase (or long history), then it can give us some insight that cannot > be > > > done easily in HDFS.3) HBase could give us the benefit to access the > data > > > by key fast, for some cases.4) HBase is available out of box. > > > What I am not sure is the Hive/HBase integration. Hive is the top tool > in > > > our environment. If one dataset stored in Hbase (even only about 100G > as > > > now), the join between it with the other Big datasets in HDFS worries > > me. I > > > read quite some information about Hive/HBase integration, and feel that > > it > > > is not really mature, as not too many usage cases I can find online, > > > especially on performance. There are quite some JIRAs related to make > > Hive > > > utilize the HBase for performance in MR job are still pending. > > > I want to know other people experience to use HBase in this way. I > > > understand HBase is not designed as a storage system for Data Warehouse > > > component or analytics engine. But the benefits to use HBase in this > case > > > still attractive me. If my use cases of HBase is mostly read or full > scan > > > the data, how bad it is compared to HDFS in the same cluster? 3x? 5x? > > > To help me understand the read throughput of HBase, I use the HBase > > > performance evaluation tool, but the output is quite confusing. I have > 2 > > > clusters, one is with 5 nodes with 3 slaves all running on VM (Each > with > > > 24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real > > > cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48 > > mapper > > > slots + 24 reducer slots).Below is the result I run the "sequentialRead > > 3" > > > on the better cluster: > > > 15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50 > > > INFO mapred.JobClient: File System Counters15/05/07 17:26:50 INFO > > > mapred.JobClient: FILE: BYTES_READ=54615/05/07 17:26:50 INFO > > > mapred.JobClient: FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO > > > mapred.JobClient: HDFS: BYTES_READ=270015/05/07 17:26:50 INFO > > > mapred.JobClient: HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO > > > mapred.JobClient: org.apache.hadoop.mapreduce.JobCounter15/05/07 > > 17:26:50 > > > INFO mapred.JobClient: TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO > > > mapred.JobClient: TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO > > > mapred.JobClient: SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO > > > mapred.JobClient: SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50 INFO > > > mapred.JobClient: FALLOW_SLOTS_MILLIS_MAPS=015/05/07 17:26:50 INFO > > > mapred.JobClient: FALLOW_SLOTS_MILLIS_REDUCES=015/05/07 17:26:50 > INFO > > > mapred.JobClient: org.apache.hadoop.mapreduce.TaskCounter15/05/07 > > > 17:26:50 INFO mapred.JobClient: MAP_INPUT_RECORDS=3015/05/07 > 17:26:50 > > > INFO mapred.JobClient: MAP_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO > > > mapred.JobClient: MAP_OUTPUT_BYTES=48015/05/07 17:26:50 INFO > > > mapred.JobClient: MAP_OUTPUT_MATERIALIZED_BYTES=72015/05/07 > 17:26:50 > > > INFO mapred.JobClient: SPLIT_RAW_BYTES=270015/05/07 17:26:50 INFO > > > mapred.JobClient: COMBINE_INPUT_RECORDS=015/05/07 17:26:50 INFO > > > mapred.JobClient: COMBINE_OUTPUT_RECORDS=015/05/07 17:26:50 INFO > > > mapred.JobClient: REDUCE_INPUT_GROUPS=3015/05/07 17:26:50 INFO > > > mapred.JobClient: REDUCE_SHUFFLE_BYTES=72015/05/07 17:26:50 INFO > > > mapred.JobClient: REDUCE_INPUT_RECORDS=3015/05/07 17:26:50 INFO > > > mapred.JobClient: REDUCE_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO > > > mapred.JobClient: SPILLED_RECORDS=6015/05/07 17:26:50 INFO > > > mapred.JobClient: CPU_MILLISECONDS=163145015/05/07 17:26:50 INFO > > > mapred.JobClient: PHYSICAL_MEMORY_BYTES=1403188838415/05/07 > 17:26:50 > > > INFO mapred.JobClient: VIRTUAL_MEMORY_BYTES=6413996032015/05/07 > > > 17:26:50 INFO mapred.JobClient: > > > COMMITTED_HEAP_BYTES=3382286745615/05/07 17:26:50 INFO > mapred.JobClient: > > > HBase Performance Evaluation15/05/07 17:26:50 INFO mapred.JobClient: > > > Elapsed time in milliseconds=248921715/05/07 17:26:50 INFO > > > mapred.JobClient: Row count=314571015/05/07 17:26:50 INFO > > > mapred.JobClient: File Input Format Counters15/05/07 17:26:50 INFO > > > mapred.JobClient: Bytes Read=015/05/07 17:26:50 INFO > > mapred.JobClient: > > > > org.apache.hadoop.mapreduce.lib.output.FileOutputFormat$Counter15/05/07 > > > 17:26:50 INFO mapred.JobClient: BYTES_WRITTEN=405 > > > First, what is the through put I should get from the above result? Does > > it > > > mean 2489 seconds to sequential read 3.1G data (I assume every record > is > > > 1k)? So about 1.2M/s, which is very low compared to HDFS. Here is the > > > output for scan operation on the same cluster: > > > 15/05/07 17:32:46 INFO mapred.JobClient: HBase Performance > > > Evaluation15/05/07 17:32:46 INFO mapred.JobClient: Elapsed time in > > > milliseconds=38302115/05/07 17:32:46 INFO mapred.JobClient: Row > > > count=3145710 > > > Does it mean scanning 3.1G data with 383 seconds can be done on this > > > cluster? What is the difference between scan and sequential read? > > > Of course, all this tests are just done with default setting coming out > > of > > > box of HBase on BigInsight. I am trying to learn how to tune it. What I > > am > > > interested to know that for a N number of nodes of cluster, what is the > > > reasonable read throughput I can expected? > > > Thanks for your time. > > > Yong > > >