Mich, thanks for the detail instructions.
While aware of the Hive method, I have a few questions/concerns: 1) the Hive method is a "INSERT FROM SELECT " ,which usually not perform as good as a bulk load though I am not familiar with the real implementation 2) I have another SQL-on-Hadoop engine working well with ORC file. So if possible, I'd like to avoid the system dependency on Hive(one fewer component to maintain). 3) HBase has well running back-end process for Replication(HBASE-1295) or Backup(HBASE-7912), so wondering anything can be piggy-back on it to deal with day-to-day works The goal is to have HBase as a OLTP front(to receive data), and the ORC file(with a SQL engine) as the OLAP end for reporting/analytic. the ORC file will also serve as my backup in the case for DR. Demai On Fri, Oct 21, 2016 at 1:57 PM, Mich Talebzadeh <[email protected]> wrote: > Create an external table in Hive on Hbase atble. Pretty straight forward. > > hive> create external table marketDataHbase (key STRING, ticker STRING, > timecreated STRING, price STRING) > > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH > SERDEPROPERTIES ("hbase.columns.mapping" = > ":key,price_info:ticker,price_info:timecreated, price_info:price") > > TBLPROPERTIES ("hbase.table.name" = "marketDataHbase"); > > > > then create a normal table in hive as ORC > > > CREATE TABLE IF NOT EXISTS marketData ( > KEY string > , TICKER string > , TIMECREATED string > , PRICE float > ) > PARTITIONED BY (DateStamp string) > STORED AS ORC > TBLPROPERTIES ( > "orc.create.index"="true", > "orc.bloom.filter.columns"="KEY", > "orc.bloom.filter.fpp"="0.05", > "orc.compress"="SNAPPY", > "orc.stripe.size"="16777216", > "orc.row.index.stride"="10000" ) > ; > --show create table marketData; > --Populate target table > INSERT OVERWRITE TABLE marketData PARTITION (DateStamp = "${TODAY}") > SELECT > KEY > , TICKER > , TIMECREATED > , PRICE > FROM MarketDataHbase > > > Run this job as a cron every often > > > HTH > > > > Dr Mich Talebzadeh > > > > LinkedIn * https://www.linkedin.com/profile/view?id= > AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCd > OABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 21 October 2016 at 21:48, Demai Ni <[email protected]> wrote: > > > hi, > > > > I am wondering whether there are existing methods to ETL HBase data to > > ORC(or other open source columnar) file? > > > > I understand in Hive "insert into Hive_ORC_Table from SELET * from > > HIVE_HBase_Table", can probably get the job done. Is this the common way > to > > do so? Performance is acceptable and able to handle the delta update in > the > > case HBase table changed? > > > > I did a bit google, and find this > > https://community.hortonworks.com/questions/2632/loading- > > hbase-from-hive-orc-tables.html > > > > which is another way around. > > > > Will it perform better(comparing to above Hive stmt) if using either > > replication logic or snapshot backup to generate ORC file from hbase > tables > > and with incremental update ability? > > > > I hope to has as fewer dependency as possible. in the Example of ORC, > will > > only depend on Apache ORC's API, and not depend on Hive > > > > Demai > > >
