Some thoughts on this:

First, there’s no plan to remove the option to use an RDBMS such as Oracle as 
your backend.  Hive’s RawStore interface is built such that various 
implementations of the metadata storage can easily coexist.  Obviously 
different users will make different choices about what metadata store makes 
sense for them.

As to why HBase:
1) We desperately need to get rid of the ORM layer.  It’s causing us 
performance problems, as evidenced by things like it taking several minutes to 
fetch all of the partition data for queries that span many partitions.  HBase 
is a way to achieve this, not the only way.  See in particular Yahoo’s work on 
optimizing Oracle access https://issues.apache.org/jira/browse/HIVE-14870  The 
question around this is whether we can optimize for Oracle, MySQL, Postgres, 
and SQLServer without creating a maintenance and testing nightmare for 
ourselves.  I’m skeptical, but others think it’s possible.  See comments on 
that JIRA.

2) We’d like to scale to much larger sizes, both in terms of data and access 
from nodes.  Not that we’re worried about the amount of metadata, but we’d like 
to be able to cache more stats, file splits, etc.  And we’d like to allow nodes 
in the cluster to contact the metastore, which we do not today since many 
RDBMSs don’t handle a thousand plus simultaneous connections well.  Obviously 
both data and connection scale can be met with high end commercial stores.  But 
saying that we have this great open source database but you have to pay for an 
expensive commercial license to make the metadata really work well is a 
non-starter.

3) By using tools within the Hadoop ecosystem like HBase we are helping to 
drive improvement in the system

To explain the HBase work a little more, it doesn’t use Phoenix, but works 
directly against HBase, with the help of a transaction manager (Omid).  In 
performance tests we’ve done so far it’s faster than Hive 1 with the ORM layer, 
but not yet to the 10x range that we’d like to see.  We haven’t yet done the 
work to put in co-processors and such that we expect would speed it up further.

Alan.

> On Oct 23, 2016, at 15:46, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> 
> A while back there was some notes on having Hive metastore on Hbase as 
> opposed to conventional RDBMSs
> 
> I am currently involved with some hefty work with Hbase and Phoenix for batch 
> ingestion of trade data. As long as you define your Hbase table through 
> Phoenix and with secondary Phoenix indexes on Hbase, the speed is impressive.
> 
> I am not sure how much having Hbase as Hive metastore is going to add to Hive 
> performance. We use Oracle 12c as Hive metastore and the Hive database/schema 
> is built on solid state disks. Never had any issues with lock and concurrency.
> 
> Therefore I am not sure what one is going to gain by having Hbase as the Hive 
> metastore? I trust that we can still use our existing schemas on Oracle.
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  

Reply via email to