Re: on duplicate update equivalent?

Gopal Vijayaraghavan Fri, 23 Sep 2016 11:37:55 -0700

> Dimensions change, and I'd rather do update than recreate a snapshot.

Slow changing dimensions are the common use-case for Hive's ACID MERGE.


The feature you need is most likely covered by 

https://issues.apache.org/jira/browse/HIVE-10924

2nd comment from that JIRA

"Once an hour, a set of inserts and updates (up to 500k rows) for various 
dimension tables (eg. customer, inventory, stores) needs to be processed. The 
dimension tables have primary keys and are typically bucketed and sorted on 
those keys."

Any other approach would need a full snapshot re-materialization, because ACID 
can generate DELETE + INSERT instead of rewriting the original file for a 2% 
upsert.

If you do not have any isolation concerns (as in, a query doing a read when 50% 
of your update has applied), using HBase backed dimension tables in Hive is 
possible, but it does not offer the same transactional consistency as the ACID 
merge will.

Cheers,
Gopal

Re: on duplicate update equivalent?

Reply via email to