[ 
https://issues.apache.org/jira/browse/HIVE-14841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushanth Sowmyan updated HIVE-14841:
------------------------------------
    Description: 
Per email sent out to the dev list, the current implementation of replication 
in hive has certain drawbacks, for instance :

* Replication follows a rubberbanding pattern, wherein different tables/ptns 
can be in a different/mixed state on the destination, so that unless all events 
are caught up on, we do not have an equivalent warehouse. Thus, this only 
satisfies DR cases, not load balancing usecases, and the secondary warehouse is 
really only seen as a backup, rather than as a live warehouse that trails the 
primary.
* The base implementation is a naive implementation, and has several 
performance problems, including a large amount of duplication of data for 
subsequent events, as mentioned in HIVE-13348, having to copy out entire 
partitions/tables when just a delta of files might be sufficient/etc. Also, 
using EXPORT/IMPORT allows us a simple implementation, but at the cost of tons 
of temporary space, much of which is not actually applied at the destination.

Thus, to track this, we now create a new branch (repl2) and a uber-jira(this 
one) to track experimental development towards improvement of this situation.

  was:
Per email sent out to the dev list, the current implementation of replication 
in hive has certain drawbacks, for instance :

* Replication follows a rubberbanding pattern, wherein different
tables/ptns can be in a different/mixed state on the destination, so
that unless all events are caught up on, we do not have an equivalent
warehouse. Thus, this only satisfies DR cases, not load balancing
usecases, and the secondary warehouse is really only seen as a backup,
rather than as a live warehouse that trails the primary.
* The base implementation is a naive implementation, and has several
performance problems, including a large amount of duplication of data
for subsequent events, as mentioned in HIVE-13348, having to copy out
entire partitions/tables when just a delta of files might be
sufficient/etc. Also, using EXPORT/IMPORT allows us a simple
implementation, but at the cost of tons of temporary space, much of
which is not actually applied at the destination.

Thus, to track this, we now create a new branch (repl2) and a uber-jira(this 
one) to track experimental development towards improvement of this situation.


> Replication - Phase 2
> ---------------------
>
>                 Key: HIVE-14841
>                 URL: https://issues.apache.org/jira/browse/HIVE-14841
>             Project: Hive
>          Issue Type: New Feature
>          Components: repl
>    Affects Versions: 2.1.0
>            Reporter: Sushanth Sowmyan
>
> Per email sent out to the dev list, the current implementation of replication 
> in hive has certain drawbacks, for instance :
> * Replication follows a rubberbanding pattern, wherein different tables/ptns 
> can be in a different/mixed state on the destination, so that unless all 
> events are caught up on, we do not have an equivalent warehouse. Thus, this 
> only satisfies DR cases, not load balancing usecases, and the secondary 
> warehouse is really only seen as a backup, rather than as a live warehouse 
> that trails the primary.
> * The base implementation is a naive implementation, and has several 
> performance problems, including a large amount of duplication of data for 
> subsequent events, as mentioned in HIVE-13348, having to copy out entire 
> partitions/tables when just a delta of files might be sufficient/etc. Also, 
> using EXPORT/IMPORT allows us a simple implementation, but at the cost of 
> tons of temporary space, much of which is not actually applied at the 
> destination.
> Thus, to track this, we now create a new branch (repl2) and a uber-jira(this 
> one) to track experimental development towards improvement of this situation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to