Hi,

 

The primary key is required for updates/deletes to uniquely identify the record 
that needs to be updated otherwise you are going to have unpredictable results 
as you may be updating too many or deleting too many.

 

This is indeed a requirement for real time delivery of data to data warehouses 
(DW) such as Sybase IQ using replication server. The replicate database (in 
this case DW) does not need to have that primary key. The replication 
definition for that end table will have to have a primary key or unique index 
as part of replication definition.

 

If the idea is to get data from RDBMS (read ACID compliant database) to Hive 
real time, I would prefer this unique key to be built in in replication 
definition. Hive does not have to have unique key as any data getting to Hive 
from RDBMS is expected to be transactional and there is no need to do sanity 
check in Hive. What you are mentioning as temp is (I gather if I am correct)  
is eferred to staging in DW. However, there are now requirement for DW to 
receive replicate data from transactional databases through SAP replication 
server or Oracle Golden Gate.

 

HTH,

 

Mich Talebzadeh

 

http://talebzadehmich.wordpress.com

 

Publications due shortly:

Creating in-memory Data Grid for Trading Systems with Oracle TimesTen and 
Coherence Cache

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Ltd, its subsidiaries 
or their employees, unless expressly so stated. It is the responsibility of the 
recipient to ensure that this email is virus free, therefore neither Peridale 
Ltd, its subsidiaries nor their employees accept any responsibility.

 

From: Alan Gates [mailto:alanfga...@gmail.com] 
Sent: 26 March 2015 21:48
To: user@hive.apache.org
Subject: Re: Adding update/delete to the hive-hcatalog-streaming API

 

The missing piece for adding update and delete to the streaming API is a 
primary key.  Updates and deletes in SQL work by scanning the table or 
partition where the record resides.  This is assumed to be ok since we are not 
supporting transactional workloads and thus update/deletes are assumed to be 
infrequent.  But a need to scan for each update or delete will not perform 
adequately in the streaming case.

I've had a few discussions with others recently who are thinking of adding 
merge like functionality, where you would upload all changes to a temp table 
and then in one scan/transaction apply those changes.  This is a common way to 
handle these situations for data warehouses, and is much easier than adding a 
primary key concept to Hive.

Alan.






 <mailto:tea...@gmail.com> Elliot West

March 26, 2015 at 14:08

Hi,

 

I'd like to ascertain if it might be possible to add 'update' and 'delete' 
operations to the hive-hcatalog-streaming API. I've been looking at the API 
with interest for the last week as it appears to have the potential to help 
with some general data processing patterns that are prevalent where I work. 
Ultimately, we continuously load large amounts of data into Hadoop which is 
partitioned by some time interval - usually hour, day, or month depending on 
the data size. However, the records that reside in this data can change. We 
often receive some new information that mutates part of an existing record 
already stored in a partition in HDFS. Typically the amount of mutations is 
very small compared to the number of records in each partitions.

 

To handle this currently we re-read and re-write all partitions that could 
potentially be affected by new data. In practice a single hour's worth of new 
data can require the reading and writing of 1 month's worth of partitions. By 
storing the data in a transactional Hive table I believe that we can instead 
issue updates and deletes for only the affected rows. Although we do use Hive 
for analytics on this data, much of the processing that generates and consumes 
the data is performed using Cascading. Therefore I'd like to be able to read 
and write the data via an API which we'd aim to integrate into a Cascading Tap 
of some description. Our Cascading processes could determine the new, updated, 
and deleted records and then use the API to stream these changes to the 
transactional Hive table.

 

We have most of this working in a proof of concept, but as 
hive-hcatalog-streaming does not expose the delete/update methods of the 
OrcRecordUpdater we've had to hack together something unpleasant based on the 
original API.

 

As a first step I'd like to check if there is any appetite for adding such 
functionality to the API or if this goes against the original motivations of 
the project? If this suggestion sounds reasonable then I'd be keen to help move 
this forward.

 

Thanks - Elliot.

 

Reply via email to