[ 
https://issues.apache.org/jira/browse/HIVE-21197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mahesh kumar behera reassigned HIVE-21197:
------------------------------------------


> Hive Replication can add duplicate data during migration from 3.0 to 4
> ----------------------------------------------------------------------
>
>                 Key: HIVE-21197
>                 URL: https://issues.apache.org/jira/browse/HIVE-21197
>             Project: Hive
>          Issue Type: Task
>          Components: repl
>            Reporter: mahesh kumar behera
>            Assignee: mahesh kumar behera
>            Priority: Major
>
> During bootstrap phase it may happen that the files copied to target are 
> created by events which are not part of the bootstrap. This is because of the 
> fact that, bootstrap first gets the last event id and then the file list. So 
> during this period if some event happens, then bootstrap will include files 
> created by these events also. So the same files will be copied again during 
> the first incremental replication just after the bootstrap. In normal 
> scenario, the duplicate copy does not cause any issue as hive allows the use 
> of target database only after the first incremental. But in case of 
> migration, the file at source and target are copied to different location 
> (based on the write id at target) and thus this may lead to duplicate data at 
> target. This can be avoided by having at check at load time for duplicate 
> file. This check can be done only for the first incremental and the search 
> can be done in the bootstrap directory (with write id 1). if the file is 
> already present then just ignore the copy.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to