Jason Dere created HIVE-17963:
---------------------------------

             Summary: Fix for HIVE-17113 can be improved for non-blobstore 
filesystems
                 Key: HIVE-17963
                 URL: https://issues.apache.org/jira/browse/HIVE-17963
             Project: Hive
          Issue Type: Bug
            Reporter: Jason Dere
            Assignee: Jason Dere
            Priority: Major


HIVE-17113/HIVE-17813 fix the duplicate file issue by performing file moves on 
a file-by-file basis. For non-blobstore filesystems this results in many more 
filesystem/namenode operations compared to the previous 
Utilities.mvFileToFinalPath() behavior (dedup files in src dir, rename src dir 
to final dir).
For non-blobstore filesystems, a better solution would be the one described 
[here|https://issues.apache.org/jira/browse/HIVE-17113?focusedCommentId=16100564&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16100564]:

1) Move the temp directory to a new directory name, to prevent additional files 
from being added by any runaway processes.
2) Run removeTempOrDuplicateFiles() on this renamed temp directory
3) Run renameOrMoveFiles() to move the renamed temp directory to the final 
location.

This results in only one additional file operation in non-blobstore FSes 
compared to the original Utilities.mvFileToFinalPath() behavior.

The proposal is to do away with the config setting 
hive.exec.move.files.from.source.dir and always have behavior that should take 
care of the duplicate file issue described in HIVE-17113. For non-blobstore 
filesystems we will do steps 1-3 described above. For blobstore filesystems we 
will do the solution done in HIVE-17113/HIVE-17813 which does the file-by-file 
copy - this should have the same number of file operations as doing a rename 
directory on blobstore, which effectively results in file moves on a 
file-by-file basis.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to