Aditya Shah created HIVE-22004:
----------------------------------

             Summary: Non-acid to acid conversion doesn't handle random 
filenames
                 Key: HIVE-22004
                 URL: https://issues.apache.org/jira/browse/HIVE-22004
             Project: Hive
          Issue Type: Bug
            Reporter: Aditya Shah


Right now the supported filename patterns for non-acid to acid table's files 
(original files) are the only ones created by Hive itself (eg. 000000, 
000000_COPY_1, bucket_00000, etc). But at the same time Hive non-acid table 
supports reading from tables having files with random filenames. We should 
support the same for acid tables.

A way to handle this would be to rename such files and though rename is not a 
costly operation for HDFS, But for non-acid tables with the location on a 
blobstore like s3 and having random filenames will have costly added steps to 
convert to acid.

Current scenario: What we do now for original files is assign them a logical 
bucket id and for unrecognized patterns we assign -1 and ignore those files.

Proposed alternatives:

1) For all the random files assume the logical bucket id as 0 and let the files 
belong to the same bucket in the way similar to we do for multiple files with 
same bucket id (_copy_N). 
2) For all the random files lexicographically sort them and sequentially assign 
them a bucket id similar to the handling of multiple files for a non-bucketed 
table where we extract the bucket id simply from filenames



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to