[ https://issues.apache.org/jira/browse/IMPALA-8663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16881431#comment-16881431 ]
ASF subversion and git services commented on IMPALA-8663: --------------------------------------------------------- Commit fc974f944a9266e68e6f1694eecdc2160fd52582 in impala's branch refs/heads/master from Vihang Karajgaonkar [ https://gitbox.apache.org/repos/asf?p=impala.git;h=fc974f9 ] IMPALA-8663 : FileMetadataLoader should skip hidden and tmp directories The FileMetadataLoader is used to load the file information in when the table is loaded. By default, it lists all the files in the table/partition directory. Currently, it only skips the filenames which are invalid (hidden files and ones starting with "_" etc). However, it does not skip the directories which are temporary or hidden. In case of Hive when data is inserted into a table, it creates a temporary staging directory which is a hidden directory under the table location. When the insert in hive is completed, such staging directories are removed. But if there is a refresh called during that time, FileMetadataLoader will add the files in the staging directory as well. Not only this could cause temporary invalid results but it causes table to go in a bad state when these temporary directories are removed. The only work-around in such a case to issue a refresh on the table again. This patch adds logic in the filemetadataloader to ignore such temporary staging directories. Unfortunately, hadoop does not provide a API which can recursively list files in a directory and skip certain directories. This patch adds a new FilterIterator which wraps around existing listFiles, listStatus and RecursingIterator to skip the hidden directories from the listing result. Also, the existing code to recover partitions implements its own recursion logic which includes path validation. This already skips such hidden directories since they do not conform to the partition spec. The patch does a minor modification to this method by directly calling the listStatusIterator instead of going through FileSystemUtil#listStatus whiche uses the filtering remote iterator now. Testing: 1. Added a new tests as well as modified existing ones which were related to cover interesting cases. 2. Ran concurrent inserts from Hive while issuing refresh in a loop on Impala side. Earlier this would cause the table to go into a bad state. Now, it works fine for the staging directories. It still runs into a FileNotFoundException from the impalad when there are insert overwrite statements in Hive Change-Id: I2c4a22908304fe9e377d77d6c18d401c3f3294aa Reviewed-on: http://gerrit.cloudera.org:8080/13665 Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Reviewed-by: Vihang Karajgaonkar <vih...@cloudera.com> > FileMetadataLoader should skip listing files in hidden and tmp directories > -------------------------------------------------------------------------- > > Key: IMPALA-8663 > URL: https://issues.apache.org/jira/browse/IMPALA-8663 > Project: IMPALA > Issue Type: Bug > Reporter: Vihang Karajgaonkar > Assignee: Vihang Karajgaonkar > Priority: Critical > Labels: catalog-v2, impala-acid > > Currently, the file metadata loader recursively lists the table and partition > directories to get the fileStatuses. For each filestatus we ignore the hidden > files in {{FileSystemUtil.isValidDataFile}}(). However that is not > sufficient. For instance, if Hive is inserting data into a table when the > refresh is called, it is possible the staging directory is present within the > table directory. This staging directory is a hidden directory of the naming > {{.hive-staging_*}}. It is possible that this directory has files which are > not hidden (starting from a . or _). Such files should be considered > temporary files and should not be considered as valid data files. > > Another instance where we see this happen is in transactional tables which > has a {{.manifest}} which is located in a {{_tmp}} directory within the table > directory. This file should also be skipped and not considered as a valid > data file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org