[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264026#comment-15264026 ] Chaoyu Tang commented on HIVE-13509: Thanks, [~leftylev] > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Fix For: 2.1.0 > > Attachments: HIVE-13509.1.patch, HIVE-13509.2.patch, HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263585#comment-15263585 ] Lefty Leverenz commented on HIVE-13509: --- Documented here: * [HCatalog Configuration Properties -- Storage Directives | https://cwiki.apache.org/confluence/display/Hive/HCatalog+Configuration+Properties#HCatalogConfigurationProperties-StorageDirectives] Thanks, [~ctang.ma]. I added version information and a link to this JIRA issue. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Fix For: 2.1.0 > > Attachments: HIVE-13509.1.patch, HIVE-13509.2.patch, HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263418#comment-15263418 ] Chaoyu Tang commented on HIVE-13509: Committed to 2.1.0. Thanks [~szehon] and [~mithun] for reviewing the patch. The new configuration hcat.input.ignore.invalid.path needs to be documented. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.1.patch, HIVE-13509.2.patch, HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258780#comment-15258780 ] Mithun Radhakrishnan commented on HIVE-13509: - Yes, sir. +1. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.1.patch, HIVE-13509.2.patch, HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258527#comment-15258527 ] Mithun Radhakrishnan commented on HIVE-13509: - bq. ... with Google Guava's {{Iterators.filter()}}. Actually, please ignore comment#3, above. I was trying to avoid checking {{ignoreInvalidPath}} multiple times. I tried writing it out myself (to illustrate), and saw that the call to {{fs.makeQualified()}} implies that we'll need to use both {{Iterators.filter()}} and {{Iterators.transform}}, at which point, it's no longer short and sweet. Please fix #2 above, and I will +1. Also, thanks for adding tests. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.1.patch, HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15258454#comment-15258454 ] Mithun Radhakrishnan commented on HIVE-13509: - Reviewing your patch now. On the face of it, it looks good. Looking at it a little more closely... A couple of observations: # {{hcat.input.ignore.invalid.path}} is well-named, and would make sense to anyone who'd want to override the default. (I thought we'd go with {{hcat.input.allow.invalid.path=true}}, but your version is better. # Consider replacing {{(pathString == null || pathString.trim().isEmpty())}} with {{StringUtils.isBlank(pathString)}}. # Nitpick: Consider replacing the loop at {{HCatBaseInputFormat.java:Line#335}} with Google Guava's {{Iterators.filter()}}. Then, depending on whether {{ignoreInvalidPath}} is set, the erstwhile loop at Line#329 will either loop on {{paths}} or on {{filteredPaths}}. This will be more readable. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.1.patch, HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256936#comment-15256936 ] Mithun Radhakrishnan commented on HIVE-13509: - Sorry for delaying you on this. If I don't have feedback for you tomorrow, please go ahead and check in as is. I'll trust [~szehon]'s review. :] Thanks for keeping the default behavior. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.1.patch, HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256720#comment-15256720 ] Chaoyu Tang commented on HIVE-13509: [~mithun] Do you have a chance to review the new patch revised based on your requests? Otherwise, I will go ahead to commit the patch since [~szehon] has already +1 on the fix. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.1.patch, HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248591#comment-15248591 ] Chaoyu Tang commented on HIVE-13509: [~szehon] [~mithun], could you review the patch, it has a new property hcat.input.ignore.invalid.path for the backwards compatibility. Thanks. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.1.patch, HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244226#comment-15244226 ] Chaoyu Tang commented on HIVE-13509: No problem, [~mithun] > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.1.patch, HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15244227#comment-15244227 ] Chaoyu Tang commented on HIVE-13509: Two failed tests are aged and not related to this patch. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.1.patch, HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243964#comment-15243964 ] Hive QA commented on HIVE-13509: Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12798886/HIVE-13509.1.patch {color:green}SUCCESS:{color} +1 due to 2 test(s) being added or modified. {color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 9977 tests executed *Failed tests:* {noformat} TestMiniTezCliDriver-cte_4.q-orc_merge5.q-vectorization_limit.q-and-12-more - did not produce a TEST-*.xml file org.apache.hadoop.hive.cli.TestMiniSparkOnYarnCliDriver.testCliDriver_index_bitmap3 {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/7611/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/7611/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-7611/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12798886 - PreCommit-HIVE-TRUNK-Build > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.1.patch, HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243789#comment-15243789 ] Mithun Radhakrishnan commented on HIVE-13509: - I'm stuck on production-support, at the moment. I'd review this on Monday. Sorry for the delay. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.1.patch, HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243410#comment-15243410 ] Chaoyu Tang commented on HIVE-13509: [~mithun] Could you take a look at the patch to see if it is good to you? > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.1.patch, HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242138#comment-15242138 ] Chaoyu Tang commented on HIVE-13509: I can do that though I do not think it proper because the default setting should favor the right behavior :-) > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242128#comment-15242128 ] Rohini Palaniswamy commented on HIVE-13509: --- bq. hcat.input.ignore.invalid.path but set it to default true which returns nothing for invalid (or empty) partition? Default will have to be false similar to other failure.percent setting examples I gave. Only someone who is messing up with their data and is ok with it will have to turn it on. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242117#comment-15242117 ] Chaoyu Tang commented on HIVE-13509: OK, How about I add a HCat property hcat.input.ignore.invalid.path but set it to default true which returns nothing for invalid (or empty) partition? > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15242058#comment-15242058 ] Mithun Radhakrishnan commented on HIVE-13509: - I knew this would be a sticking point with the Pig folks. ([~rohini], et al.) I'm afraid I agree with their assessment as well. Changing the default behaviour of {{HCatLoader}} to break Pig semantics would be incorrect, and would hide problems with missing data. We've run into failures/bugs in the {{FileOutputCommitterContainer}} that thankfully didn't perpetuate downstream, thanks to the current behaviour. Can we keep the default behaviour, with a client-side option to ignore missing data? > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241989#comment-15241989 ] Rohini Palaniswamy commented on HIVE-13509: --- IMHO, Hive should also be throwing an error as well if data does not exist because the results returned is incomplete and wrong. Data integrity is important. If some users are ok with it, then it can be a configurable option for them but it cannot be the default (at least with Pig). For eg: mapred.max.map.failures.percent and mapred.max.reduce.failures.percent are useful for users who are ok with tolerating some amount of failure, but default is 0. Same with pig.error.threshold.percent. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15241955#comment-15241955 ] Szehon Ho commented on HIVE-13509: -- Chatting with Chaoyu, it seems we do have a Pig/Hcat user that does not want an exception if the input directory is non-existent, so it will be same behavior as Hive. I'm +1 on the change, though admittedly not familiar enough with Pig side to understand why failure is preferred. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240484#comment-15240484 ] Chaoyu Tang commented on HIVE-13509: Yes, this patch is actually trying to close the gap as you mentioned between Hive and Pig via HCatalog. It seems to make more sense to return nothing for a non-existing (or empty) partition instead of throwing out error. Actually we should not expect every query to return some data in its output. For a query which returns nothing, it is because indeed there is no data in the specified partition, so in this case, the data is not missing. What do you think? > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240242#comment-15240242 ] Rohini Palaniswamy commented on HIVE-13509: --- bq. In ETL jobs using Pig, we might actually prefer a failure when the input data isn't available. Wouldn't this fix break those semantics for Pig? Yes. Missing data in output is not acceptable. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240190#comment-15240190 ] Mithun Radhakrishnan commented on HIVE-13509: - +[~daijy], [~rohini]. One possible concern is the disconnect between Hive and Pig: # When one attempts to consume a *non-existent* directory (i.e. not just an empty directory) through Pig, one gets a failure. # When one attempts to consume a non-existent partition (e.g. {{dt='3016-04-13'}}) in Hive, via an unsatisfied partition-predicate, the query runs successfully (and returns nothing). In ETL jobs using Pig, we might actually prefer a failure when the input data isn't available. Wouldn't this fix break those semantics for Pig? > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-13509) HCatalog getSplits should ignore the partition with invalid path
[ https://issues.apache.org/jira/browse/HIVE-13509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15240156#comment-15240156 ] Chaoyu Tang commented on HIVE-13509: Patch has been uploaded on https://reviews.apache.org/r/46174/ for review. [~alangates], [~mithun], [~szehon] could you help to review the patch? Thanks in advanced. > HCatalog getSplits should ignore the partition with invalid path > > > Key: HIVE-13509 > URL: https://issues.apache.org/jira/browse/HIVE-13509 > Project: Hive > Issue Type: Improvement > Components: HCatalog >Reporter: Chaoyu Tang >Assignee: Chaoyu Tang > Attachments: HIVE-13509.patch > > > It is quite common that there is the discrepancy between partition directory > and its HMS metadata, simply because the directory could be added/deleted > externally using hdfs shell command. Technically it should be fixed by MSCK > and alter table .. add/drop command etc, but sometimes it might not be > practical especially in a multi-tenant env. This discrepancy does not cause > any problem to Hive, Hive returns no rows for a partition with an invalid > (e.g. non-existing) path, but it fails the Pig load with HCatLoader, because > the HCatBaseInputFormat getSplits throws an error when getting a split for a > non-existing path. The error message might looks like: > {code} > Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > hdfs://xyz.com:8020/user/hive/warehouse/xyz/date=2016-01-01/country=BR > at > org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) > at > org.apache.hive.hcatalog.mapreduce.HCatBaseInputFormat.getSplits(HCatBaseInputFormat.java:162) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:274) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)