[jira] [Updated] (HIVE-27985) Avoid duplicate files.
[ https://issues.apache.org/jira/browse/HIVE-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-27985: -- Affects Version/s: 4.0.0 > Avoid duplicate files. > -- > > Key: HIVE-27985 > URL: https://issues.apache.org/jira/browse/HIVE-27985 > Project: Hive > Issue Type: Bug > Components: Tez >Affects Versions: 4.0.0 >Reporter: Chenyu Zheng >Assignee: Chenyu Zheng >Priority: Major > Attachments: how tez examples commit.png > > > *1 introducation* > Hive on Tez occasionally produces duplicated files, especially speculative > execution is enable. Hive identifies and removes duplicate files through > removeTempOrDuplicateFiles. However, this logic often does not take effect. > For example, the killed task attempt may commit files during the execution of > this method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during > union all. There are many issues to solve these problems, mainly focusing on > how to identify duplicate files. *This issue mainly solves this problem by > avoiding the generation of duplicate files.* > *2 How Tez avoids duplicate files?* > After testing, I found that Hadoop MapReduce examples and Tez examples do not > have this problem. Through OutputCommitter, duplicate files can be avoided if > designed properly. Let's analyze how Tez avoids duplicate files. > {color:#172b4d} _Note: Compared with Tez, Hadoop MapReduce has one more > commitPending, which is not critical, so only analyzing Tez._{color} > !how tez examples commit.png|width=778,height=483! > > Let’s analyze this step: > * (1) {*}process records{*}: Process records. > * (2) {*}send canCommit request{*}: After all Records are processed, call > canCommit remotely to AM. > * (3) {*}update commitAttempt{*}: After AM receives the canCommit request, > it will check whether there are other tasksattempts in the current task that > have already executed canCommit. If there is no other taskattempt to execute > canCommit first, return true. Otherwise return false. This ensures that only > one taskattempt is committed for each task. > * (4) {*}return canCommit response{*}: Task receives AM's response. If > returns true, it means it can be committed. If false is returned, it means > that another task attempt has already executed the commit first, and you > cannot commit. The task will jump into (2) loop to execute canCommit until it > is killed or other tasks fail. > * (5) {*}output.commit{*}: Execute commit, specifically rename the generated > temporary file to the final file. > * (6) {*}notify succeeded{*}: Although the task has completed the final > file, AM still needs to be notified that its work is completed. Therefore, AM > needs to be notified through heartbeat that the current task attempt has been > completed. > There is a problem in the above steps. That is, if an exception occurs in the > task after (5) and before (6), AM does not know that the Task attempt has > been completed, so AM will still start a new task attempt, and the new task > attempt will generate a new file, so It will cause duplication. I added code > for randomly throwing exceptions between (5) and (6), and found that in fact, > Tez example did not produce data duplication. Why? Mainly because the final > file generated by which task attempt is the same is the same. When a new task > attempt commits and finds that the final file exists (this file was generated > by the previous task attempt), it will be deleted firstly, then renamed. > Regardless of whether the previous task attempt was committed normally, the > last successful task will clear the previous error results. > To summarize, tez-examples uses two methods to avoid duplicate files: > * (1) Avoid repeated commit through canCommit. This is particularly > effective for tasks with speculative execution turned on. > * (2) The final file names generated by different task attempts are the > same. Combined with canCommit, it can be guaranteed that only one file > generated in the end, and it can only be generated by a successful task > attempt. > *3 Why can't Hive on Tez avoid duplicate files?* > Hive on Tez does not have the two mechanisms mentioned in the Tez example. > First of all, Hive on Tez does not call canCommit.TezProcessor inherited from > AbstractLogicalIOProcessor. The logic of canCommit in Tez examples is mainly > in SimpleMRProcessor. > Secondly, the file names generated for each file under Hive on Tez are not > same. The file generated by the first attempt of a task is 00_0, and the > file generated by the second attempt is 00_1. > *4 How to improve?* > Use canCommit to ensure that speculative tasks will not be submitted at the > same time. (HIVE-27899) > Let different task attempts for each task
[jira] [Commented] (HIVE-28110) MetastoreConf - String casting of default values breaks Hive
[ https://issues.apache.org/jira/browse/HIVE-28110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834519#comment-17834519 ] Ayush Saxena commented on HIVE-28110: - I don't think we can do {{toString()}} here. The correct way would be the client to call {{getAsString()}} or {{getBoolVar()}}, There maybe cases where the defaultVal object doesn't have its own {{toString()}} in that case it would return the value from the Object class. I don't think it is a bug, but a wrong way of using the method > MetastoreConf - String casting of default values breaks Hive > > > Key: HIVE-28110 > URL: https://issues.apache.org/jira/browse/HIVE-28110 > Project: Hive > Issue Type: Bug > Components: Configuration >Affects Versions: All Versions > Environment: Ubuntu 22.04 > VSCode with Extension Pack for Java > CommitHash: bee33d2018 on Apache Hive master branch > (https://github.com/apache/hive) >Reporter: Dominik Diedrich >Assignee: tanishqchugh >Priority: Minor > Labels: easyfix, pull-request-available > > When using the *getVar(Configuration conf, ConfVars var)* method of the > MetastoreConf class, Apache breaks when e.g. trying to retrieve the > environment variable "USE_SSL" and it isn't set in the system. The method > then tries to cast the default value, which is the boolean false for USE_SSL, > to a String which can't work. > > {quote}{{return val == null ? conf.get(var.hiveName, > {color:#FF}*(String)var.defaultVal*{color}) : val;}}{quote} > > Also in the *getStringCollection(Configuration conf, ConfVars var)* method it > tries to cast any default values to a String. > > Strangely, e.g. in the method *get(Configuration conf, String key)* the > default value isn't casted but the .toString() method is called, which should > be also done for the 2 methods I mentioned above. > > If nobody has time for that fix, I could open a PR for that. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-28173) Issues with staging dirs with materialized views on HDFS encrypted table
[ https://issues.apache.org/jira/browse/HIVE-28173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-28173: -- Affects Version/s: 4.0.0 > Issues with staging dirs with materialized views on HDFS encrypted table > > > Key: HIVE-28173 > URL: https://issues.apache.org/jira/browse/HIVE-28173 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Affects Versions: 4.0.0 >Reporter: Steve Carlin >Priority: Major > Labels: pull-request-available > > In the materialized view registry thread which runs in the background, there > are 2 issues involving staging directories on hdfs encrypted tables > 1) The staging directory is created at compile time. For non hdfs encrypted > tables, the "mkdir" flag is set to false. There is no such flag for hdfs > encrypted tables. > 2) The "FileSystem.deleteOnFileExit()" method is not called from the > HiveMaterializedViewRegistry thread. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-24291) Compaction Cleaner prematurely cleans up deltas
[ https://issues.apache.org/jira/browse/HIVE-24291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-24291: -- Affects Version/s: 4.0.0 > Compaction Cleaner prematurely cleans up deltas > --- > > Key: HIVE-24291 > URL: https://issues.apache.org/jira/browse/HIVE-24291 > Project: Hive > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Peter Varga >Assignee: Peter Varga >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0-alpha-1 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Since HIVE-23107 the cleaner can clean up deltas that are still used by > running queries. > Example: > * TxnId 1-5 writes to a partition, all commits > * Compactor starts with txnId=6 > * Long running query starts with txnId=7, it sees txnId=6 as open in its > snapshot > * Compaction commits > * Cleaner runs > Previously min_history_level table would have prevented the Cleaner to delete > the deltas1-5 until txnId=7 is open, but now they will be deleted and the > long running query may fail if its tries to access the files. > Solution could be to not run the cleaner until any txn is open that was > opened before the compaction was committed (CQ_NEXT_TXN_ID) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-28143) After HIVE-27492 fix, some HPLSQL built-in functions like trim, lower are not working when used in insert statement.
[ https://issues.apache.org/jira/browse/HIVE-28143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-28143: -- Affects Version/s: 4.0.0 > After HIVE-27492 fix, some HPLSQL built-in functions like trim, lower are not > working when used in insert statement. > > > Key: HIVE-28143 > URL: https://issues.apache.org/jira/browse/HIVE-28143 > Project: Hive > Issue Type: Bug > Components: hpl/sql >Affects Versions: 4.0.0 >Reporter: Dayakar M >Assignee: Dayakar M >Priority: Major > Labels: pull-request-available > > After HIVE-27492 fix, some HPLSQL built-in functions like trim, lower are not > working when used in insert statement. > Steps to reproduce: > {noformat} > CREATE TABLE result (name String); > CREATE PROCEDURE p1(s1 string) > BEGIN\n" + > INSERT INTO result VALUES(lower(s1)); > END; > call p1('abcd'); > SELECT * FROM result;{noformat} > Error reported: > {noformat} > ERROR : Ln:3 identifier 'LOWER' must be declared.{noformat} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-24291) Compaction Cleaner prematurely cleans up deltas
[ https://issues.apache.org/jira/browse/HIVE-24291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-24291: -- Affects Version/s: (was: 4.0.0) > Compaction Cleaner prematurely cleans up deltas > --- > > Key: HIVE-24291 > URL: https://issues.apache.org/jira/browse/HIVE-24291 > Project: Hive > Issue Type: Bug >Reporter: Peter Varga >Assignee: Peter Varga >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0-alpha-1 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Since HIVE-23107 the cleaner can clean up deltas that are still used by > running queries. > Example: > * TxnId 1-5 writes to a partition, all commits > * Compactor starts with txnId=6 > * Long running query starts with txnId=7, it sees txnId=6 as open in its > snapshot > * Compaction commits > * Cleaner runs > Previously min_history_level table would have prevented the Cleaner to delete > the deltas1-5 until txnId=7 is open, but now they will be deleted and the > long running query may fail if its tries to access the files. > Solution could be to not run the cleaner until any txn is open that was > opened before the compaction was committed (CQ_NEXT_TXN_ID) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26339) HIVE-26047 Related LIKE pattern issues
[ https://issues.apache.org/jira/browse/HIVE-26339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-26339: -- Target Version/s: 4.1.0 > HIVE-26047 Related LIKE pattern issues > -- > > Key: HIVE-26339 > URL: https://issues.apache.org/jira/browse/HIVE-26339 > Project: Hive > Issue Type: Bug > Components: Vectorization >Affects Versions: 4.0.0 >Reporter: Ryu Kobayashi >Assignee: Ryu Kobayashi >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > Fixed https://issues.apache.org/jira/browse/HIVE-26047 without using regular > expressions. Current code also confirmed that the current regular expression > pattern cannot be supported by the following LIKE patterns. > End pattern > {code:java} > %abc\%def {code} > Start pattern > {code:java} > abc\%def% {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26339) HIVE-26047 Related LIKE pattern issues
[ https://issues.apache.org/jira/browse/HIVE-26339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-26339: -- Component/s: Vectorization > HIVE-26047 Related LIKE pattern issues > -- > > Key: HIVE-26339 > URL: https://issues.apache.org/jira/browse/HIVE-26339 > Project: Hive > Issue Type: Bug > Components: Vectorization >Affects Versions: 4.0.0 >Reporter: Ryu Kobayashi >Assignee: Ryu Kobayashi >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > Fixed https://issues.apache.org/jira/browse/HIVE-26047 without using regular > expressions. Current code also confirmed that the current regular expression > pattern cannot be supported by the following LIKE patterns. > End pattern > {code:java} > %abc\%def {code} > Start pattern > {code:java} > abc\%def% {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26339) HIVE-26047 Related LIKE pattern issues
[ https://issues.apache.org/jira/browse/HIVE-26339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-26339: -- Affects Version/s: 4.0.0 > HIVE-26047 Related LIKE pattern issues > -- > > Key: HIVE-26339 > URL: https://issues.apache.org/jira/browse/HIVE-26339 > Project: Hive > Issue Type: Bug >Affects Versions: 4.0.0 >Reporter: Ryu Kobayashi >Assignee: Ryu Kobayashi >Priority: Major > Labels: pull-request-available > Time Spent: 1h 10m > Remaining Estimate: 0h > > Fixed https://issues.apache.org/jira/browse/HIVE-26047 without using regular > expressions. Current code also confirmed that the current regular expression > pattern cannot be supported by the following LIKE patterns. > End pattern > {code:java} > %abc\%def {code} > Start pattern > {code:java} > abc\%def% {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-28185) Hive 4.1.0 backlog
[ https://issues.apache.org/jira/browse/HIVE-28185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-28185: -- Description: https://issues.apache.org/jira/issues/?jql=project%20%3D%20HIVE%20AND%20(priority%20in%20(Critical%2C%20Blocker)%20or%20priority%20%3D%20Major%20AND%20cf%5B12310320%5D%20%3D%204.1.0%20)%20AND%20resolution%20%3D%20Unresolved%20AND%20(affectedVersion%20in%20(4.0.0-alpha-1%2C%204.0.0-alpha-2%2C%204.0.0-beta-1%2C%204.0.0%2C%204.1.0)%20%20or%20affectedVersion%20%3D%20EMPTY)%20and%20NOT%20(cf%5B12310320%5D%20in%20(3.0.0%2C%203.1.0%2C%203.2.0))%20and%20updated%20%3E%3D%20-104w%20ORDER%20BY%20created%20DESC > Hive 4.1.0 backlog > -- > > Key: HIVE-28185 > URL: https://issues.apache.org/jira/browse/HIVE-28185 > Project: Hive > Issue Type: Task >Reporter: Denys Kuzmenko >Priority: Major > > https://issues.apache.org/jira/issues/?jql=project%20%3D%20HIVE%20AND%20(priority%20in%20(Critical%2C%20Blocker)%20or%20priority%20%3D%20Major%20AND%20cf%5B12310320%5D%20%3D%204.1.0%20)%20AND%20resolution%20%3D%20Unresolved%20AND%20(affectedVersion%20in%20(4.0.0-alpha-1%2C%204.0.0-alpha-2%2C%204.0.0-beta-1%2C%204.0.0%2C%204.1.0)%20%20or%20affectedVersion%20%3D%20EMPTY)%20and%20NOT%20(cf%5B12310320%5D%20in%20(3.0.0%2C%203.1.0%2C%203.2.0))%20and%20updated%20%3E%3D%20-104w%20ORDER%20BY%20created%20DESC -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HIVE-28185) Hive 4.1.0 backlog
Denys Kuzmenko created HIVE-28185: - Summary: Hive 4.1.0 backlog Key: HIVE-28185 URL: https://issues.apache.org/jira/browse/HIVE-28185 Project: Hive Issue Type: Task Reporter: Denys Kuzmenko -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-19566) Vectorization: Fix NULL / Wrong Results issues in Complex Type Functions
[ https://issues.apache.org/jira/browse/HIVE-19566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-19566: -- Issue Type: Test (was: Bug) > Vectorization: Fix NULL / Wrong Results issues in Complex Type Functions > > > Key: HIVE-19566 > URL: https://issues.apache.org/jira/browse/HIVE-19566 > Project: Hive > Issue Type: Test >Reporter: Matt McCline >Priority: Major > > Write new UT tests that use random data and intentional isRepeating batches > to checks for NULL and Wrong Results for vectorized Complex Type functions: > * index > * (StructField) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-19566) Vectorization: Fix NULL / Wrong Results issues in Complex Type Functions
[ https://issues.apache.org/jira/browse/HIVE-19566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-19566: -- Priority: Major (was: Critical) > Vectorization: Fix NULL / Wrong Results issues in Complex Type Functions > > > Key: HIVE-19566 > URL: https://issues.apache.org/jira/browse/HIVE-19566 > Project: Hive > Issue Type: Bug >Reporter: Matt McCline >Priority: Major > > Write new UT tests that use random data and intentional isRepeating batches > to checks for NULL and Wrong Results for vectorized Complex Type functions: > * index > * (StructField) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-20282) HiveServer2 incorrect queue name when using Tez instead of MR
[ https://issues.apache.org/jira/browse/HIVE-20282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834516#comment-17834516 ] Denys Kuzmenko commented on HIVE-20282: --- hi [~steveyeom2017], would you mind raising a PR for that fix > HiveServer2 incorrect queue name when using Tez instead of MR > - > > Key: HIVE-20282 > URL: https://issues.apache.org/jira/browse/HIVE-20282 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: Steve Yeom >Assignee: Steve Yeom >Priority: Critical > Attachments: HIVE-20282.01.patch > > > Ambari -> Tez view has > "Hive Queries" and "All DAGs" view pages. > The queue names from a query id and from its DAG id does not match for Tez > engine context. > The one from a query is not correct. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-20282) HiveServer2 incorrect queue name when using Tez instead of MR
[ https://issues.apache.org/jira/browse/HIVE-20282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-20282: -- Target Version/s: 4.1.0 > HiveServer2 incorrect queue name when using Tez instead of MR > - > > Key: HIVE-20282 > URL: https://issues.apache.org/jira/browse/HIVE-20282 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: Steve Yeom >Assignee: Steve Yeom >Priority: Critical > Attachments: HIVE-20282.01.patch > > > Ambari -> Tez view has > "Hive Queries" and "All DAGs" view pages. > The queue names from a query id and from its DAG id does not match for Tez > engine context. > The one from a query is not correct. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-23721) MetaStoreDirectSql.ensureDbInit() need to optimize QuerySQL
[ https://issues.apache.org/jira/browse/HIVE-23721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-23721: -- Priority: Major (was: Critical) > MetaStoreDirectSql.ensureDbInit() need to optimize QuerySQL > --- > > Key: HIVE-23721 > URL: https://issues.apache.org/jira/browse/HIVE-23721 > Project: Hive > Issue Type: Bug > Components: Standalone Metastore >Affects Versions: 3.1.2, 4.0.0 > Environment: Hadoop 3.1(1700+ nodes) > YARN 3.1 (with timelineserver enabled,https enabled) > Hive 3.1 (15 HS2 instance) > 6+ YARN Applications every day >Reporter: YulongZ >Assignee: Butao Zhang >Priority: Major > Labels: pull-request-available > Attachments: HIVE-23721.01.patch > > Time Spent: 40m > Remaining Estimate: 0h > > From Hive3.0,catalog added to hivemeta,many schema of metastore added column > “catName”,and index for table added column “catName”。 > In MetaStoreDirectSql.ensureDbInit() ,two queries below > “ > initQueries.add(pm.newQuery(MTableColumnStatistics.class, "dbName == > ''")); > initQueries.add(pm.newQuery(MPartitionColumnStatistics.class, "dbName > == ''")); > ” > should use "catName == ''" instead of "dbName == ''",because “catName” is the > first index column。 > When data of metastore become large,for example, table of > MPartitionColumnStatistics have millions of lines。The > “newQuery(MPartitionColumnStatistics.class, "dbName == ''")” for metastore > executed very slowly,and the query “show tables“ for hiveserver2 executed > very slowly too。 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-23721) MetaStoreDirectSql.ensureDbInit() need to optimize QuerySQL
[ https://issues.apache.org/jira/browse/HIVE-23721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-23721: -- Target Version/s: 4.1.0 > MetaStoreDirectSql.ensureDbInit() need to optimize QuerySQL > --- > > Key: HIVE-23721 > URL: https://issues.apache.org/jira/browse/HIVE-23721 > Project: Hive > Issue Type: Bug > Components: Standalone Metastore >Affects Versions: 3.1.2, 4.0.0 > Environment: Hadoop 3.1(1700+ nodes) > YARN 3.1 (with timelineserver enabled,https enabled) > Hive 3.1 (15 HS2 instance) > 6+ YARN Applications every day >Reporter: YulongZ >Assignee: Butao Zhang >Priority: Critical > Labels: pull-request-available > Attachments: HIVE-23721.01.patch > > Time Spent: 40m > Remaining Estimate: 0h > > From Hive3.0,catalog added to hivemeta,many schema of metastore added column > “catName”,and index for table added column “catName”。 > In MetaStoreDirectSql.ensureDbInit() ,two queries below > “ > initQueries.add(pm.newQuery(MTableColumnStatistics.class, "dbName == > ''")); > initQueries.add(pm.newQuery(MPartitionColumnStatistics.class, "dbName > == ''")); > ” > should use "catName == ''" instead of "dbName == ''",because “catName” is the > first index column。 > When data of metastore become large,for example, table of > MPartitionColumnStatistics have millions of lines。The > “newQuery(MPartitionColumnStatistics.class, "dbName == ''")” for metastore > executed very slowly,and the query “show tables“ for hiveserver2 executed > very slowly too。 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-23721) MetaStoreDirectSql.ensureDbInit() need to optimize QuerySQL
[ https://issues.apache.org/jira/browse/HIVE-23721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-23721: -- Component/s: Standalone Metastore > MetaStoreDirectSql.ensureDbInit() need to optimize QuerySQL > --- > > Key: HIVE-23721 > URL: https://issues.apache.org/jira/browse/HIVE-23721 > Project: Hive > Issue Type: Bug > Components: Standalone Metastore >Affects Versions: 3.1.2, 4.0.0 > Environment: Hadoop 3.1(1700+ nodes) > YARN 3.1 (with timelineserver enabled,https enabled) > Hive 3.1 (15 HS2 instance) > 6+ YARN Applications every day >Reporter: YulongZ >Assignee: Butao Zhang >Priority: Critical > Labels: pull-request-available > Attachments: HIVE-23721.01.patch > > Time Spent: 40m > Remaining Estimate: 0h > > From Hive3.0,catalog added to hivemeta,many schema of metastore added column > “catName”,and index for table added column “catName”。 > In MetaStoreDirectSql.ensureDbInit() ,two queries below > “ > initQueries.add(pm.newQuery(MTableColumnStatistics.class, "dbName == > ''")); > initQueries.add(pm.newQuery(MPartitionColumnStatistics.class, "dbName > == ''")); > ” > should use "catName == ''" instead of "dbName == ''",because “catName” is the > first index column。 > When data of metastore become large,for example, table of > MPartitionColumnStatistics have millions of lines。The > “newQuery(MPartitionColumnStatistics.class, "dbName == ''")” for metastore > executed very slowly,and the query “show tables“ for hiveserver2 executed > very slowly too。 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27274) Unsecure cluster does not need set delagation token when build hms client
[ https://issues.apache.org/jira/browse/HIVE-27274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-27274: -- Priority: Major (was: Critical) > Unsecure cluster does not need set delagation token when build hms client > - > > Key: HIVE-27274 > URL: https://issues.apache.org/jira/browse/HIVE-27274 > Project: Hive > Issue Type: Bug > Components: Metastore >Affects Versions: 3.1.0, 4.0.0-alpha-2 >Reporter: zhaolong >Priority: Major > Attachments: image-2023-04-20-15-02-21-917.png > > > In unscure cluster, if set HADOOP_PROXY_USER will get delegation failed. And > HMSClient init failed. And delegation tolen only use when sasl enabled. > !image-2023-04-20-15-02-21-917.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26877) Parquet CTAS with JOIN on decimals with different precision/scale fail
[ https://issues.apache.org/jira/browse/HIVE-26877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-26877: -- Labels: hive-4.1.0-must (was: ) > Parquet CTAS with JOIN on decimals with different precision/scale fail > -- > > Key: HIVE-26877 > URL: https://issues.apache.org/jira/browse/HIVE-26877 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Affects Versions: 3.1.0, 4.0.0-alpha-2 >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Critical > Labels: hive-4.1.0-must > Attachments: ctas_parquet_join.q > > > Creating a Parquet table using CREATE TABLE AS SELECT syntax (CTAS) leads to > runtime error when the SELECT statement joins columns with different > precision/scale. > Steps to reproduce: > {code:sql} > CREATE TABLE table_a (col_dec decimal(5,0)); > CREATE TABLE table_b(col_dec decimal(38,10)); > INSERT INTO table_a VALUES (1); > INSERT INTO table_b VALUES (1.00); > set hive.default.fileformat=parquet; > create table target as > select table_a.col_dec > from table_a > left outer join table_b on > table_a.col_dec = table_b.col_dec; > {code} > Stacktrace: > {noformat} > 2022-12-20T07:02:52,237 INFO [2dfbd95a-7553-467b-b9d0-629100785502 Listener > at 0.0.0.0/46609] reexec.ReExecuteLostAMQueryPlugin: Got exception message: > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1671548565336_0001_3_02, > diagnostics=[Task failed, taskId=task_1671548565336_0001_3_02_00, > diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( > failure ) : > attempt_1671548565336_0001_3_02_00_0:java.lang.RuntimeException: > java.lang.RuntimeException: Hive Runtime Error while closing operators: Fixed > Binary size 16 does not match field type length 3 > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:348) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:276) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:381) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:82) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:69) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:69) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:39) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at > org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.RuntimeException: Hive Runtime Error while closing > operators: Fixed Binary size 16 does not match field type length 3 > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:379) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:310) > ... 15 more > Caused by: java.lang.IllegalArgumentException: Fixed Binary size 16 does not > match field type length 3 > at > org.apache.parquet.column.values.plain.FixedLenByteArrayPlainValuesWriter.writeBytes(FixedLenByteArrayPlainValuesWriter.java:56) > at > org.apache.parquet.column.impl.ColumnWriterBase.write(ColumnWriterBase.java:174) > at > org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:476) > at > org.apache.parquet.io.RecordConsumerLoggingWrapper.addBinary(RecordConsumerLoggingWrapper.java:116) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$DecimalDataWriter.write(DataWritableWriter.java:571) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$GroupDataWriter.write(DataWritableWriter.java:228) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$MessageDataWriter.write(DataWritableWriter.java:251) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:115) > at >
[jira] [Updated] (HIVE-26877) Parquet CTAS with JOIN on decimals with different precision/scale fail
[ https://issues.apache.org/jira/browse/HIVE-26877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-26877: -- Target Version/s: 4.1.0 > Parquet CTAS with JOIN on decimals with different precision/scale fail > -- > > Key: HIVE-26877 > URL: https://issues.apache.org/jira/browse/HIVE-26877 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Affects Versions: 3.1.0, 4.0.0-alpha-2 >Reporter: Stamatis Zampetakis >Assignee: Stamatis Zampetakis >Priority: Critical > Attachments: ctas_parquet_join.q > > > Creating a Parquet table using CREATE TABLE AS SELECT syntax (CTAS) leads to > runtime error when the SELECT statement joins columns with different > precision/scale. > Steps to reproduce: > {code:sql} > CREATE TABLE table_a (col_dec decimal(5,0)); > CREATE TABLE table_b(col_dec decimal(38,10)); > INSERT INTO table_a VALUES (1); > INSERT INTO table_b VALUES (1.00); > set hive.default.fileformat=parquet; > create table target as > select table_a.col_dec > from table_a > left outer join table_b on > table_a.col_dec = table_b.col_dec; > {code} > Stacktrace: > {noformat} > 2022-12-20T07:02:52,237 INFO [2dfbd95a-7553-467b-b9d0-629100785502 Listener > at 0.0.0.0/46609] reexec.ReExecuteLostAMQueryPlugin: Got exception message: > Vertex failed, vertexName=Reducer 2, vertexId=vertex_1671548565336_0001_3_02, > diagnostics=[Task failed, taskId=task_1671548565336_0001_3_02_00, > diagnostics=[TaskAttempt 0 failed, info=[Error: Error while running task ( > failure ) : > attempt_1671548565336_0001_3_02_00_0:java.lang.RuntimeException: > java.lang.RuntimeException: Hive Runtime Error while closing operators: Fixed > Binary size 16 does not match field type length 3 > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:348) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:276) > at > org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:381) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:82) > at > org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:69) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:69) > at > org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:39) > at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) > at > org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.RuntimeException: Hive Runtime Error while closing > operators: Fixed Binary size 16 does not match field type length 3 > at > org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.close(ReduceRecordProcessor.java:379) > at > org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:310) > ... 15 more > Caused by: java.lang.IllegalArgumentException: Fixed Binary size 16 does not > match field type length 3 > at > org.apache.parquet.column.values.plain.FixedLenByteArrayPlainValuesWriter.writeBytes(FixedLenByteArrayPlainValuesWriter.java:56) > at > org.apache.parquet.column.impl.ColumnWriterBase.write(ColumnWriterBase.java:174) > at > org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.addBinary(MessageColumnIO.java:476) > at > org.apache.parquet.io.RecordConsumerLoggingWrapper.addBinary(RecordConsumerLoggingWrapper.java:116) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$DecimalDataWriter.write(DataWritableWriter.java:571) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$GroupDataWriter.write(DataWritableWriter.java:228) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter$MessageDataWriter.write(DataWritableWriter.java:251) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:115) > at >
[jira] [Updated] (HIVE-26505) Case When Some result data is lost when there are common column conditions and partitioned column conditions
[ https://issues.apache.org/jira/browse/HIVE-26505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-26505: -- Target Version/s: 4.1.0 > Case When Some result data is lost when there are common column conditions > and partitioned column conditions > - > > Key: HIVE-26505 > URL: https://issues.apache.org/jira/browse/HIVE-26505 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 3.1.0, 4.0.0 >Reporter: GuangMing Lu >Assignee: Krisztian Kasa >Priority: Critical > Labels: check, pull-request-available > > {code:java}https://issues.apache.org/jira/browse/HIVE-26505# > create table test0831 (id string) partitioned by (cp string); > insert into test0831 values ('a', '2022-08-23'),('c', '2022-08-23'),('d', > '2022-08-23'); > insert into test0831 values ('a', '2022-08-24'),('b', '2022-08-24'); > select * from test0831; > +-+--+ > | test0831.id | test0831.cp | > +-+--+ > | a | 2022-08-23 | > | b | 2022-08-23 | > | a | 2022-08-23 | > | c | 2022-08-24 | > | d | 2022-08-24 | > +-+--+ > select * from test0831 where (case when id='a' and cp='2022-08-23' then 1 > else 0 end)=0; > +--+--+ > | test0830.id | test0830.cp | > +--+--+ > | a | 2022-08-24 | > | b | 2022-08-24 | > +--+--+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26505) Case When Some result data is lost when there are common column conditions and partitioned column conditions
[ https://issues.apache.org/jira/browse/HIVE-26505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-26505: -- Labels: check hive-4.1.0-must pull-request-available (was: check pull-request-available) > Case When Some result data is lost when there are common column conditions > and partitioned column conditions > - > > Key: HIVE-26505 > URL: https://issues.apache.org/jira/browse/HIVE-26505 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 3.1.0, 4.0.0 >Reporter: GuangMing Lu >Assignee: Krisztian Kasa >Priority: Critical > Labels: check, hive-4.1.0-must, pull-request-available > > {code:java}https://issues.apache.org/jira/browse/HIVE-26505# > create table test0831 (id string) partitioned by (cp string); > insert into test0831 values ('a', '2022-08-23'),('c', '2022-08-23'),('d', > '2022-08-23'); > insert into test0831 values ('a', '2022-08-24'),('b', '2022-08-24'); > select * from test0831; > +-+--+ > | test0831.id | test0831.cp | > +-+--+ > | a | 2022-08-23 | > | b | 2022-08-23 | > | a | 2022-08-23 | > | c | 2022-08-24 | > | d | 2022-08-24 | > +-+--+ > select * from test0831 where (case when id='a' and cp='2022-08-23' then 1 > else 0 end)=0; > +--+--+ > | test0830.id | test0830.cp | > +--+--+ > | a | 2022-08-24 | > | b | 2022-08-24 | > +--+--+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-25351) stddev(), stddev_pop() with CBO enable returning null
[ https://issues.apache.org/jira/browse/HIVE-25351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834515#comment-17834515 ] Denys Kuzmenko commented on HIVE-25351: --- hi [~Dayakar], does it affect Hive-4.0 release? > stddev(), stddev_pop() with CBO enable returning null > - > > Key: HIVE-25351 > URL: https://issues.apache.org/jira/browse/HIVE-25351 > Project: Hive > Issue Type: Bug >Reporter: Ashish Sharma >Assignee: Dayakar M >Priority: Blocker > Labels: pull-request-available > > *script used to repro* > create table cbo_test (key string, v1 double, v2 decimal(30,2), v3 > decimal(30,2)); > insert into cbo_test values ("00140006375905", 10230.72, > 10230.72, 10230.69), ("00140006375905", 10230.72, 10230.72, > 10230.69), ("00140006375905", 10230.72, 10230.72, 10230.69), > ("00140006375905", 10230.72, 10230.72, 10230.69), > ("00140006375905", 10230.72, 10230.72, 10230.69), > ("00140006375905", 10230.72, 10230.72, 10230.69); > select stddev(v1), stddev(v2), stddev(v3) from cbo_test; > *Enable CBO* > ++ > | Explain | > ++ > | Plan optimized by CBO. | > || > | Vertex dependency in root stage| > | Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)| > || > | Stage-0| > | Fetch Operator | > | limit:-1 | > | Stage-1| > | Reducer 2 vectorized | > | File Output Operator [FS_13] | > | Select Operator [SEL_12] (rows=1 width=24) | > | Output:["_col0","_col1","_col2"] | > | Group By Operator [GBY_11] (rows=1 width=72) | > | > Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"],aggregations:["sum(VALUE._col0)","sum(VALUE._col1)","count(VALUE._col2)","sum(VALUE._col3)","sum(VALUE._col4)","count(VALUE._col5)","sum(VALUE._col6)","sum(VALUE._col7)","count(VALUE._col8)"] > | > | <-Map 1 [CUSTOM_SIMPLE_EDGE] vectorized | > | PARTITION_ONLY_SHUFFLE [RS_10] | > | Group By Operator [GBY_9] (rows=1 width=72) | > | > Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7","_col8"],aggregations:["sum(_col3)","sum(_col0)","count(_col0)","sum(_col5)","sum(_col4)","count(_col1)","sum(_col7)","sum(_col6)","count(_col2)"] > | > | Select Operator [SEL_8] (rows=6 width=232) | > | > Output:["_col0","_col1","_col2","_col3","_col4","_col5","_col6","_col7"] | > | TableScan [TS_0] (rows=6 width=232) | > | default@cbo_test,cbo_test, ACID > table,Tbl:COMPLETE,Col:COMPLETE,Output:["v1","v2","v3"] | > || > ++ > *Query Result* > _c0 _c1 _c2 > 0.0 NaN NaN > *Disable CBO* > ++ > | Explain | > ++ > | Vertex dependency in root stage| > | Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)| > || > | Stage-0| > | Fetch Operator | > | limit:-1 | > | Stage-1| > | Reducer 2 vectorized | > | File Output Operator [FS_11] | > | Group By Operator [GBY_10] (rows=1 width=24) | > | > Output:["_col0","_col1","_col2"],aggregations:["stddev(VALUE._col0)","stddev(VALUE._col1)","stddev(VALUE._col2)"] > | > | <-Map 1 [CUSTOM_SIMPLE_EDGE] vectorized| > | PARTITION_ONLY_SHUFFLE [RS_9]| > | Group By Operator [GBY_8] (rows=1 width=240) | > | > Output:["_col0","_col1","_col2"],aggregations:["stddev(v1)","stddev(v2)","stddev(v3)"] > | > | Select Operator [SEL_7] (rows=6 width=232) | > | Output:["v1","v2","v3"]| > | TableScan [TS_0] (rows=6 width=232) | > | default@cbo_test,cbo_test, ACID >
[jira] [Updated] (HIVE-23586) load data overwrite into bucket table failed
[ https://issues.apache.org/jira/browse/HIVE-23586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-23586: -- Target Version/s: 4.1.0 > load data overwrite into bucket table failed > > > Key: HIVE-23586 > URL: https://issues.apache.org/jira/browse/HIVE-23586 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Affects Versions: 3.1.0, 3.1.2, 4.0.0 >Reporter: zhaolong >Assignee: zhaolong >Priority: Critical > Labels: pull-request-available > Attachments: HIVE-23586.01.patch, image-2020-06-01-21-40-21-726.png, > image-2020-06-01-21-41-28-732.png > > Time Spent: 40m > Remaining Estimate: 0h > > load data overwrite into bucket table is failed if filename is not like > 00_0, but insert new data in the table. > > for example: > CREATE EXTERNAL TABLE IF NOT EXISTS test_hive2 (name string,account string) > PARTITIONED BY (logdate string) CLUSTERED BY (account) INTO 4 BUCKETS row > format delimited fields terminated by '|' STORED AS textfile; > load data inpath 'hdfs://hacluster/tmp/zltest' overwrite into table > default.test_hive2 partition (logdate='20200508'); > !image-2020-06-01-21-40-21-726.png! > load data inpath 'hdfs://hacluster/tmp/zltest' overwrite into table > default.test_hive2 partition (logdate='20200508');// should overwrite but > insert new data > !image-2020-06-01-21-41-28-732.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-23586) load data overwrite into bucket table failed
[ https://issues.apache.org/jira/browse/HIVE-23586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-23586: -- Component/s: HiveServer2 > load data overwrite into bucket table failed > > > Key: HIVE-23586 > URL: https://issues.apache.org/jira/browse/HIVE-23586 > Project: Hive > Issue Type: Bug > Components: HiveServer2 >Affects Versions: 3.1.0, 3.1.2, 4.0.0 >Reporter: zhaolong >Assignee: zhaolong >Priority: Critical > Labels: pull-request-available > Attachments: HIVE-23586.01.patch, image-2020-06-01-21-40-21-726.png, > image-2020-06-01-21-41-28-732.png > > Time Spent: 40m > Remaining Estimate: 0h > > load data overwrite into bucket table is failed if filename is not like > 00_0, but insert new data in the table. > > for example: > CREATE EXTERNAL TABLE IF NOT EXISTS test_hive2 (name string,account string) > PARTITIONED BY (logdate string) CLUSTERED BY (account) INTO 4 BUCKETS row > format delimited fields terminated by '|' STORED AS textfile; > load data inpath 'hdfs://hacluster/tmp/zltest' overwrite into table > default.test_hive2 partition (logdate='20200508'); > !image-2020-06-01-21-40-21-726.png! > load data inpath 'hdfs://hacluster/tmp/zltest' overwrite into table > default.test_hive2 partition (logdate='20200508');// should overwrite but > insert new data > !image-2020-06-01-21-41-28-732.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-13157) MetaStoreEventListener.onAlter triggered for INSERT and SELECT
[ https://issues.apache.org/jira/browse/HIVE-13157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-13157: -- Priority: Major (was: Critical) > MetaStoreEventListener.onAlter triggered for INSERT and SELECT > -- > > Key: HIVE-13157 > URL: https://issues.apache.org/jira/browse/HIVE-13157 > Project: Hive > Issue Type: Bug > Components: Metastore >Affects Versions: 1.2.1, 3.1.3, 4.0.0 >Reporter: Eugen Stoianovici >Priority: Major > Labels: obsolete? > > The event onAlter from > org.apache.hadoop.hive.metastore.MetaStoreEventListener is triggered when > INSERT or SELECT statements are executed on the target table. > Furthermore, the value of transient_lastDdl is updated in table properties > for INSERT statements. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-13484) LLAPTaskScheduler should handle situations where the node may be blacklisted by Tez
[ https://issues.apache.org/jira/browse/HIVE-13484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-13484: -- Priority: Major (was: Critical) > LLAPTaskScheduler should handle situations where the node may be blacklisted > by Tez > --- > > Key: HIVE-13484 > URL: https://issues.apache.org/jira/browse/HIVE-13484 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Major > Labels: obsolete? > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-13484) LLAPTaskScheduler should handle situations where the node may be blacklisted by Tez
[ https://issues.apache.org/jira/browse/HIVE-13484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-13484: -- Labels: obsolete? (was: ) > LLAPTaskScheduler should handle situations where the node may be blacklisted > by Tez > --- > > Key: HIVE-13484 > URL: https://issues.apache.org/jira/browse/HIVE-13484 > Project: Hive > Issue Type: Bug > Components: llap >Reporter: Siddharth Seth >Assignee: Siddharth Seth >Priority: Critical > Labels: obsolete? > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-11117) Hive external table - skip header and trailer property issue
[ https://issues.apache.org/jira/browse/HIVE-7?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834513#comment-17834513 ] Denys Kuzmenko commented on HIVE-7: --- MR engine is deprecated in Hive-4.0. [~jana_chander], please set the relevant affected versions. > Hive external table - skip header and trailer property issue > > > Key: HIVE-7 > URL: https://issues.apache.org/jira/browse/HIVE-7 > Project: Hive > Issue Type: Bug > Environment: Production >Reporter: Janarthanan >Priority: Critical > Labels: check, mapreduce > > I am using an external hive table pointing to a HDFS location. The external > table is partitioned on year/mm/dd folders. When there are more than one > partition folder (ex: /2015/01/02/file.txt & /2015/01/03/file2.txt), the > select on external table, skips the DATA RECORD instead of skipping the > header/trailer record from one of the file). > tblproperties ("skip.header.line.count"="1"); > Resolution: On enabling hive.input format instead of text input format and > execution using TEZ engine instead of MapReduce resovled the issue. > How to resolve the problem without setting these parameters ? I don't want to > run the hive query using TEZ. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-27744) privileges check is skipped when using partly dynamic partition write.
[ https://issues.apache.org/jira/browse/HIVE-27744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834512#comment-17834512 ] Denys Kuzmenko commented on HIVE-27744: --- hi [~shuaiqi.guo], is that relevant to Hive-4.0? > privileges check is skipped when using partly dynamic partition write. > -- > > Key: HIVE-27744 > URL: https://issues.apache.org/jira/browse/HIVE-27744 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: All Versions >Reporter: shuaiqi.guo >Assignee: shuaiqi.guo >Priority: Blocker > Fix For: 2.3.5 > > Attachments: HIVE-27744.patch > > > the privileges check will be skiped when using dynamic partition write with > part of the partition specified, just like the following example: > {code:java} > insert overwrite table test_privilege partition (`date` = '2023-09-27', hour) > ... {code} > hive will execute it directly without checking write privileges. > > use the following patch to fix this bug. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27944) When HIVE-LLAP reads the ICEBERG table, a deadlock may occur.
[ https://issues.apache.org/jira/browse/HIVE-27944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-27944: -- Target Version/s: 4.1.0 > When HIVE-LLAP reads the ICEBERG table, a deadlock may occur. > - > > Key: HIVE-27944 > URL: https://issues.apache.org/jira/browse/HIVE-27944 > Project: Hive > Issue Type: Bug >Affects Versions: 3.1.3, 4.0.0, 4.0.0-beta-1 >Reporter: yongzhi.shao >Assignee: yongzhi.shao >Priority: Critical > Labels: hive-4.1.0-must, pull-request-available > Attachments: image-2023-12-08-14-17-53-822.png, > image-2023-12-08-14-22-18-998.png, image-2023-12-10-16-24-34-351.png > > > We found that org.apache.hadoop.hive.ql.plan.PartitionDesc.equals() may > deadlock in a multithreaded environment. > Here's the deadlock information we've gathered: > {code:java} > "DAG44-Input-4-16" Id=161 BLOCKED on > org.apache.hadoop.hive.common.CopyOnFirstWriteProperties@44196d35 owned by > "DAG44-Input-4-15" Id=160 > at > org.apache.hadoop.hive.common.CopyOnFirstWriteProperties.size(CopyOnFirstWriteProperties.java:315) > - blocked on > org.apache.hadoop.hive.common.CopyOnFirstWriteProperties@44196d35 > at java.util.Hashtable.equals(Hashtable.java:801) > - locked java.util.Properties@77a541be < but blocks 3 other threads! > at > org.apache.hadoop.hive.common.CopyOnFirstWriteProperties.equals(CopyOnFirstWriteProperties.java:213) > - locked > org.apache.hadoop.hive.common.CopyOnFirstWriteProperties@2d973aa3 > at > org.apache.hadoop.hive.ql.plan.PartitionDesc.equals(PartitionDesc.java:327) > at java.util.AbstractMap.equals(AbstractMap.java:495) > at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:940) > at > org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getFromPathRecursively(HiveFileFormatUtils.java:374) > at > org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getFromPathRecursively(HiveFileFormatUtils.java:359) > at > org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getFromPathRecursively(HiveFileFormatUtils.java:354) > at > org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:278) > at > org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:183) > at > org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:160) > at > org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:287) > {code} > Since the Properties object implement HashTable interface, all the methods of > the HashTable interface are synchronised. > In a multi-threaded environment, a deadlock will occur when > propA.equals(propB) and propB.equals(propA) occur at the same time. > > I have a fix-idea for this, when we call CopyOnFirstWriteProperties.equals(), > we can do a copy of the object within this method. Compare it with the copied > object. If there are no problems with this solution, I will submit a PR. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-27944) When HIVE-LLAP reads the ICEBERG table, a deadlock may occur.
[ https://issues.apache.org/jira/browse/HIVE-27944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-27944: -- Labels: hive-4.1.0-must pull-request-available (was: pull-request-available) > When HIVE-LLAP reads the ICEBERG table, a deadlock may occur. > - > > Key: HIVE-27944 > URL: https://issues.apache.org/jira/browse/HIVE-27944 > Project: Hive > Issue Type: Bug >Affects Versions: 3.1.3, 4.0.0, 4.0.0-beta-1 >Reporter: yongzhi.shao >Assignee: yongzhi.shao >Priority: Critical > Labels: hive-4.1.0-must, pull-request-available > Attachments: image-2023-12-08-14-17-53-822.png, > image-2023-12-08-14-22-18-998.png, image-2023-12-10-16-24-34-351.png > > > We found that org.apache.hadoop.hive.ql.plan.PartitionDesc.equals() may > deadlock in a multithreaded environment. > Here's the deadlock information we've gathered: > {code:java} > "DAG44-Input-4-16" Id=161 BLOCKED on > org.apache.hadoop.hive.common.CopyOnFirstWriteProperties@44196d35 owned by > "DAG44-Input-4-15" Id=160 > at > org.apache.hadoop.hive.common.CopyOnFirstWriteProperties.size(CopyOnFirstWriteProperties.java:315) > - blocked on > org.apache.hadoop.hive.common.CopyOnFirstWriteProperties@44196d35 > at java.util.Hashtable.equals(Hashtable.java:801) > - locked java.util.Properties@77a541be < but blocks 3 other threads! > at > org.apache.hadoop.hive.common.CopyOnFirstWriteProperties.equals(CopyOnFirstWriteProperties.java:213) > - locked > org.apache.hadoop.hive.common.CopyOnFirstWriteProperties@2d973aa3 > at > org.apache.hadoop.hive.ql.plan.PartitionDesc.equals(PartitionDesc.java:327) > at java.util.AbstractMap.equals(AbstractMap.java:495) > at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:940) > at > org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getFromPathRecursively(HiveFileFormatUtils.java:374) > at > org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getFromPathRecursively(HiveFileFormatUtils.java:359) > at > org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getFromPathRecursively(HiveFileFormatUtils.java:354) > at > org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.schemaEvolved(SplitGrouper.java:278) > at > org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:183) > at > org.apache.hadoop.hive.ql.exec.tez.SplitGrouper.generateGroupedSplits(SplitGrouper.java:160) > at > org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:287) > {code} > Since the Properties object implement HashTable interface, all the methods of > the HashTable interface are synchronised. > In a multi-threaded environment, a deadlock will occur when > propA.equals(propB) and propB.equals(propA) occur at the same time. > > I have a fix-idea for this, when we call CopyOnFirstWriteProperties.equals(), > we can do a copy of the object within this method. Compare it with the copied > object. If there are no problems with this solution, I will submit a PR. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26986) A DAG created by OperatorGraph is not equal to the Tez DAG.
[ https://issues.apache.org/jira/browse/HIVE-26986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-26986: -- Labels: hive-4.1.0-must pull-request-available (was: pull-request-available) > A DAG created by OperatorGraph is not equal to the Tez DAG. > --- > > Key: HIVE-26986 > URL: https://issues.apache.org/jira/browse/HIVE-26986 > Project: Hive > Issue Type: Sub-task >Affects Versions: 4.0.0-alpha-2 >Reporter: Seonggon Namgung >Assignee: Seonggon Namgung >Priority: Major > Labels: hive-4.1.0-must, pull-request-available > Attachments: Query71 OperatorGraph.png, Query71 TezDAG.png > > Time Spent: 50m > Remaining Estimate: 0h > > A DAG created by OperatorGraph is not equal to the corresponding DAG that is > submitted to Tez. > Because of this problem, ParallelEdgeFixer reports a pair of normal edges to > a parallel edge. > We observe this problem by comparing OperatorGraph and Tez DAG when running > TPC-DS query 71 on 1TB ORC format managed table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26986) A DAG created by OperatorGraph is not equal to the Tez DAG.
[ https://issues.apache.org/jira/browse/HIVE-26986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-26986: -- Target Version/s: 4.1.0 > A DAG created by OperatorGraph is not equal to the Tez DAG. > --- > > Key: HIVE-26986 > URL: https://issues.apache.org/jira/browse/HIVE-26986 > Project: Hive > Issue Type: Sub-task >Affects Versions: 4.0.0-alpha-2 >Reporter: Seonggon Namgung >Assignee: Seonggon Namgung >Priority: Major > Labels: pull-request-available > Attachments: Query71 OperatorGraph.png, Query71 TezDAG.png > > Time Spent: 50m > Remaining Estimate: 0h > > A DAG created by OperatorGraph is not equal to the corresponding DAG that is > submitted to Tez. > Because of this problem, ParallelEdgeFixer reports a pair of normal edges to > a parallel edge. > We observe this problem by comparing OperatorGraph and Tez DAG when running > TPC-DS query 71 on 1TB ORC format managed table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26654) Test with the TPC-DS benchmark
[ https://issues.apache.org/jira/browse/HIVE-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-26654: -- Target Version/s: (was: 4.0.0) > Test with the TPC-DS benchmark > --- > > Key: HIVE-26654 > URL: https://issues.apache.org/jira/browse/HIVE-26654 > Project: Hive > Issue Type: Bug >Affects Versions: 4.0.0-alpha-2 >Reporter: Sungwoo Park >Priority: Major > Labels: hive-4.0.0-must > > This Jira reports the result of running system tests using the TPC-DS > benchmark. The test scenario is: > 1) create a database consisting of external tables from a 100GB or 1TB TPC-DS > text dataset > 2) load a database consisting of ORC tables > 3) compute column statistics > 4) run TPC-DS queries > 5) check the results for correctness > For step 5), we will compare the results against Hive 3 (which has been > tested against SparkSQL and Presto). We use Hive on Tez. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-26654) Test with the TPC-DS benchmark
[ https://issues.apache.org/jira/browse/HIVE-26654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-26654: -- Priority: Major (was: Blocker) > Test with the TPC-DS benchmark > --- > > Key: HIVE-26654 > URL: https://issues.apache.org/jira/browse/HIVE-26654 > Project: Hive > Issue Type: Bug >Affects Versions: 4.0.0-alpha-2 >Reporter: Sungwoo Park >Priority: Major > Labels: hive-4.0.0-must > > This Jira reports the result of running system tests using the TPC-DS > benchmark. The test scenario is: > 1) create a database consisting of external tables from a 100GB or 1TB TPC-DS > text dataset > 2) load a database consisting of ORC tables > 3) compute column statistics > 4) run TPC-DS queries > 5) check the results for correctness > For step 5), we will compare the results against Hive 3 (which has been > tested against SparkSQL and Presto). We use Hive on Tez. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HIVE-28120) When insert overwrite the iceberg table, data will loss if the sql contains union all
[ https://issues.apache.org/jira/browse/HIVE-28120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834511#comment-17834511 ] Denys Kuzmenko commented on HIVE-28120: --- [~xinmingchang], is this ticket still valid, or could be closed? > When insert overwrite the iceberg table, data will loss if the sql contains > union all > - > > Key: HIVE-28120 > URL: https://issues.apache.org/jira/browse/HIVE-28120 > Project: Hive > Issue Type: Bug > Components: Iceberg integration >Affects Versions: 4.0.0-beta-1 > Environment: hadoop version: 3.3.1 > hive version: 4.0.0-beta-1 > iceberg version: 1.3.0 >Reporter: xinmingchang >Priority: Critical > > {{(1)}} > create table tmp.test_iceberg_overwrite_union_all( > a string > ) > stored by iceberg > ; > {{(2)}} > insert overwrite table tmp.test_iceberg_overwrite_union_all > select distinct 'a' union all select distinct 'b'; > {{(3)}} > select * from tmp.test_iceberg_overwrite_union_all; > > the result only has one record: > +-+ > | test_iceberg_overwrite_union_all.a | > +-+ > | a | > +-+ > According to the hiveserver log, this query will start two jobs, and each job > will be committed. The problem is that the job that is committed later is > also an overwrite, causing the result of the first commit to be overwritten. > like this: > 2024-03-05T22:10:12,995 INFO [iceberg-commit-table-pool-0]: > hive.HiveIcebergOutputCommitter () - Committing job has started for table: > default_iceberg.tmp.test_iceberg_overwrite_union_all > 2024-03-05T22:10:13,081 INFO [iceberg-commit-table-pool-1]: > hive.HiveIcebergOutputCommitter () - Committing job has started for table: > default_iceberg.tmp.test_iceberg_overwrite_union_all > 2024-03-05T22:10:15,152 INFO [iceberg-commit-table-pool-0]: > hive.HiveIcebergOutputCommitter () - Overwrite commit took 2157 ms for table: > default_iceberg.tmp.test_iceberg_overwrite_union_all with 1 file(s) > 2024-03-05T22:10:16,980 INFO [iceberg-commit-table-pool-1]: > hive.HiveIcebergOutputCommitter () - Overwrite commit took 3899 ms for table: > default_iceberg.tmp.test_iceberg_overwrite_union_all with 1 file(s) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-28120) When insert overwrite the iceberg table, data will loss if the sql contains union all
[ https://issues.apache.org/jira/browse/HIVE-28120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-28120: -- Priority: Major (was: Critical) > When insert overwrite the iceberg table, data will loss if the sql contains > union all > - > > Key: HIVE-28120 > URL: https://issues.apache.org/jira/browse/HIVE-28120 > Project: Hive > Issue Type: Bug > Components: Iceberg integration >Affects Versions: 4.0.0-beta-1 > Environment: hadoop version: 3.3.1 > hive version: 4.0.0-beta-1 > iceberg version: 1.3.0 >Reporter: xinmingchang >Priority: Major > > {{(1)}} > create table tmp.test_iceberg_overwrite_union_all( > a string > ) > stored by iceberg > ; > {{(2)}} > insert overwrite table tmp.test_iceberg_overwrite_union_all > select distinct 'a' union all select distinct 'b'; > {{(3)}} > select * from tmp.test_iceberg_overwrite_union_all; > > the result only has one record: > +-+ > | test_iceberg_overwrite_union_all.a | > +-+ > | a | > +-+ > According to the hiveserver log, this query will start two jobs, and each job > will be committed. The problem is that the job that is committed later is > also an overwrite, causing the result of the first commit to be overwritten. > like this: > 2024-03-05T22:10:12,995 INFO [iceberg-commit-table-pool-0]: > hive.HiveIcebergOutputCommitter () - Committing job has started for table: > default_iceberg.tmp.test_iceberg_overwrite_union_all > 2024-03-05T22:10:13,081 INFO [iceberg-commit-table-pool-1]: > hive.HiveIcebergOutputCommitter () - Committing job has started for table: > default_iceberg.tmp.test_iceberg_overwrite_union_all > 2024-03-05T22:10:15,152 INFO [iceberg-commit-table-pool-0]: > hive.HiveIcebergOutputCommitter () - Overwrite commit took 2157 ms for table: > default_iceberg.tmp.test_iceberg_overwrite_union_all with 1 file(s) > 2024-03-05T22:10:16,980 INFO [iceberg-commit-table-pool-1]: > hive.HiveIcebergOutputCommitter () - Overwrite commit took 3899 ms for table: > default_iceberg.tmp.test_iceberg_overwrite_union_all with 1 file(s) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HIVE-24167) TPC-DS query 14 fails while generating plan for the filter
[ https://issues.apache.org/jira/browse/HIVE-24167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-24167: -- Target Version/s: 4.1.0 > TPC-DS query 14 fails while generating plan for the filter > -- > > Key: HIVE-24167 > URL: https://issues.apache.org/jira/browse/HIVE-24167 > Project: Hive > Issue Type: Sub-task > Components: CBO >Reporter: Stamatis Zampetakis >Assignee: Shohei Okumiya >Priority: Major > Labels: hive-4.1.0-must, pull-request-available > > TPC-DS query 14 (cbo_query14.q and query4.q) fail with NPE on the metastore > with the partitioned TPC-DS 30TB dataset while generating the plan for the > filter. > The problem can be reproduced using the PR in HIVE-23965. > The current stacktrace shows that the NPE appears while trying to display the > debug message but even if this line didn't exist it would fail again later on. > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10867) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlanForSubQueryPredicate(SemanticAnalyzer.java:3375) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:3473) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10819) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12417) > at > org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:718) > at > org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12519) > at > org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:443) > at > org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301) > at > org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:171) > at > org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301) > at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:220) > at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104) > at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:173) > at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:414) > at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:363) > at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:357) > at > org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:129) > at > org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:231) > at > org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258) > at > org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:203) > at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:129) > at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:424) > at > org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:355) > at > org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:740) > at >
[jira] [Updated] (HIVE-27226) FullOuterJoin with filter expressions is not computed correctly
[ https://issues.apache.org/jira/browse/HIVE-27226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Denys Kuzmenko updated HIVE-27226: -- Target Version/s: 4.1.0 > FullOuterJoin with filter expressions is not computed correctly > --- > > Key: HIVE-27226 > URL: https://issues.apache.org/jira/browse/HIVE-27226 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0-beta-1 >Reporter: Seonggon Namgung >Priority: Major > Labels: hive-4.1.0-must, known_issue > > I tested many OuterJoin queries as an extension of HIVE-27138, and I found > that Hive returns incorrect result for a query containing FullOuterJoin with > filter expressions. In a nutshell, all JoinOperators that run on Tez engine > return incorrect result for OuterJoin queries, and one of the reason for > incorrect computation comes from CommonJoinOperator, which is the base of all > JoinOperators. I attached the queries and configuration that I used at the > bottom of the document. I am still inspecting this problems, and I will share > an update once when I find out another reason. Also any comments and opinions > would be appreciated. > First of all, I observed that current Hive ignores filter expressions > contained in MapJoinOperator. For example, the attached result of query1 > shows that MapJoinOperator performs inner join, not full outer join. This > problem stems from removal of filterMap. When converting JoinOperator to > MapJoinOperator, ConvertJoinMapJoin#convertJoinDynamicPartitionedHashJoin() > removes filterMap of MapJoinOperator. Because MapJoinOperator does not > evaluate filter expressions if filterMap is null, this change makes > MapJoinOperator ignore filter expressions and it always joins tables > regardless whether they satisfy filter expressions or not. To solve this > problem, I disable FullOuterMapJoinOptimization and apply path for > HIVE-27138, which prevents NPE. (The patch is available at the following > link: LINK.) The rest of this document uses this modified Hive, but most of > problems happen to current Hive, too. > The second problem I found is that Hive returns the same left-null or > right-null rows multiple time when it uses MapJoinOperator or > CommonMergeJoinOperator. This is caused by the logic of current > CommonJoinOperator. Both of the two JoinOperators joins tables in 2 steps. > First, they create RowContainers, each of which is a group of rows from one > table and has the same key. Second, they call > CommonJoinOperator#checkAndGenObject() with created RowContainers. This > method checks filterTag of each row in RowContainers and forwards joined row > if they meet all filter conditions. For OuterJoin, checkAndGenObject() > forwards non-matching rows if there is no matching row in RowContainer. The > problem happens when there are multiple RowContainer for the same key and > table. For example, suppose that there are two left RowContainers and one > right RowContainer. If none of the row in two left RowContainers satisfies > filter condition, then checkAndGenObject() will forward Left-Null row for > each right row. Because checkAndGenObject() is called with each left > RowContainer, there will be two duplicated Left-Null rows for every right row. > In the case of MapJoinOperator, it always creates singleton RowContainer for > big table. Therefore, it always produces duplicated non-matching rows. > CommonMergeJoinOperator also creates multiple RowContainer for big table, > whose size is hive.join.emit.interval. In the below experiment, I also set > hive.join.shortcut.unmatched.rows=false, and hive.exec.reducers.max=1 to > disable specialized algorithm for OuterJoin of 2 tables and force calling > checkAndGenObject() before all rows with the same keys are gathered. I didn't > observe this problem when using VectorMapJoinOperator, and I will inspect > VectorMapJoinOperator whether we can reproduce the problem with it. > I think the second problem is not limited to FullOuterJoin, but I couldn't > find such query as of now. This will also be added to this issue if I can > write a query that reproduces the second problem without FullOuterJoin. > I also found that Hive returns wrong result for query2 even when I used > VectorMapJoinOperator. I am still inspecting this problem and I will add an > update on it when I find out the reason. > > Experiment: > > {code:java} > Configuration > set hive.optimize.shared.work=false; > -- Std MapJoin > set hive.auto.convert.join=true; > set hive.vectorized.execution.enabled=false; > -- Vec MapJoin > set hive.auto.convert.join=true; > set hive.vectorized.execution.enabled=true; > -- MergeJoin > set hive.auto.convert.join=false; > set hive.vectorized.execution.enabled=false; >