from:"okumin \(Jira\)"

[jira] [Commented] (HIVE-24167) TPC-DS query 14 fails while generating plan for the filter

2024-03-20 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829237#comment-17829237
 ] 

okumin commented on HIVE-24167:
---

Yes. Materialized CTEs in a single query have their own independent Tez DAGs, 
but all of them are still a part of the single Hive query at the Hive level.

> TPC-DS query 14 fails while generating plan for the filter
> --
>
> Key: HIVE-24167
> URL: https://issues.apache.org/jira/browse/HIVE-24167
> Project: Hive
>  Issue Type: Sub-task
>  Components: CBO
>Reporter: Stamatis Zampetakis
>Assignee: okumin
>Priority: Major
>  Labels: hive-4.1.0-must, pull-request-available
>
> TPC-DS query 14 (cbo_query14.q and query4.q) fail with NPE on the metastore 
> with the partitioned TPC-DS 30TB dataset while generating the plan for the 
> filter.
> The problem can be reproduced using the PR in HIVE-23965.
> The current stacktrace shows that the NPE appears while trying to display the 
> debug message but even if this line didn't exist it would fail again later on.
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10867)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlanForSubQueryPredicate(SemanticAnalyzer.java:3375)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:3473)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10819)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12417)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:718)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12519)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:443)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
> at 
> org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:171)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
> at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:220)
> at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:173)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:414)
> at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:363)
> at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:357)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:129)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:231)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:203)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:129)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:424)
> at 
>

[jira] [Commented] (HIVE-28088) CalcitePlanner fails to resolve column names on materialized CTEs

2024-03-20 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828949#comment-17828949
 ] 

okumin commented on HIVE-28088:
---

I read an unofficial document that says Oracle also appends an ordinal postfix 
when a column name is duplicated. So, I started feeling that automatic renaming 
is a well-known approach. It could be more helpful to rename it to `key_1` 
instead of `_col1`, though.

> CalcitePlanner fails to resolve column names on materialized CTEs
> -
>
> Key: HIVE-28088
> URL: https://issues.apache.org/jira/browse/HIVE-28088
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> We found a case where CBO fails when it refers to a materialized CTE.
> These are queries to reproduce the issue.
> {code:java}
> DROP TABLE IF EXISTS test;
> CREATE TABLE test (key STRING);
> set hive.optimize.cte.materialize.threshold=1;
> set hive.optimize.cte.materialize.full.aggregate.only=false;
> EXPLAIN CBO WITH materialized_cte AS (
>   SELECT a.key, b.key
>   FROM test a
>   JOIN test b ON (a.key = b.key)
> )
> SELECT *
> FROM materialized_cte c
> JOIN materialized_cte d ON (c.key = d.key); {code}
> CBO fails.
> {code:java}
> +--+
> | Explain  |
> +--+
> +--+ {code}
> Error log on HiveServer2.
> {code:java}
> 2024-02-23T04:43:19,065 ERROR [d145a95b-8840-4d2b-9869-a186b4bd82ef 
> HiveServer2-Handler-Pool: Thread-555] parse.CalcitePlanner: CBO failed, 
> skipping CBO. 
> org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException: Could 
> not resolve column name
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.getPosition(RexNodeExprFactory.java:184)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
> h     at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:161)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:97)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.JoinCondTypeCheckProcFactory$JoinCondDefaultExprProcessor.processQualifiedColRef(JoinCondTypeCheckProcFactory.java:188)
>  ~[hive-exec-4.0.0-beta-2-
> SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1414)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.CostLessRuleDispatcher.dispatch(CostLessRuleDispatcher.java:66)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.ExpressionWalker.walk(ExpressionWalker.java:101)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:231)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeTypeCheck.genExprNodeJoinCond(RexNodeTypeCheck.java:60)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genJoinRelNode(CalcitePlanner.java:2656)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genJoinLogicalPlan(CalcitePlanner.java:2888)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genLogicalPlan(CalcitePlanner.java:5048)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1625)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1569)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
>

[jira] [Commented] (HIVE-24167) TPC-DS query 14 fails while generating plan for the filter

2024-03-20 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828939#comment-17828939
 ] 

okumin commented on HIVE-24167:
---

[~zabetak]

No, I haven't. I am guessing the problem is a little more difficult than that. 
The final purpose of PlanMapper is to propagate runtime stats into all 
equivalent Calcite RelNodes and Hive Operators across application attempts. 
[The propagation can happen even across queries through HS2 or 
HMS|https://github.com/apache/hive/blob/rel/release-4.0.0-beta-1/ql/src/java/org/apache/hadoop/hive/ql/plan/mapper/StatsSources.java#L52-L61].

We bind runtime stats to RelNodes or Operators in [a Hive 
hook|https://github.com/apache/hive/blob/rel/release-4.0.0-beta-1/ql/src/java/org/apache/hadoop/hive/ql/stats/OperatorStatsReaderHook.java#L77].
 So, the lifetime of Context/PlanMapper is required to equal that of a Hive 
query. It is longer than the lifetime of a materialized CTE = Tez DAG.

So, we could need to make a small modification if we apply that approach. Some 
of my ideas are here.
 * To make Context : PlanMapper = 1 : N. We create a new PlanMapper per 
materialized CTE, and retain all mappers during the entire query. When 
OperatorStatsReaderHook links the runtime stats, it will try to propagate stats 
to all PlanMappers(maybe via signatures, or Operator ids if it is difficult)
 * [To tag the name of materialized CTEs to each entry in 
PlanMapper|https://gist.github.com/okumin/b111fe0a911507bdf6a7204f49b9cb72#give-separate-namespaces-to-each-cte],
 keeping Context : PlanMapper = 1 : 1. The basic idea is the same as the first 
one

I expect that either approach prevents SemanticAnalyzers from over-linking 
RelNodes or Operators across materialized CTEs at compile-time and allows 
OperatorStatsReaderHook to load stats with all Operators at the end of a query.

To be honest, I can't present 100% confidence or evidence as the related codes 
are difficult. I will try it if the above approaches will likely make the most 
sense to us.

> TPC-DS query 14 fails while generating plan for the filter
> --
>
> Key: HIVE-24167
> URL: https://issues.apache.org/jira/browse/HIVE-24167
> Project: Hive
>  Issue Type: Sub-task
>  Components: CBO
>Reporter: Stamatis Zampetakis
>Assignee: okumin
>Priority: Major
>  Labels: hive-4.1.0-must, pull-request-available
>
> TPC-DS query 14 (cbo_query14.q and query4.q) fail with NPE on the metastore 
> with the partitioned TPC-DS 30TB dataset while generating the plan for the 
> filter.
> The problem can be reproduced using the PR in HIVE-23965.
> The current stacktrace shows that the NPE appears while trying to display the 
> debug message but even if this line didn't exist it would fail again later on.
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10867)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlanForSubQueryPredicate(SemanticAnalyzer.java:3375)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:3473)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10819)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12417)
> at

[jira] [Commented] (HIVE-28088) CalcitePlanner fails to resolve column names on materialized CTEs

2024-03-20 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828818#comment-17828818
 ] 

okumin commented on HIVE-28088:
---

I'm checking "Part 2: Foundation (SQL/Foundation)" of SQL:2023.
 * From "4.15 Columns, fields, and attributes" or "6.16 " -> 
"Syntax Rules" -> "(4)", references to ambiguous columns should raise an error. 
It means (2) and (4) or ambiguous references in WHERE, GROUP BY, or other 
similar statements should fail. I guess this makes sense to everyone.
 * From "7.17 " -> "Syntax Rules" -> "(3) (d)", I am reading 
a CTE shall not directly retain ambiguous columns. We need to rename column 
names explicitly like "WITH cte (a, b, c) AS ()". This is not 
consistent with major query engines. 7.17 is too long, and I might have missed 
some statements.
 * As for duplicated column names in the top level or sub-queries, I have not 
found any rules to decline them. Maybe they are legal.
 * I have not found descriptions that justify the renaming from "key" to 
"_col1" in Hive. I guess this is a Hive-specific behavior.
 ** [BigQuery does have the 
behavior|https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#duplicate_aliases]

We may implement the following stuff.
 * To disallow references to ambiguous columns
 * To disallow duplicated column names in CTE?
 ** We might postpone it until we are confident with the standard. Or it could 
be reasonable to keep the behavior to align Hive with other query engines.
 * To stop automatically renaming from a duplicated column name to "_colN."
 ** We might postpone it because looks like [it is not a bug but an intentional 
behavior|https://github.com/apache/hive/blob/rel/release-4.0.0-beta-1/ql/src/java/org/apache/hadoop/hive/ql/parse/RowResolver.java#L445-L449].
 And BQ also does it.
 * To make consistency between CTEs and materialized CTEs as much as possible

Then, I am thinking of addressing the first and last items. [~kkasa] Does that 
make sense?

> CalcitePlanner fails to resolve column names on materialized CTEs
> -
>
> Key: HIVE-28088
> URL: https://issues.apache.org/jira/browse/HIVE-28088
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> We found a case where CBO fails when it refers to a materialized CTE.
> These are queries to reproduce the issue.
> {code:java}
> DROP TABLE IF EXISTS test;
> CREATE TABLE test (key STRING);
> set hive.optimize.cte.materialize.threshold=1;
> set hive.optimize.cte.materialize.full.aggregate.only=false;
> EXPLAIN CBO WITH materialized_cte AS (
>   SELECT a.key, b.key
>   FROM test a
>   JOIN test b ON (a.key = b.key)
> )
> SELECT *
> FROM materialized_cte c
> JOIN materialized_cte d ON (c.key = d.key); {code}
> CBO fails.
> {code:java}
> +--+
> | Explain  |
> +--+
> +--+ {code}
> Error log on HiveServer2.
> {code:java}
> 2024-02-23T04:43:19,065 ERROR [d145a95b-8840-4d2b-9869-a186b4bd82ef 
> HiveServer2-Handler-Pool: Thread-555] parse.CalcitePlanner: CBO failed, 
> skipping CBO. 
> org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException: Could 
> not resolve column name
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.getPosition(RexNodeExprFactory.java:184)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
> h     at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:161)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:97)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.JoinCondTypeCheckProcFactory$JoinCondDefaultExprProcessor.processQualifiedColRef(JoinCondTypeCheckProcFactory.java:188)
>  ~[hive-exec-4.0.0-beta-2-
> SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1414)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.CostLessRuleDispatcher.dispatch(CostLessRuleDispatcher.java:66)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.ExpressionWalker.walk(ExpressionWalker.java:101)

[jira] [Commented] (HIVE-24167) TPC-DS query 14 fails while generating plan for the filter

2024-03-20 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17828635#comment-17828635
 ] 

okumin commented on HIVE-24167:
---

[~zabetak] Thanks. Let me clarify one thing. Can I still assume we will release 
Hive 4.0 with `hive.optimize.cte.materialize.threshold=-1` and 
`hive.optimize.cte.materialize.full.aggregate.only=true` and then HIVE-24167 is 
not a blocker? Regardless of the answer, I still prioritize this ticket as we 
strongly depend on this feature. I want to confirm whether the priority is 
extremely high or just high.

 

> I am not sure if there have been offline discussions about the best path 
> forward but it would be nice to converge on the topic because the way it is 
> right now the CTE materialization feature appears quite broken. If you lean 
> towards #5037, #5077, or another approach let me know and I will try to help 
> review, brainstorm, etc.

 

No, we haven't had any further discussions after #5037. I'd be glad if you 
helped with the brainstorming and decision-making so that we could bet on a 
primary approach. I have not found which approach seems to be conclusively 
reasonable.

> TPC-DS query 14 fails while generating plan for the filter
> --
>
> Key: HIVE-24167
> URL: https://issues.apache.org/jira/browse/HIVE-24167
> Project: Hive
>  Issue Type: Sub-task
>  Components: CBO
>Reporter: Stamatis Zampetakis
>Assignee: okumin
>Priority: Major
>  Labels: hive-4.1.0-must, pull-request-available
>
> TPC-DS query 14 (cbo_query14.q and query4.q) fail with NPE on the metastore 
> with the partitioned TPC-DS 30TB dataset while generating the plan for the 
> filter.
> The problem can be reproduced using the PR in HIVE-23965.
> The current stacktrace shows that the NPE appears while trying to display the 
> debug message but even if this line didn't exist it would fail again later on.
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10867)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlanForSubQueryPredicate(SemanticAnalyzer.java:3375)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:3473)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10819)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12417)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:718)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12519)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:443)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
> at 
> org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:171)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
> at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:220)
> at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:173)
>

[jira] [Commented] (HIVE-28088) CalcitePlanner fails to resolve column names on materialized CTEs

2024-03-10 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17825035#comment-17825035
 ] 

okumin commented on HIVE-28088:
---

I observe multiple problems.
 * `*` resolves a duplicated column name as its internal name
 * Some clauses, WHERE, GROUP BY, and JOIN, don't fail when an ambiguous column 
is accessed
 * Materialized CTEs don't retain duplicated column names

{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> WITH cte AS (
. . . . . . . . . . . . . . . . . . . . . . .>   SELECT 'k1' AS key, 'k2' AS key
. . . . . . . . . . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . . . . . . . . . .> SELECT * FROM cte;
+--++
| cte.key  | cte._col1  |
+--++
| k1       | k2         |
+--++
0: jdbc:hive2://hive-hiveserver2:1/defaul> WITH cte AS (
. . . . . . . . . . . . . . . . . . . . . . .>   SELECT 'k1' AS key, 'k2' AS key
. . . . . . . . . . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . . . . . . . . . .> SELECT * FROM cte WHERE key = 
'k2';
+--++
| cte.key  | cte._col1  |
+--++
+--++
0: jdbc:hive2://hive-hiveserver2:1/defaul> WITH cte AS (
. . . . . . . . . . . . . . . . . . . . . . .>   SELECT 'k1' AS key, 'k2' AS key
. . . . . . . . . . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . . . . . . . . . .> SELECT key FROM cte;
Error: Error while compiling statement: FAILED: SemanticException Ambiguous 
column reference: .key (state=42000,code=4)
0: jdbc:hive2://hive-hiveserver2:1/defaul> WITH cte AS (
. . . . . . . . . . . . . . . . . . . . . . .>   SELECT 'k1' AS key, 'k2' AS key
. . . . . . . . . . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . . . . . . . . . .> SELECT key, count(*) FROM cte 
GROUP BY key;
+--+--+
| key  | _c1  |
+--+--+
| k1   | 1    |
+--+--+
0: jdbc:hive2://hive-hiveserver2:1/defaul> WITH cte AS (
. . . . . . . . . . . . . . . . . . . . . . .>   SELECT 'k1' AS key, 'k2' AS key
. . . . . . . . . . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . . . . . . . . . .> SELECT * FROM cte a JOIN cte b 
ON a.key = b.key;
++--++--+
| a.key  | a._col1  | b.key  | b._col1  |
++--++--+
| k1     | k2       | k1     | k2       |
++--++--+{code}

> CalcitePlanner fails to resolve column names on materialized CTEs
> -
>
> Key: HIVE-28088
> URL: https://issues.apache.org/jira/browse/HIVE-28088
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> We found a case where CBO fails when it refers to a materialized CTE.
> These are queries to reproduce the issue.
> {code:java}
> DROP TABLE IF EXISTS test;
> CREATE TABLE test (key STRING);
> set hive.optimize.cte.materialize.threshold=1;
> set hive.optimize.cte.materialize.full.aggregate.only=false;
> EXPLAIN CBO WITH materialized_cte AS (
>   SELECT a.key, b.key
>   FROM test a
>   JOIN test b ON (a.key = b.key)
> )
> SELECT *
> FROM materialized_cte c
> JOIN materialized_cte d ON (c.key = d.key); {code}
> CBO fails.
> {code:java}
> +--+
> | Explain  |
> +--+
> +--+ {code}
> Error log on HiveServer2.
> {code:java}
> 2024-02-23T04:43:19,065 ERROR [d145a95b-8840-4d2b-9869-a186b4bd82ef 
> HiveServer2-Handler-Pool: Thread-555] parse.CalcitePlanner: CBO failed, 
> skipping CBO. 
> org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException: Could 
> not resolve column name
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.getPosition(RexNodeExprFactory.java:184)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
> h     at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:161)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:97)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.JoinCondTypeCheckProcFactory$JoinCondDefaultExprProcessor.processQualifiedColRef(JoinCondTypeCheckProcFactory.java:188)
>  ~[hive-exec-4.0.0-beta-2-
> SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1414)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.CostLessRuleDispatcher.dispatch(CostLessRuleDispatcher.java:66)
>

[jira] [Resolved] (HIVE-27858) OOM happens when selecting many columns and JOIN.

2024-03-08 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin resolved HIVE-27858.
---
Resolution: Won't Fix

Let me void this one at this point

> OOM happens when selecting many columns and  JOIN.
> --
>
> Key: HIVE-27858
> URL: https://issues.apache.org/jira/browse/HIVE-27858
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Affects Versions: 4.0.0-beta-1
>Reporter: Ryu Kobayashi
>Assignee: okumin
>Priority: Major
> Fix For: Not Applicable
>
> Attachments: HIVE-27858.full.q, HIVE-27858.less.columns.q, ddl.sql, 
> query.sql
>
>
> OOM happens when executing [^query.sql] using a table in [^ddl.sql]. These 
> did not happen in Hive 2 previously.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-28098) Fails to copy empty column statistics of materialized CTE

2024-03-04 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-28098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-28098:
--
Status: Patch Available  (was: Open)

I created a pull request.

https://github.com/apache/hive/pull/5106

> Fails to copy empty column statistics of materialized CTE
> -
>
> Key: HIVE-28098
> URL: https://issues.apache.org/jira/browse/HIVE-28098
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>
> HIVE-28080 introduced the optimization of materialized CTEs, but it turned 
> out that it failed when statistics were empty.
> This query reproduces the issue.
> {code:java}
> set hive.stats.autogather=false;
> CREATE TABLE src_no_stats AS SELECT '123' as key, 'val123' as value UNION ALL 
> SELECT '9' as key, 'val9' as value;
> set hive.optimize.cte.materialize.threshold=2;
> set hive.optimize.cte.materialize.full.aggregate.only=false;
> EXPLAIN WITH materialized_cte1 AS (
>   SELECT * FROM src_no_stats
> ),
> materialized_cte2 AS (
>   SELECT a.key
>   FROM materialized_cte1 a
>   JOIN materialized_cte1 b ON (a.key = b.key)
> )
> SELECT a.key
> FROM materialized_cte2 a
> JOIN materialized_cte2 b ON (a.key = b.key); {code}
> It throws an error.
> {code:java}
> Error: Error while compiling statement: FAILED: IllegalStateException The 
> size of col stats must be equal to that of schema. Stats = [], Schema = [key] 
> (state=42000,code=4) {code}
> Attaching a debugger, FSO of materialized_cte2 has empty stats as 
> JoinOperator loses stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-28098) Fails to copy empty column statistics of materialized CTE

2024-02-29 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-28098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-28098:
--
Summary: Fails to copy empty column statistics of materialized CTE  (was: 
Fails to generate statistics of materialized CTE)

> Fails to copy empty column statistics of materialized CTE
> -
>
> Key: HIVE-28098
> URL: https://issues.apache.org/jira/browse/HIVE-28098
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> HIVE-28080 introduced the optimization of materialized CTEs, but it turned 
> out that it failed when statistics were empty.
> This query reproduces the issue.
> {code:java}
> set hive.stats.autogather=false;
> CREATE TABLE src_no_stats AS SELECT '123' as key, 'val123' as value UNION ALL 
> SELECT '9' as key, 'val9' as value;
> set hive.optimize.cte.materialize.threshold=2;
> set hive.optimize.cte.materialize.full.aggregate.only=false;
> EXPLAIN WITH materialized_cte1 AS (
>   SELECT * FROM src_no_stats
> ),
> materialized_cte2 AS (
>   SELECT a.key
>   FROM materialized_cte1 a
>   JOIN materialized_cte1 b ON (a.key = b.key)
> )
> SELECT a.key
> FROM materialized_cte2 a
> JOIN materialized_cte2 b ON (a.key = b.key); {code}
> It throws an error.
> {code:java}
> Error: Error while compiling statement: FAILED: IllegalStateException The 
> size of col stats must be equal to that of schema. Stats = [], Schema = [key] 
> (state=42000,code=4) {code}
> Attaching a debugger, FSO of materialized_cte2 has empty stats as 
> JoinOperator loses stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-28098) Fails to generate statistics of materialized CTE

2024-02-29 Thread okumin (Jira)

okumin created HIVE-28098:
-

 Summary: Fails to generate statistics of materialized CTE
 Key: HIVE-28098
 URL: https://issues.apache.org/jira/browse/HIVE-28098
 Project: Hive
  Issue Type: Bug
  Components: Query Planning
Reporter: okumin
Assignee: okumin


HIVE-28080 introduced the optimization of materialized CTEs, but it turned out 
that it failed when statistics were empty.

This query reproduces the issue.
{code:java}
set hive.stats.autogather=false;
CREATE TABLE src_no_stats AS SELECT '123' as key, 'val123' as value UNION ALL 
SELECT '9' as key, 'val9' as value;
set hive.optimize.cte.materialize.threshold=2;
set hive.optimize.cte.materialize.full.aggregate.only=false;

EXPLAIN WITH materialized_cte1 AS (
  SELECT * FROM src_no_stats
),
materialized_cte2 AS (
  SELECT a.key
  FROM materialized_cte1 a
  JOIN materialized_cte1 b ON (a.key = b.key)
)
SELECT a.key
FROM materialized_cte2 a
JOIN materialized_cte2 b ON (a.key = b.key); {code}
It throws an error.
{code:java}
Error: Error while compiling statement: FAILED: IllegalStateException The size 
of col stats must be equal to that of schema. Stats = [], Schema = [key] 
(state=42000,code=4) {code}
Attaching a debugger, FSO of materialized_cte2 has empty stats as JoinOperator 
loses stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27858) OOM happens when selecting many columns and JOIN.

2024-02-26 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820717#comment-17820717
 ] 

okumin commented on HIVE-27858:
---

[~dkuzmenko]

I would like to simply VOID this ticket as we have nothing to do with this 
ticket. No versions can successfully finish the query without CTE 
materialization. So, this is not degradation. Also, I think we are not 
obligated to process a query with 10,000 vertices without any tuning.
 * Hive 2 with hive.optimize.cte.materialize.threshold=-1 => OOM
 * Hive 3 with hive.optimize.cte.materialize.threshold=-1 => OOM
 * Hive 4 or the master with hive.optimize.cte.materialize.threshold=-1 => OOM

Though Hive 3 fails even with CTE materialization, it's been fixed on the 
master branch. We have nothing to do.
 * Hive 2 with hive.optimize.cte.materialize.threshold=2 => OK
 * Hive 3 with hive.optimize.cte.materialize.threshold=2 => Failed with 
"Grouping sets size cannot be greater than 64"
 * Hive 4 or the master branch with hive.optimize.cte.materialize.threshold=2 
and hive.optimize.cte.materialize.full.aggregate.only=false => OK

Does that make sense? We can keep this ticket if we would like to support such 
queries without CTE materialization. It sounds like a new capability rather 
than a bug fix.

> OOM happens when selecting many columns and  JOIN.
> --
>
> Key: HIVE-27858
> URL: https://issues.apache.org/jira/browse/HIVE-27858
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Affects Versions: 4.0.0-beta-1
>Reporter: Ryu Kobayashi
>Assignee: okumin
>Priority: Major
> Fix For: Not Applicable
>
> Attachments: HIVE-27858.full.q, HIVE-27858.less.columns.q, ddl.sql, 
> query.sql
>
>
> OOM happens when executing [^query.sql] using a table in [^ddl.sql]. These 
> did not happen in Hive 2 previously.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HIVE-28088) CalcitePlanner fails to resolve column names on materialized CTEs

2024-02-23 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820058#comment-17820058
 ] 

okumin edited comment on HIVE-28088 at 2/23/24 1:18 PM:


|| ||Hive w/ CBO||Hive w/o CBO||Spark SQL||Trino||PostgreSQL||MySQL||
|(1) Sub-query|(key, col_1)|(key, col1)|(key, key)|(key, key)|(key, key)|Error|
|(2) Sub-query + top-level join|(key, col_1, key, 
col_1)|Error|Error|Error|Error|Error|
|(3) CTE|(key, col_1)|(key, col_1)|(key, key)|(key, key)|(key, key)|Error|
|(4) CTE + top-level join|(key, col_1, key, 
col_1)|Error|Error|Error|Error|Error|
|(5) Materialized CTE|(key)|(key)|N/A|N/A|N/A|N/A|
|(6) Materialized CTE + top-level join|Error|(key, key)|N/A|N/A|N/A|N/A|

If we assume Spark SQL, Trino, or PostgreSQL are correct, (1), (3), and (5) 
should return two keys. (2), (4), and (6) should fail.


was (Author: okumin):
|| ||Hive w/ CBO||Hive w/o CBO||Spark SQL||Trino||PostgreSQL||MySQL||
|(1) Sub-query|(key, col_1)|(key, col1)|(key, key)|(key, key)|(key, key)|Error|
|(2) Sub-query + top-level join|(key, col_1, key, 
col_1)|Error|Error|Error|Error|Error|
|(3) CTE|(key, col_1)|(key, col_1)|(key, key)|(key, key)|(key, key)|Error|
|(4) CTE + top-level join|(key, col_1, key, 
col_1)|Error|Error|Error|Error|Error|
|(5) Materialized CTE|(key)|(key)|N/A|N/A|N/A|N/A|
|(6) Materialized CTE + top-level join|Error|(key, key)|N/A|N/A|N/A|N/A|

If we assume Spark SQL, Trino, or PostgreSQL, (1), (3), and (5) should return 
two keys. (2), (4), and (6) should fail.

> CalcitePlanner fails to resolve column names on materialized CTEs
> -
>
> Key: HIVE-28088
> URL: https://issues.apache.org/jira/browse/HIVE-28088
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> We found a case where CBO fails when it refers to a materialized CTE.
> These are queries to reproduce the issue.
> {code:java}
> DROP TABLE IF EXISTS test;
> CREATE TABLE test (key STRING);
> set hive.optimize.cte.materialize.threshold=1;
> set hive.optimize.cte.materialize.full.aggregate.only=false;
> EXPLAIN CBO WITH materialized_cte AS (
>   SELECT a.key, b.key
>   FROM test a
>   JOIN test b ON (a.key = b.key)
> )
> SELECT *
> FROM materialized_cte c
> JOIN materialized_cte d ON (c.key = d.key); {code}
> CBO fails.
> {code:java}
> +--+
> | Explain  |
> +--+
> +--+ {code}
> Error log on HiveServer2.
> {code:java}
> 2024-02-23T04:43:19,065 ERROR [d145a95b-8840-4d2b-9869-a186b4bd82ef 
> HiveServer2-Handler-Pool: Thread-555] parse.CalcitePlanner: CBO failed, 
> skipping CBO. 
> org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException: Could 
> not resolve column name
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.getPosition(RexNodeExprFactory.java:184)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
> h     at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:161)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:97)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.JoinCondTypeCheckProcFactory$JoinCondDefaultExprProcessor.processQualifiedColRef(JoinCondTypeCheckProcFactory.java:188)
>  ~[hive-exec-4.0.0-beta-2-
> SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1414)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.CostLessRuleDispatcher.dispatch(CostLessRuleDispatcher.java:66)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.ExpressionWalker.walk(ExpressionWalker.java:101)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:231)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
>

[jira] [Commented] (HIVE-28088) CalcitePlanner fails to resolve column names on materialized CTEs

2024-02-23 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820058#comment-17820058
 ] 

okumin commented on HIVE-28088:
---

|| ||Hive w/ CBO||Hive w/o CBO||Spark SQL||Trino||PostgreSQL||MySQL||
|(1) Sub-query|(key, col_1)|(key, col1)|(key, key)|(key, key)|(key, key)|Error|
|(2) Sub-query + top-level join|(key, col_1, key, 
col_1)|Error|Error|Error|Error|Error|
|(3) CTE|(key, col_1)|(key, col_1)|(key, key)|(key, key)|(key, key)|Error|
|(4) CTE + top-level join|(key, col_1, key, 
col_1)|Error|Error|Error|Error|Error|
|(5) Materialized CTE|(key)|(key)|N/A|N/A|N/A|N/A|
|(6) Materialized CTE + top-level join|Error|(key, key)|N/A|N/A|N/A|N/A|

If we assume Spark SQL, Trino, or PostgreSQL, (1), (3), and (5) should return 
two keys. (2), (4), and (6) should fail.

> CalcitePlanner fails to resolve column names on materialized CTEs
> -
>
> Key: HIVE-28088
> URL: https://issues.apache.org/jira/browse/HIVE-28088
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> We found a case where CBO fails when it refers to a materialized CTE.
> These are queries to reproduce the issue.
> {code:java}
> DROP TABLE IF EXISTS test;
> CREATE TABLE test (key STRING);
> set hive.optimize.cte.materialize.threshold=1;
> set hive.optimize.cte.materialize.full.aggregate.only=false;
> EXPLAIN CBO WITH materialized_cte AS (
>   SELECT a.key, b.key
>   FROM test a
>   JOIN test b ON (a.key = b.key)
> )
> SELECT *
> FROM materialized_cte c
> JOIN materialized_cte d ON (c.key = d.key); {code}
> CBO fails.
> {code:java}
> +--+
> | Explain  |
> +--+
> +--+ {code}
> Error log on HiveServer2.
> {code:java}
> 2024-02-23T04:43:19,065 ERROR [d145a95b-8840-4d2b-9869-a186b4bd82ef 
> HiveServer2-Handler-Pool: Thread-555] parse.CalcitePlanner: CBO failed, 
> skipping CBO. 
> org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException: Could 
> not resolve column name
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.getPosition(RexNodeExprFactory.java:184)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
> h     at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:161)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:97)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.JoinCondTypeCheckProcFactory$JoinCondDefaultExprProcessor.processQualifiedColRef(JoinCondTypeCheckProcFactory.java:188)
>  ~[hive-exec-4.0.0-beta-2-
> SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1414)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.CostLessRuleDispatcher.dispatch(CostLessRuleDispatcher.java:66)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.ExpressionWalker.walk(ExpressionWalker.java:101)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:231)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeTypeCheck.genExprNodeJoinCond(RexNodeTypeCheck.java:60)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genJoinRelNode(CalcitePlanner.java:2656)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genJoinLogicalPlan(CalcitePlanner.java:2888)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genLogicalPlan(CalcitePlanner.java:5048)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
>

[jira] [Comment Edited] (HIVE-28088) CalcitePlanner fails to resolve column names on materialized CTEs

2024-02-23 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819917#comment-17819917
 ] 

okumin edited comment on HIVE-28088 at 2/23/24 1:11 PM:


I found (2), (4), and (6) generated different results with CBO or without CBO.

Also, (3) vs (5) and (4) vs (6) are not consistent, which means CTE 
materialization could change a result when there are duplicated columns.
|| ||CBO||Non-CBO||
|(1) Sub-query|(key, col_1)|(key, col1)|
|(2) Sub-query + top-level join|(key, col_1, key, col_1)|Error|
|(3) CTE|(key, col_1)|(key, col_1)|
|(4) CTE + top-level join|(key, col_1, key, col_1)|Error|
|(5) Materialized CTE|(key)|(key)|
|(6) Materialized CTE + top-level join|Error|(key, key)|

Test queries.
{code:java}
DROP TABLE IF EXISTS test1;
DROP TABLE IF EXISTS test2;
CREATE TABLE test1 (key STRING);
CREATE TABLE test2 (key STRING);

-- (1)
SELECT *
FROM (SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)) c;

-- (2)
SELECT *
FROM (SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)) c
JOIN (SELECT a.key, b.key FROM test2 a JOIN test2 b ON (a.key = b.key)) d
ON c.key = d.key;

-- (3)
WITH cte AS (
  SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)
)
SELECT * FROM cte;

-- (4)
WITH cte AS (
  SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)
)
SELECT *
FROM cte c
JOIN cte d ON (c.key = d.key);

set hive.optimize.cte.materialize.threshold=1;
set hive.optimize.cte.materialize.full.aggregate.only=false;

-- (5)
WITH materialized_cte AS (
  SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)
)
SELECT * FROM materialized_cte;

-- (6)
WITH materialized_cte AS (
  SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)
)
SELECT *
FROM materialized_cte c
JOIN materialized_cte d ON (c.key = d.key);{code}
With CBO + `hive.cbo.fallback.strategy=NEVER;`.
{code:java}
(1)
++--+
| c.key  | c._col1  |
++--+
++--+

(2)
++--++--+
| c.key  | c._col1  | d.key  | d._col1  |
++--++--+
++--++--+

(3)
+--++
| cte.key  | cte._col1  |
+--++
+--++

(4)
++--++--+
| c.key  | c._col1  | d.key  | d._col1  |
++--++--+
++--++--+

(5)
+---+
| materialized_cte.key  |
+---+
+---+

(6)
Error: Error while compiling statement: FAILED: CalciteSemanticException Could 
not resolve column name (state=42000,code=4){code}
Without CBO.
{code:java}
(1)
++--+
| c.key  | c._col1  |
++--+

(2)
Error: Error while compiling statement: FAILED: SemanticException [Error 
10007]: Ambiguous column reference key in c (state=42000,code=10007)

(3)
+--++
| cte.key  | cte._col1  |
+--++
+--++

(4)
Error: Error while compiling statement: FAILED: SemanticException [Error 
10007]: Ambiguous column reference key in c (state=42000,code=10007)

(5)
+---+
| materialized_cte.key  |
+---+
+---+

(6)
+++
| c.key  | d.key  |
+++
+++{code}


was (Author: okumin):
I found (2), (4), and (6) generated different results with CBO or without CBO.

Also, (3) vs (5) and (4) vs (6) are not consistent, which means CTE 
materialization could change a result when there are duplicated columns.
|| ||CBO||Non-CBO||
|(1) Sub-query|(key, col_1)|(key, col1)|
|(2) Sub-query + top-level join|(key, col_1, key, col_1)|Error|
|(3) CTE|(key, col_1)|(key, col_1)|
|(4) CTE + top-level join|(key, col_1, key, col_1)|Error|
|(5) Materialized CTE|(key)|(key)|
|(6) Materialized CTE + top-level join|Error|(key, key)|

Test queries.

 
{code:java}
DROP TABLE IF EXISTS test1;
DROP TABLE IF EXISTS test2;
CREATE TABLE test1 (key STRING);
CREATE TABLE test2 (key STRING);

-- (1)
SELECT *
FROM (SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)) c;

-- (2)
SELECT *
FROM (SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)) c
JOIN (SELECT a.key, b.key FROM test2 a JOIN test2 b ON (a.key = b.key)) d
ON c.key = d.key;

-- (3)
WITH cte AS (
  SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)
)
SELECT * FROM cte;

-- (4)
WITH cte AS (
  SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)
)
SELECT *
FROM cte c
JOIN cte d ON (c.key = d.key);

set hive.optimize.cte.materialize.threshold=1;
set hive.optimize.cte.materialize.full.aggregate.only=false;

-- (5)
WITH materialized_cte AS (
  SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)
)
SELECT * FROM materialized_cte;

-- (6)
WITH materialized_cte AS (
  SELECT

[jira] [Comment Edited] (HIVE-28088) CalcitePlanner fails to resolve column names on materialized CTEs

2024-02-23 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820054#comment-17820054
 ] 

okumin edited comment on HIVE-28088 at 2/23/24 1:11 PM:


I tested them with various databases.

Spark SQL 3.4.1.
{code:java}
(1)
key    key

(2)
[AMBIGUOUS_REFERENCE] Reference `c`.`key` is ambiguous, could be: [`c`.`key`, 
`c`.`key`].; line 4 pos 3

(3)
key key

(4)
[AMBIGUOUS_REFERENCE] Reference `c`.`key` is ambiguous, could be: [`c`.`key`, 
`c`.`key`].; line 6 pos 15{code}
Trino 413.
{code:java}
(1)
 key | key 
-+-
(0 rows)

(2)
Query 20240223_125426_00032_p9au2 failed: line 5:4: Column 'c.key' is ambiguous

(3)
 key | key 
-+-
(0 rows)

(4)
Query 20240223_125426_00034_p9au2 failed: line 7:16: Column 'c.key' is 
ambiguous {code}
PostgreSQL 16.2.
{code:java}
(1)
 key | key 
-+-
(0 rows)

(2)
ERROR:  column reference "key" is ambiguous
LINE 4: ON c.key = d.key;

(3)
 key | key 
-+-
(0 rows)

(4)
ERROR:  column reference "key" is ambiguous
LINE 6: JOIN cte d ON (c.key = d.key);{code}
MySQL 8.3.
{code:java}
(1)
ERROR 1060 (42S21): Duplicate column name 'key'

(2)
ERROR 1060 (42S21): Duplicate column name 'key'

(3)
ERROR 1060 (42S21): Duplicate column name 'key'

(4)
ERROR 1060 (42S21): Duplicate column name 'key'{code}


was (Author: okumin):
I tested them with various databases.

Spark SQL 3.4.1.

 
{code:java}
(1)
key    key

(2)
[AMBIGUOUS_REFERENCE] Reference `c`.`key` is ambiguous, could be: [`c`.`key`, 
`c`.`key`].; line 4 pos 3

(3)
key key

(4)
[AMBIGUOUS_REFERENCE] Reference `c`.`key` is ambiguous, could be: [`c`.`key`, 
`c`.`key`].; line 6 pos 15{code}
Trino 413.

 

 
{code:java}
(1)
 key | key 
-+-
(0 rows)

(2)
Query 20240223_125426_00032_p9au2 failed: line 5:4: Column 'c.key' is ambiguous

(3)
 key | key 
-+-
(0 rows)

(4)
Query 20240223_125426_00034_p9au2 failed: line 7:16: Column 'c.key' is 
ambiguous {code}
PostgreSQL 16.2.
{code:java}
(1)
 key | key 
-+-
(0 rows)

(2)
ERROR:  column reference "key" is ambiguous
LINE 4: ON c.key = d.key;

(3)
 key | key 
-+-
(0 rows)

(4)
ERROR:  column reference "key" is ambiguous
LINE 6: JOIN cte d ON (c.key = d.key);{code}
MySQL 8.3.
{code:java}
(1)
ERROR 1060 (42S21): Duplicate column name 'key'

(2)
ERROR 1060 (42S21): Duplicate column name 'key'

(3)
ERROR 1060 (42S21): Duplicate column name 'key'

(4)
ERROR 1060 (42S21): Duplicate column name 'key'{code}
 

> CalcitePlanner fails to resolve column names on materialized CTEs
> -
>
> Key: HIVE-28088
> URL: https://issues.apache.org/jira/browse/HIVE-28088
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> We found a case where CBO fails when it refers to a materialized CTE.
> These are queries to reproduce the issue.
> {code:java}
> DROP TABLE IF EXISTS test;
> CREATE TABLE test (key STRING);
> set hive.optimize.cte.materialize.threshold=1;
> set hive.optimize.cte.materialize.full.aggregate.only=false;
> EXPLAIN CBO WITH materialized_cte AS (
>   SELECT a.key, b.key
>   FROM test a
>   JOIN test b ON (a.key = b.key)
> )
> SELECT *
> FROM materialized_cte c
> JOIN materialized_cte d ON (c.key = d.key); {code}
> CBO fails.
> {code:java}
> +--+
> | Explain  |
> +--+
> +--+ {code}
> Error log on HiveServer2.
> {code:java}
> 2024-02-23T04:43:19,065 ERROR [d145a95b-8840-4d2b-9869-a186b4bd82ef 
> HiveServer2-Handler-Pool: Thread-555] parse.CalcitePlanner: CBO failed, 
> skipping CBO. 
> org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException: Could 
> not resolve column name
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.getPosition(RexNodeExprFactory.java:184)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
> h     at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:161)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:97)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.JoinCondTypeCheckProcFactory$JoinCondDefaultExprProcessor.processQualifiedColRef(JoinCondTypeCheckProcFactory.java:188)
>  ~[hive-exec-4.0.0-beta-2-
> SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1414)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.CostLessRuleDispatcher.dispatch(CostLessRuleDispatcher.java:66)
>

[jira] [Commented] (HIVE-28088) CalcitePlanner fails to resolve column names on materialized CTEs

2024-02-23 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17820054#comment-17820054
 ] 

okumin commented on HIVE-28088:
---

I tested them with various databases.

Spark SQL 3.4.1.

 
{code:java}
(1)
key    key

(2)
[AMBIGUOUS_REFERENCE] Reference `c`.`key` is ambiguous, could be: [`c`.`key`, 
`c`.`key`].; line 4 pos 3

(3)
key key

(4)
[AMBIGUOUS_REFERENCE] Reference `c`.`key` is ambiguous, could be: [`c`.`key`, 
`c`.`key`].; line 6 pos 15{code}
Trino 413.

 

 
{code:java}
(1)
 key | key 
-+-
(0 rows)

(2)
Query 20240223_125426_00032_p9au2 failed: line 5:4: Column 'c.key' is ambiguous

(3)
 key | key 
-+-
(0 rows)

(4)
Query 20240223_125426_00034_p9au2 failed: line 7:16: Column 'c.key' is 
ambiguous {code}
PostgreSQL 16.2.
{code:java}
(1)
 key | key 
-+-
(0 rows)

(2)
ERROR:  column reference "key" is ambiguous
LINE 4: ON c.key = d.key;

(3)
 key | key 
-+-
(0 rows)

(4)
ERROR:  column reference "key" is ambiguous
LINE 6: JOIN cte d ON (c.key = d.key);{code}
MySQL 8.3.
{code:java}
(1)
ERROR 1060 (42S21): Duplicate column name 'key'

(2)
ERROR 1060 (42S21): Duplicate column name 'key'

(3)
ERROR 1060 (42S21): Duplicate column name 'key'

(4)
ERROR 1060 (42S21): Duplicate column name 'key'{code}
 

> CalcitePlanner fails to resolve column names on materialized CTEs
> -
>
> Key: HIVE-28088
> URL: https://issues.apache.org/jira/browse/HIVE-28088
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> We found a case where CBO fails when it refers to a materialized CTE.
> These are queries to reproduce the issue.
> {code:java}
> DROP TABLE IF EXISTS test;
> CREATE TABLE test (key STRING);
> set hive.optimize.cte.materialize.threshold=1;
> set hive.optimize.cte.materialize.full.aggregate.only=false;
> EXPLAIN CBO WITH materialized_cte AS (
>   SELECT a.key, b.key
>   FROM test a
>   JOIN test b ON (a.key = b.key)
> )
> SELECT *
> FROM materialized_cte c
> JOIN materialized_cte d ON (c.key = d.key); {code}
> CBO fails.
> {code:java}
> +--+
> | Explain  |
> +--+
> +--+ {code}
> Error log on HiveServer2.
> {code:java}
> 2024-02-23T04:43:19,065 ERROR [d145a95b-8840-4d2b-9869-a186b4bd82ef 
> HiveServer2-Handler-Pool: Thread-555] parse.CalcitePlanner: CBO failed, 
> skipping CBO. 
> org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException: Could 
> not resolve column name
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.getPosition(RexNodeExprFactory.java:184)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
> h     at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:161)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:97)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.JoinCondTypeCheckProcFactory$JoinCondDefaultExprProcessor.processQualifiedColRef(JoinCondTypeCheckProcFactory.java:188)
>  ~[hive-exec-4.0.0-beta-2-
> SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1414)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.CostLessRuleDispatcher.dispatch(CostLessRuleDispatcher.java:66)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.ExpressionWalker.walk(ExpressionWalker.java:101)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:231)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeTypeCheck.genExprNodeJoinCond(RexNodeTypeCheck.java:60)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
>

[jira] [Commented] (HIVE-28088) CalcitePlanner fails to resolve column names on materialized CTEs

2024-02-22 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-28088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17819917#comment-17819917
 ] 

okumin commented on HIVE-28088:
---

I found (2), (4), and (6) generated different results with CBO or without CBO.

Also, (3) vs (5) and (4) vs (6) are not consistent, which means CTE 
materialization could change a result when there are duplicated columns.
|| ||CBO||Non-CBO||
|(1) Sub-query|(key, col_1)|(key, col1)|
|(2) Sub-query + top-level join|(key, col_1, key, col_1)|Error|
|(3) CTE|(key, col_1)|(key, col_1)|
|(4) CTE + top-level join|(key, col_1, key, col_1)|Error|
|(5) Materialized CTE|(key)|(key)|
|(6) Materialized CTE + top-level join|Error|(key, key)|

Test queries.

 
{code:java}
DROP TABLE IF EXISTS test1;
DROP TABLE IF EXISTS test2;
CREATE TABLE test1 (key STRING);
CREATE TABLE test2 (key STRING);

-- (1)
SELECT *
FROM (SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)) c;

-- (2)
SELECT *
FROM (SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)) c
JOIN (SELECT a.key, b.key FROM test2 a JOIN test2 b ON (a.key = b.key)) d
ON c.key = d.key;

-- (3)
WITH cte AS (
  SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)
)
SELECT * FROM cte;

-- (4)
WITH cte AS (
  SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)
)
SELECT *
FROM cte c
JOIN cte d ON (c.key = d.key);

set hive.optimize.cte.materialize.threshold=1;
set hive.optimize.cte.materialize.full.aggregate.only=false;

-- (5)
WITH materialized_cte AS (
  SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)
)
SELECT * FROM materialized_cte;

-- (6)
WITH materialized_cte AS (
  SELECT a.key, b.key FROM test1 a JOIN test2 b ON (a.key = b.key)
)
SELECT *
FROM materialized_cte c
JOIN materialized_cte d ON (c.key = d.key);{code}
With CBO + `hive.cbo.fallback.strategy=NEVER;`.

 
{code:java}
(1)
++--+
| c.key  | c._col1  |
++--+
++--+

(2)
++--++--+
| c.key  | c._col1  | d.key  | d._col1  |
++--++--+
++--++--+

(3)
+--++
| cte.key  | cte._col1  |
+--++
+--++

(4)
++--++--+
| c.key  | c._col1  | d.key  | d._col1  |
++--++--+
++--++--+

(5)
+---+
| materialized_cte.key  |
+---+
+---+

(6)
Error: Error while compiling statement: FAILED: CalciteSemanticException Could 
not resolve column name (state=42000,code=4){code}
Without CBO.

 
{code:java}
(1)
++--+
| c.key  | c._col1  |
++--+

(2)
Error: Error while compiling statement: FAILED: SemanticException [Error 
10007]: Ambiguous column reference key in c (state=42000,code=10007)

(3)
+--++
| cte.key  | cte._col1  |
+--++
+--++

(4)
Error: Error while compiling statement: FAILED: SemanticException [Error 
10007]: Ambiguous column reference key in c (state=42000,code=10007)

(5)
+---+
| materialized_cte.key  |
+---+
+---+

(6)
+++
| c.key  | d.key  |
+++
+++{code}
 

 

 

> CalcitePlanner fails to resolve column names on materialized CTEs
> -
>
> Key: HIVE-28088
> URL: https://issues.apache.org/jira/browse/HIVE-28088
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> We found a case where CBO fails when it refers to a materialized CTE.
> These are queries to reproduce the issue.
> {code:java}
> DROP TABLE IF EXISTS test;
> CREATE TABLE test (key STRING);
> set hive.optimize.cte.materialize.threshold=1;
> set hive.optimize.cte.materialize.full.aggregate.only=false;
> EXPLAIN CBO WITH materialized_cte AS (
>   SELECT a.key, b.key
>   FROM test a
>   JOIN test b ON (a.key = b.key)
> )
> SELECT *
> FROM materialized_cte c
> JOIN materialized_cte d ON (c.key = d.key); {code}
> CBO fails.
> {code:java}
> +--+
> | Explain  |
> +--+
> +--+ {code}
> Error log on HiveServer2.
> {code:java}
> 2024-02-23T04:43:19,065 ERROR [d145a95b-8840-4d2b-9869-a186b4bd82ef 
> HiveServer2-Handler-Pool: Thread-555] parse.CalcitePlanner: CBO failed, 
> skipping CBO. 
> org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException: Could 
> not resolve column name
>      at 
> org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.getPosition(RexNodeExprFactory.java:184)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
> h     at 
>

[jira] [Created] (HIVE-28088) CalcitePlanner fails to resolve column names on materialized CTEs

2024-02-22 Thread okumin (Jira)

okumin created HIVE-28088:
-

 Summary: CalcitePlanner fails to resolve column names on 
materialized CTEs
 Key: HIVE-28088
 URL: https://issues.apache.org/jira/browse/HIVE-28088
 Project: Hive
  Issue Type: Bug
  Components: CBO
Affects Versions: 4.0.0-beta-1
Reporter: okumin
Assignee: okumin


We found a case where CBO fails when it refers to a materialized CTE.

These are queries to reproduce the issue.
{code:java}
DROP TABLE IF EXISTS test;
CREATE TABLE test (key STRING);

set hive.optimize.cte.materialize.threshold=1;
set hive.optimize.cte.materialize.full.aggregate.only=false;

EXPLAIN CBO WITH materialized_cte AS (
  SELECT a.key, b.key
  FROM test a
  JOIN test b ON (a.key = b.key)
)
SELECT *
FROM materialized_cte c
JOIN materialized_cte d ON (c.key = d.key); {code}
CBO fails.
{code:java}
+--+
| Explain  |
+--+
+--+ {code}
Error log on HiveServer2.
{code:java}
2024-02-23T04:43:19,065 ERROR [d145a95b-8840-4d2b-9869-a186b4bd82ef 
HiveServer2-Handler-Pool: Thread-555] parse.CalcitePlanner: CBO failed, 
skipping CBO. 
org.apache.hadoop.hive.ql.optimizer.calcite.CalciteSemanticException: Could not 
resolve column name
     at 
org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.getPosition(RexNodeExprFactory.java:184)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
h     at 
org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:161)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.parse.type.RexNodeExprFactory.createColumnRefExpr(RexNodeExprFactory.java:97)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.parse.type.JoinCondTypeCheckProcFactory$JoinCondDefaultExprProcessor.processQualifiedColRef(JoinCondTypeCheckProcFactory.java:188)
 ~[hive-exec-4.0.0-beta-2-
SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1414)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.lib.CostLessRuleDispatcher.dispatch(CostLessRuleDispatcher.java:66)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.lib.ExpressionWalker.walk(ExpressionWalker.java:101) 
~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:231)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.parse.type.RexNodeTypeCheck.genExprNodeJoinCond(RexNodeTypeCheck.java:60)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genJoinRelNode(CalcitePlanner.java:2656)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genJoinLogicalPlan(CalcitePlanner.java:2888)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.genLogicalPlan(CalcitePlanner.java:5048)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1625)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner$CalcitePlannerAction.apply(CalcitePlanner.java:1569)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.calcite.tools.Frameworks.lambda$withPlanner$0(Frameworks.java:131) 
~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.calcite.prepare.CalcitePrepareImpl.perform(CalcitePrepareImpl.java:914)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at org.apache.calcite.tools.Frameworks.withPrepare(Frameworks.java:180) 
~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at org.apache.calcite.tools.Frameworks.withPlanner(Frameworks.java:126) 
~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.parse.CalcitePlanner.logicalPlan(CalcitePlanner.java:1321)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]

[jira] [Updated] (HIVE-28082) HiveAggregateReduceFunctionsRule could generate an inconsistent result

2024-02-21 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-28082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-28082:
--
Status: Patch Available  (was: Open)

I created a PR.

https://github.com/apache/hive/pull/5091

> HiveAggregateReduceFunctionsRule could generate an inconsistent result
> --
>
> Key: HIVE-28082
> URL: https://issues.apache.org/jira/browse/HIVE-28082
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> HiveAggregateReduceFunctionsRule translates AVG, STDDEV_POP, STDDEV_SAMP, 
> VAR_POP, and VAR_SAMP. Those UDFs accept string types and try to decode them 
> as floating point values. It is possible that undecodable values exist.
> We found that it could cause inconsistent behaviors with or without CBO.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:1/defaul> SELECT AVG('text');
> ...
> +--+
> | _c0  |
> +--+
> | 0.0  |
> +--+
> 1 row selected (18.229 seconds)
> 0: jdbc:hive2://hive-hiveserver2:1/defaul> set hive.cbo.enable=false;
> No rows affected (0.013 seconds)
> 0: jdbc:hive2://hive-hiveserver2:1/defaul> SELECT AVG('text');
> ...
> +---+
> |  _c0  |
> +---+
> | NULL  |
> +---+ {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-28082) HiveAggregateReduceFunctionsRule could generate an inconsistent result

2024-02-19 Thread okumin (Jira)

okumin created HIVE-28082:
-

 Summary: HiveAggregateReduceFunctionsRule could generate an 
inconsistent result
 Key: HIVE-28082
 URL: https://issues.apache.org/jira/browse/HIVE-28082
 Project: Hive
  Issue Type: Bug
  Components: CBO
Affects Versions: 4.0.0-beta-1
Reporter: okumin
Assignee: okumin


HiveAggregateReduceFunctionsRule translates AVG, STDDEV_POP, STDDEV_SAMP, 
VAR_POP, and VAR_SAMP. Those UDFs accept string types and try to decode them as 
floating point values. It is possible that undecodable values exist.

We found that it could cause inconsistent behaviors with or without CBO.
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> SELECT AVG('text');
...
+--+
| _c0  |
+--+
| 0.0  |
+--+
1 row selected (18.229 seconds)
0: jdbc:hive2://hive-hiveserver2:1/defaul> set hive.cbo.enable=false;
No rows affected (0.013 seconds)
0: jdbc:hive2://hive-hiveserver2:1/defaul> SELECT AVG('text');
...
+---+
|  _c0  |
+---+
| NULL  |
+---+ {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-28080) Propagate statistics from a source table to the materialized CTE

2024-02-19 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-28080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-28080:
--
Status: Patch Available  (was: Open)

Pull Request

https://github.com/apache/hive/pull/5089

> Propagate statistics from a source table to the materialized CTE
> 
>
> Key: HIVE-28080
> URL: https://issues.apache.org/jira/browse/HIVE-28080
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Planning
>Affects Versions: 4.0.0-beta-1, 3.1.3, 2.3.8
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> Hive doesn't fill in the statistics of materialized CTEs, and the size of 
> those TableScans is underestimated. That causes Tez to run with fewer tasks 
> or to fail with OOM because MapJoin could be wrongly applied.
>  
> The following example shows Map 1 reading `src` generates 493 rows, but Map 3 
> reading `cte` is expected to scan only 1 row.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:1/defaul> EXPLAIN WITH cte AS (
> . . . . . . . . . . . . . . . . . . . . . . .>   SELECT * FROM src
> . . . . . . . . . . . . . . . . . . . . . . .> )
> . . . . . . . . . . . . . . . . . . . . . . .> SELECT *
> . . . . . . . . . . . . . . . . . . . . . . .> FROM cte a
> . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte b ON (a.key = b.key)
> . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte c ON (a.key = c.key);
> ...
> ++
> |                      Explain                       |
> ++
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in Stage-4                       |
> | Map 2 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) |
> | Reducer 4 <- Map 3 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-3                                            |
> |   Fetch Operator                                   |
> |     limit:-1                                       |
> |     Stage-4                                        |
> |       Map 2 vectorized                             |
> |       File Output Operator [FS_69]                 |
> |         Map Join Operator [MAPJOIN_68] (rows=1 width=444) |
> |           
> Conds:MAPJOIN_67._col0=RS_61._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5"]
>  |
> |         <-Map 3 [BROADCAST_EDGE] vectorized        |
> |           SHUFFLE [RS_61]                          |
> |             PartitionCols:_col0                    |
> |             Select Operator [SEL_60] (rows=1 width=368) |
> |               Output:["_col0","_col1"]             |
> |               Filter Operator [FIL_59] (rows=1 width=368) |
> |                 predicate:key is not null          |
> |                 TableScan [TS_11] (rows=1 width=368) |
> |                   
> default@cte,c,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         <-Map Join Operator [MAPJOIN_67] (rows=1 width=404) |
> |             
> Conds:SEL_66._col0=RS_64._col0(Inner),Output:["_col0","_col1","_col2","_col3"]
>  |
> |           <-Reducer 4 [BROADCAST_EDGE] vectorized  |
> |             BROADCAST [RS_64]                      |
> |               PartitionCols:_col0                  |
> |               Select Operator [SEL_63]             |
> |                 Output:["_col0","_col1"]           |
> |           <-Select Operator [SEL_66] (rows=1 width=368) |
> |               Output:["_col0","_col1"]             |
> |               Filter Operator [FIL_65] (rows=1 width=368) |
> |                 predicate:key is not null          |
> |                 TableScan [TS_5] (rows=1 width=368) |
> |                   
> default@cte,a,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         Stage-2                                    |
> |           Dependency Collection{}                  |
> |             Stage-1                                |
> |               Map 1 vectorized                     |
> |               File Output Operator [FS_4]          |
> |                 table:{"name:":"default.cte"}      |
> |                 Select Operator [SEL_3] (rows=493 width=350) |
> |                   Output:["_col0","_col1"]         |
> |                   TableScan [TS_0] (rows=493 width=350) |
> |                     
> default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         Stage-0                                    |
> |           Move Operator                            |
> |              Please refer to the previous Stage-1  |
> |                                                    |
> ++ {code}



--
This message was sent by

[jira] [Created] (HIVE-28080) Propagate statistics from a source table to the materialized CTE

2024-02-18 Thread okumin (Jira)

okumin created HIVE-28080:
-

 Summary: Propagate statistics from a source table to the 
materialized CTE
 Key: HIVE-28080
 URL: https://issues.apache.org/jira/browse/HIVE-28080
 Project: Hive
  Issue Type: Improvement
  Components: Query Planning
Affects Versions: 4.0.0-beta-1, 3.1.3, 2.3.8
Reporter: okumin
Assignee: okumin


Hive doesn't fill in the statistics of materialized CTEs, and the size of those 
TableScans is underestimated. That causes Tez to run with fewer tasks or to 
fail with OOM because MapJoin could be wrongly applied.

 

The following example shows Map 1 reading `src` generates 493 rows, but Map 3 
reading `cte` is expected to scan only 1 row.
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> EXPLAIN WITH cte AS (
. . . . . . . . . . . . . . . . . . . . . . .>   SELECT * FROM src
. . . . . . . . . . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . . . . . . . . . .> SELECT *
. . . . . . . . . . . . . . . . . . . . . . .> FROM cte a
. . . . . . . . . . . . . . . . . . . . . . .> JOIN cte b ON (a.key = b.key)
. . . . . . . . . . . . . . . . . . . . . . .> JOIN cte c ON (a.key = c.key);
...
++
|                      Explain                       |
++
| Plan optimized by CBO.                             |
|                                                    |
| Vertex dependency in Stage-4                       |
| Map 2 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) |
| Reducer 4 <- Map 3 (SIMPLE_EDGE)                   |
|                                                    |
| Stage-3                                            |
|   Fetch Operator                                   |
|     limit:-1                                       |
|     Stage-4                                        |
|       Map 2 vectorized                             |
|       File Output Operator [FS_69]                 |
|         Map Join Operator [MAPJOIN_68] (rows=1 width=444) |
|           
Conds:MAPJOIN_67._col0=RS_61._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5"]
 |
|         <-Map 3 [BROADCAST_EDGE] vectorized        |
|           SHUFFLE [RS_61]                          |
|             PartitionCols:_col0                    |
|             Select Operator [SEL_60] (rows=1 width=368) |
|               Output:["_col0","_col1"]             |
|               Filter Operator [FIL_59] (rows=1 width=368) |
|                 predicate:key is not null          |
|                 TableScan [TS_11] (rows=1 width=368) |
|                   default@cte,c,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
|         <-Map Join Operator [MAPJOIN_67] (rows=1 width=404) |
|             
Conds:SEL_66._col0=RS_64._col0(Inner),Output:["_col0","_col1","_col2","_col3"] |
|           <-Reducer 4 [BROADCAST_EDGE] vectorized  |
|             BROADCAST [RS_64]                      |
|               PartitionCols:_col0                  |
|               Select Operator [SEL_63]             |
|                 Output:["_col0","_col1"]           |
|           <-Select Operator [SEL_66] (rows=1 width=368) |
|               Output:["_col0","_col1"]             |
|               Filter Operator [FIL_65] (rows=1 width=368) |
|                 predicate:key is not null          |
|                 TableScan [TS_5] (rows=1 width=368) |
|                   default@cte,a,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
|         Stage-2                                    |
|           Dependency Collection{}                  |
|             Stage-1                                |
|               Map 1 vectorized                     |
|               File Output Operator [FS_4]          |
|                 table:{"name:":"default.cte"}      |
|                 Select Operator [SEL_3] (rows=493 width=350) |
|                   Output:["_col0","_col1"]         |
|                   TableScan [TS_0] (rows=493 width=350) |
|                     
default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
|         Stage-0                                    |
|           Move Operator                            |
|              Please refer to the previous Stage-1  |
|                                                    |
++ {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27858) OOM happens when selecting many columns and JOIN.

2024-02-14 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817351#comment-17817351
 ] 

okumin commented on HIVE-27858:
---

I assume [~jfs] tested it with `hive.cbo.fallback.strategy=NEVER`. As far as I 
investigated, HIVE-24167 makes CBO fail but the non-CBO part will not likely 
fail. Please correct us if our assumption is wrong.

> OOM happens when selecting many columns and  JOIN.
> --
>
> Key: HIVE-27858
> URL: https://issues.apache.org/jira/browse/HIVE-27858
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Affects Versions: 4.0.0-beta-1
>Reporter: Ryu Kobayashi
>Assignee: okumin
>Priority: Critical
>  Labels: hive-4.0.0-must
> Fix For: Not Applicable
>
> Attachments: HIVE-27858.full.q, HIVE-27858.less.columns.q, ddl.sql, 
> query.sql
>
>
> OOM happens when executing [^query.sql] using a table in [^ddl.sql]. These 
> did not happen in Hive 2 previously.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-24167) TPC-DS query 14 fails while generating plan for the filter

2024-02-14 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817258#comment-17817258
 ] 

okumin commented on HIVE-24167:
---

I'm updating the current summary just in case that another member joins the 
discussion.
 * PlanMapper is not easy to understand. I recommend to read [my 
note|https://gist.github.com/okumin/b111fe0a911507bdf6a7204f49b9cb72] if you 
are not familiar with the feature
 * [#5037|https://github.com/apache/hive/pull/5037] is the initial Pull 
Request. I implemented one of the options mentioned in the above gist
 * Then, I am working on [#5077|https://github.com/apache/hive/pull/5077] to 
implement another option based on the advice of [~kgyrtkirk] 

> TPC-DS query 14 fails while generating plan for the filter
> --
>
> Key: HIVE-24167
> URL: https://issues.apache.org/jira/browse/HIVE-24167
> Project: Hive
>  Issue Type: Sub-task
>  Components: CBO
>Reporter: Stamatis Zampetakis
>Assignee: okumin
>Priority: Major
>  Labels: hive-4.1.0-must, pull-request-available
>
> TPC-DS query 14 (cbo_query14.q and query4.q) fail with NPE on the metastore 
> with the partitioned TPC-DS 30TB dataset while generating the plan for the 
> filter.
> The problem can be reproduced using the PR in HIVE-23965.
> The current stacktrace shows that the NPE appears while trying to display the 
> debug message but even if this line didn't exist it would fail again later on.
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10867)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlanForSubQueryPredicate(SemanticAnalyzer.java:3375)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:3473)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10819)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12417)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:718)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12519)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:443)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
> at 
> org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:171)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
> at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:220)
> at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:173)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:414)
> at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:363)
> at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:357)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:129)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:231)
> at 
>

[jira] [Updated] (HIVE-27957) Better error message for STORED BY

2024-02-10 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-27957:
--
Status: Patch Available  (was: Open)

> Better error message for STORED BY
> --
>
> Key: HIVE-27957
> URL: https://issues.apache.org/jira/browse/HIVE-27957
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>
> We'd like to see a little kinder message when an invalid identifier or 
> literal is given. It is confusing especially when I mistakenly type STORED BY 
> instead of STORED AS.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:1/defaul> CREATE TABLE test (a STRING) 
> STORED BY ORC TBLPROPERTIES ("orc.compress"="ZSTD");
> Error: Error while compiling statement: FAILED: SemanticException Cannot find 
> class '' (state=42000,code=4){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27984) Support backward compatibility of hms thrift struct about column stats

2024-01-31 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813092#comment-17813092
 ] 

okumin commented on HIVE-27984:
---

[~zhangbutao] Thanks you! It is great. Can we also backport the change into 
branch-4.0?

https://github.com/apache/hive/tree/branch-4.0

> Support backward compatibility of hms thrift struct about column stats
> --
>
> Key: HIVE-27984
> URL: https://issues.apache.org/jira/browse/HIVE-27984
> Project: Hive
>  Issue Type: Improvement
>  Components: Standalone Metastore
>Reporter: Butao Zhang
>Assignee: Butao Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> We had some discussions about column stats thrift api isssue before: 
> [https://www.mail-archive.com/dev@hive.apache.org/msg144294.html] ,
> and i also did some attempt to fix this issue from TrinoDB side when i 
> playing TrinoDB [https://github.com/trinodb/trino-hive-apache/pull/43] .
> Recently, i tried to use starrocks([https://github.com/StarRocks/starrocks]) 
> which uses a hms client compatible with Hive3 to query Hive4, and i also 
> encountered this 
> issue([https://github.com/StarRocks/starrocks/issues/38620]), 
>  
> {code:java}
> Caused by: org.apache.thrift.TApplicationException: Required field 'engine' 
> is unset! Struct:TableStatsRequest(dbName:testdb, tblName:testsr, 
> colNames:[id], engine:null)
>         at 
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) 
> ~[libthrift-0.13.0.jar:0.13.0]
>         at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table_statistics_req(ThriftHiveMetastore.java:4161)
>  ~[hive-apache-3.1.2-13.jar:?]
>         at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table_statistics_req(ThriftHiveMetastore.java:4148)
>  ~[hive-apache-3.1.2-13.jar:?]
>         at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTableColumnStatistics(HiveMetaStoreClient.java:735)
>  ~[starrocks-fe.jar:?]
>         at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTableColumnStatistics(HiveMetaStoreClient.java:727)
>  ~[starrocks-fe.jar:?]
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:1.8.0_221]
>         at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_221]
>         at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_221]
>         at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_221]
>         at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:208)
>  ~[hive-apache-3.1.2-13.jar:?]
>         at com.sun.proxy.$Proxy32.getTableColumnStatistics(Unknown Source) 
> ~[?:?]
>  {code}
>  
> In addition, i also found similar ticket in Impala 
> https://issues.apache.org/jira/browse/IMPALA-12539 .
> That is to say, if other components(Impala) want to use 
> hive3 hms client to get the column stats to make CBO, they will get the 
> execption from Hive4. I think it is necessary to support  backward 
> compatibility to make other compontents work well with hive4(hms4).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27858) OOM happens when selecting many columns and JOIN.

2024-01-31 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17813088#comment-17813088
 ] 

okumin commented on HIVE-27858:
---

[~ngangam] Can I ask you about the current status?

Ryu, the reporter, and I also started digging into this issue and we may also 
find any clues. Thanks.

> OOM happens when selecting many columns and  JOIN.
> --
>
> Key: HIVE-27858
> URL: https://issues.apache.org/jira/browse/HIVE-27858
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Affects Versions: 4.0.0-beta-1
>Reporter: Ryu Kobayashi
>Assignee: Naveen Gangam
>Priority: Critical
>  Labels: hive-4.0.0-must
> Fix For: Not Applicable
>
> Attachments: HIVE-27858.full.q, HIVE-27858.less.columns.q, ddl.sql, 
> query.sql
>
>
> OOM happens when executing [^query.sql] using a table in [^ddl.sql]. These 
> did not happen in Hive 2 previously.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-24167) TPC-DS query 14 fails while generating plan for the filter

2024-01-29 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-24167:
--
Status: Patch Available  (was: Open)

I implemented one of the approaches in my mind.

[https://github.com/apache/hive/pull/5037]

 

Also, I wrote down what I've investigated to resolve this issue.

https://gist.github.com/okumin/b111fe0a911507bdf6a7204f49b9cb72

> TPC-DS query 14 fails while generating plan for the filter
> --
>
> Key: HIVE-24167
> URL: https://issues.apache.org/jira/browse/HIVE-24167
> Project: Hive
>  Issue Type: Sub-task
>  Components: CBO
>Reporter: Stamatis Zampetakis
>Assignee: okumin
>Priority: Major
>  Labels: hive-4.1.0-must, pull-request-available
>
> TPC-DS query 14 (cbo_query14.q and query4.q) fail with NPE on the metastore 
> with the partitioned TPC-DS 30TB dataset while generating the plan for the 
> filter.
> The problem can be reproduced using the PR in HIVE-23965.
> The current stacktrace shows that the NPE appears while trying to display the 
> debug message but even if this line didn't exist it would fail again later on.
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10867)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlanForSubQueryPredicate(SemanticAnalyzer.java:3375)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:3473)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10819)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12417)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:718)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12519)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:443)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
> at 
> org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:171)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
> at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:220)
> at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:173)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:414)
> at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:363)
> at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:357)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:129)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:231)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:203)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:129)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:424)
>

[jira] [Commented] (HIVE-24167) TPC-DS query 14 fails while generating plan for the filter

2024-01-23 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17810178#comment-17810178
 ] 

okumin commented on HIVE-24167:
---

Let me take this over as we plan to use CTE materialization with Hive 4. I know 
the root cause and I am almost close to the remediation.

> TPC-DS query 14 fails while generating plan for the filter
> --
>
> Key: HIVE-24167
> URL: https://issues.apache.org/jira/browse/HIVE-24167
> Project: Hive
>  Issue Type: Sub-task
>  Components: CBO
>Reporter: Stamatis Zampetakis
>Assignee: okumin
>Priority: Major
>  Labels: hive-4.1.0-must
>
> TPC-DS query 14 (cbo_query14.q and query4.q) fail with NPE on the metastore 
> with the partitioned TPC-DS 30TB dataset while generating the plan for the 
> filter.
> The problem can be reproduced using the PR in HIVE-23965.
> The current stacktrace shows that the NPE appears while trying to display the 
> debug message but even if this line didn't exist it would fail again later on.
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10867)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlanForSubQueryPredicate(SemanticAnalyzer.java:3375)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:3473)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10819)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12417)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:718)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12519)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:443)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
> at 
> org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:171)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
> at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:220)
> at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:173)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:414)
> at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:363)
> at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:357)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:129)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:231)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:203)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:129)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:424)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:355)
> at 
>

[jira] [Assigned] (HIVE-24167) TPC-DS query 14 fails while generating plan for the filter

2024-01-23 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin reassigned HIVE-24167:
-

Assignee: okumin  (was: Zoltan Haindrich)

> TPC-DS query 14 fails while generating plan for the filter
> --
>
> Key: HIVE-24167
> URL: https://issues.apache.org/jira/browse/HIVE-24167
> Project: Hive
>  Issue Type: Sub-task
>  Components: CBO
>Reporter: Stamatis Zampetakis
>Assignee: okumin
>Priority: Major
>  Labels: hive-4.1.0-must
>
> TPC-DS query 14 (cbo_query14.q and query4.q) fail with NPE on the metastore 
> with the partitioned TPC-DS 30TB dataset while generating the plan for the 
> filter.
> The problem can be reproduced using the PR in HIVE-23965.
> The current stacktrace shows that the NPE appears while trying to display the 
> debug message but even if this line didn't exist it would fail again later on.
> {noformat}
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10867)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlanForSubQueryPredicate(SemanticAnalyzer.java:3375)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:3473)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10819)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11765)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11625)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11622)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11649)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11635)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12417)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:718)
> at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12519)
> at 
> org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:443)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
> at 
> org.apache.hadoop.hive.ql.parse.ExplainSemanticAnalyzer.analyzeInternal(ExplainSemanticAnalyzer.java:171)
> at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
> at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:220)
> at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
> at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:173)
> at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:414)
> at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:363)
> at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:357)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:129)
> at 
> org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:231)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:258)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:203)
> at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:129)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:424)
> at 
> org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:355)
> at 
> org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:740)
> at 
>

[jira] [Created] (HIVE-27957) Better error message for STORED BY

2023-12-15 Thread okumin (Jira)

okumin created HIVE-27957:
-

 Summary: Better error message for STORED BY
 Key: HIVE-27957
 URL: https://issues.apache.org/jira/browse/HIVE-27957
 Project: Hive
  Issue Type: Improvement
Affects Versions: 4.0.0-beta-1
Reporter: okumin
Assignee: okumin


We'd like to see a little kinder message when an invalid identifier or literal 
is given. It is confusing especially when I mistakenly type STORED BY instead 
of STORED AS.
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> CREATE TABLE test (a STRING) 
STORED BY ORC TBLPROPERTIES ("orc.compress"="ZSTD");
Error: Error while compiling statement: FAILED: SemanticException Cannot find 
class '' (state=42000,code=4){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27950) STACK UDTF returns wrong results when # of argument is not a multiple of N

2023-12-12 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-27950:
--
Status: Patch Available  (was: Open)

> STACK UDTF returns wrong results when # of argument is not a multiple of N
> --
>
> Key: HIVE-27950
> URL: https://issues.apache.org/jira/browse/HIVE-27950
> Project: Hive
>  Issue Type: Bug
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>
> GenericUDTFStack nullifies a wrong cell when the number of values is 
> indivisible. In the following case, the `col2` column of the last row should 
> be `NULL`. But, `col1` is NULL somehow. 
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:1/defaul> select stack(2, 'a', 'b', 'c', 
> 'd', 'e');
> +---+---+---+
> | col0  | col1  | col2  |
> +---+---+---+
> | a     | b     | c     |
> | d     | NULL  | c     |
> +---+---+---+{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27950) STACK UDTF returns wrong results when # of argument is not a multiple of N

2023-12-11 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-27950:
--
Description: 
GenericUDTFStack nullifies a wrong cell when the number of values is 
indivisible. In the following case, the `col2` column of the last row should be 
`NULL`. But, `col1` is NULL somehow. 
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> select stack(2, 'a', 'b', 'c', 
'd', 'e');
+---+---+---+
| col0  | col1  | col2  |
+---+---+---+
| a     | b     | c     |
| d     | NULL  | c     |
+---+---+---+{code}

  was:
`STACK` nullifies a wrong cell when the number of values is indivisible. In the 
following case, the `col2` column of the last row should be `NULL`. But, `col1` 
is NULL somehow. 
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> select stack(2, 'a', 'b', 'c', 
'd', 'e');
+---+---+---+
| col0  | col1  | col2  |
+---+---+---+
| a     | b     | c     |
| d     | NULL  | c     |
+---+---+---+{code}


> STACK UDTF returns wrong results when # of argument is not a multiple of N
> --
>
> Key: HIVE-27950
> URL: https://issues.apache.org/jira/browse/HIVE-27950
> Project: Hive
>  Issue Type: Bug
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> GenericUDTFStack nullifies a wrong cell when the number of values is 
> indivisible. In the following case, the `col2` column of the last row should 
> be `NULL`. But, `col1` is NULL somehow. 
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:1/defaul> select stack(2, 'a', 'b', 'c', 
> 'd', 'e');
> +---+---+---+
> | col0  | col1  | col2  |
> +---+---+---+
> | a     | b     | c     |
> | d     | NULL  | c     |
> +---+---+---+{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27950) STACK UDTF returns wrong results when # of argument is not a multiple of N

2023-12-11 Thread okumin (Jira)

okumin created HIVE-27950:
-

 Summary: STACK UDTF returns wrong results when # of argument is 
not a multiple of N
 Key: HIVE-27950
 URL: https://issues.apache.org/jira/browse/HIVE-27950
 Project: Hive
  Issue Type: Bug
Reporter: okumin
Assignee: okumin


`STACK` nullifies a wrong cell when the number of values is indivisible. In the 
following case, the `col2` column of the last row should be `NULL`. But, `col1` 
is NULL somehow. 
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> select stack(2, 'a', 'b', 'c', 
'd', 'e');
+---+---+---+
| col0  | col1  | col2  |
+---+---+---+
| a     | b     | c     |
| d     | NULL  | c     |
+---+---+---+{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-3580) STACK() is crashes if an argument is a literal NULL

2023-12-11 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17795526#comment-17795526
 ] 

okumin commented on HIVE-3580:
--

I guess this has been resolved by HIVE-8642 or another ticket. I will resolve 
this
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> SELECT STACK(3, 1.0, 'a', NULL, 
'b', 2.0, 'c');
INFO  : Compiling 
command(queryId=hive_20231211223258_40410dc5-a0e3-4635-8d19-8bb010417b63): 
SELECT STACK(3, 1.0, 'a', NULL, 'b', 2.0, 'c')
INFO  : Semantic Analysis Completed (retrial = false)
INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:col0, 
type:decimal(1,0), comment:null), FieldSchema(name:col1, type:string, 
comment:null)], properties:null)
INFO  : Completed compiling 
command(queryId=hive_20231211223258_40410dc5-a0e3-4635-8d19-8bb010417b63); Time 
taken: 0.627 seconds
INFO  : Concurrency mode is disabled, not creating a lock manager
INFO  : Executing 
command(queryId=hive_20231211223258_40410dc5-a0e3-4635-8d19-8bb010417b63): 
SELECT STACK(3, 1.0, 'a', NULL, 'b', 2.0, 'c')
INFO  : Completed executing 
command(queryId=hive_20231211223258_40410dc5-a0e3-4635-8d19-8bb010417b63); Time 
taken: 0.001 seconds
+---+---+
| col0  | col1  |
+---+---+
| 1     | a     |
| NULL  | b     |
| 2     | c     |
+---+---+ {code}

> STACK() is crashes if an argument is a literal NULL
> ---
>
> Key: HIVE-3580
> URL: https://issues.apache.org/jira/browse/HIVE-3580
> Project: Hive
>  Issue Type: Bug
>Reporter: Jonathan Chang
>Priority: Minor
>
> SELECT foo, bar FROM table LATERAL VIEW STACK(3, 1.0, 'a', NULL, 'b', 2.0, 
> 'c') z AS foo, bar 
> fails.  Type conversion between the NULL and the doubles is probably broken 
> in the UDF.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27905) Some GenericUDFs wrongly cast ObjectInspectors

2023-12-04 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-27905:
--
Description: 
For example, GenericUDFSplit throws ClassCastException when a non-primitive 
type is given.
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> select split(array('a,b,c'), 
',');
Error: Error while compiling statement: FAILED: ClassCastException 
org.apache.hadoop.hive.serde2.objectinspector.StandardConstantListObjectInspector
 cannot be cast to 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector 
(state=42000,code=4) {code}
 

  was:
GenericUDFSplit throws ClassCastException when a non-primitive type is given.
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> select split(array('a,b,c'), 
',');
Error: Error while compiling statement: FAILED: ClassCastException 
org.apache.hadoop.hive.serde2.objectinspector.StandardConstantListObjectInspector
 cannot be cast to 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector 
(state=42000,code=4) {code}

Summary: Some GenericUDFs wrongly cast ObjectInspectors  (was: SPLIT 
throws ClassCastException)

> Some GenericUDFs wrongly cast ObjectInspectors
> --
>
> Key: HIVE-27905
> URL: https://issues.apache.org/jira/browse/HIVE-27905
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>
> For example, GenericUDFSplit throws ClassCastException when a non-primitive 
> type is given.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:1/defaul> select split(array('a,b,c'), 
> ',');
> Error: Error while compiling statement: FAILED: ClassCastException 
> org.apache.hadoop.hive.serde2.objectinspector.StandardConstantListObjectInspector
>  cannot be cast to 
> org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector 
> (state=42000,code=4) {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27905) SPLIT throws ClassCastException

2023-11-24 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-27905:
--
Status: Patch Available  (was: Open)

> SPLIT throws ClassCastException
> ---
>
> Key: HIVE-27905
> URL: https://issues.apache.org/jira/browse/HIVE-27905
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>
> GenericUDFSplit throws ClassCastException when a non-primitive type is given.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:1/defaul> select split(array('a,b,c'), 
> ',');
> Error: Error while compiling statement: FAILED: ClassCastException 
> org.apache.hadoop.hive.serde2.objectinspector.StandardConstantListObjectInspector
>  cannot be cast to 
> org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector 
> (state=42000,code=4) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27905) SPLIT throws ClassCastException

2023-11-22 Thread okumin (Jira)

okumin created HIVE-27905:
-

 Summary: SPLIT throws ClassCastException
 Key: HIVE-27905
 URL: https://issues.apache.org/jira/browse/HIVE-27905
 Project: Hive
  Issue Type: Improvement
Affects Versions: 4.0.0-beta-1
Reporter: okumin
Assignee: okumin


GenericUDFSplit throws ClassCastException when a non-primitive type is given.
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> select split(array('a,b,c'), 
',');
Error: Error while compiling statement: FAILED: ClassCastException 
org.apache.hadoop.hive.serde2.objectinspector.StandardConstantListObjectInspector
 cannot be cast to 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector 
(state=42000,code=4) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27825) Better error message for an empty quoted identifier

2023-10-25 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-27825:
--
Status: Patch Available  (was: Open)

> Better error message for an empty quoted identifier
> ---
>
> Key: HIVE-27825
> URL: https://issues.apache.org/jira/browse/HIVE-27825
> Project: Hive
>  Issue Type: Improvement
>  Components: Parser
>Affects Versions: 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Minor
>  Labels: pull-request-available
>
> SemanticAnalyzer throws a StringIndexOutOfBoundsException when a quoted 
> identifier is empty. It is likely to happen when a template system fails to 
> fill a database name or table name. We should return a better message so that 
> a user can easily identify where is wrong.
>  
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:1/defaul> SELECT * FROM ``;
> Error: Error while compiling statement: FAILED: 
> StringIndexOutOfBoundsException String index out of range: 0 
> (state=42000,code=4) {code}
> This is the stacktrace on HiveServer2.
> {code:java}
> org.apache.hive.service.cli.HiveSQLException: Error while compiling 
> statement: FAILED: StringIndexOutOfBoundsException String index out of range: > 0
>      at 
> org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:376)
>  ~[hive-service-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:214)
>  ~[hive-service-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
> ...
>  Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
> range: 0
>      at java.lang.String.charAt(String.java:658) ~[?:1.8.0_275]
>      at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.unescapeIdentifier(BaseSemanticAnalyzer.java:608)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
>      at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.getDbTableNamePair(BaseSemanticAnalyzer.java:424)
>  ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]{code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27825) Better error message for an empty quoted identifier

2023-10-25 Thread okumin (Jira)

okumin created HIVE-27825:
-

 Summary: Better error message for an empty quoted identifier
 Key: HIVE-27825
 URL: https://issues.apache.org/jira/browse/HIVE-27825
 Project: Hive
  Issue Type: Improvement
  Components: Parser
Affects Versions: 4.0.0-beta-1
Reporter: okumin
Assignee: okumin


SemanticAnalyzer throws a StringIndexOutOfBoundsException when a quoted 
identifier is empty. It is likely to happen when a template system fails to 
fill a database name or table name. We should return a better message so that a 
user can easily identify where is wrong.

 
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> SELECT * FROM ``;
Error: Error while compiling statement: FAILED: StringIndexOutOfBoundsException 
String index out of range: 0 (state=42000,code=4) {code}
This is the stacktrace on HiveServer2.
{code:java}
org.apache.hive.service.cli.HiveSQLException: Error while compiling statement: 
FAILED: StringIndexOutOfBoundsException String index out of range: 0
     at 
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:376)
 ~[hive-service-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hive.service.cli.operation.SQLOperation.prepare(SQLOperation.java:214)
 ~[hive-service-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
...
 Caused by: java.lang.StringIndexOutOfBoundsException: String index out of 
range: 0
     at java.lang.String.charAt(String.java:658) ~[?:1.8.0_275]
     at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.unescapeIdentifier(BaseSemanticAnalyzer.java:608)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]
     at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.getDbTableNamePair(BaseSemanticAnalyzer.java:424)
 ~[hive-exec-4.0.0-beta-2-SNAPSHOT.jar:4.0.0-beta-2-SNAPSHOT]{code}
 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (HIVE-27734) Add Icenerg's storage-partitioned join capabilities to Hive's [sorted-]bucket-map-join

2023-10-23 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778749#comment-17778749
 ] 

okumin edited comment on HIVE-27734 at 10/23/23 5:05 PM:
-

This is my first trial to implement Bucket Map Join in a very easy case.

[https://github.com/okumin/hive/commits/POC-BMJ-ICEBERG2]

Test data.
{code:java}
SET hive.disable.unsafe.external.table.operations=false;
SET hive.fetch.task.conversion = none;DROP TABLE IF EXISTS 
default.iceberg_bucketed PURGE;
DROP TABLE IF EXISTS default.iceberg_non_bucketed PURGE;
CREATE TABLE default.iceberg_bucketed (id int, txt string) PARTITIONED BY SPEC 
(bucket(8, id)) STORED BY ICEBERG;
CREATE TABLE default.iceberg_non_bucketed (id int, txt string) STORED BY 
ICEBERG;
INSERT INTO default.iceberg_bucketed VALUES 
(1,'1'),(2,'2'),(3,'3'),(4,'4'),(5,'5'),(6,'6'),(7,'7'),(8,'8');
INSERT INTO default.iceberg_non_bucketed VALUES 
(1,'1'),(2,'2'),(3,'3'),(4,'4'); {code}
EXPLAIN
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> EXPLAIN EXTENDED SELECT * FROM 
default.iceberg_bucketed f INNER JOIN default.iceberg_non_bucketed d ON f.id = 
d.id;
...
|                       Map Join Operator            |
|                         condition map:             |
|                              Inner Join 0 to 1     |
|                         Estimated key counts: Map 2 => 1 |
|                         keys:                      |
|                           0 _col0 (type: int)      |
|                           1 _col0 (type: int)      |
|                         outputColumnNames: _col0, _col1, _col2, _col3 |
|                         input vertices:            |
|                           1 Map 2                  |
|                         Position of Big Table: 0   |
|                         Statistics: Num rows: 4 Data size: 712 Basic stats: 
COMPLETE Column stats: COMPLETE |
|                         BucketMapJoin: true        |
|                         File Output Operator       |{code}
It succeeded in joining records.
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> SELECT * FROM 
default.iceberg_bucketed f INNER JOIN default.iceberg_non_bucketed d ON f.id = 
d.id;
...
--
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  KILLED  
--
Map 2 .. container     SUCCEEDED      1          1        0        0    
   0       0  
Map 1 .. container     SUCCEEDED      5          5        0        0    
   0       0  
--
VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 12.78 s    
--
INFO  : Completed executing 
command(queryId=hive_20231023164515_ec38118a-9ba7-471a-914c-f53866e09ab9); Time 
taken: 12.979 seconds
+---++---++
| f.id  | f.txt  | d.id  | d.txt  |
+---++---++
| 3     | 3      | 3     | 3      |
| 1     | 1      | 1     | 1      |
| 2     | 2      | 2     | 2      |
| 4     | 4      | 4     | 4      |
+---++---++ {code}


was (Author: okumin):
This is my first trial to implement Bucket Map Join.

[https://github.com/okumin/hive/commits/POC-BMJ-ICEBERG2]

Test data.
{code:java}
SET hive.disable.unsafe.external.table.operations=false;
SET hive.fetch.task.conversion = none;DROP TABLE IF EXISTS 
default.iceberg_bucketed PURGE;
DROP TABLE IF EXISTS default.iceberg_non_bucketed PURGE;
CREATE TABLE default.iceberg_bucketed (id int, txt string) PARTITIONED BY SPEC 
(bucket(8, id)) STORED BY ICEBERG;
CREATE TABLE default.iceberg_non_bucketed (id int, txt string) STORED BY 
ICEBERG;
INSERT INTO default.iceberg_bucketed VALUES 
(1,'1'),(2,'2'),(3,'3'),(4,'4'),(5,'5'),(6,'6'),(7,'7'),(8,'8');
INSERT INTO default.iceberg_non_bucketed VALUES 
(1,'1'),(2,'2'),(3,'3'),(4,'4'); {code}
EXPLAIN
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> EXPLAIN EXTENDED SELECT * FROM 
default.iceberg_bucketed f INNER JOIN default.iceberg_non_bucketed d ON f.id = 
d.id;
...
|                       Map Join Operator            |
|                         condition map:             |
|                              Inner Join 0 to 1     |
|                         Estimated key counts: Map 2 => 1 |
|                         keys:                      |
|                           0 _col0 (type: int)      |
|                           1 _col0 (type: int)      |
|                         outputColumnNames: _col0, _col1, _col2, _col3 |
|                         input vertices:            |
|                           1 Map 2                  |
|

[jira] [Commented] (HIVE-27734) Add Icenerg's storage-partitioned join capabilities to Hive's [sorted-]bucket-map-join

2023-10-23 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778757#comment-17778757
 ] 

okumin commented on HIVE-27734:
---

What I confirmed.
 * We need a method to extract a bucket ID from metadata. In Hive tables, it is 
hardcoded in the path name
 * We need a method to customize a hash function for bucketing

I assume we don't need to implement the following.
 * The feature to prune records by bucket IDs since it is handled as a 
partitioning pruning
 * The feature to colocate records with the same bucket ID on INSERT since it 
is handled as a partition insertion

We need further investigations or discussions.
 * What if partition spec is evolved?
 * How to implement the case where there are multiple bucketing columns
 * How to implement SMJ joins
 * Should we extend Bucket Merge Join for partition transform types other than 
`bucket[N]`?
 * Can aggregation optimization be implemented?

> Add Icenerg's storage-partitioned join capabilities to Hive's 
> [sorted-]bucket-map-join
> --
>
> Key: HIVE-27734
> URL: https://issues.apache.org/jira/browse/HIVE-27734
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Affects Versions: 4.0.0-alpha-2
>Reporter: Janos Kovacs
>Priority: Major
>
> Iceberg's 'data bucketing' is implemented through its rich (function based) 
> partitioning feature which helps to optimize join operations - called storage 
> partitioned joins. 
> doc: 
> [https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit#heading=h.82w8qxfl2uwl]
> spark impl.: https://issues.apache.org/jira/browse/SPARK-37375
> This feature is not yet leveraged in Hive into its bucket-map-join 
> optimization, neither alone nor with Iceberg's SortOrder to 
> sorted-bucket-map-join.
> Customers migrating from Hive table format to Iceberg format with storage 
> optimized schema will experience performance degradation on large tables 
> where Iceberg's gain on no-listing performance improvement is significantly 
> smaller than the actual join performance over bucket-join or even 
> sorted-bucket-join.
>  
> {noformat}
> SET hive.query.results.cache.enabled=false;
> SET hive.fetch.task.conversion = none;
> SET hive.optimize.bucketmapjoin=true;
> SET hive.convert.join.bucket.mapjoin.tez=true;
> SET hive.auto.convert.join.noconditionaltask.size=1000;
> --if you are working with external table, you need this for bmj:
> SET hive.disable.unsafe.external.table.operations=false;
> -- HIVE BUCKET-MAP-JOIN
> DROP TABLE IF EXISTS default.hivebmjt1 PURGE;
> DROP TABLE IF EXISTS default.hivebmjt2 PURGE;
> CREATE TABLE default.hivebmjt1 (id int, txt string) CLUSTERED BY (id) INTO 8 
> BUCKETS;
> CREATE TABLE default.hivebmjt2 (id int, txt string);
> INSERT INTO default.hivebmjt1 VALUES 
> (1,'1'),(2,'2'),(3,'3'),(4,'4'),(5,'5'),(6,'6'),(7,'7'),(8,'8');
> INSERT INTO default.hivebmjt2 VALUES (1,'1'),(2,'2'),(3,'3'),(4,'4');
> EXPLAIN
> SELECT * FROM default.hivebmjt1 f INNER  JOIN default.hivebmjt2 d ON f.id 
> = d.id;
> EXPLAIN
> SELECT * FROM default.hivebmjt1 f LEFT OUTER JOIN default.hivebmjt2 d ON f.id 
> = d.id;
> -- Both are optimized into BMJ
> -- ICEBERG BUCKET-MAP-JOIN via Iceberg's storage-partitioned join
> DROP TABLE IF EXISTS default.icespbmjt1 PURGE;
> DROP TABLE IF EXISTS default.icespbmjt2 PURGE;
> CREATE TABLE default.icespbmjt1 (txt string) PARTITIONED BY (id int) STORED 
> BY ICEBERG ;
> CREATE TABLE default.icespbmjt2 (txt string) PARTITIONED BY (id int) STORED 
> BY ICEBERG ;
> INSERT INTO default.icespbmjt1 VALUES ('1',1),('2',2),('3',3),('4',4);
> INSERT INTO default.icespbmjt2 VALUES ('1',1),('2',2),('3',3),('4',4);
> EXPLAIN
> SELECT * FROM default.icespbmjt1 f INNER  JOIN default.icespbmjt2 d ON 
> f.id = d.id;
> EXPLAIN
> SELECT * FROM default.icespbmjt1 f LEFT OUTER JOIN default.icespbmjt2 d ON 
> f.id = d.id;
> -- Only Map-Join optimised
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27734) Add Icenerg's storage-partitioned join capabilities to Hive's [sorted-]bucket-map-join

2023-10-23 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778749#comment-17778749
 ] 

okumin commented on HIVE-27734:
---

This is my first trial to implement Bucket Map Join.

[https://github.com/okumin/hive/commits/POC-BMJ-ICEBERG2]

Test data.
{code:java}
SET hive.disable.unsafe.external.table.operations=false;
SET hive.fetch.task.conversion = none;DROP TABLE IF EXISTS 
default.iceberg_bucketed PURGE;
DROP TABLE IF EXISTS default.iceberg_non_bucketed PURGE;
CREATE TABLE default.iceberg_bucketed (id int, txt string) PARTITIONED BY SPEC 
(bucket(8, id)) STORED BY ICEBERG;
CREATE TABLE default.iceberg_non_bucketed (id int, txt string) STORED BY 
ICEBERG;
INSERT INTO default.iceberg_bucketed VALUES 
(1,'1'),(2,'2'),(3,'3'),(4,'4'),(5,'5'),(6,'6'),(7,'7'),(8,'8');
INSERT INTO default.iceberg_non_bucketed VALUES 
(1,'1'),(2,'2'),(3,'3'),(4,'4'); {code}
EXPLAIN
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> EXPLAIN EXTENDED SELECT * FROM 
default.iceberg_bucketed f INNER JOIN default.iceberg_non_bucketed d ON f.id = 
d.id;
...
|                       Map Join Operator            |
|                         condition map:             |
|                              Inner Join 0 to 1     |
|                         Estimated key counts: Map 2 => 1 |
|                         keys:                      |
|                           0 _col0 (type: int)      |
|                           1 _col0 (type: int)      |
|                         outputColumnNames: _col0, _col1, _col2, _col3 |
|                         input vertices:            |
|                           1 Map 2                  |
|                         Position of Big Table: 0   |
|                         Statistics: Num rows: 4 Data size: 712 Basic stats: 
COMPLETE Column stats: COMPLETE |
|                         BucketMapJoin: true        |
|                         File Output Operator       |{code}
It succeeded in joining records.
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> SELECT * FROM 
default.iceberg_bucketed f INNER JOIN default.iceberg_non_bucketed d ON f.id = 
d.id;
...
--
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  KILLED  
--
Map 2 .. container     SUCCEEDED      1          1        0        0    
   0       0  
Map 1 .. container     SUCCEEDED      5          5        0        0    
   0       0  
--
VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 12.78 s    
--
INFO  : Completed executing 
command(queryId=hive_20231023164515_ec38118a-9ba7-471a-914c-f53866e09ab9); Time 
taken: 12.979 seconds
+---++---++
| f.id  | f.txt  | d.id  | d.txt  |
+---++---++
| 3     | 3      | 3     | 3      |
| 1     | 1      | 1     | 1      |
| 2     | 2      | 2     | 2      |
| 4     | 4      | 4     | 4      |
+---++---++ {code}

> Add Icenerg's storage-partitioned join capabilities to Hive's 
> [sorted-]bucket-map-join
> --
>
> Key: HIVE-27734
> URL: https://issues.apache.org/jira/browse/HIVE-27734
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Affects Versions: 4.0.0-alpha-2
>Reporter: Janos Kovacs
>Priority: Major
>
> Iceberg's 'data bucketing' is implemented through its rich (function based) 
> partitioning feature which helps to optimize join operations - called storage 
> partitioned joins. 
> doc: 
> [https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit#heading=h.82w8qxfl2uwl]
> spark impl.: https://issues.apache.org/jira/browse/SPARK-37375
> This feature is not yet leveraged in Hive into its bucket-map-join 
> optimization, neither alone nor with Iceberg's SortOrder to 
> sorted-bucket-map-join.
> Customers migrating from Hive table format to Iceberg format with storage 
> optimized schema will experience performance degradation on large tables 
> where Iceberg's gain on no-listing performance improvement is significantly 
> smaller than the actual join performance over bucket-join or even 
> sorted-bucket-join.
>  
> {noformat}
> SET hive.query.results.cache.enabled=false;
> SET hive.fetch.task.conversion = none;
> SET hive.optimize.bucketmapjoin=true;
> SET hive.convert.join.bucket.mapjoin.tez=true;
> SET hive.auto.convert.join.noconditionaltask.size=1000;
> --if you are

[jira] [Commented] (HIVE-27473) Make SessionHiveMetaStoreClient and HiveMetaStoreClientWithLocalCache composable

2023-10-20 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1689#comment-1689
 ] 

okumin commented on HIVE-27473:
---

This might not be as trivial as I originally expected. 
SessionHiveMetaStoreClient and HiveMetaStoreClientWithLocalCache are coupled 
with not IMetaStoreClient but HiveMetaStoreClient.

Let's see the case of `getTable`. The public endpoint is defined as 
[IMetaStoreClient#getTable|https://github.com/apache/hive/blob/b02cef4fe943b9aba597dcdfd3b8f3d3a5efca3e/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/IMetaStoreClient.java#L735].
 However, SessionHiveMetaStoreClient, HiveMetaStoreClientWithLocalCache, and 
HiveMetaStoreClient are coupled with not only the public API,`getTable`, but 
also another non-public API, `getTableInternal`.

1. A client calls 
[SessionHiveMetaStoreClient#getTable|https://github.com/apache/hive/blob/b02cef4fe943b9aba597dcdfd3b8f3d3a5efca3e/ql/src/java/org/apache/hadoop/hive/ql/metadata/SessionHiveMetaStoreClient.java#L277]
2. The method checks if a tmp table exists. If it doesn't exist, the method 
calls 
[super.getTable|https://github.com/apache/hive/blob/b02cef4fe943b9aba597dcdfd3b8f3d3a5efca3e/ql/src/java/org/apache/hadoop/hive/ql/metadata/SessionHiveMetaStoreClient.java#L284]
3. 
[HiveMetaStoreClient#getTable|https://github.com/apache/hive/blob/b02cef4fe943b9aba597dcdfd3b8f3d3a5efca3e/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java#L2780]
 is invoked
4. The method calls 
[getTableInternal|https://github.com/apache/hive/blob/b02cef4fe943b9aba597dcdfd3b8f3d3a5efca3e/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java#L2790]
5. 
[SessionHiveMetaStoreClient#getTableInternal|https://github.com/apache/hive/blob/b02cef4fe943b9aba597dcdfd3b8f3d3a5efca3e/ql/src/java/org/apache/hadoop/hive/ql/metadata/SessionHiveMetaStoreClient.java#L2094]
 is invoked
6. The method checks query level cache. If not cached, 
[super.getTableInternal|https://github.com/apache/hive/blob/b02cef4fe943b9aba597dcdfd3b8f3d3a5efca3e/ql/src/java/org/apache/hadoop/hive/ql/metadata/SessionHiveMetaStoreClient.java#L2108]
 is called
7. 
[HiveMetaStoreClientWithLocalCache#getTableInternal|https://github.com/apache/hive/blob/b02cef4fe943b9aba597dcdfd3b8f3d3a5efca3e/ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreClientWithLocalCache.java#L273]
 is invoked
8. The method checks if metastore client cache. If not cached, 
[super.getTableInternal|https://github.com/apache/hive/blob/b02cef4fe943b9aba597dcdfd3b8f3d3a5efca3e/ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveMetaStoreClientWithLocalCache.java#L282]
 is called
9. 
[HiveMetaStoreClient#getTableInternal|https://github.com/apache/hive/blob/b02cef4fe943b9aba597dcdfd3b8f3d3a5efca3e/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreClient.java#L2731]
 is invoked and the actual access to HMS happens

 

Those have mutual dependencies, meaning SessionHiveMetaStoreClient invokes 
steps implemented in HiveMetaStoreClient. HiveMetaStoreClient also invokes 
steps implemented in SessionHiveMetaStoreClient. I guess we need to restructure 
the dependency in order to make them composable.

> Make SessionHiveMetaStoreClient and HiveMetaStoreClientWithLocalCache 
> composable
> 
>
> Key: HIVE-27473
> URL: https://issues.apache.org/jira/browse/HIVE-27473
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Affects Versions: 4.0.0-alpha-2
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> Currently, we implement the two classes using inheritance. It assumes we 
> always use a single implementation of IMetaStoreClient.
> Some community members have been willing to make IMetaStoreClient pluggable 
> as proposed in HIVE-12679. Considering the use case, we should provide the 
> additional traits with composition over inheritance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27734) Add Icenerg's storage-partitioned join capabilities to Hive's [sorted-]bucket-map-join

2023-10-11 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17774036#comment-17774036
 ] 

okumin commented on HIVE-27734:
---

Thanks. Let me briefly write down potential sub-tasks. I remember the current 
implementation is tightly coupled with traditional Hive tables.

> Add Icenerg's storage-partitioned join capabilities to Hive's 
> [sorted-]bucket-map-join
> --
>
> Key: HIVE-27734
> URL: https://issues.apache.org/jira/browse/HIVE-27734
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Affects Versions: 4.0.0-alpha-2
>Reporter: Janos Kovacs
>Priority: Major
>
> Iceberg's 'data bucketing' is implemented through its rich (function based) 
> partitioning feature which helps to optimize join operations - called storage 
> partitioned joins. 
> doc: 
> [https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit#heading=h.82w8qxfl2uwl]
> spark impl.: https://issues.apache.org/jira/browse/SPARK-37375
> This feature is not yet leveraged in Hive into its bucket-map-join 
> optimization, neither alone nor with Iceberg's SortOrder to 
> sorted-bucket-map-join.
> Customers migrating from Hive table format to Iceberg format with storage 
> optimized schema will experience performance degradation on large tables 
> where Iceberg's gain on no-listing performance improvement is significantly 
> smaller than the actual join performance over bucket-join or even 
> sorted-bucket-join.
>  
> {noformat}
> SET hive.query.results.cache.enabled=false;
> SET hive.fetch.task.conversion = none;
> SET hive.optimize.bucketmapjoin=true;
> SET hive.convert.join.bucket.mapjoin.tez=true;
> SET hive.auto.convert.join.noconditionaltask.size=1000;
> --if you are working with external table, you need this for bmj:
> SET hive.disable.unsafe.external.table.operations=false;
> -- HIVE BUCKET-MAP-JOIN
> DROP TABLE IF EXISTS default.hivebmjt1 PURGE;
> DROP TABLE IF EXISTS default.hivebmjt2 PURGE;
> CREATE TABLE default.hivebmjt1 (id int, txt string) CLUSTERED BY (id) INTO 8 
> BUCKETS;
> CREATE TABLE default.hivebmjt2 (id int, txt string);
> INSERT INTO default.hivebmjt1 VALUES 
> (1,'1'),(2,'2'),(3,'3'),(4,'4'),(5,'5'),(6,'6'),(7,'7'),(8,'8');
> INSERT INTO default.hivebmjt2 VALUES (1,'1'),(2,'2'),(3,'3'),(4,'4');
> EXPLAIN
> SELECT * FROM default.hivebmjt1 f INNER  JOIN default.hivebmjt2 d ON f.id 
> = d.id;
> EXPLAIN
> SELECT * FROM default.hivebmjt1 f LEFT OUTER JOIN default.hivebmjt2 d ON f.id 
> = d.id;
> -- Both are optimized into BMJ
> -- ICEBERG BUCKET-MAP-JOIN via Iceberg's storage-partitioned join
> DROP TABLE IF EXISTS default.icespbmjt1 PURGE;
> DROP TABLE IF EXISTS default.icespbmjt2 PURGE;
> CREATE TABLE default.icespbmjt1 (txt string) PARTITIONED BY (id int) STORED 
> BY ICEBERG ;
> CREATE TABLE default.icespbmjt2 (txt string) PARTITIONED BY (id int) STORED 
> BY ICEBERG ;
> INSERT INTO default.icespbmjt1 VALUES ('1',1),('2',2),('3',3),('4',4);
> INSERT INTO default.icespbmjt2 VALUES ('1',1),('2',2),('3',3),('4',4);
> EXPLAIN
> SELECT * FROM default.icespbmjt1 f INNER  JOIN default.icespbmjt2 d ON 
> f.id = d.id;
> EXPLAIN
> SELECT * FROM default.icespbmjt1 f LEFT OUTER JOIN default.icespbmjt2 d ON 
> f.id = d.id;
> -- Only Map-Join optimised
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27734) Add Icenerg's storage-partitioned join capabilities to Hive's [sorted-]bucket-map-join

2023-10-06 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17772576#comment-17772576
 ] 

okumin commented on HIVE-27734:
---

I'm taking a look since I have experience implementing Bucket Map Join for 
non-native tables.

> Add Icenerg's storage-partitioned join capabilities to Hive's 
> [sorted-]bucket-map-join
> --
>
> Key: HIVE-27734
> URL: https://issues.apache.org/jira/browse/HIVE-27734
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Affects Versions: 4.0.0-alpha-2
>Reporter: Janos Kovacs
>Priority: Major
>
> Iceberg's 'data bucketing' is implemented through its rich (function based) 
> partitioning feature which helps to optimize join operations - called storage 
> partitioned joins. 
> doc: 
> [https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE/edit#heading=h.82w8qxfl2uwl]
> spark impl.: https://issues.apache.org/jira/browse/SPARK-37375
> This feature is not yet leveraged in Hive into its bucket-map-join 
> optimization, neither alone nor with Iceberg's SortOrder to 
> sorted-bucket-map-join.
> Customers migrating from Hive table format to Iceberg format with storage 
> optimized schema will experience performance degradation on large tables 
> where Iceberg's gain on no-listing performance improvement is significantly 
> smaller than the actual join performance over bucket-join or even 
> sorted-bucket-join.
>  
> {noformat}
> SET hive.query.results.cache.enabled=false;
> SET hive.fetch.task.conversion = none;
> SET hive.optimize.bucketmapjoin=true;
> SET hive.convert.join.bucket.mapjoin.tez=true;
> SET hive.auto.convert.join.noconditionaltask.size=1000;
> --if you are working with external table, you need this for bmj:
> SET hive.disable.unsafe.external.table.operations=false;
> -- HIVE BUCKET-MAP-JOIN
> DROP TABLE IF EXISTS default.hivebmjt1 PURGE;
> DROP TABLE IF EXISTS default.hivebmjt2 PURGE;
> CREATE TABLE default.hivebmjt1 (id int, txt string) CLUSTERED BY (id) INTO 8 
> BUCKETS;
> CREATE TABLE default.hivebmjt2 (id int, txt string);
> INSERT INTO default.hivebmjt1 VALUES 
> (1,'1'),(2,'2'),(3,'3'),(4,'4'),(5,'5'),(6,'6'),(7,'7'),(8,'8');
> INSERT INTO default.hivebmjt2 VALUES (1,'1'),(2,'2'),(3,'3'),(4,'4');
> EXPLAIN
> SELECT * FROM default.hivebmjt1 f INNER  JOIN default.hivebmjt2 d ON f.id 
> = d.id;
> EXPLAIN
> SELECT * FROM default.hivebmjt1 f LEFT OUTER JOIN default.hivebmjt2 d ON f.id 
> = d.id;
> -- Both are optimized into BMJ
> -- ICEBERG BUCKET-MAP-JOIN via Iceberg's storage-partitioned join
> DROP TABLE IF EXISTS default.icespbmjt1 PURGE;
> DROP TABLE IF EXISTS default.icespbmjt2 PURGE;
> CREATE TABLE default.icespbmjt1 (txt string) PARTITIONED BY (id int) STORED 
> BY ICEBERG ;
> CREATE TABLE default.icespbmjt2 (txt string) PARTITIONED BY (id int) STORED 
> BY ICEBERG ;
> INSERT INTO default.icespbmjt1 VALUES ('1',1),('2',2),('3',3),('4',4);
> INSERT INTO default.icespbmjt2 VALUES ('1',1),('2',2),('3',3),('4',4);
> EXPLAIN
> SELECT * FROM default.icespbmjt1 f INNER  JOIN default.icespbmjt2 d ON 
> f.id = d.id;
> EXPLAIN
> SELECT * FROM default.icespbmjt1 f LEFT OUTER JOIN default.icespbmjt2 d ON 
> f.id = d.id;
> -- Only Map-Join optimised
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27084) Iceberg: Stats are not populated correctly during query compilation

2023-10-06 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17772464#comment-17772464
 ] 

okumin commented on HIVE-27084:
---

[~rajesh.balamohan] It could not be a problem of Iceberg and it could be an 
expected behavior.

First, the first query scans two columns, `ss_sold_date_sk` and 
`ss_wholesale_cost`. The second query scans a single column, `ss_sold_date_sk`. 
I guess `ss_wholesale_cost` is not a partition column. So, the first query has 
to scan more data than the second one.

{code:sql}
select count(*) from store_sales where ss_sold_date_sk=2450822 and 
ss_wholesale_cost > 0.0;
select count(*) from store_sales where ss_sold_date_sk=2450822
{code}

Then, I guess `ss_sold_date_sk` is int or bigint and `ss_wholesale_cost` is 
decimal. On my machine, the derived size of bigint is 8 while that of decimal 
is 112. It can potentially make a big difference.

> Iceberg: Stats are not populated correctly during query compilation
> ---
>
> Key: HIVE-27084
> URL: https://issues.apache.org/jira/browse/HIVE-27084
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: performance
>
> - Table stats are not properly used/computed during query compilation phase.
>  - Here is an example. Check out the query with the filter which give more 
> data than the regular query
> This is just an example, real world queries can have bad query plans due to 
> this
> {{10470974584 with filter, vs 303658262936 without filter}}
> {noformat}
> explain select count(*) from store_sales where ss_sold_date_sk=2450822 and 
> ss_wholesale_cost > 0.0
> Explain
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48
>   Edges:
> Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
>   DagName: hive_20230216065808_80d68e3f-3a6b-422b-9265-50bc707ae3c6:48
>   Vertices:
> Map 1
> Map Operator Tree:
> TableScan
>   alias: store_sales
>   filterExpr: ((ss_sold_date_sk = 2450822) and 
> (ss_wholesale_cost > 0)) (type: boolean)
>   Statistics: Num rows: 2755519629 Data size: 303658262936 
> Basic stats: COMPLETE Column stats: NONE
>   Filter Operator
> predicate: ((ss_sold_date_sk = 2450822) and 
> (ss_wholesale_cost > 0)) (type: boolean)
> Statistics: Num rows: 5 Data size: 550 Basic stats: 
> COMPLETE Column stats: NONE
> Select Operator
>   Statistics: Num rows: 5 Data size: 550 Basic stats: 
> COMPLETE Column stats: NONE
>   Group By Operator
> aggregations: count()
> minReductionHashAggr: 0.99
> mode: hash
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 124 Basic stats: 
> COMPLETE Column stats: NONE
> Reduce Output Operator
>   null sort order:
>   sort order:
>   Statistics: Num rows: 1 Data size: 124 Basic stats: 
> COMPLETE Column stats: NONE
>   value expressions: _col0 (type: bigint)
> Execution mode: vectorized, llap
> LLAP IO: all inputs (cache only)
> Reducer 2
> Execution mode: vectorized, llap
> Reduce Operator Tree:
>   Group By Operator
> aggregations: count(VALUE._col0)
> mode: mergepartial
> outputColumnNames: _col0
> Statistics: Num rows: 1 Data size: 124 Basic stats: COMPLETE 
> Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 1 Data size: 124 Basic stats: 
> COMPLETE Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>   output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>   serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
> Fetch Operator
>   limit: -1
>   Processor Tree:
> ListSink
> 58 rows selected (0.73 seconds)
> explain select count(*) from store_sales where ss_sold_date_sk=2450822
> INFO  : Starting task [Stage-3:EXPLAIN] in serial mode
> INFO  : Completed executing 
> command(queryId=hive_20230216065813_e51482a2-1c9a-41a7-b1b3-9aec2fba9ba7); 
> Time taken: 0.061

[jira] [Commented] (HIVE-27119) Iceberg: Delete from table generates lot of files

2023-10-06 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17772444#comment-17772444
 ] 

okumin commented on HIVE-27119:
---

[~rajesh.balamohan] I think Auto Reduce Parallelism would help you.


{code:java}
zookage@client-node-0:~$ beeline -e 'delete from store_Sales where 
ss_customer_sk % 10 = 0' --hiveconf hive.tez.auto.reducer.parallelism=true 
--hiveconf hive.server2.in.place.progress=false
...
INFO  : 2023-10-06 05:58:56,843 Map 1: -/-  Reducer 2: 0/8  
INFO  : 2023-10-06 05:58:58,367 Map 1: 0/1  Reducer 2: 0/8  
INFO  : 2023-10-06 05:59:01,396 Map 1: 0(+1)/1  Reducer 2: 0/8  
INFO  : 2023-10-06 05:59:12,491 Map 1: 1/1  Reducer 2: 0/8  
INFO  : 2023-10-06 05:59:13,005 Map 1: 1/1  Reducer 2: 2/2  
{code}


> Iceberg: Delete from table generates lot of files
> -
>
> Key: HIVE-27119
> URL: https://issues.apache.org/jira/browse/HIVE-27119
> Project: Hive
>  Issue Type: Improvement
>  Components: Iceberg integration
>Reporter: Rajesh Balamohan
>Priority: Major
>  Labels: performance
>
> With "delete" it generates lot of files due to the way data is sent to the 
> reducers. Files per partition is impacted by the number of reduce tasks.
> One way could be to explicitly control the number of reducers; Creating this 
> ticket to have a long term fix.
>  
> {noformat}
>  explain delete from store_Sales where ss_customer_sk % 10 = 0;
> INFO  : Compiling 
> command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b): 
> explain delete from store_Sales where ss_customer_sk % 10 = 0
> INFO  : No Stats for tpcds_1000_iceberg_mor_v4@store_sales, Columns: 
> ss_sold_time_sk, ss_cdemo_sk, ss_promo_sk, ss_ext_discount_amt, 
> ss_ext_sales_price, ss_net_profit, ss_addr_sk, ss_ticket_number, 
> ss_wholesale_cost, ss_item_sk, ss_ext_list_price, ss_sold_date_sk, 
> ss_store_sk, ss_coupon_amt, ss_quantity, ss_list_price, ss_sales_price, 
> ss_customer_sk, ss_ext_wholesale_cost, ss_net_paid, ss_ext_tax, ss_hdemo_sk, 
> ss_net_paid_inc_tax
> INFO  : Semantic Analysis Completed (retrial = false)
> INFO  : Created Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, 
> type:string, comment:null)], properties:null)
> INFO  : Completed compiling 
> command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b); 
> Time taken: 0.704 seconds
> INFO  : Executing 
> command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b): 
> explain delete from store_Sales where ss_customer_sk % 10 = 0
> INFO  : Starting task [Stage-4:EXPLAIN] in serial mode
> INFO  : Completed executing 
> command(queryId=hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b); 
> Time taken: 0.005 seconds
> INFO  : OK
> Explain
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-2 depends on stages: Stage-1
>   Stage-0 depends on stages: Stage-2
>   Stage-3 depends on stages: Stage-0
> STAGE PLANS:
>   Stage: Stage-1
> Tez
>   DagId: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377
>   Edges:
> Reducer 2 <- Map 1 (SIMPLE_EDGE)
>   DagName: hive_20230303021031_855dd644-8f67-482d-98d7-e9f70b56ae0b:377
>   Vertices:
> Map 1
> Map Operator Tree:
> TableScan
>   alias: store_sales
>   filterExpr: ((ss_customer_sk % 10) = 0) (type: boolean)
>   Statistics: Num rows: 2755519629 Data size: 3643899155232 
> Basic stats: COMPLETE Column stats: NONE
>   Filter Operator
> predicate: ((ss_customer_sk % 10) = 0) (type: boolean)
> Statistics: Num rows: 1377759814 Data size: 1821949576954 
> Basic stats: COMPLETE Column stats: NONE
> Select Operator
>   expressions: PARTITION__SPEC__ID (type: int), 
> PARTITION__HASH (type: bigint), FILE__PATH (type: string), ROW__POSITION 
> (type: bigint), ss_sold_time_sk (type: int), ss_item_sk (type: int), 
> ss_customer_sk (type: int), ss_cdemo_sk (type: int), ss_hdemo_sk (type: int), 
> ss_addr_sk (type: int), ss_store_sk (type: int), ss_promo_sk (type: int), 
> ss_ticket_number (type: bigint), ss_quantity (type: int), ss_wholesale_cost 
> (type: decimal(7,2)), ss_list_price (type: decimal(7,2)), ss_sales_price 
> (type: decimal(7,2)), ss_ext_discount_amt (type: decimal(7,2)), 
> ss_ext_sales_price (type: decimal(7,2)), ss_ext_wholesale_cost (type: 
> decimal(7,2)), ss_ext_list_price (type: decimal(7,2)), ss_ext_tax (type: 
> decimal(7,2)), ss_coupon_amt (type: decimal(7,2)), ss_net_paid (type: 
> decimal(7,2)), ss_net_paid_inc_tax (type: decimal(7,2)), ss_net_profit (type: 
> decimal(7,2)), ss_sold_date_sk (type: int)
>   outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5, _col6,

[jira] [Commented] (HIVE-4966) Introduce Collect_Map UDAF

2023-09-10 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763516#comment-17763516
 ] 

okumin commented on HIVE-4966:
--

We'd like to support a similar UDF as 
[map_agg|https://trino.io/docs/current/functions/aggregate.html] of Trino. It 
sounds reasonable to implement it as a `COLLECT_*` family in Hive. Let me take 
this over since we have had no update for 10 years.

> Introduce Collect_Map UDAF
> --
>
> Key: HIVE-4966
> URL: https://issues.apache.org/jira/browse/HIVE-4966
> Project: Hive
>  Issue Type: Bug
>Reporter: Harish Butani
>Assignee: okumin
>Priority: Major
> Attachments: Collect_map.patch
>
>
> Similar to Collect_Set. For e.g. on a Txn table
> {noformat}
> Txn(customer, product, amt)
> select customer, collect_map(product, amt)
> from txn
> group by customer
> {noformat}
> Would give you an activity map for each customer.
> Other thoughts:
> - have explode do the inverse on maps just as it does for sets today.
> - introduce a table function that outputs each value as a column. So in the 
> e.g. above you get an activity matrix instead of a map. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HIVE-4966) Introduce Collect_Map UDAF

2023-09-10 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-4966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin reassigned HIVE-4966:


Assignee: okumin  (was: Harish Butani)

> Introduce Collect_Map UDAF
> --
>
> Key: HIVE-4966
> URL: https://issues.apache.org/jira/browse/HIVE-4966
> Project: Hive
>  Issue Type: Bug
>Reporter: Harish Butani
>Assignee: okumin
>Priority: Major
> Attachments: Collect_map.patch
>
>
> Similar to Collect_Set. For e.g. on a Txn table
> {noformat}
> Txn(customer, product, amt)
> select customer, collect_map(product, amt)
> from txn
> group by customer
> {noformat}
> Would give you an activity map for each customer.
> Other thoughts:
> - have explode do the inverse on maps just as it does for sets today.
> - introduce a table function that outputs each value as a column. So in the 
> e.g. above you get an activity matrix instead of a map. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27587) Add TIME data type

2023-08-09 Thread okumin (Jira)

okumin created HIVE-27587:
-

 Summary: Add TIME data type
 Key: HIVE-27587
 URL: https://issues.apache.org/jira/browse/HIVE-27587
 Project: Hive
  Issue Type: New Feature
  Components: Metastore, Query Processor
Affects Versions: 4.0.0-alpha-2
Reporter: okumin
Assignee: okumin


Support the TIME type defined in the SQL standard



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27575) MASK_HASH UDF generate a log per row

2023-08-07 Thread okumin (Jira)

okumin created HIVE-27575:
-

 Summary: MASK_HASH UDF generate a log per row
 Key: HIVE-27575
 URL: https://issues.apache.org/jira/browse/HIVE-27575
 Project: Hive
  Issue Type: Improvement
  Components: UDF
Affects Versions: 4.0.0-alpha-2
Reporter: okumin


Hive 4 started generating too many logs since we merged the following patch.
https://issues.apache.org/jira/browse/HIVE-26724



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HIVE-27575) MASK_HASH UDF generate a log per row

2023-08-07 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin reassigned HIVE-27575:
-

Assignee: okumin

> MASK_HASH UDF generate a log per row
> 
>
> Key: HIVE-27575
> URL: https://issues.apache.org/jira/browse/HIVE-27575
> Project: Hive
>  Issue Type: Improvement
>  Components: UDF
>Affects Versions: 4.0.0-alpha-2
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> Hive 4 started generating too many logs since we merged the following patch.
> https://issues.apache.org/jira/browse/HIVE-26724



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27575) MASK_HASH UDF generate a log per row

2023-08-07 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-27575:
--
Description: 
Hive 4 started generating too many logs since we merged the following patch.
https://issues.apache.org/jira/browse/HIVE-26724


{code:txt}
$ ./bin/logs hive-hiveserver2-859d7686f7-8tkj8 | grep MaskHashTransformer
hive-hiveserver2-859d7686f7-8tkj8: 2023-08-07T21:55:50,360  INFO 
[HiveServer2-Background-Pool: Thread-90] generic.MaskHashTransformer: Using 
SHA256 for masking
hive-hiveserver2-859d7686f7-8tkj8: 2023-08-07T21:55:50,361  INFO 
[HiveServer2-Background-Pool: Thread-90] generic.MaskHashTransformer: Using 
SHA256 for masking
hive-hiveserver2-859d7686f7-8tkj8: 2023-08-07T21:55:50,361  INFO 
[HiveServer2-Background-Pool: Thread-90] generic.MaskHashTransformer: Using 
SHA256 for masking
hive-hiveserver2-859d7686f7-8tkj8: 2023-08-07T21:55:50,361  INFO 
[HiveServer2-Background-Pool: Thread-90] generic.MaskHashTransformer: Using 
SHA256 for masking
hive-hiveserver2-859d7686f7-8tkj8: 2023-08-07T21:55:50,361  INFO 
[HiveServer2-Background-Pool: Thread-90] generic.MaskHashTransformer: Using 
SHA256 for masking
{code}


  was:
Hive 4 started generating too many logs since we merged the following patch.
https://issues.apache.org/jira/browse/HIVE-26724


> MASK_HASH UDF generate a log per row
> 
>
> Key: HIVE-27575
> URL: https://issues.apache.org/jira/browse/HIVE-27575
> Project: Hive
>  Issue Type: Improvement
>  Components: UDF
>Affects Versions: 4.0.0-alpha-2
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> Hive 4 started generating too many logs since we merged the following patch.
> https://issues.apache.org/jira/browse/HIVE-26724
> {code:txt}
> $ ./bin/logs hive-hiveserver2-859d7686f7-8tkj8 | grep MaskHashTransformer
> hive-hiveserver2-859d7686f7-8tkj8: 2023-08-07T21:55:50,360  INFO 
> [HiveServer2-Background-Pool: Thread-90] generic.MaskHashTransformer: Using 
> SHA256 for masking
> hive-hiveserver2-859d7686f7-8tkj8: 2023-08-07T21:55:50,361  INFO 
> [HiveServer2-Background-Pool: Thread-90] generic.MaskHashTransformer: Using 
> SHA256 for masking
> hive-hiveserver2-859d7686f7-8tkj8: 2023-08-07T21:55:50,361  INFO 
> [HiveServer2-Background-Pool: Thread-90] generic.MaskHashTransformer: Using 
> SHA256 for masking
> hive-hiveserver2-859d7686f7-8tkj8: 2023-08-07T21:55:50,361  INFO 
> [HiveServer2-Background-Pool: Thread-90] generic.MaskHashTransformer: Using 
> SHA256 for masking
> hive-hiveserver2-859d7686f7-8tkj8: 2023-08-07T21:55:50,361  INFO 
> [HiveServer2-Background-Pool: Thread-90] generic.MaskHashTransformer: Using 
> SHA256 for masking
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27484) Limit pushdown with offset generate wrong results

2023-07-07 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-27484:
--
Status: Patch Available  (was: In Progress)

> Limit pushdown with offset generate wrong results
> -
>
> Key: HIVE-27484
> URL: https://issues.apache.org/jira/browse/HIVE-27484
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0-alpha-2
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>
> With `hive.optimize.limittranspose` in CBO, Hive can generate incorrect 
> results.
> For example, I'd say this case should generate 1 row.
> https://github.com/apache/hive/blob/rel/release-4.0.0-alpha-2/ql/src/test/results/clientpositive/llap/limit_join_transpose.q.out#L1328-L1341



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work started] (HIVE-27484) Limit pushdown with offset generate wrong results

2023-07-06 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-27484 started by okumin.
-
> Limit pushdown with offset generate wrong results
> -
>
> Key: HIVE-27484
> URL: https://issues.apache.org/jira/browse/HIVE-27484
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0-alpha-2
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> With `hive.optimize.limittranspose` in CBO, Hive can generate incorrect 
> results.
> For example, I'd say this case should generate 1 row.
> https://github.com/apache/hive/blob/rel/release-4.0.0-alpha-2/ql/src/test/results/clientpositive/llap/limit_join_transpose.q.out#L1328-L1341



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27484) Limit pushdown with offset generate wrong results

2023-07-06 Thread okumin (Jira)

okumin created HIVE-27484:
-

 Summary: Limit pushdown with offset generate wrong results
 Key: HIVE-27484
 URL: https://issues.apache.org/jira/browse/HIVE-27484
 Project: Hive
  Issue Type: Bug
  Components: CBO
Affects Versions: 4.0.0-alpha-2
Reporter: okumin
Assignee: okumin


With `hive.optimize.limittranspose`, Hive can generate incorrect results.

For example, I'd say this case should generate 1 row.
https://github.com/apache/hive/blob/rel/release-4.0.0-alpha-2/ql/src/test/results/clientpositive/llap/limit_join_transpose.q.out#L1328-L1341



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-27484) Limit pushdown with offset generate wrong results

2023-07-06 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-27484:
--
Description: 
With `hive.optimize.limittranspose` in CBO, Hive can generate incorrect results.

For example, I'd say this case should generate 1 row.
https://github.com/apache/hive/blob/rel/release-4.0.0-alpha-2/ql/src/test/results/clientpositive/llap/limit_join_transpose.q.out#L1328-L1341

  was:
With `hive.optimize.limittranspose`, Hive can generate incorrect results.

For example, I'd say this case should generate 1 row.
https://github.com/apache/hive/blob/rel/release-4.0.0-alpha-2/ql/src/test/results/clientpositive/llap/limit_join_transpose.q.out#L1328-L1341


> Limit pushdown with offset generate wrong results
> -
>
> Key: HIVE-27484
> URL: https://issues.apache.org/jira/browse/HIVE-27484
> Project: Hive
>  Issue Type: Bug
>  Components: CBO
>Affects Versions: 4.0.0-alpha-2
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> With `hive.optimize.limittranspose` in CBO, Hive can generate incorrect 
> results.
> For example, I'd say this case should generate 1 row.
> https://github.com/apache/hive/blob/rel/release-4.0.0-alpha-2/ql/src/test/results/clientpositive/llap/limit_join_transpose.q.out#L1328-L1341



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work started] (HIVE-27480) OFFSET without ORDER BY generates wrong results

2023-07-02 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-27480 started by okumin.
-
> OFFSET without ORDER BY generates wrong results
> ---
>
> Key: HIVE-27480
> URL: https://issues.apache.org/jira/browse/HIVE-27480
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Affects Versions: 2.4.0, 3.1.3, 4.0.0-alpha-2
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> Without ORDER BY, Hive doesn't add an additional single reducer, and it 
> pushes OFFSET and LIMIT to the original vertex. It can generate incorrect 
> results because OFFSET should be counted globally unlike LIMIT.
> We would make the following changes.
> - To fix the incorrect behavior
> - To add a new `hive.strict.checks.*` to prevent this usage. Mostly, OFFSET 
> without ORDER BY is meaningless
> We can reproduce the issue in the following steps.
> *Prepare test data*
> The following SQLs generate a test table with multiple files.
> {code:sql}
> $ beeline -e "
>   drop table test;
>   create table test(id int);
>   insert into test values (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
>   insert into test values (11), (12), (13), (14), (15), (16), (17), (18), 
> (19), (20);
>   insert into test values (21), (22), (23), (24), (25), (26), (27), (28), 
> (29), (30);
> "
> {code}
> *Reproduce the issue*
> No rows are returned with `limit 10 offset 10`.
> {code:sql}
> $ beeline -e "select * from test limit 10, 10" --hiveconf 
> hive.fetch.task.conversion=none --hiveconf tez.grouping.min-size=1 --hiveconf 
> tez.grouping.max-size=1
> ...
> +--+
> | test.id  |
> +--+
> +--+
> No rows selected (13.595 seconds)
> {code}
> *Expected behavior*
> It should return any 10 rows like this.
> {code:sql}
> $ beeline -e "select * from test limit 10, 10"
> ...
> +--+
> | test.id  |
> +--+
> | 11   |
> | 12   |
> | 13   |
> | 14   |
> | 15   |
> | 16   |
> | 17   |
> | 18   |
> | 19   |
> | 20   |
> +--+
> 10 rows selected (0.175 seconds)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27480) OFFSET without ORDER BY generates wrong results

2023-07-02 Thread okumin (Jira)

okumin created HIVE-27480:
-

 Summary: OFFSET without ORDER BY generates wrong results
 Key: HIVE-27480
 URL: https://issues.apache.org/jira/browse/HIVE-27480
 Project: Hive
  Issue Type: Bug
  Components: Query Planning
Affects Versions: 4.0.0-alpha-2, 3.1.3, 2.4.0
Reporter: okumin
Assignee: okumin


Without ORDER BY, Hive doesn't add an additional single reducer, and it pushes 
OFFSET and LIMIT to the original vertex. It can generate incorrect results 
because OFFSET should be counted globally unlike LIMIT.

We would make the following changes.
- To fix the incorrect behavior
- To add a new `hive.strict.checks.*` to prevent this usage. Mostly, OFFSET 
without ORDER BY is meaningless

We can reproduce the issue in the following steps.

*Prepare test data*

The following SQLs generate a test table with multiple files.

{code:sql}
$ beeline -e "
  drop table test;
  create table test(id int);
  insert into test values (1), (2), (3), (4), (5), (6), (7), (8), (9), (10);
  insert into test values (11), (12), (13), (14), (15), (16), (17), (18), (19), 
(20);
  insert into test values (21), (22), (23), (24), (25), (26), (27), (28), (29), 
(30);
"
{code}

*Reproduce the issue*

No rows are returned with `limit 10 offset 10`.

{code:sql}
$ beeline -e "select * from test limit 10, 10" --hiveconf 
hive.fetch.task.conversion=none --hiveconf tez.grouping.min-size=1 --hiveconf 
tez.grouping.max-size=1
...
+--+
| test.id  |
+--+
+--+
No rows selected (13.595 seconds)
{code}

*Expected behavior*

It should return any 10 rows like this.

{code:sql}
$ beeline -e "select * from test limit 10, 10"
...
+--+
| test.id  |
+--+
| 11   |
| 12   |
| 13   |
| 14   |
| 15   |
| 16   |
| 17   |
| 18   |
| 19   |
| 20   |
+--+
10 rows selected (0.175 seconds)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27473) Make SessionHiveMetaStoreClient and HiveMetaStoreClientWithLocalCache composable

2023-06-29 Thread okumin (Jira)

okumin created HIVE-27473:
-

 Summary: Make SessionHiveMetaStoreClient and 
HiveMetaStoreClientWithLocalCache composable
 Key: HIVE-27473
 URL: https://issues.apache.org/jira/browse/HIVE-27473
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Affects Versions: 4.0.0-alpha-2
Reporter: okumin
Assignee: okumin


Currently, we implement the two classes using inheritance. It assumes we always 
use a single implementation of IMetaStoreClient.
Some community members have been willing to make IMetaStoreClient pluggable as 
proposed in HIVE-12679. Considering the use case, we should provide the 
additional traits with composition over inheritance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-22618) Fix checkstyle violations for ParseUtils

2023-06-25 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-22618:
--
Labels: pull-request-available  (was: )
Status: Patch Available  (was: Open)

> Fix checkstyle violations for ParseUtils
> 
>
> Key: HIVE-22618
> URL: https://issues.apache.org/jira/browse/HIVE-22618
> Project: Hive
>  Issue Type: Task
>  Components: Query Processor
>Reporter: okumin
>Assignee: okumin
>Priority: Minor
>  Labels: pull-request-available
>
> ParseUtils.java has incorrect indents and it causes any changes to violate 
> checkstyle.
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseUtils.java#L274]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-12679) Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf

2023-06-21 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17735793#comment-17735793
 ] 

okumin commented on HIVE-12679:
---

[~moomindani] Thanks! I created a patch with some cosmetic changes aligned with 
the latest Hive.
https://github.com/apache/hive/pull/

cc: [~austintlee]

> Allow users to be able to specify an implementation of IMetaStoreClient via 
> HiveConf
> 
>
> Key: HIVE-12679
> URL: https://issues.apache.org/jira/browse/HIVE-12679
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration, Metastore, Query Planning
>Reporter: Austin Lee
>Assignee: okumin
>Priority: Minor
>  Labels: metastore, pull-request-available
> Attachments: HIVE-12679.1.patch, HIVE-12679.2.patch, 
> HIVE-12679.branch-1.2.patch, HIVE-12679.branch-2.3.patch, HIVE-12679.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Hi,
> I would like to propose a change that would make it possible for users to 
> choose an implementation of IMetaStoreClient via HiveConf, i.e. 
> hive-site.xml.  Currently, in Hive the choice is hard coded to be 
> SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive.  There 
> is no other direct reference to SessionHiveMetaStoreClient other than the 
> hard coded class name in Hive.java and the QL component operates only on the 
> IMetaStoreClient interface so the change would be minimal and it would be 
> quite similar to how an implementation of RawStore is specified and loaded in 
> hive-metastore.  One use case this change would serve would be one where a 
> user wishes to use an implementation of this interface without the dependency 
> on the Thrift server.
>   
> Thank you,
> Austin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HIVE-12679) Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf

2023-06-21 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin reassigned HIVE-12679:
-

Assignee: okumin  (was: Noritaka Sekiyama)

> Allow users to be able to specify an implementation of IMetaStoreClient via 
> HiveConf
> 
>
> Key: HIVE-12679
> URL: https://issues.apache.org/jira/browse/HIVE-12679
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration, Metastore, Query Planning
>Reporter: Austin Lee
>Assignee: okumin
>Priority: Minor
>  Labels: metastore, pull-request-available
> Attachments: HIVE-12679.1.patch, HIVE-12679.2.patch, 
> HIVE-12679.branch-1.2.patch, HIVE-12679.branch-2.3.patch, HIVE-12679.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Hi,
> I would like to propose a change that would make it possible for users to 
> choose an implementation of IMetaStoreClient via HiveConf, i.e. 
> hive-site.xml.  Currently, in Hive the choice is hard coded to be 
> SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive.  There 
> is no other direct reference to SessionHiveMetaStoreClient other than the 
> hard coded class name in Hive.java and the QL component operates only on the 
> IMetaStoreClient interface so the change would be minimal and it would be 
> quite similar to how an implementation of RawStore is specified and loaded in 
> hive-metastore.  One use case this change would serve would be one where a 
> user wishes to use an implementation of this interface without the dependency 
> on the Thrift server.
>   
> Thank you,
> Austin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-26571) Thrift over HTTP config warnings in logs

2023-06-18 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-26571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733906#comment-17733906
 ] 

okumin commented on HIVE-26571:
---

[~charanh] I guess you can use 
[metastore.server.thrift.transport.mode|https://github.com/apache/hive/blob/rel/release-4.0.0-alpha-2/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java#L1378]
 and 
[metastore.client.thrift.transport.mode|https://github.com/apache/hive/blob/rel/release-4.0.0-alpha-2/standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java#L1531]
 instead.

> Thrift over HTTP config warnings in logs
> 
>
> Key: HIVE-26571
> URL: https://issues.apache.org/jira/browse/HIVE-26571
> Project: Hive
>  Issue Type: Bug
>  Components: Hive, Metastore
>Reporter: Charan Hebri
>Priority: Major
>
> HIVE-21456 added support for using Thrift protocol with HTTP as the mode. 
> After setting the configs,
> {noformat}
> hive.metastore.server.thrift.transport.mode=http
> hive.metastore.client.thrift.transport.mode=http{noformat}
> in HMS/HS2 warnings are seen in the logs that these aren't recognised.
> {noformat}
> [mdc@18060 class="conf.HiveConf" level="WARN" thread="main"] HiveConf of name 
> hive.metastore.client.thrift.transport.mode 
> does not exist{noformat}
> {noformat}
> [mdc@18060 class="conf.HiveConf" level="WARN" thread=""] HiveConf of name 
> hive.metastore.server.thrift.transport.mode 
> does not exist
> {noformat}
> But I see in the logs the mode used is indeed HTTP. So these warnings are 
> just noise(not actual issues) and need to be taken care of.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27342) Duplicate row retured using Order by, Limit and Offset

2023-06-18 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733864#comment-17733864
 ] 

okumin commented on HIVE-27342:
---

[~jimmydeng] I didn't reproduce the issue on 4.0.0-alpha-2.
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> create table t1(f1 int);
...
No rows affected (0.057 seconds)
0: jdbc:hive2://hive-hiveserver2:1/defaul> insert into t1 
values(111),(222),(333),(444),(555),(666),(777),(888),(999);
...
9 rows affected (11.162 seconds)
...
0: jdbc:hive2://hive-hiveserver2:1/defaul> select * from t1 order by f1 
limit 0,3;
...
++
| t1.f1  |
++
| 111    |
| 222    |
| 333    |
++
...
0: jdbc:hive2://hive-hiveserver2:1/defaul> select * from t1 order by f1 
limit 3,3;
...
++
| t1.f1  |
++
| 444    |
| 555    |
| 666    |
++{code}
I remember we applied [several patches to 
VectorLimitOperator|https://github.com/apache/hive/commits/master/ql/src/java/org/apache/hadoop/hive/ql/exec/vector/VectorLimitOperator.java]
 to fix issues related to OFFSET.

Actually, we backported 
[HIVE-22120|https://issues.apache.org/jira/browse/HIVE-22120], 
[HIVE-22164|https://issues.apache.org/jira/browse/HIVE-22164], and 
[HIVE-23265|https://issues.apache.org/jira/browse/HIVE-23265]. I guess 
HIVE-22164 would fix your problem though my memory could be wrong.

> Duplicate row retured using Order by, Limit and Offset
> --
>
> Key: HIVE-27342
> URL: https://issues.apache.org/jira/browse/HIVE-27342
> Project: Hive
>  Issue Type: Bug
>Affects Versions: 3.1.1
>Reporter: jimmydeng
>Priority: Major
>
> Create an example table: 
> {code:java}
> create table t1(f1 int);
> insert into t1 values(111),(222),(333),(444),(555),(666),(777),(888),(999); 
> {code}
>  
> Query using order by, limit, offset. Page 1 is correct: 
> {code:java}
> select * from t1 order by f1 limit 0,3;
> +-+
> | t1.f1   |
> +-+
> | 111     |
> | 222     |
> | 333     |
> +-+{code}
>  
> But there is an duplicate row `333` on page 2: 
> {code:java}
> select * from t1 order by f1 limit 3,3;
> +-+
> | t1.f1   |
> +-+
> | 333     |
> | 444     |
> | 555     |
> +-+  
> {code}
> set hive.vectorized.execution.reduce.enabled=false does not cause the problem.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27241) insert queries failing for iceberg table stored with orc using zstd compression codec.

2023-06-18 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733862#comment-17733862
 ] 

okumin commented on HIVE-27241:
---

[~dharmikt] What version or environment did you test? I didn't reproduce the 
issue using `4.0.0-alpha-2` or the latest master branch.

 
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> create table test_dt (id int) 
stored by iceberg stored as orc 
tblproperties('write.orc.compression-codec'='zstd');
...
No rows affected (0.717 seconds)
0: jdbc:hive2://hive-hiveserver2:1/defaul> insert into test_dt values (1); 
...
--
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  KILLED  
--
        VERTICES      MODE        STATUS  TOTAL  COMPLETED  RUNNING  PENDING  
FAILED  KILLED  
--
Map 1 .. container     SUCCEEDED      1          1        0        0    
   0       0  
Reducer 2 .. container     SUCCEEDED      1          1        0        0    
   0       0  
--
VERTICES: 02/02  [==>>] 100%  ELAPSED TIME: 7.54 s     
--
INFO  : Starting task [Stage-3:STATS] in serial mode
INFO  : Executing stats task
INFO  : Table default.test_dt stats: [numFiles=1, numRows=1, totalSize=246, 
rawDataSize=0, numFilesErasureCoded=0]
INFO  : Completed executing 
command(queryId=hive_20230618063054_cc11b0af-75ec-464e-a7c4-10a90622c047); Time 
taken: 11.334 seconds
1 row affected (11.52 seconds)
0: jdbc:hive2://hive-hiveserver2:1/defaul> select * from test_dt;
...
+-+
| test_dt.id  |
+-+
| 1           |
+-+
1 row selected (0.465 seconds){code}
 

> insert queries failing for iceberg table stored with orc using zstd 
> compression codec.
> --
>
> Key: HIVE-27241
> URL: https://issues.apache.org/jira/browse/HIVE-27241
> Project: Hive
>  Issue Type: Bug
>  Components: Iceberg integration
>Reporter: Dharmik Thakkar
>Priority: Major
>
> insert queries failing for iceberg table stored with orc using zstd 
> compression codec.
> {code:java}
> create table test_dt (id int) stored by iceberg stored as orc 
> tblproperties('write.orc.compression-codec'='zstd');
> insert into test_dt values (1); {code}
> {code:java}
> Error while compiling statement: FAILED: Execution Error, return code 2 from 
> org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, 
> vertexId=vertex_1681195782720_0001_2_00, diagnostics=[Task failed, 
> taskId=task_1681195782720_0001_2_00_00, diagnostics=[TaskAttempt 0 
> failed, info=[Error: Error while running task ( failure ) : 
> attempt_1681195782720_0001_2_00_00_0:java.lang.RuntimeException: 
> java.lang.RuntimeException: java.lang.NoClassDefFoundError: 
> io/airlift/compress/zstd/ZstdCompressor at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:351)
>  at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:280) at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)
>  at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:84)
>  at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:70)
>  at java.base/java.security.AccessController.doPrivileged(Native Method) at 
> java.base/javax.security.auth.Subject.doAs(Subject.java:423) at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>  at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:70)
>  at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:40)
>  at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36) at 
> org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)
>  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>  at java.base/java.lang.Thread.run(Thread.java:829) Caused by: 
> java.lang.RuntimeException: java.lang.NoClassDefFoundError: 
> io/airlift/compress/zstd/ZstdCompressor at 
>

[jira] [Updated] (HIVE-24286) Render date and time with progress of Hive on Tez

2023-06-18 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-24286:
--
Status: Patch Available  (was: In Progress)

> Render date and time with progress of Hive on Tez
> -
>
> Key: HIVE-24286
> URL: https://issues.apache.org/jira/browse/HIVE-24286
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Add date/time to each line written by RenderStrategy like MapReduce and Spark.
>  
>  * 
> [https://github.com/apache/hive/blob/31c1658d9884eb4f31b06eaa718dfef8b1d92d22/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/HadoopJobExecHelper.java#L350]
>  * 
> [https://github.com/apache/hive/blob/31c1658d9884eb4f31b06eaa718dfef8b1d92d22/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/RenderStrategy.java#L64-L67]
>  
> This ticket would add the current time to the head of each line.
>  
> {code:java}
> 2020-10-19 13:32:41,162   Map 1: 0/1  Reducer 2: 0/1  
> 2020-10-19 13:32:44,231   Map 1: 0/1  Reducer 2: 0/1  
> 2020-10-19 13:32:46,813   Map 1: 0(+1)/1  Reducer 2: 0/1  
> 2020-10-19 13:32:49,878   Map 1: 0(+1)/1  Reducer 2: 0/1  
> 2020-10-19 13:32:51,416   Map 1: 1/1  Reducer 2: 0/1  
> 2020-10-19 13:32:51,936   Map 1: 1/1  Reducer 2: 0(+1)/1  
> 2020-10-19 13:32:52,877   Map 1: 1/1  Reducer 2: 1/1  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work started] (HIVE-23831) Introduce a threshold to turn on or off auto-parallelism of Tez

2023-06-18 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-23831 started by okumin.
-
> Introduce a threshold to turn on or off auto-parallelism of Tez
> ---
>
> Key: HIVE-23831
> URL: https://issues.apache.org/jira/browse/HIVE-23831
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>
> Hive on Tez disables auto-reduce and uses the maximum number of partitions 
> when the minimum number of reducers, `estimated number * 
> min.partition.factor`, is less than or equal to 1.
> [https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/ql/src/java/org/apache/hadoop/hive/ql/parse/GenTezUtils.java#L121-L131]
> https://issues.apache.org/jira/browse/HIVE-14200
>  
> This can make Tez lose chance to optimize # of reducers especially in an 
> environment with small `hive.tez.min.partition.factor` or big 
> `hive.tez.max.partition.factor`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-23831) Introduce a threshold to turn on or off auto-parallelism of Tez

2023-06-18 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-23831:
--
Status: Patch Available  (was: In Progress)

> Introduce a threshold to turn on or off auto-parallelism of Tez
> ---
>
> Key: HIVE-23831
> URL: https://issues.apache.org/jira/browse/HIVE-23831
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>
> Hive on Tez disables auto-reduce and uses the maximum number of partitions 
> when the minimum number of reducers, `estimated number * 
> min.partition.factor`, is less than or equal to 1.
> [https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/ql/src/java/org/apache/hadoop/hive/ql/parse/GenTezUtils.java#L121-L131]
> https://issues.apache.org/jira/browse/HIVE-14200
>  
> This can make Tez lose chance to optimize # of reducers especially in an 
> environment with small `hive.tez.min.partition.factor` or big 
> `hive.tez.max.partition.factor`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-23831) Introduce a threshold to turn on or off auto-parallelism of Tez

2023-06-16 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-23831:
--
Summary: Introduce a threshold to turn on or off auto-parallelism of Tez  
(was: Enforce auto-parallelism of Tez even if the number of reducers can be 1)

> Introduce a threshold to turn on or off auto-parallelism of Tez
> ---
>
> Key: HIVE-23831
> URL: https://issues.apache.org/jira/browse/HIVE-23831
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> Hive on Tez disables auto-reduce and uses the maximum number of partitions 
> when the minimum number of reducers, `estimated number * 
> min.partition.factor`, is less than or equal to 1.
> [https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/ql/src/java/org/apache/hadoop/hive/ql/parse/GenTezUtils.java#L121-L131]
> https://issues.apache.org/jira/browse/HIVE-14200
>  
> This can make Tez lose chance to optimize # of reducers especially in an 
> environment with small `hive.tez.min.partition.factor`.
> I'd say it would be fine to introduce an option to enforce auto-reduce.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-23831) Introduce a threshold to turn on or off auto-parallelism of Tez

2023-06-16 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-23831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-23831:
--
Description: 
Hive on Tez disables auto-reduce and uses the maximum number of partitions when 
the minimum number of reducers, `estimated number * min.partition.factor`, is 
less than or equal to 1.

[https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/ql/src/java/org/apache/hadoop/hive/ql/parse/GenTezUtils.java#L121-L131]

https://issues.apache.org/jira/browse/HIVE-14200

 

This can make Tez lose chance to optimize # of reducers especially in an 
environment with small `hive.tez.min.partition.factor` or big 
`hive.tez.max.partition.factor`.

  was:
Hive on Tez disables auto-reduce and uses the maximum number of partitions when 
the minimum number of reducers, `estimated number * min.partition.factor`, is 
less than or equal to 1.

[https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/ql/src/java/org/apache/hadoop/hive/ql/parse/GenTezUtils.java#L121-L131]

https://issues.apache.org/jira/browse/HIVE-14200

 

This can make Tez lose chance to optimize # of reducers especially in an 
environment with small `hive.tez.min.partition.factor`.

I'd say it would be fine to introduce an option to enforce auto-reduce.


> Introduce a threshold to turn on or off auto-parallelism of Tez
> ---
>
> Key: HIVE-23831
> URL: https://issues.apache.org/jira/browse/HIVE-23831
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> Hive on Tez disables auto-reduce and uses the maximum number of partitions 
> when the minimum number of reducers, `estimated number * 
> min.partition.factor`, is less than or equal to 1.
> [https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/ql/src/java/org/apache/hadoop/hive/ql/parse/GenTezUtils.java#L121-L131]
> https://issues.apache.org/jira/browse/HIVE-14200
>  
> This can make Tez lose chance to optimize # of reducers especially in an 
> environment with small `hive.tez.min.partition.factor` or big 
> `hive.tez.max.partition.factor`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Work started] (HIVE-24286) Render date and time with progress of Hive on Tez

2023-06-16 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-24286 started by okumin.
-
> Render date and time with progress of Hive on Tez
> -
>
> Key: HIVE-24286
> URL: https://issues.apache.org/jira/browse/HIVE-24286
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Add date/time to each line written by RenderStrategy like MapReduce and Spark.
>  
>  * 
> [https://github.com/apache/hive/blob/31c1658d9884eb4f31b06eaa718dfef8b1d92d22/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/HadoopJobExecHelper.java#L350]
>  * 
> [https://github.com/apache/hive/blob/31c1658d9884eb4f31b06eaa718dfef8b1d92d22/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/RenderStrategy.java#L64-L67]
>  
> This ticket would add the current time to the head of each line.
>  
> {code:java}
> 2020-10-19 13:32:41,162   Map 1: 0/1  Reducer 2: 0/1  
> 2020-10-19 13:32:44,231   Map 1: 0/1  Reducer 2: 0/1  
> 2020-10-19 13:32:46,813   Map 1: 0(+1)/1  Reducer 2: 0/1  
> 2020-10-19 13:32:49,878   Map 1: 0(+1)/1  Reducer 2: 0/1  
> 2020-10-19 13:32:51,416   Map 1: 1/1  Reducer 2: 0/1  
> 2020-10-19 13:32:51,936   Map 1: 1/1  Reducer 2: 0(+1)/1  
> 2020-10-19 13:32:52,877   Map 1: 1/1  Reducer 2: 1/1  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-12679) Allow users to be able to specify an implementation of IMetaStoreClient via HiveConf

2023-06-15 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-12679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733324#comment-17733324
 ] 

okumin commented on HIVE-12679:
---

[~moomindani] Hi, we definitely need this patch and actually have it 
backported. I am sure it should be merged into upstream. Could you please try 
to send the patch, or can I take it over?

Hive community has more attentions to pull requests and I hope someone does a 
review. Thanks.

> Allow users to be able to specify an implementation of IMetaStoreClient via 
> HiveConf
> 
>
> Key: HIVE-12679
> URL: https://issues.apache.org/jira/browse/HIVE-12679
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration, Metastore, Query Planning
>Reporter: Austin Lee
>Assignee: Noritaka Sekiyama
>Priority: Minor
>  Labels: metastore, pull-request-available
> Attachments: HIVE-12679.1.patch, HIVE-12679.2.patch, 
> HIVE-12679.branch-1.2.patch, HIVE-12679.branch-2.3.patch, HIVE-12679.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> Hi,
> I would like to propose a change that would make it possible for users to 
> choose an implementation of IMetaStoreClient via HiveConf, i.e. 
> hive-site.xml.  Currently, in Hive the choice is hard coded to be 
> SessionHiveMetaStoreClient in org.apache.hadoop.hive.ql.metadata.Hive.  There 
> is no other direct reference to SessionHiveMetaStoreClient other than the 
> hard coded class name in Hive.java and the QL component operates only on the 
> IMetaStoreClient interface so the change would be minimal and it would be 
> quite similar to how an implementation of RawStore is specified and loaded in 
> hive-metastore.  One use case this change would serve would be one where a 
> user wishes to use an implementation of this interface without the dependency 
> on the Thrift server.
>   
> Thank you,
> Austin



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-24485) Make the slow-start behavior tunable

2023-06-12 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17731520#comment-17731520
 ] 

okumin commented on HIVE-24485:
---

Thank you!

> Make the slow-start behavior tunable
> 
>
> Key: HIVE-24485
> URL: https://issues.apache.org/jira/browse/HIVE-24485
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Affects Versions: 3.1.2, 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> This ticket would enable users to configure the timing of slow-start with 
> `tez.shuffle-vertex-manager.min-src-fraction` and 
> `tez.shuffle-vertex-manager.max-src-fraction`.
> Hive on Tez currently doesn't honor these parameters and ShuffleVertexManager 
> always uses the default value.
> We can control the timing to start vertexes the accuracy of estimated input 
> size if we can tweak these ones. This is useful when a vertex has tasks that 
> process a different amount of data.
>  
> We can reproduce the issue with this query.
> {code:java}
> SET hive.tez.auto.reducer.parallelism=true;
> SET hive.tez.min.partition.factor=1.0; -- enforce auto-parallelism
> SET tez.shuffle-vertex-manager.min-src-fraction=0.55;
> SET tez.shuffle-vertex-manager.max-src-fraction=0.95;
> CREATE TABLE mofu (name string);
> INSERT INTO mofu (name) VALUES ('12345');
> SELECT name, count(*) FROM mofu GROUP BY name;{code}
> The fractions are ignored.
> {code:java}
> 2020-12-04 11:41:42,484 [INFO] [Dispatcher thread {Central}] 
> |vertexmanager.ShuffleVertexManagerBase|: Settings minFrac: 0.25 maxFrac: 
> 0.75 auto: true desiredTaskIput: 25600
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27120) Warn when Authorizer V2 is configured

2023-05-24 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17726015#comment-17726015
 ] 

okumin commented on HIVE-27120:
---

Thanks!

> Warn when Authorizer V2 is configured
> -
>
> Key: HIVE-27120
> URL: https://issues.apache.org/jira/browse/HIVE-27120
> Project: Hive
>  Issue Type: Improvement
>Reporter: okumin
>Assignee: okumin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> SessionState configures an internal param that is not listed in HiveConf. It 
> causes WARN.
> {code:java}
> pod/hive-hiveserver2-7fc4df88b6-dmn8v: 2023-03-01T13:50:38,959  WARN [main] 
> conf.HiveConf: HiveConf of name 
> hive.internal.ss.authz.settings.applied.marker does not exist {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HIVE-22601) Some columns will be lost when a UDTF has multiple aliases in some cases

2023-05-23 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin resolved HIVE-22601.
---
Resolution: Duplicate

I verified this issue doesn't happen with Hive 4.

 
{code:java}
0: jdbc:hive2://hive-hiveserver2:1/defaul> SELECT stack(1, 'a', 'b', 'c') 
AS (c1, c2, c3)
. . . . . . . . . . . . . . . . . . . . . . .> UNION ALL
. . . . . . . . . . . . . . . . . . . . . . .> SELECT stack(1, 'd', 'e', 'f') 
AS (c1, c2, c3);
...
+-+-+-+
| c1  | c2  | c3  |
+-+-+-+
| a   | b   | c   |
| d   | e   | f   |
+-+-+-+ {code}
Looks like, HIVE-25754 resolved this issue. Thanks!

 

> Some columns will be lost when a UDTF has multiple aliases in some cases
> 
>
> Key: HIVE-22601
> URL: https://issues.apache.org/jira/browse/HIVE-22601
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 2.1.1, 2.2.0, 2.3.6, 3.1.2
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22601.1.patch, HIVE-22601.2.patch, 
> HIVE-22601.3.patch, HIVE-22601.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Only one column will be retained when putting UDTFs with multiple aliases and 
> a top-level UNION together.
> For example, the result of the following SQL should have three columns, c1, 
> c2 and c3.
> {code:java}
> SELECT stack(1, 'a', 'b', 'c') AS (c1, c2, c3)
> UNION ALL
> SELECT stack(1, 'd', 'e', 'f') AS (c1, c2, c3);
> {code}
> However, It's only the c3 column which I can get.
> {code:java}
> +-+
> | _u1.c3  |
> +-+
> | c   |
> | f   |
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-22601) Some columns will be lost when a UDTF has multiple aliases in some cases

2023-05-23 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-22601:
--
Status: Open  (was: Patch Available)

> Some columns will be lost when a UDTF has multiple aliases in some cases
> 
>
> Key: HIVE-22601
> URL: https://issues.apache.org/jira/browse/HIVE-22601
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 3.1.2, 2.3.6, 2.2.0, 2.1.1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22601.1.patch, HIVE-22601.2.patch, 
> HIVE-22601.3.patch, HIVE-22601.patch
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Only one column will be retained when putting UDTFs with multiple aliases and 
> a top-level UNION together.
> For example, the result of the following SQL should have three columns, c1, 
> c2 and c3.
> {code:java}
> SELECT stack(1, 'a', 'b', 'c') AS (c1, c2, c3)
> UNION ALL
> SELECT stack(1, 'd', 'e', 'f') AS (c1, c2, c3);
> {code}
> However, It's only the c3 column which I can get.
> {code:java}
> +-+
> | _u1.c3  |
> +-+
> | c   |
> | f   |
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27363) Typo in the description of hive.tez.bucket.pruning

2023-05-22 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-27363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725141#comment-17725141
 ] 

okumin commented on HIVE-27363:
---

Thanks!

> Typo in the description of hive.tez.bucket.pruning
> --
>
> Key: HIVE-27363
> URL: https://issues.apache.org/jira/browse/HIVE-27363
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Affects Versions: 4.0.0-alpha-2
>Reporter: okumin
>Assignee: okumin
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> `hive.optimize.index.filters` seems to be typo of 
> `hive.optimize.index.filter`.
> [https://github.com/apache/hive/blob/8f2ee8baae6fbc0a60128fbeae3f51c7b9ee62d6/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L4782]
> {code:java}
>     TEZ_OPTIMIZE_BUCKET_PRUNING(
>         "hive.tez.bucket.pruning", true,
>          "When pruning is enabled, filters on bucket columns will be 
> processed by \n" +
>          "filtering the splits against a bitset of included buckets. This 
> needs predicates \n"+
>             "produced by hive.optimize.ppd and 
> hive.optimize.index.filters."), {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HIVE-27363) Typo in the description of hive.tez.bucket.pruning

2023-05-20 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin reassigned HIVE-27363:
-

Assignee: okumin

> Typo in the description of hive.tez.bucket.pruning
> --
>
> Key: HIVE-27363
> URL: https://issues.apache.org/jira/browse/HIVE-27363
> Project: Hive
>  Issue Type: Improvement
>  Components: Configuration
>Affects Versions: 4.0.0-alpha-2
>Reporter: okumin
>Assignee: okumin
>Priority: Minor
>
> `hive.optimize.index.filters` seems to be typo of 
> `hive.optimize.index.filter`.
> [https://github.com/apache/hive/blob/8f2ee8baae6fbc0a60128fbeae3f51c7b9ee62d6/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L4782]
> {code:java}
>     TEZ_OPTIMIZE_BUCKET_PRUNING(
>         "hive.tez.bucket.pruning", true,
>          "When pruning is enabled, filters on bucket columns will be 
> processed by \n" +
>          "filtering the splits against a bitset of included buckets. This 
> needs predicates \n"+
>             "produced by hive.optimize.ppd and 
> hive.optimize.index.filters."), {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (HIVE-27363) Typo in the description of hive.tez.bucket.pruning

2023-05-20 Thread okumin (Jira)

okumin created HIVE-27363:
-

 Summary: Typo in the description of hive.tez.bucket.pruning
 Key: HIVE-27363
 URL: https://issues.apache.org/jira/browse/HIVE-27363
 Project: Hive
  Issue Type: Improvement
  Components: Configuration
Affects Versions: 4.0.0-alpha-2
Reporter: okumin


`hive.optimize.index.filters` seems to be typo of `hive.optimize.index.filter`.

[https://github.com/apache/hive/blob/8f2ee8baae6fbc0a60128fbeae3f51c7b9ee62d6/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L4782]
{code:java}
    TEZ_OPTIMIZE_BUCKET_PRUNING(
        "hive.tez.bucket.pruning", true,
         "When pruning is enabled, filters on bucket columns will be processed 
by \n" +
         "filtering the splits against a bitset of included buckets. This needs 
predicates \n"+
            "produced by hive.optimize.ppd and hive.optimize.index.filters."), 
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HIVE-27120) Warn when Authorizer V2 is configured

2023-03-03 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-27120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin reassigned HIVE-27120:
-

Assignee: okumin

> Warn when Authorizer V2 is configured
> -
>
> Key: HIVE-27120
> URL: https://issues.apache.org/jira/browse/HIVE-27120
> Project: Hive
>  Issue Type: Improvement
>Reporter: okumin
>Assignee: okumin
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> SessionState configures an internal param that is not listed in HiveConf. It 
> causes WARN.
> {code:java}
> pod/hive-hiveserver2-7fc4df88b6-dmn8v: 2023-03-01T13:50:38,959  WARN [main] 
> conf.HiveConf: HiveConf of name 
> hive.internal.ss.authz.settings.applied.marker does not exist {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-06-13 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553854#comment-17553854
 ] 

okumin commented on HIVE-26184:
---

Thanks a lot!

> COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
> ---
>
> Key: HIVE-26184
> URL: https://issues.apache.org/jira/browse/HIVE-26184
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.3.8, 3.1.3
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0-alpha-2
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> I observed some reducers spend 98% of CPU time in invoking 
> `java.util.HashMap#clear`.
> Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its 
> `clear` can be quite heavy when a relation has a small number of highly 
> skewed keys.
>  
> To reproduce the issue, first, we will create rows with a skewed key.
> {code:java}
> INSERT INTO test_collect_set
> SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
> AS value
> FROM table_with_many_rows
> LIMIT 10;{code}
> Then, we will create many non-skewed rows.
> {code:java}
> INSERT INTO test_collect_set
> SELECT UUID() AS key, UUID() AS value
> FROM table_with_many_rows
> LIMIT 500;{code}
> We can observe the issue when we aggregate values by `key`.
> {code:java}
> SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-04-28 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17529389#comment-17529389
 ] 

okumin commented on HIVE-26184:
---

[~kgyrtkirk] 

Thanks. I wanted to say something gets wrong when a highly skewed key(100,000 
of `----` in the example) coexists with 
non-skewed keys(5,000,000 of unique UUIDs). I also put a comment in the PR and 
please feel free to ask me if that doesn't make sense.

> COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
> ---
>
> Key: HIVE-26184
> URL: https://issues.apache.org/jira/browse/HIVE-26184
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.3.8, 3.1.3
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> I observed some reducers spend 98% of CPU time in invoking 
> `java.util.HashMap#clear`.
> Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its 
> `clear` can be quite heavy when a relation has a small number of highly 
> skewed keys.
>  
> To reproduce the issue, first, we will create rows with a skewed key.
> {code:java}
> INSERT INTO test_collect_set
> SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
> AS value
> FROM table_with_many_rows
> LIMIT 10;{code}
> Then, we will create many non-skewed rows.
> {code:java}
> INSERT INTO test_collect_set
> SELECT UUID() AS key, UUID() AS value
> FROM table_with_many_rows
> LIMIT 500;{code}
> We can observe the issue when we aggregate values by `key`.
> {code:java}
> SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Updated] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-04-28 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-26184:
--
Description: 
I observed some reducers spend 98% of CPU time in invoking 
`java.util.HashMap#clear`.

Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its `clear` 
can be quite heavy when a relation has a small number of highly skewed keys.

 

To reproduce the issue, first, we will create rows with a skewed key.
{code:java}
INSERT INTO test_collect_set
SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
AS value
FROM table_with_many_rows
LIMIT 10;{code}
Then, we will create many non-skewed rows.
{code:java}
INSERT INTO test_collect_set
SELECT UUID() AS key, UUID() AS value
FROM table_with_many_rows
LIMIT 500;{code}
We can observe the issue when we aggregate values by `key`.
{code:java}
SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}

  was:
I observed some reducers spend 98% of CPU time in invoking 
`java.util.HashMap#clear`.

Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its `clear` 
can be quite heavy when a relation has a small number of highly skewed keys.

 

To reproduce the issue, first, we will create rows with a skewed key.
{code:java}
INSERT INTO test_collect_set
SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
AS value
FROM table_with_many_rows
LIMIT 10;{code}
Then, we will create many non-skewed rows.
{code:java}
INSERT INTO test_collect_set
SELECT UUID() AS key, UUID() AS value
FROM sample_datasets.nasdaq
LIMIT 500;{code}
We can observe the issue when we aggregate values by `key`.
{code:java}
SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}


> COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
> ---
>
> Key: HIVE-26184
> URL: https://issues.apache.org/jira/browse/HIVE-26184
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 2.3.8, 3.1.3
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> I observed some reducers spend 98% of CPU time in invoking 
> `java.util.HashMap#clear`.
> Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its 
> `clear` can be quite heavy when a relation has a small number of highly 
> skewed keys.
>  
> To reproduce the issue, first, we will create rows with a skewed key.
> {code:java}
> INSERT INTO test_collect_set
> SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
> AS value
> FROM table_with_many_rows
> LIMIT 10;{code}
> Then, we will create many non-skewed rows.
> {code:java}
> INSERT INTO test_collect_set
> SELECT UUID() AS key, UUID() AS value
> FROM table_with_many_rows
> LIMIT 500;{code}
> We can observe the issue when we aggregate values by `key`.
> {code:java}
> SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Assigned] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-04-28 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-26184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin reassigned HIVE-26184:
-


> COLLECT_SET with GROUP BY is very slow when some keys are highly skewed
> ---
>
> Key: HIVE-26184
> URL: https://issues.apache.org/jira/browse/HIVE-26184
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 3.1.3, 2.3.8
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> I observed some reducers spend 98% of CPU time in invoking 
> `java.util.HashMap#clear`.
> Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its 
> `clear` can be quite heavy when a relation has a small number of highly 
> skewed keys.
>  
> To reproduce the issue, first, we will create rows with a skewed key.
> {code:java}
> INSERT INTO test_collect_set
> SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
> AS value
> FROM table_with_many_rows
> LIMIT 10;{code}
> Then, we will create many non-skewed rows.
> {code:java}
> INSERT INTO test_collect_set
> SELECT UUID() AS key, UUID() AS value
> FROM sample_datasets.nasdaq
> LIMIT 500;{code}
> We can observe the issue when we aggregate values by `key`.
> {code:java}
> SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Work started] (HIVE-25410) CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size

2021-07-31 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HIVE-25410 started by okumin.
-
> CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size
> --
>
> Key: HIVE-25410
> URL: https://issues.apache.org/jira/browse/HIVE-25410
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Thanks to HIVE-24883, CommonMergeJoinOperator can handle ARRAY or STRUCT 
> types as a JOIN key.
> There are corner cases where CommonMergeJoinOperator fails with 
> `ArrayIndexOutOfBoundsException`.
>  
> This is a simple case.
> {code:java}
> SET hive.auto.convert.join=false;
> CREATE TABLE table_list_types (id int, key array);
> INSERT INTO table_list_types VALUES (1, array(1, 2)), (2, array(1, 2)), (3, 
> array(1, 2, 3)), (4, array(1, 2, 3));
> SELECT * FROM table_list_types t1 INNER JOIN table_list_types t2 ON t1.key = 
> t2.key; {code}
> With 69c97c26ac68a245f4d327cc2f7b3a2333f8fa84, the following error happened.
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
>   at 
> org.apache.hadoop.hive.ql.exec.HiveStructComparator.compare(HiveStructComparator.java:57)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.compareKey(CommonMergeJoinOperator.java:629)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.compareKeys(CommonMergeJoinOperator.java:597)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.processKey(CommonMergeJoinOperator.java:566)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:249)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>   ... 26 more {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-25410) CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size

2021-07-30 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-25410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390779#comment-17390779
 ] 

okumin commented on HIVE-25410:
---

[~zabetak] [~maheshk114]

Hi, we've found a corner case while backporting HIVE-24883. Could you please 
review this PR?

[https://github.com/apache/hive/pull/2551]

 

Thanks.

> CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size
> --
>
> Key: HIVE-25410
> URL: https://issues.apache.org/jira/browse/HIVE-25410
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Thanks to HIVE-24883, CommonMergeJoinOperator can handle ARRAY or STRUCT 
> types as a JOIN key.
> There are corner cases where CommonMergeJoinOperator fails with 
> `ArrayIndexOutOfBoundsException`.
>  
> This is a simple case.
> {code:java}
> SET hive.auto.convert.join=false;
> CREATE TABLE table_list_types (id int, key array);
> INSERT INTO table_list_types VALUES (1, array(1, 2)), (2, array(1, 2)), (3, 
> array(1, 2, 3)), (4, array(1, 2, 3));
> SELECT * FROM table_list_types t1 INNER JOIN table_list_types t2 ON t1.key = 
> t2.key; {code}
> With 69c97c26ac68a245f4d327cc2f7b3a2333f8fa84, the following error happened.
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
>   at 
> org.apache.hadoop.hive.ql.exec.HiveStructComparator.compare(HiveStructComparator.java:57)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.compareKey(CommonMergeJoinOperator.java:629)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.compareKeys(CommonMergeJoinOperator.java:597)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.processKey(CommonMergeJoinOperator.java:566)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:249)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>   ... 26 more {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-25410) CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size

2021-07-30 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin reassigned HIVE-25410:
-


> CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size
> --
>
> Key: HIVE-25410
> URL: https://issues.apache.org/jira/browse/HIVE-25410
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Reporter: okumin
>Assignee: okumin
>Priority: Major
> Fix For: 4.0.0
>
>
> Thanks to HIVE-24883, CommonMergeJoinOperator can handle ARRAY or STRUCT 
> types as a JOIN key.
> There are corner cases where CommonMergeJoinOperator fails with 
> `ArrayIndexOutOfBoundsException`.
>  
> This is a simple case.
> {code:java}
> SET hive.auto.convert.join=false;
> CREATE TABLE table_list_types (id int, key array);
> INSERT INTO table_list_types VALUES (1, array(1, 2)), (2, array(1, 2)), (3, 
> array(1, 2, 3)), (4, array(1, 2, 3));
> SELECT * FROM table_list_types t1 INNER JOIN table_list_types t2 ON t1.key = 
> t2.key; {code}
> With 69c97c26ac68a245f4d327cc2f7b3a2333f8fa84, the following error happened.
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
>   at 
> org.apache.hadoop.hive.ql.exec.HiveStructComparator.compare(HiveStructComparator.java:57)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.compareKey(CommonMergeJoinOperator.java:629)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.compareKeys(CommonMergeJoinOperator.java:597)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.processKey(CommonMergeJoinOperator.java:566)
>   at 
> org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:249)
>   at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
>   ... 26 more {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-22601) Some columns will be lost when a UDTF has multiple aliases in some cases

2021-03-07 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-22601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin reassigned HIVE-22601:
-

Assignee: okumin  (was: Owen O'Malley)

> Some columns will be lost when a UDTF has multiple aliases in some cases
> 
>
> Key: HIVE-22601
> URL: https://issues.apache.org/jira/browse/HIVE-22601
> Project: Hive
>  Issue Type: Bug
>  Components: Query Processor
>Affects Versions: 2.1.1, 2.2.0, 2.3.6, 3.1.2
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
> Attachments: HIVE-22601.1.patch, HIVE-22601.2.patch, 
> HIVE-22601.3.patch, HIVE-22601.patch
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Only one column will be retained when putting UDTFs with multiple aliases and 
> a top-level UNION together.
> For example, the result of the following SQL should have three columns, c1, 
> c2 and c3.
> {code:java}
> SELECT stack(1, 'a', 'b', 'c') AS (c1, c2, c3)
> UNION ALL
> SELECT stack(1, 'd', 'e', 'f') AS (c1, c2, c3);
> {code}
> However, It's only the c3 column which I can get.
> {code:java}
> +-+
> | _u1.c3  |
> +-+
> | c   |
> | f   |
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24819) CombineHiveInputFormat format seems to be returning row count in the multiple of Maps

2021-02-25 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290989#comment-17290989
 ] 

okumin commented on HIVE-24819:
---

[~jitender], I believe the user mailing list below is the right place to ask 
something.

[https://hive.apache.org/mailing_lists.html]

 

As to your issue, I think you don't have to use 
`org.apache.hadoop.hive.ql.io.CombineHiveInputFormat` with Tez.

CombineHiveInputFormat is capable of putting fragmented files/blocks together. 
I understand this is mainly used for other than Tez.

Tez has a similar and hopefully more powerful feature, Tez grouping. It will 
group blocks by itself.

[https://cwiki.apache.org/confluence/display/TEZ/How+initial+task+parallelism+works]

 

So I recommend using HiveInputFormat and tweak parallelism with 
`tez.grouping.min-size` and `tez.grouping.max-size`.

To be honest, I don't know why TABLESAMPLE returns the wrong result.

> CombineHiveInputFormat format seems to be returning row count in the multiple 
> of Maps 
> --
>
> Key: HIVE-24819
> URL: https://issues.apache.org/jira/browse/HIVE-24819
> Project: Hive
>  Issue Type: Bug
> Environment: Apache Hive (version 3.1.0.3.1.0.0-78)
> Driver: Hive JDBC (version 3.1.0.3.1.0.0-78)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> Beeline version 3.1.0.3.1.0.0-78 by Apache Hive
>Reporter: Jitender Kumar
>Priority: Critical
>
> Hi Team,
> This is the first time I am writing a bug using apache Jira, so pardon me if 
> I am unintentionally breaking any protocols. 
> I am facing the following issue (on a multi-node cluster) when I set 
> hive.tez.input.format to  
> org.apache.hadoop.hive.ql.io.CombineHiveInputFormat. 
> Just for demonstration purposes, I will be executing the following query for 
> multiple cases. 
> _select count(1) from dbname.personal_data_rc tablesample(1000 rows);_
> *Case1*
> mapred.map.tasks=2
> hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
> *Output*
> 1000
> *Case 2*
> mapred.map.tasks=2
> hive.tez.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
> *Output*
> 2000
> *Case 3*
> mapred.map.tasks=3
> hive.tez.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
> *Output*
> 3000
> After 3 maps set as default, out remains same, i.e multiple of 3. 
> Can you help me understand why if I have TABLESAMPLE set to 1000 rows, it is 
> giving me more number of rows? Is there any other property that must be used 
> with CombineHiveInputFormat or is it an issue with CombineHiveInputFormat 
> only? 
> I have tried to look for a solution but in the end i had to come here. Please 
> share your inputs ASAP as one of our client is looking for a solution or 
> explaination regarding this? 
> For now as a workaround we have changed it to following.  
> *hive.tez.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat*
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24749) Disable user's UDF use SystemExit

2021-02-08 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281262#comment-17281262
 ] 

okumin commented on HIVE-24749:
---

I wonder if it's possible to make SecurityManager pluggable so that 
administrators can disallow users to run potentially dangerous UDFs.

I know this is too much for this ticket, though.

> Disable user's UDF use SystemExit
> -
>
> Key: HIVE-24749
> URL: https://issues.apache.org/jira/browse/HIVE-24749
> Project: Hive
>  Issue Type: Bug
>  Components: HiveServer2
>Affects Versions: All Versions
>Reporter: xiepengjie
>Assignee: xiepengjie
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If the System.exit() is executed in the user's UDF and using default 
> SecurityManager, it will cause the HS2 service process to exit, that's too 
> bad.
> It is safer to use NoExitSecurityManager which can intercepting System.exit().



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24485) Make the slow-start behavior tunable

2021-02-05 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280069#comment-17280069
 ] 

okumin commented on HIVE-24485:
---

[~gopalv] Thanks. I tried to move parameters and updated the PR.

I have one concern about which params we should provide with users to tweak 
slow-start behavior. I'd appreciate it if you would give me opinions.

Thanks.

> Make the slow-start behavior tunable
> 
>
> Key: HIVE-24485
> URL: https://issues.apache.org/jira/browse/HIVE-24485
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Affects Versions: 3.1.2, 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This ticket would enable users to configure the timing of slow-start with 
> `tez.shuffle-vertex-manager.min-src-fraction` and 
> `tez.shuffle-vertex-manager.max-src-fraction`.
> Hive on Tez currently doesn't honor these parameters and ShuffleVertexManager 
> always uses the default value.
> We can control the timing to start vertexes the accuracy of estimated input 
> size if we can tweak these ones. This is useful when a vertex has tasks that 
> process a different amount of data.
>  
> We can reproduce the issue with this query.
> {code:java}
> SET hive.tez.auto.reducer.parallelism=true;
> SET hive.tez.min.partition.factor=1.0; -- enforce auto-parallelism
> SET tez.shuffle-vertex-manager.min-src-fraction=0.55;
> SET tez.shuffle-vertex-manager.max-src-fraction=0.95;
> CREATE TABLE mofu (name string);
> INSERT INTO mofu (name) VALUES ('12345');
> SELECT name, count(*) FROM mofu GROUP BY name;{code}
> The fractions are ignored.
> {code:java}
> 2020-12-04 11:41:42,484 [INFO] [Dispatcher thread {Central}] 
> |vertexmanager.ShuffleVertexManagerBase|: Settings minFrac: 0.25 maxFrac: 
> 0.75 auto: true desiredTaskIput: 25600
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24485) Make the slow-start behavior tunable

2021-02-03 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17278519#comment-17278519
 ] 

okumin commented on HIVE-24485:
---

I got a notification to stale the PR.

[~gopalv] or anyone familiar with Hive on Tez: Could you please take a look 
when you have a chance?

> Make the slow-start behavior tunable
> 
>
> Key: HIVE-24485
> URL: https://issues.apache.org/jira/browse/HIVE-24485
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Affects Versions: 3.1.2, 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> This ticket would enable users to configure the timing of slow-start with 
> `tez.shuffle-vertex-manager.min-src-fraction` and 
> `tez.shuffle-vertex-manager.max-src-fraction`.
> Hive on Tez currently doesn't honor these parameters and ShuffleVertexManager 
> always uses the default value.
> We can control the timing to start vertexes the accuracy of estimated input 
> size if we can tweak these ones. This is useful when a vertex has tasks that 
> process a different amount of data.
>  
> We can reproduce the issue with this query.
> {code:java}
> SET hive.tez.auto.reducer.parallelism=true;
> SET hive.tez.min.partition.factor=1.0; -- enforce auto-parallelism
> SET tez.shuffle-vertex-manager.min-src-fraction=0.55;
> SET tez.shuffle-vertex-manager.max-src-fraction=0.95;
> CREATE TABLE mofu (name string);
> INSERT INTO mofu (name) VALUES ('12345');
> SELECT name, count(*) FROM mofu GROUP BY name;{code}
> The fractions are ignored.
> {code:java}
> 2020-12-04 11:41:42,484 [INFO] [Dispatcher thread {Central}] 
> |vertexmanager.ShuffleVertexManagerBase|: Settings minFrac: 0.25 maxFrac: 
> 0.75 auto: true desiredTaskIput: 25600
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24606) Multi-stage materialized CTEs can lose intermediate data

2021-01-15 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17266507#comment-17266507
 ] 

okumin commented on HIVE-24606:
---

CI has passed. This PR would require review by a SemanticAnalyzer expert.

https://github.com/apache/hive/pull/1873

> Multi-stage materialized CTEs can lose intermediate data
> 
>
> Key: HIVE-24606
> URL: https://issues.apache.org/jira/browse/HIVE-24606
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> With complex multi-stage CTEs, Hive can start a latter stage before its 
> previous stage finishes.
>  That's because `SemanticAnalyzer#toRealRootTasks` can fail to resolve 
> dependency between multistage materialized CTEs when a non-materialized CTE 
> cuts in.
>  
> [https://github.com/apache/hive/blob/425e1ff7c054f87c4db87e77d004282d529599ae/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L1414]
>  
> For example, when submitting this query,
> {code:sql}
> SET hive.optimize.cte.materialize.threshold=2;
> SET hive.optimize.cte.materialize.full.aggregate.only=false;
> WITH x AS ( SELECT 'x' AS id ), -- not materialized
> a1 AS ( SELECT 'a1' AS id ), -- materialized by a2 and the root
> a2 AS ( SELECT 'a2 <- ' || id AS id FROM a1) -- materialized by the root
> SELECT * FROM a1
> UNION ALL
> SELECT * FROM x
> UNION ALL
> SELECT * FROM a2
> UNION ALL
> SELECT * FROM a2;
> {code}
> `toRealRootTask` will traverse the CTEs in order of `a1`, `x`, and `a2`. It 
> means the dependency between `a1` and `a2` will be ignored and `a2` can start 
> without waiting for `a1`. As a result, the above query returns the following 
> result.
> {code:java}
> +-+
> | id  |
> +-+
> | a1  |
> | x   |
> +-+
> {code}
> For your information, I ran this test with revision = 
> 425e1ff7c054f87c4db87e77d004282d529599ae.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24606) Multi-stage materialized CTEs can lose intermediate data

2021-01-13 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-24606:
--
Summary: Multi-stage materialized CTEs can lose intermediate data  (was: 
Multi-stage materialized CTEs can lost intermediate data)

> Multi-stage materialized CTEs can lose intermediate data
> 
>
> Key: HIVE-24606
> URL: https://issues.apache.org/jira/browse/HIVE-24606
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Affects Versions: 2.3.7, 3.1.2, 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> With complex multi-stage CTEs, Hive can start a latter stage before its 
> previous stage finishes.
>  That's because `SemanticAnalyzer#toRealRootTasks` can fail to resolve 
> dependency between multistage materialized CTEs when a non-materialized CTE 
> cuts in.
>  
> [https://github.com/apache/hive/blob/425e1ff7c054f87c4db87e77d004282d529599ae/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L1414]
>  
> For example, when submitting this query,
> {code:sql}
> SET hive.optimize.cte.materialize.threshold=2;
> SET hive.optimize.cte.materialize.full.aggregate.only=false;
> WITH x AS ( SELECT 'x' AS id ), -- not materialized
> a1 AS ( SELECT 'a1' AS id ), -- materialized by a2 and the root
> a2 AS ( SELECT 'a2 <- ' || id AS id FROM a1) -- materialized by the root
> SELECT * FROM a1
> UNION ALL
> SELECT * FROM x
> UNION ALL
> SELECT * FROM a2
> UNION ALL
> SELECT * FROM a2;
> {code}
> `toRealRootTask` will traverse the CTEs in order of `a1`, `x`, and `a2`. It 
> means the dependency between `a1` and `a2` will be ignored and `a2` can start 
> without waiting for `a1`. As a result, the above query returns the following 
> result.
> {code:java}
> +-+
> | id  |
> +-+
> | a1  |
> | x   |
> +-+
> {code}
> For your information, I ran this test with revision = 
> 425e1ff7c054f87c4db87e77d004282d529599ae.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-24606) Multi-stage materialized CTEs can lost intermediate data

2021-01-08 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin reassigned HIVE-24606:
-


> Multi-stage materialized CTEs can lost intermediate data
> 
>
> Key: HIVE-24606
> URL: https://issues.apache.org/jira/browse/HIVE-24606
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Affects Versions: 3.1.2, 2.3.7, 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> With complex multi-stage CTEs, Hive can start a latter stage before its 
> previous stage finishes.
>  That's because `SemanticAnalyzer#toRealRootTasks` can fail to resolve 
> dependency between multistage materialized CTEs when a non-materialized CTE 
> cuts in.
>  
> [https://github.com/apache/hive/blob/425e1ff7c054f87c4db87e77d004282d529599ae/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L1414]
>  
> For example, when submitting this query,
> {code:sql}
> SET hive.optimize.cte.materialize.threshold=2;
> SET hive.optimize.cte.materialize.full.aggregate.only=false;
> WITH x AS ( SELECT 'x' AS id ), -- not materialized
> a1 AS ( SELECT 'a1' AS id ), -- materialized by a2 and the root
> a2 AS ( SELECT 'a2 <- ' || id AS id FROM a1) -- materialized by the root
> SELECT * FROM a1
> UNION ALL
> SELECT * FROM x
> UNION ALL
> SELECT * FROM a2
> UNION ALL
> SELECT * FROM a2;
> {code}
> `toRealRootTask` will traverse the CTEs in order of `a1`, `x`, and `a2`. It 
> means the dependency between `a1` and `a2` will be ignored and `a2` can start 
> without waiting for `a1`. As a result, the above query returns the following 
> result.
> {code:java}
> +-+
> | id  |
> +-+
> | a1  |
> | x   |
> +-+
> {code}
> For your information, I ran this test with revision = 
> 425e1ff7c054f87c4db87e77d004282d529599ae.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HIVE-24485) Make the slow-start behavior tunable

2020-12-07 Thread okumin (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-24485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245072#comment-17245072
 ] 

okumin commented on HIVE-24485:
---

Tests for my patch have passed.

https://github.com/apache/hive/pull/1744

 

[~gopalv] Could you please give me your opinion on the above change?

I know this feature was introduced by HIVE-7158, but I'm not sure if we 
intended to use the default values whenever auto reducer parallelism is enabled.

 

Thanks.

> Make the slow-start behavior tunable
> 
>
> Key: HIVE-24485
> URL: https://issues.apache.org/jira/browse/HIVE-24485
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Affects Versions: 3.1.2, 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This ticket would enable users to configure the timing of slow-start with 
> `tez.shuffle-vertex-manager.min-src-fraction` and 
> `tez.shuffle-vertex-manager.max-src-fraction`.
> Hive on Tez currently doesn't honor these parameters and ShuffleVertexManager 
> always uses the default value.
> We can control the timing to start vertexes the accuracy of estimated input 
> size if we can tweak these ones. This is useful when a vertex has tasks that 
> process a different amount of data.
>  
> We can reproduce the issue with this query.
> {code:java}
> SET hive.tez.auto.reducer.parallelism=true;
> SET hive.tez.min.partition.factor=1.0; -- enforce auto-parallelism
> SET tez.shuffle-vertex-manager.min-src-fraction=0.55;
> SET tez.shuffle-vertex-manager.max-src-fraction=0.95;
> CREATE TABLE mofu (name string);
> INSERT INTO mofu (name) VALUES ('12345');
> SELECT name, count(*) FROM mofu GROUP BY name;{code}
> The fractions are ignored.
> {code:java}
> 2020-12-04 11:41:42,484 [INFO] [Dispatcher thread {Central}] 
> |vertexmanager.ShuffleVertexManagerBase|: Settings minFrac: 0.25 maxFrac: 
> 0.75 auto: true desiredTaskIput: 25600
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-24485) Make the slow-start behavior tunable

2020-12-04 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin reassigned HIVE-24485:
-


> Make the slow-start behavior tunable
> 
>
> Key: HIVE-24485
> URL: https://issues.apache.org/jira/browse/HIVE-24485
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive, Tez
>Affects Versions: 3.1.2, 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> This ticket would enable users to configure the timing of slow-start with 
> `tez.shuffle-vertex-manager.min-src-fraction` and 
> `tez.shuffle-vertex-manager.max-src-fraction`.
> Hive on Tez currently doesn't honor these parameters and ShuffleVertexManager 
> always uses the default value.
> We can control the timing to start vertexes the accuracy of estimated input 
> size if we can tweak these ones. This is useful when a vertex has tasks that 
> process a different amount of data.
>  
> We can reproduce the issue with this query.
> {code:java}
> SET hive.tez.auto.reducer.parallelism=true;
> SET hive.tez.min.partition.factor=1.0; -- enforce auto-parallelism
> SET tez.shuffle-vertex-manager.min-src-fraction=0.55;
> SET tez.shuffle-vertex-manager.max-src-fraction=0.95;
> CREATE TABLE mofu (name string);
> INSERT INTO mofu (name) VALUES ('12345');
> SELECT name, count(*) FROM mofu GROUP BY name;{code}
> The fractions are ignored.
> {code:java}
> 2020-12-04 11:41:42,484 [INFO] [Dispatcher thread {Central}] 
> |vertexmanager.ShuffleVertexManagerBase|: Settings minFrac: 0.25 maxFrac: 
> 0.75 auto: true desiredTaskIput: 25600
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-24286) Render date and time with progress of Hive on Tez

2020-10-19 Thread okumin (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin reassigned HIVE-24286:
-


> Render date and time with progress of Hive on Tez
> -
>
> Key: HIVE-24286
> URL: https://issues.apache.org/jira/browse/HIVE-24286
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 4.0.0
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> Add date/time to each line written by RenderStrategy like MapReduce and Spark.
>  
>  * 
> [https://github.com/apache/hive/blob/31c1658d9884eb4f31b06eaa718dfef8b1d92d22/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/HadoopJobExecHelper.java#L350]
>  * 
> [https://github.com/apache/hive/blob/31c1658d9884eb4f31b06eaa718dfef8b1d92d22/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/RenderStrategy.java#L64-L67]
>  
> This ticket would add the current time to the head of each line.
>  
> {code:java}
> 2020-10-19 13:32:41,162   Map 1: 0/1  Reducer 2: 0/1  
> 2020-10-19 13:32:44,231   Map 1: 0/1  Reducer 2: 0/1  
> 2020-10-19 13:32:46,813   Map 1: 0(+1)/1  Reducer 2: 0/1  
> 2020-10-19 13:32:49,878   Map 1: 0(+1)/1  Reducer 2: 0/1  
> 2020-10-19 13:32:51,416   Map 1: 1/1  Reducer 2: 0/1  
> 2020-10-19 13:32:51,936   Map 1: 1/1  Reducer 2: 0(+1)/1  
> 2020-10-19 13:32:52,877   Map 1: 1/1  Reducer 2: 1/1  
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

1 2 >

1 - 100 of 125 matches

Mail list logo