[jira] [Created] (HIVE-28102) Invoke validateDataFilesExist for RowDelta operations

2024-02-29 Thread Jira
Zoltán Borók-Nagy created HIVE-28102:


 Summary: Invoke validateDataFilesExist for RowDelta operations
 Key: HIVE-28102
 URL: https://issues.apache.org/jira/browse/HIVE-28102
 Project: Hive
  Issue Type: Bug
Reporter: Zoltán Borók-Nagy


Hive must invoke validateDataFilesExist for RowDelta operations 
(DELETE/UPDATE/MERGE).

Without this a concurrent RewriteFiles (compaction) and RowDelta can corrupt a 
table.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-27490) HPL/SQL says it support default value for parameters but not considering them when no value is passed

2024-02-29 Thread Krisztian Kasa (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-27490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Kasa resolved HIVE-27490.
---
Fix Version/s: 4.1.0
   Resolution: Fixed

Merged to master. Thanks [~Dayakar] for the patch.

> HPL/SQL says it support default value for parameters but not considering them 
> when no value is passed
> -
>
> Key: HIVE-27490
> URL: https://issues.apache.org/jira/browse/HIVE-27490
> Project: Hive
>  Issue Type: Bug
>  Components: hpl/sql
>Reporter: Dayakar M
>Assignee: Dayakar M
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> HPL/SQL says it support default value for parameters but not considering them 
> when no value is passed.
> {noformat}
> CREATE OR replace PROCEDURE test123(a NUMBER DEFAULT -110)
> AS
> BEGIN
> dbms_output.put_line (a);
> end;{noformat}
> Oracle shows the default value-
> {noformat}
> SQL> call test123();
> -110{noformat}
> Hive shows the variable name instead of the default value-
> {noformat}
> call test123();
> INFO : a{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28101) [Athena] Add connector for Amazon Athena

2024-02-29 Thread Naveen Gangam (Jira)
Naveen Gangam created HIVE-28101:


 Summary: [Athena] Add connector for Amazon Athena
 Key: HIVE-28101
 URL: https://issues.apache.org/jira/browse/HIVE-28101
 Project: Hive
  Issue Type: Sub-task
  Components: Standalone Metastore
Reporter: Naveen Gangam


Recent added a HIVEJDBC connector for Hive to Hive over JDBC. This seems to 
also work for Hive to EMR with a local catalog. Does not seem to work with EMR 
backed with a AWS Glue Catalog. 

Just filing this jira to assess the need for a connector implementation for 
Amazon Athena with Glue Catalog.

[~zhangbutao] What do you think? I do not have access to a test bed for Athena.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28094) Improve HMS client cache and query cache performance for getTableInternal

2024-02-29 Thread Soumyakanti Das (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-28094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822328#comment-17822328
 ] 

Soumyakanti Das commented on HIVE-28094:


After further testing, I found that currently we cannot uniquely identify a 
tableID with the fields of GetTableRequest. So with the current PR, we will run 
into issues if we have a table X, which we then drop and recreate with the same 
name but with an additional column. In this case, we will still get the tableID 
for the older table. Thus, I think the current implementation of only caching 
tableIDs in the query cache is the best we can do - we cannot cache it in the 
HS2 level cache.

I am not planning to work on this in the near future - but I may revisit this 
at a later point.

> Improve HMS client cache and query cache performance for getTableInternal
> -
>
> Key: HIVE-28094
> URL: https://issues.apache.org/jira/browse/HIVE-28094
> Project: Hive
>  Issue Type: Improvement
>  Components: Hive
>Affects Versions: 4.0.0-beta-1
>Reporter: Soumyakanti Das
>Assignee: Soumyakanti Das
>Priority: Major
>  Labels: pull-request-available
>
> Currently we cache calls to {{getTableInternal}} method in HMS client cache 
> and query cache. We also cache table ids in the query cache, but not in the 
> HMS client cache.
>  
> To cache {{{}getTableInternal{}}}, we create a CacheKey containing the 
> {{GetTableRequest}} object. However, we do not check if all the necessary 
> fields are set in the key. This results in a lot of cache misses, especially 
> because we rely on {{validWriteIdList}} not being null and {{tableId}} not 
> being -1. {{GetTableRequest}} object also contains `catName` which is not 
> always set. All these things result in creating duplicate keys and not using 
> the caches efficiently.
>  
> Moreover, {{getTableInternal}} is called from other APIs that are getting 
> cached, e.g. {{{}getPartitionsByExprInternal{}}}, so improvements in its 
> performance will positively affect other APIs too.
>  
> *RESULTS:*
> I ran all TPCDS explain cbo queries on my local machine, after cherry-picking 
> [HIVE-28083: Enable HMS client cache and HMS query cache for Explain 
> plans|https://github.com/apache/hive/pull/5092/commits/41a766d6a51480edb505fd53661a03c63ef3937a].
>  Then I analyzed the logs with a simple python script to get min, 25th 
> percentile, median, 75th percentile, and max for PERFLOG logs with this 
> pattern:
> {code:java}
> '
> {code}
> Here are the results.
> *WITHOUT the improvements to {{getTableInternal}} method:*
> |*API*|*MIN*|*25th*|*MEDIAN*|*75th*|*MAX*|
> |*getTable*|2|3|3|4|233|
> |*getTableConstraints*|2|4|4|5|22|
> |*getPartitionsByExpr*|19|22|25|27|2396|
> |*getAggrColStatsFor*|0|125.5|186|284|910|
> |*getTableColumnStatistics*|4|6|7|8|454|
> Cache Stats:
> {code:java}
> CacheStats{hitCount=77464, missCount=11919, loadSuccessCount=0, 
> loadFailureCount=0, totalLoadTime=0, evictionCount=0, evictionWeight=0} {code}
> *WITH the improvements to {{getTableInternal}} method:*
> |*API*|*MIN*|*25th*|*MEDIAN*|*75th*|*MAX*|
> |*getTable*|0|0|0|0|33|
> |*getTableConstraints*|3|4|4|5|20|
> |*getPartitionsByExpr*|14|16|19|21|2247|
> |*getAggrColStatsFor*|0|124.5|187|272.5|936|
> |*getTableColumnStatistics*|0|0|0|1|16|
> Cache Stats:
> {code:java}
> CacheStats{hitCount=81044, missCount=11943, loadSuccessCount=0, 
> loadFailureCount=0, totalLoadTime=0, evictionCount=0, evictionWeight=0} {code}
> We can see that latency for the APIs, and the cache {{hitCount}} improves 
> with this patch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28100) Fix Some Typos in CachedStore.

2024-02-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28100:
--
Labels: pull-request-available  (was: )

> Fix Some Typos in CachedStore. 
> ---
>
> Key: HIVE-28100
> URL: https://issues.apache.org/jira/browse/HIVE-28100
> Project: Hive
>  Issue Type: Improvement
>  Components: Metastore
>Affects Versions: 4.1.0
>Reporter: Shilun Fan
>Assignee: Shilun Fan
>Priority: Major
>  Labels: pull-request-available
>
> Fix Some Typos in CachedStore. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28100) Fix Some Typos in CachedStore.

2024-02-29 Thread Shilun Fan (Jira)
Shilun Fan created HIVE-28100:
-

 Summary: Fix Some Typos in CachedStore. 
 Key: HIVE-28100
 URL: https://issues.apache.org/jira/browse/HIVE-28100
 Project: Hive
  Issue Type: Improvement
  Components: Metastore
Affects Versions: 4.1.0
Reporter: Shilun Fan
Assignee: Shilun Fan


Fix Some Typos in CachedStore. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28099) Fix logging in HMS benchmarks

2024-02-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28099:
--
Labels: pull-request-available  (was: )

> Fix logging in HMS benchmarks
> -
>
> Key: HIVE-28099
> URL: https://issues.apache.org/jira/browse/HIVE-28099
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zsolt Miskolczi
>Assignee: Zsolt Miskolczi
>Priority: Major
>  Labels: pull-request-available
>
> The logging is completely broken in HMS benchmarks. When we create a Log 
> entry, it only outputs a format pattern string instead of the log message. 
> Example output:
> {noformat}
> %d [%thread] %-5level %logger - %msg{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (HIVE-28099) Fix logging in HMS benchmarks

2024-02-29 Thread Zsolt Miskolczi (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zsolt Miskolczi reassigned HIVE-28099:
--

Assignee: Zsolt Miskolczi

> Fix logging in HMS benchmarks
> -
>
> Key: HIVE-28099
> URL: https://issues.apache.org/jira/browse/HIVE-28099
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zsolt Miskolczi
>Assignee: Zsolt Miskolczi
>Priority: Major
>
> The logging is completely broken in HMS benchmarks. When we create a Log 
> entry, it only outputs a format pattern string instead of the log message. 
> Example output:
> {noformat}
> %d [%thread] %-5level %logger - %msg{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28099) Fix logging in HMS benchmarks

2024-02-29 Thread Zsolt Miskolczi (Jira)
Zsolt Miskolczi created HIVE-28099:
--

 Summary: Fix logging in HMS benchmarks
 Key: HIVE-28099
 URL: https://issues.apache.org/jira/browse/HIVE-28099
 Project: Hive
  Issue Type: Improvement
Reporter: Zsolt Miskolczi


The logging is completely broken in HMS benchmarks. When we create a Log entry, 
it only outputs a format pattern string instead of the log message. 

Example output:


{noformat}
%d [%thread] %-5level %logger - %msg{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-28063) Drop PerfLogger#setPerfLogger method and unused fields/methods

2024-02-29 Thread Stamatis Zampetakis (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stamatis Zampetakis resolved HIVE-28063.

Fix Version/s: 4.1.0
   Resolution: Fixed

Fixed in 
[https://github.com/apache/hive/commit/49e65bdd7fd47adffbc59091eaea1618d90c253a.]
 Thanks for the reviews [~abstractdog], [~aturoczy] !

> Drop PerfLogger#setPerfLogger method and unused fields/methods
> --
>
> Key: HIVE-28063
> URL: https://issues.apache.org/jira/browse/HIVE-28063
> Project: Hive
>  Issue Type: Task
>  Components: Hive, Standalone Metastore
>Reporter: Stamatis Zampetakis
>Assignee: Stamatis Zampetakis
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> The PerfLogger#setPerfLogger is redundant and error-prone. 
> The small number of current uses could be replaced by simply calling the 
> respective getter (which implicitly changes the underlying ThreadLocal 
> variable).
> Ideally thread local variable should never be set after obtaining the initial 
> value. Moreover, allowing any caller to change the thread local variable can 
> affect the correctness of the program.
> Dropping this method improves the encapsulation and readability of the class.
> The org.apache.hadoop.hive.metastore.metrics.PerfLogger has various unused 
> fields/methods that can be removed as well to improve encapsulation, 
> readability, and maintenance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HIVE-28093) Re-execute DAG in case of NoCurrentDAGException

2024-02-29 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-28093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor resolved HIVE-28093.
-
Resolution: Fixed

> Re-execute DAG in case of NoCurrentDAGException
> ---
>
> Key: HIVE-28093
> URL: https://issues.apache.org/jira/browse/HIVE-28093
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: 
> compute-1708603165-qlg5-query-coordinator-0-0-1708684369388541000.log
>
>
> This is to adapt the ReExecuteLostAMQueryPlugin to the exception introduced 
> in TEZ-4543 to prevent scenarios when the DAGClient keeps asking for the 
> status of a DAG that is already gone:  
> [^compute-1708603165-qlg5-query-coordinator-0-0-1708684369388541000.log] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HIVE-28093) Re-execute DAG in case of NoCurrentDAGException

2024-02-29 Thread Jira


[ 
https://issues.apache.org/jira/browse/HIVE-28093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822040#comment-17822040
 ] 

László Bodor commented on HIVE-28093:
-

merged to master, thanks [~ayushtkn] for the review!

> Re-execute DAG in case of NoCurrentDAGException
> ---
>
> Key: HIVE-28093
> URL: https://issues.apache.org/jira/browse/HIVE-28093
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Attachments: 
> compute-1708603165-qlg5-query-coordinator-0-0-1708684369388541000.log
>
>
> This is to adapt the ReExecuteLostAMQueryPlugin to the exception introduced 
> in TEZ-4543 to prevent scenarios when the DAGClient keeps asking for the 
> status of a DAG that is already gone:  
> [^compute-1708603165-qlg5-query-coordinator-0-0-1708684369388541000.log] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28093) Re-execute DAG in case of NoCurrentDAGException

2024-02-29 Thread Jira


 [ 
https://issues.apache.org/jira/browse/HIVE-28093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

László Bodor updated HIVE-28093:

Fix Version/s: 4.0.0

> Re-execute DAG in case of NoCurrentDAGException
> ---
>
> Key: HIVE-28093
> URL: https://issues.apache.org/jira/browse/HIVE-28093
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Bodor
>Assignee: László Bodor
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: 
> compute-1708603165-qlg5-query-coordinator-0-0-1708684369388541000.log
>
>
> This is to adapt the ReExecuteLostAMQueryPlugin to the exception introduced 
> in TEZ-4543 to prevent scenarios when the DAGClient keeps asking for the 
> status of a DAG that is already gone:  
> [^compute-1708603165-qlg5-query-coordinator-0-0-1708684369388541000.log] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (HIVE-28098) Fails to copy empty column statistics of materialized CTE

2024-02-29 Thread okumin (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-28098:
--
Summary: Fails to copy empty column statistics of materialized CTE  (was: 
Fails to generate statistics of materialized CTE)

> Fails to copy empty column statistics of materialized CTE
> -
>
> Key: HIVE-28098
> URL: https://issues.apache.org/jira/browse/HIVE-28098
> Project: Hive
>  Issue Type: Bug
>  Components: Query Planning
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> HIVE-28080 introduced the optimization of materialized CTEs, but it turned 
> out that it failed when statistics were empty.
> This query reproduces the issue.
> {code:java}
> set hive.stats.autogather=false;
> CREATE TABLE src_no_stats AS SELECT '123' as key, 'val123' as value UNION ALL 
> SELECT '9' as key, 'val9' as value;
> set hive.optimize.cte.materialize.threshold=2;
> set hive.optimize.cte.materialize.full.aggregate.only=false;
> EXPLAIN WITH materialized_cte1 AS (
>   SELECT * FROM src_no_stats
> ),
> materialized_cte2 AS (
>   SELECT a.key
>   FROM materialized_cte1 a
>   JOIN materialized_cte1 b ON (a.key = b.key)
> )
> SELECT a.key
> FROM materialized_cte2 a
> JOIN materialized_cte2 b ON (a.key = b.key); {code}
> It throws an error.
> {code:java}
> Error: Error while compiling statement: FAILED: IllegalStateException The 
> size of col stats must be equal to that of schema. Stats = [], Schema = [key] 
> (state=42000,code=4) {code}
> Attaching a debugger, FSO of materialized_cte2 has empty stats as 
> JoinOperator loses stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-28098) Fails to generate statistics of materialized CTE

2024-02-29 Thread okumin (Jira)
okumin created HIVE-28098:
-

 Summary: Fails to generate statistics of materialized CTE
 Key: HIVE-28098
 URL: https://issues.apache.org/jira/browse/HIVE-28098
 Project: Hive
  Issue Type: Bug
  Components: Query Planning
Reporter: okumin
Assignee: okumin


HIVE-28080 introduced the optimization of materialized CTEs, but it turned out 
that it failed when statistics were empty.

This query reproduces the issue.
{code:java}
set hive.stats.autogather=false;
CREATE TABLE src_no_stats AS SELECT '123' as key, 'val123' as value UNION ALL 
SELECT '9' as key, 'val9' as value;
set hive.optimize.cte.materialize.threshold=2;
set hive.optimize.cte.materialize.full.aggregate.only=false;

EXPLAIN WITH materialized_cte1 AS (
  SELECT * FROM src_no_stats
),
materialized_cte2 AS (
  SELECT a.key
  FROM materialized_cte1 a
  JOIN materialized_cte1 b ON (a.key = b.key)
)
SELECT a.key
FROM materialized_cte2 a
JOIN materialized_cte2 b ON (a.key = b.key); {code}
It throws an error.
{code:java}
Error: Error while compiling statement: FAILED: IllegalStateException The size 
of col stats must be equal to that of schema. Stats = [], Schema = [key] 
(state=42000,code=4) {code}
Attaching a debugger, FSO of materialized_cte2 has empty stats as JoinOperator 
loses stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)