date:20220118

[jira] [Comment Edited] (HIVE-25876) Update log4j2 version to 2.17.1

2022-01-18 Thread zengxl (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-25876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478361#comment-17478361
 ] 

zengxl edited comment on HIVE-25876 at 1/19/22, 6:58 AM:
-

here is my patch 
{code:java}
diff --git a/pom.xml b/pom.xml
index c06f7f81..808a76a1 100644
--- a/pom.xml
+++ b/pom.xml
@@ -181,7 +181,7 @@
     3.0.3
     0.9.3
     0.9.3
-    2.10.0
+    2.17.1
     2.3
     1.5.6
     1.10.19
     
diff --git 
a/ql/src/java/org/apache/hadoop/hive/ql/log/SlidingFilenameRolloverStrategy.java
 
b/ql/src/java/org/apache/hadoop/hive/ql/log/SlidingFilenameRolloverStrategy.java
index 664734c7..958dbbcf 100644
--- 
a/ql/src/java/org/apache/hadoop/hive/ql/log/SlidingFilenameRolloverStrategy.java
+++ 
b/ql/src/java/org/apache/hadoop/hive/ql/log/SlidingFilenameRolloverStrategy.java
@@ -72,6 +72,12 @@ public String getCurrentFileName(RollingFileManager 
rollingFileManager) {
     String pattern = rollingFileManager.getPatternProcessor().getPattern();
     return getLogFileName(pattern);
   }
+
+  @Override
+  public void clearCurrentFileName() {
+
+  }
+   /**
    * @return Mangled file name formed by appending the current timestamp {code}


was (Author: zengxl):
here is my patch 

> Update log4j2 version to 2.17.1
> ---
>
> Key: HIVE-25876
> URL: https://issues.apache.org/jira/browse/HIVE-25876
> Project: Hive
>  Issue Type: Bug
>  Components: Logging
>Affects Versions: 3.1.2
>Reporter: Anatoly
>Priority: Blocker
>
> Hive version 3.1.2 -> log2j -> Should upgrade the version to 2.17.1



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HIVE-25876) Update log4j2 version to 2.17.1

2022-01-18 Thread zengxl (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-25876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17478361#comment-17478361
 ] 

zengxl commented on HIVE-25876:
---

here is my patch 

> Update log4j2 version to 2.17.1
> ---
>
> Key: HIVE-25876
> URL: https://issues.apache.org/jira/browse/HIVE-25876
> Project: Hive
>  Issue Type: Bug
>  Components: Logging
>Affects Versions: 3.1.2
>Reporter: Anatoly
>Priority: Blocker
>
> Hive version 3.1.2 -> log2j -> Should upgrade the version to 2.17.1



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HIVE-25876) Update log4j2 version to 2.17.1

2022-01-18 Thread Anatoly (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anatoly updated HIVE-25876:
---
Description: Hive version 3.1.2 -> log2j -> Should upgrade the version to 
2.17.1  (was: Hive version 3.1.2 -> log3j -> Should upgrade the version to 
2.17.1)

> Update log4j2 version to 2.17.1
> ---
>
> Key: HIVE-25876
> URL: https://issues.apache.org/jira/browse/HIVE-25876
> Project: Hive
>  Issue Type: Bug
>  Components: Logging
>Affects Versions: 3.1.2
>Reporter: Anatoly
>Priority: Blocker
>
> Hive version 3.1.2 -> log2j -> Should upgrade the version to 2.17.1



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25864) Hive query optimisation creates wrong plan for predicate pushdown with windowing function

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25864?focusedWorklogId=711048=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-711048
 ]

ASF GitHub Bot logged work on HIVE-25864:
-

Author: ASF GitHub Bot
Created on: 19/Jan/22 02:01
Start Date: 19/Jan/22 02:01
Worklog Time Spent: 10m 
  Work Description: maheshk114 commented on a change in pull request #2943:
URL: https://github.com/apache/hive/pull/2943#discussion_r787291853



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterProjectTransposeRule.java
##
@@ -170,6 +171,7 @@ public void onMatch(RelOptRuleCall call) {
 if 
(HiveCalciteUtil.isDeterministicFuncWithSingleInputRef(newCondition,
 commonPartitionKeys)) {
   newPartKeyFilConds.add(newCondition);
+  isConversionDone = true;

Review comment:
   I was also thinking that way to keep the code clean, But  there will be 
two conversion ..one to check if its deterministic and one more for the 
projection.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 711048)
Time Spent: 0.5h  (was: 20m)

> Hive query optimisation creates wrong plan for predicate pushdown with 
> windowing function 
> --
>
> Key: HIVE-25864
> URL: https://issues.apache.org/jira/browse/HIVE-25864
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In case of a query with windowing function, the deterministic predicates are 
> pushed down below the window function. Before pushing down, the predicate is 
> converted to refer the project operator values. But the same conversion is 
> done again while creating the project and thus causing wrong plan generation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25753) Improving performance of getLatestCommittedCompactionInfo

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25753?focusedWorklogId=711041=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-711041
 ]

ASF GitHub Bot logged work on HIVE-25753:
-

Author: ASF GitHub Bot
Created on: 19/Jan/22 01:42
Start Date: 19/Jan/22 01:42
Worklog Time Spent: 10m 
  Work Description: hsnusonic closed pull request #2829:
URL: https://github.com/apache/hive/pull/2829


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 711041)
Time Spent: 0.5h  (was: 20m)

> Improving performance of getLatestCommittedCompactionInfo
> -
>
> Key: HIVE-25753
> URL: https://issues.apache.org/jira/browse/HIVE-25753
> Project: Hive
>  Issue Type: Improvement
>  Components: Standalone Metastore
>Reporter: Yu-Wen Lai
>Assignee: Yu-Wen Lai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The API getLatestCommittedCompactionInfo is used by external cache server to 
> check latest compaction id for tables/partitions.
> Previously we have to set partition names for a partitioned table. That 
> restriction causes a performance issue when a table with lots of partitions. 
> We could remove this restriction so that when all the partitions are needed 
> for a huge partitioned table we don't have to set all the partition names. 
> That can reduce the size of the request and the overhead of sending request 
> from client to HMS and from HMS to db.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Resolved] (HIVE-25753) Improving performance of getLatestCommittedCompactionInfo

2022-01-18 Thread Naveen Gangam (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naveen Gangam resolved HIVE-25753.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Committed to master. Thank you for the patch [~hsnusonic]. Closing the jira.

> Improving performance of getLatestCommittedCompactionInfo
> -
>
> Key: HIVE-25753
> URL: https://issues.apache.org/jira/browse/HIVE-25753
> Project: Hive
>  Issue Type: Improvement
>  Components: Standalone Metastore
>Reporter: Yu-Wen Lai
>Assignee: Yu-Wen Lai
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The API getLatestCommittedCompactionInfo is used by external cache server to 
> check latest compaction id for tables/partitions.
> Previously we have to set partition names for a partitioned table. That 
> restriction causes a performance issue when a table with lots of partitions. 
> We could remove this restriction so that when all the partitions are needed 
> for a huge partitioned table we don't have to set all the partition names. 
> That can reduce the size of the request and the overhead of sending request 
> from client to HMS and from HMS to db.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25753) Improving performance of getLatestCommittedCompactionInfo

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25753?focusedWorklogId=711040=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-711040
 ]

ASF GitHub Bot logged work on HIVE-25753:
-

Author: ASF GitHub Bot
Created on: 19/Jan/22 01:38
Start Date: 19/Jan/22 01:38
Worklog Time Spent: 10m 
  Work Description: nrg4878 commented on pull request #2829:
URL: https://github.com/apache/hive/pull/2829#issuecomment-1015996057


   Fix has been committed to master. Please close the PR. Thank you @hsnusonic 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 711040)
Time Spent: 20m  (was: 10m)

> Improving performance of getLatestCommittedCompactionInfo
> -
>
> Key: HIVE-25753
> URL: https://issues.apache.org/jira/browse/HIVE-25753
> Project: Hive
>  Issue Type: Improvement
>  Components: Standalone Metastore
>Reporter: Yu-Wen Lai
>Assignee: Yu-Wen Lai
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The API getLatestCommittedCompactionInfo is used by external cache server to 
> check latest compaction id for tables/partitions.
> Previously we have to set partition names for a partitioned table. That 
> restriction causes a performance issue when a table with lots of partitions. 
> We could remove this restriction so that when all the partitions are needed 
> for a huge partitioned table we don't have to set all the partition names. 
> That can reduce the size of the request and the overhead of sending request 
> from client to HMS and from HMS to db.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25723) Found some typos

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25723?focusedWorklogId=711015=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-711015
 ]

ASF GitHub Bot logged work on HIVE-25723:
-

Author: ASF GitHub Bot
Created on: 19/Jan/22 00:39
Start Date: 19/Jan/22 00:39
Worklog Time Spent: 10m 
  Work Description: jsoref commented on pull request #2800:
URL: https://github.com/apache/hive/pull/2800#issuecomment-1015963556


   Sigh


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 711015)
Remaining Estimate: 10m  (was: 20m)
Time Spent: 50m  (was: 40m)

> Found some typos
> 
>
> Key: HIVE-25723
> URL: https://issues.apache.org/jira/browse/HIVE-25723
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: All Versions
>Reporter: Feng
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: All Versions
>
> Attachments: DateUtils typo.png, RELEASE_NOTES typo.png
>
>   Original Estimate: 1h
>  Time Spent: 50m
>  Remaining Estimate: 10m
>
> I found some typos in DateUtils.java and 
> RELEASE_NOTES.txt{color:#172b4d}{{}}{color}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25723) Found some typos

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25723?focusedWorklogId=711004=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-711004
 ]

ASF GitHub Bot logged work on HIVE-25723:
-

Author: ASF GitHub Bot
Created on: 19/Jan/22 00:12
Start Date: 19/Jan/22 00:12
Worklog Time Spent: 10m 
  Work Description: github-actions[bot] commented on pull request #2800:
URL: https://github.com/apache/hive/pull/2800#issuecomment-1015948890


   This pull request has been automatically marked as stale because it has not 
had recent activity. It will be closed if no further activity occurs.
   Feel free to reach out on the d...@hive.apache.org list if the patch is in 
need of reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 711004)
Remaining Estimate: 20m  (was: 0.5h)
Time Spent: 40m  (was: 0.5h)

> Found some typos
> 
>
> Key: HIVE-25723
> URL: https://issues.apache.org/jira/browse/HIVE-25723
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: All Versions
>Reporter: Feng
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: All Versions
>
> Attachments: DateUtils typo.png, RELEASE_NOTES typo.png
>
>   Original Estimate: 1h
>  Time Spent: 40m
>  Remaining Estimate: 20m
>
> I found some typos in DateUtils.java and 
> RELEASE_NOTES.txt{color:#172b4d}{{}}{color}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25873) Fix nested partition statements in Explain DDL

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25873?focusedWorklogId=710978=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710978
 ]

ASF GitHub Bot logged work on HIVE-25873:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 23:34
Start Date: 18/Jan/22 23:34
Worklog Time Spent: 10m 
  Work Description: rbalamohan commented on pull request #2949:
URL: https://github.com/apache/hive/pull/2949#issuecomment-1015930062


   LGTM. +1


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710978)
Remaining Estimate: 0h
Time Spent: 10m

> Fix nested partition statements in Explain DDL
> --
>
> Key: HIVE-25873
> URL: https://issues.apache.org/jira/browse/HIVE-25873
> Project: Hive
>  Issue Type: Bug
>Reporter: Harshit Gupta
>Assignee: Harshit Gupta
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Explain ddl doesn't generate proper statements for nested partitions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HIVE-25873) Fix nested partition statements in Explain DDL

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-25873:
--
Labels: pull-request-available  (was: )

> Fix nested partition statements in Explain DDL
> --
>
> Key: HIVE-25873
> URL: https://issues.apache.org/jira/browse/HIVE-25873
> Project: Hive
>  Issue Type: Bug
>Reporter: Harshit Gupta
>Assignee: Harshit Gupta
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Explain ddl doesn't generate proper statements for nested partitions.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25824) Upgrade branch-2.3 to log4j 2.17.0

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25824?focusedWorklogId=710931=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710931
 ]

ASF GitHub Bot logged work on HIVE-25824:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 22:08
Start Date: 18/Jan/22 22:08
Worklog Time Spent: 10m 
  Work Description: sunchao commented on pull request #2908:
URL: https://github.com/apache/hive/pull/2908#issuecomment-1015878177


   @Gingernaut I believe Naveen is planning to make a 3.x release soon. You can 
subscribe to https://issues.apache.org/jira/browse/HIVE-25855 for the latest 
update.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710931)
Time Spent: 2h 20m  (was: 2h 10m)

> Upgrade branch-2.3 to log4j 2.17.0
> --
>
> Key: HIVE-25824
> URL: https://issues.apache.org/jira/browse/HIVE-25824
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 2.3.8
>Reporter: Luca Toscano
>Assignee: Luca Toscano
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.3.10
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> Hi everybody,
> I am wondering if there are any plans to upgrade branch-2.3 to log4j 2.17.0 
> as it was done in HIVE-25795 (and related).
> In Apache Bigtop we created https://github.com/apache/bigtop/pull/844, since 
> the one before the last release (Bigtop 1.5.0) shipped Hive 2.3.6.
> I can try to file a pull request for branch-2.3 adapting what was done for 
> Bigtop (if the branch is still maintained), but I am currently experiencing 
> some mvn package failures (that seem unrelated to log4j) so I'd need some 
> help for you in case :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25824) Upgrade branch-2.3 to log4j 2.17.0

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25824?focusedWorklogId=710928=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710928
 ]

ASF GitHub Bot logged work on HIVE-25824:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 22:05
Start Date: 18/Jan/22 22:05
Worklog Time Spent: 10m 
  Work Description: Gingernaut commented on pull request #2908:
URL: https://github.com/apache/hive/pull/2908#issuecomment-1015876591


   @sunchao any updates on when a 3.x branch update might be released? Seems 
the last release was July 2020, and with the log4j vulnerability it seems it 
would be a high priority fix to publish a new version for. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710928)
Time Spent: 2h 10m  (was: 2h)

> Upgrade branch-2.3 to log4j 2.17.0
> --
>
> Key: HIVE-25824
> URL: https://issues.apache.org/jira/browse/HIVE-25824
> Project: Hive
>  Issue Type: Improvement
>Affects Versions: 2.3.8
>Reporter: Luca Toscano
>Assignee: Luca Toscano
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.3.10
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Hi everybody,
> I am wondering if there are any plans to upgrade branch-2.3 to log4j 2.17.0 
> as it was done in HIVE-25795 (and related).
> In Apache Bigtop we created https://github.com/apache/bigtop/pull/844, since 
> the one before the last release (Bigtop 1.5.0) shipped Hive 2.3.6.
> I can try to file a pull request for branch-2.3 adapting what was done for 
> Bigtop (if the branch is still maintained), but I am currently experiencing 
> some mvn package failures (that seem unrelated to log4j) so I'd need some 
> help for you in case :)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710740=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710740
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 18:36
Start Date: 18/Jan/22 18:36
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r787044907



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/metrics/DeltaFilesMetricReporter.java
##
@@ -435,55 +187,71 @@ private void initObjectsForMetrics() throws Exception {
 .getObjectName());
   }
 
-  private void initCachesForMetrics(HiveConf conf) {
-int maxCacheSize = HiveConf.getIntVar(conf, 
HiveConf.ConfVars.HIVE_TXN_ACID_METRICS_MAX_CACHE_SIZE);
-long duration = HiveConf.getTimeVar(conf,
-HiveConf.ConfVars.HIVE_TXN_ACID_METRICS_CACHE_DURATION, 
TimeUnit.SECONDS);
-
-deltaTopN = new PriorityBlockingQueue<>(maxCacheSize, getComparator());
-smallDeltaTopN = new PriorityBlockingQueue<>(maxCacheSize, 
getComparator());
-obsoleteDeltaTopN = new PriorityBlockingQueue<>(maxCacheSize, 
getComparator());
-
-deltaCache = CacheBuilder.newBuilder()
-  .expireAfterWrite(duration, TimeUnit.SECONDS)
-  .removalListener(notification -> removalPredicate(deltaTopN, 
notification))
-  .softValues()
-  .build();
-
-smallDeltaCache = CacheBuilder.newBuilder()
-  .expireAfterWrite(duration, TimeUnit.SECONDS)
-  .removalListener(notification -> removalPredicate(smallDeltaTopN, 
notification))
-  .softValues()
-  .build();
-
-obsoleteDeltaCache = CacheBuilder.newBuilder()
-  .expireAfterWrite(duration, TimeUnit.SECONDS)
-  .removalListener(notification -> removalPredicate(obsoleteDeltaTopN, 
notification))
-  .softValues()
-  .build();
-  }
-
-  private static Comparator> getComparator() {
-return Comparator.comparing(Pair::getValue);
-  }
+  private final class ReportingTask implements Runnable {
 
-  private void removalPredicate(BlockingQueue> topN, 
RemovalNotification notification) {
-topN.removeIf(item -> item.getKey().equals(notification.getKey()));
-  }
+private final TxnStore txnHandler;
 
-  private final class ReportingTask implements Runnable {
+private ReportingTask(TxnStore txnHandler) {
+  this.txnHandler = txnHandler;
+}
 @Override
 public void run() {
   Metrics metrics = MetricsFactory.getInstance();
   if (metrics != null) {
-obsoleteDeltaCache.cleanUp();
-obsoleteDeltaObject.updateAll(obsoleteDeltaCache.asMap());
+try {
+  LOG.debug("Called reporting task.");
+  List deltas = 
txnHandler.getTopCompactionMetricsDataPerType(maxCacheSize);
+  Map deltasMap = deltas.stream()
+  .filter(d -> d.getMetricType() == 
CompactionMetricsData.MetricType.NUM_DELTAS).collect(
+  Collectors.toMap(item -> getDeltaCountKey(item.getDbName(), 
item.getTblName(), item.getPartitionName()),
+  CompactionMetricsData::getMetricValue));
+  deltaObject.updateAll(deltasMap);
+
+  Map smallDeltasMap = deltas.stream()
+  .filter(d -> d.getMetricType() == 
CompactionMetricsData.MetricType.NUM_SMALL_DELTAS).collect(
+  Collectors.toMap(item -> getDeltaCountKey(item.getDbName(), 
item.getTblName(), item.getPartitionName()),
+  CompactionMetricsData::getMetricValue));
+  smallDeltaObject.updateAll(smallDeltasMap);
+
+  Map obsoleteDeltasMap = deltas.stream()
+  .filter(d -> d.getMetricType() == 
CompactionMetricsData.MetricType.NUM_OBSOLETE_DELTAS).collect(
+  Collectors.toMap(item -> getDeltaCountKey(item.getDbName(), 
item.getTblName(), item.getPartitionName()),
+  CompactionMetricsData::getMetricValue));
+  obsoleteDeltaObject.updateAll(obsoleteDeltasMap);
+} catch (MetaException e) {

Review comment:
   Maybe catch all Throwables here just in case? (and also in run())




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710740)
Time Spent: 3h 50m  (was: 3h 40m)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710738=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710738
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 18:33
Start Date: 18/Jan/22 18:33
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r787032118



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/metrics/DeltaFilesMetricReporter.java
##
@@ -512,7 +284,177 @@ private void shutdown() {
 }
   }
 
-  public static class DeltaFilesMetadata implements Serializable {
-public String dbName, tableName, partitionName;
+  public static void updateMetricsFromInitiator(AcidDirectory dir, String 
dbName, String tableName, String partitionName,
+  Configuration conf, TxnStore txnHandler) {
+LOG.debug("Updating delta file metrics from initiator");
+double deltaPctThreshold = MetastoreConf.getDoubleVar(conf, 
MetastoreConf.ConfVars.METASTORE_DELTAMETRICS_DELTA_PCT_THRESHOLD);
+int deltasThreshold = MetastoreConf.getIntVar(conf, 
MetastoreConf.ConfVars.METASTORE_DELTAMETRICS_DELTA_NUM_THRESHOLD);
+int obsoleteDeltasThreshold = MetastoreConf.getIntVar(conf,
+
MetastoreConf.ConfVars.METASTORE_DELTAMETRICS_OBSOLETE_DELTA_NUM_THRESHOLD);
+try {
+  // We have an AcidDir from the initiator, therefore we can use that to 
calculate active,small, obsolete delta
+  // count
+  long baseSize = getBaseSize(dir);
+
+  int numDeltas = dir.getCurrentDirectories().size();
+  int numSmallDeltas = 0;
+
+  for (AcidUtils.ParsedDelta delta : dir.getCurrentDirectories()) {
+long deltaSize = getDirSize(delta, dir.getFs());
+if (baseSize != 0 && deltaSize / (float) baseSize < deltaPctThreshold) 
{
+  numSmallDeltas++;
+}
+  }
+
+  int numObsoleteDeltas = dir.getObsolete().size();

Review comment:
   I'm seriously wondering if we should.
   Cons:
   - the metric name is "obsolete deltas" would == obsolete + aborted deltas
   - We already have metrics about the amount of aborts in the system (gotten 
from metadata)
   
   Pros:
   - The Cleaner's job is to remove aborted directories as well ; so including 
metrics about aborted directories would help with observability of Cleaner 
health
   - Aborted directories can clog up the file system just as much as obsolete 
directories




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710738)
Time Spent: 3h 40m  (was: 3.5h)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> FUNCTIONALITY: Metrics are collected only when a Tez query runs a table 
> (select * and select count( * ) don't update the metrics)
> Metrics aren't updated after compaction or cleaning after compaction, so 
> users will probably see "issues" with compaction (like many active or 
> obsolete or small deltas) that don't exist.
> RISK: Metrics are collected during queries – we tried to put a try-catch 
> around each method in DeltaFilesMetricsReporter but of course this isn't 
> foolproof. This is a HUGE performance and functionality liability. Tests 
> caught some issues, but our tests aren't perfect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710724=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710724
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 18:20
Start Date: 18/Jan/22 18:20
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r787033085



##
File path: 
standalone-metastore/metastore-server/src/main/sql/derby/hive-schema-4.0.0.derby.sql
##
@@ -661,6 +661,16 @@ CREATE TABLE COMPLETED_COMPACTIONS (
 
 CREATE INDEX COMPLETED_COMPACTIONS_RES ON COMPLETED_COMPACTIONS 
(CC_DATABASE,CC_TABLE,CC_PARTITION);
 
+-- HIVE-25842
+CREATE TABLE COMPACTION_METRICS_CACHE (
+  CMC_DATABASE varchar(128) NOT NULL,
+  CMC_TABLE varchar(128) NOT NULL,
+  CMC_PARTITION varchar(767),
+  CMC_METRIC_TYPE varchar(128) NOT NULL,
+  CMC_METRIC_VALUE integer NOT NULL,

Review comment:
   Currently yes, but if the use of this table is expanded in the future, 
do you think there's any chance we'll want to allow nulls?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710724)
Time Spent: 3.5h  (was: 3h 20m)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> FUNCTIONALITY: Metrics are collected only when a Tez query runs a table 
> (select * and select count( * ) don't update the metrics)
> Metrics aren't updated after compaction or cleaning after compaction, so 
> users will probably see "issues" with compaction (like many active or 
> obsolete or small deltas) that don't exist.
> RISK: Metrics are collected during queries – we tried to put a try-catch 
> around each method in DeltaFilesMetricsReporter but of course this isn't 
> foolproof. This is a HUGE performance and functionality liability. Tests 
> caught some issues, but our tests aren't perfect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710722=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710722
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 18:18
Start Date: 18/Jan/22 18:18
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r787032118



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/metrics/DeltaFilesMetricReporter.java
##
@@ -512,7 +284,177 @@ private void shutdown() {
 }
   }
 
-  public static class DeltaFilesMetadata implements Serializable {
-public String dbName, tableName, partitionName;
+  public static void updateMetricsFromInitiator(AcidDirectory dir, String 
dbName, String tableName, String partitionName,
+  Configuration conf, TxnStore txnHandler) {
+LOG.debug("Updating delta file metrics from initiator");
+double deltaPctThreshold = MetastoreConf.getDoubleVar(conf, 
MetastoreConf.ConfVars.METASTORE_DELTAMETRICS_DELTA_PCT_THRESHOLD);
+int deltasThreshold = MetastoreConf.getIntVar(conf, 
MetastoreConf.ConfVars.METASTORE_DELTAMETRICS_DELTA_NUM_THRESHOLD);
+int obsoleteDeltasThreshold = MetastoreConf.getIntVar(conf,
+
MetastoreConf.ConfVars.METASTORE_DELTAMETRICS_OBSOLETE_DELTA_NUM_THRESHOLD);
+try {
+  // We have an AcidDir from the initiator, therefore we can use that to 
calculate active,small, obsolete delta
+  // count
+  long baseSize = getBaseSize(dir);
+
+  int numDeltas = dir.getCurrentDirectories().size();
+  int numSmallDeltas = 0;
+
+  for (AcidUtils.ParsedDelta delta : dir.getCurrentDirectories()) {
+long deltaSize = getDirSize(delta, dir.getFs());
+if (baseSize != 0 && deltaSize / (float) baseSize < deltaPctThreshold) 
{
+  numSmallDeltas++;
+}
+  }
+
+  int numObsoleteDeltas = dir.getObsolete().size();

Review comment:
   I'm seriously wondering if we should.
   Cons:
   - the metric name is "obsolete deltas" would == obsolete + aborted deltas
   - We already have metrics about the amount of aborts in the system (gotten 
from metadata)
   Pros:
   - The Cleaner's job is to remove aborted directories as well ; so including 
metrics about aborted directories would help with observability of Cleaner 
health
   - Aborted directories can clog up the file system just as much as obsolete 
directories




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710722)
Time Spent: 3h 20m  (was: 3h 10m)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> FUNCTIONALITY: Metrics are collected only when a Tez query runs a table 
> (select * and select count( * ) don't update the metrics)
> Metrics aren't updated after compaction or cleaning after compaction, so 
> users will probably see "issues" with compaction (like many active or 
> obsolete or small deltas) that don't exist.
> RISK: Metrics are collected during queries – we tried to put a try-catch 
> around each method in DeltaFilesMetricsReporter but of course this isn't 
> foolproof. This is a HUGE performance and functionality liability. Tests 
> caught some issues, but our tests aren't perfect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710716=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710716
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 18:15
Start Date: 18/Jan/22 18:15
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r787029629



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -396,7 +423,9 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
 }
 StringBuilder extraDebugInfo = new 
StringBuilder("[").append(obsoleteDirs.stream()
 .map(Path::getName).collect(Collectors.joining(",")));
-return remove(location, ci, obsoleteDirs, true, fs, extraDebugInfo);
+boolean success = remove(location, ci, obsoleteDirs, true, fs, 
extraDebugInfo);
+updateDeltaFilesMetrics(ci.dbname, ci.tableName, ci.partName, 
dir.getObsolete().size());
+return success;

Review comment:
   Let me rephrase then (discussion of aborted directories below):
   Base directories and original files may be in the list of obsolete files the 
Cleaner deletes. Do you think it would be worth filtering the obsolete files’ 
names for “delta”?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710716)
Time Spent: 3h 10m  (was: 3h)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> FUNCTIONALITY: Metrics are collected only when a Tez query runs a table 
> (select * and select count( * ) don't update the metrics)
> Metrics aren't updated after compaction or cleaning after compaction, so 
> users will probably see "issues" with compaction (like many active or 
> obsolete or small deltas) that don't exist.
> RISK: Metrics are collected during queries – we tried to put a try-catch 
> around each method in DeltaFilesMetricsReporter but of course this isn't 
> foolproof. This is a HUGE performance and functionality liability. Tests 
> caught some issues, but our tests aren't perfect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710713=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710713
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 18:13
Start Date: 18/Jan/22 18:13
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r787028622



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/metrics/DeltaFilesMetricReporter.java
##
@@ -398,13 +156,6 @@ private static long getBaseSize(AcidDirectory dir) throws 
IOException {
 return baseSize;
   }
 
-  private static long getModificationTime(AcidUtils.ParsedDirectory dir, 
FileSystem fs) throws IOException {
-return dir.getFiles(fs, Ref.from(false)).stream()
-  .map(HadoopShims.HdfsFileStatusWithId::getFileStatus)
-  .mapToLong(FileStatus::getModificationTime)
-  .max()
-  .orElse(new Date().getTime());
-  }
 
   private static long getDirSize(AcidUtils.ParsedDirectory dir, FileSystem fs) 
throws IOException {

Review comment:
   Do you think that collecting metrics about small deltas is worth the 
amount this might slow down the Initiator/Worker/Cleaner?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710713)
Time Spent: 3h  (was: 2h 50m)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> FUNCTIONALITY: Metrics are collected only when a Tez query runs a table 
> (select * and select count( * ) don't update the metrics)
> Metrics aren't updated after compaction or cleaning after compaction, so 
> users will probably see "issues" with compaction (like many active or 
> obsolete or small deltas) that don't exist.
> RISK: Metrics are collected during queries – we tried to put a try-catch 
> around each method in DeltaFilesMetricsReporter but of course this isn't 
> foolproof. This is a HUGE performance and functionality liability. Tests 
> caught some issues, but our tests aren't perfect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710710=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710710
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 18:12
Start Date: 18/Jan/22 18:12
Worklog Time Spent: 10m 
  Work Description: klcopp commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r787027526



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Worker.java
##
@@ -142,6 +144,9 @@ public void init(AtomicBoolean stop) throws Exception {
 super.init(stop);
 this.workerName = getWorkerId();
 setName(workerName);
+metricsEnabled = MetastoreConf.getBoolVar(conf, 
MetastoreConf.ConfVars.METRICS_ENABLED) &&
+MetastoreConf.getBoolVar(conf, 
MetastoreConf.ConfVars.METASTORE_ACIDMETRICS_EXT_ON) &&
+MetastoreConf.getBoolVar(conf, 
MetastoreConf.ConfVars.COMPACTOR_INITIATOR_ON);

Review comment:
   There is a possibility that COMPACTOR_INITIATOR_ON==false on a given HS2 
instance (even if the Initiator / Cleaner are running in some HMS somewhere 
else).
   
   There's already a risk that MetastoreConf.ConfVars.METRICS_ENABLED == false 
on any given HS2, which means that only the Initiator and Cleaner are updating 
the metrics, which means the metric values are incorrect. Adding another config 
(COMPACTOR_INITIATOR_ON) just increases this risk.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710710)
Time Spent: 2h 50m  (was: 2h 40m)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> FUNCTIONALITY: Metrics are collected only when a Tez query runs a table 
> (select * and select count( * ) don't update the metrics)
> Metrics aren't updated after compaction or cleaning after compaction, so 
> users will probably see "issues" with compaction (like many active or 
> obsolete or small deltas) that don't exist.
> RISK: Metrics are collected during queries – we tried to put a try-catch 
> around each method in DeltaFilesMetricsReporter but of course this isn't 
> foolproof. This is a HUGE performance and functionality liability. Tests 
> caught some issues, but our tests aren't perfect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (HIVE-25875) Support multiple authentication mechanisms simultaneously

2022-01-18 Thread Naveen Gangam (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naveen Gangam reassigned HIVE-25875:


Assignee: Sai Hemanth Gantasala  (was: Naveen Gangam)

> Support multiple authentication mechanisms simultaneously 
> --
>
> Key: HIVE-25875
> URL: https://issues.apache.org/jira/browse/HIVE-25875
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Affects Versions: 3.1.0
>Reporter: Naveen Gangam
>Assignee: Sai Hemanth Gantasala
>Priority: Major
>
> Currently, HS2 supports a single form of auth on any given instance of 
> HiveServer2. Hive should be able to support multiple auth mechanisms on a 
> single instance especially with http transport. for example, LDAP and SAML.  
> In both cases, HS2 ends up with receiving an Authorization Header in the 
> request. Similarly we could be able to support JWT support or other forms of 
> boundary authentication that is done outside of Hive.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (HIVE-25875) Support multiple authentication mechanisms simultaneously

2022-01-18 Thread Naveen Gangam (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Naveen Gangam reassigned HIVE-25875:



> Support multiple authentication mechanisms simultaneously 
> --
>
> Key: HIVE-25875
> URL: https://issues.apache.org/jira/browse/HIVE-25875
> Project: Hive
>  Issue Type: Improvement
>  Components: HiveServer2
>Affects Versions: 3.1.0
>Reporter: Naveen Gangam
>Assignee: Naveen Gangam
>Priority: Major
>
> Currently, HS2 supports a single form of auth on any given instance of 
> HiveServer2. Hive should be able to support multiple auth mechanisms on a 
> single instance especially with http transport. for example, LDAP and SAML.  
> In both cases, HS2 ends up with receiving an Authorization Header in the 
> request. Similarly we could be able to support JWT support or other forms of 
> boundary authentication that is done outside of Hive.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25864) Hive query optimisation creates wrong plan for predicate pushdown with windowing function

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25864?focusedWorklogId=710650=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710650
 ]

ASF GitHub Bot logged work on HIVE-25864:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 17:03
Start Date: 18/Jan/22 17:03
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #2943:
URL: https://github.com/apache/hive/pull/2943#discussion_r786971885



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveFilterProjectTransposeRule.java
##
@@ -170,6 +171,7 @@ public void onMatch(RelOptRuleCall call) {
 if 
(HiveCalciteUtil.isDeterministicFuncWithSingleInputRef(newCondition,
 commonPartitionKeys)) {
   newPartKeyFilConds.add(newCondition);
+  isConversionDone = true;

Review comment:
   * above line adds 1 element to `newPartKeyFilConds`
   * then the if at line180 will only fire in that case the collection is not 
empty
   * we will also have `isConversionDone` set to true (it will be set iff 
collection not empty|)
   * `getNewProject` has only one callsite => which will have 
`isConversionDone=true` all the time
   
   I think we could:
   * simply add `ce` to the `newPartKeyFilConds` and let the existing code push 
it...the non-over related codepath have been using external rex-es but now we 
add a pushed one...
   * I was thinking a different approach in with a rename of 
`filterCondToPushBelowProj` would happened to `filterCondPushedBelowProj`; but 
it seemed more complicated; 
   
   I think it would be better to add `ce` to that array in line#173.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710650)
Time Spent: 20m  (was: 10m)

> Hive query optimisation creates wrong plan for predicate pushdown with 
> windowing function 
> --
>
> Key: HIVE-25864
> URL: https://issues.apache.org/jira/browse/HIVE-25864
> Project: Hive
>  Issue Type: Bug
>Reporter: mahesh kumar behera
>Assignee: mahesh kumar behera
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In case of a query with windowing function, the deterministic predicates are 
> pushed down below the window function. Before pushing down, the predicate is 
> converted to refer the project operator values. But the same conversion is 
> done again while creating the project and thus causing wrong plan generation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-24805) Compactor: Initiator shouldn't fetch table details again and again for partitioned tables

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24805?focusedWorklogId=710643=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710643
 ]

ASF GitHub Bot logged work on HIVE-24805:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 16:47
Start Date: 18/Jan/22 16:47
Worklog Time Spent: 10m 
  Work Description: asinkovits commented on a change in pull request #2906:
URL: https://github.com/apache/hive/pull/2906#discussion_r786957161



##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java
##
@@ -589,6 +589,9 @@ public static ConfVars getMetaConf(String name) {
 "in which a warning is being raised if multiple worker version are 
detected.\n" +
 "The setting has no effect if the metastore.metrics.enabled is 
disabled \n" +
 "or the metastore.acidmetrics.thread.on is turned off."),
+
COMPACTOR_METADATA_CACHE_TIMEOUT("metastore.compactor.metadata.cache.timeout",
+  "hive.metastore.compactor.metadata.cache.timeout", 60, TimeUnit.SECONDS,

Review comment:
   The idea was to keep it at a low value so that we don't need extra 
eviction logic. Now that the cache works for a single cycle of the initiator, 
the sole purpose of this config is to be able to control the memory consumption 
of the cache.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710643)
Time Spent: 3h 10m  (was: 3h)

> Compactor: Initiator shouldn't fetch table details again and again for 
> partitioned tables
> -
>
> Key: HIVE-24805
> URL: https://issues.apache.org/jira/browse/HIVE-24805
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Rajesh Balamohan
>Assignee: Antal Sinkovits
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Initiator shouldn't be fetch table details for all its partitions. When there 
> are large number of databases/tables, it takes lot of time for Initiator to 
> complete its initial iteration and load on DB also goes higher.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java#L129
> https://github.com/apache/hive/blob/64bb52316f19426ebea0087ee15e282cbde1d852/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java#L456
> For all the following partitions, table details would be the same. However, 
> it ends up fetching table details from HMS again and again.
> {noformat}
> 2021-02-22 08:13:16,106 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2451899
> 2021-02-22 08:13:16,124 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2451830
> 2021-02-22 08:13:16,140 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2452586
> 2021-02-22 08:13:16,149 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2452698
> 2021-02-22 08:13:16,158 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2452063
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-24805) Compactor: Initiator shouldn't fetch table details again and again for partitioned tables

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24805?focusedWorklogId=710633=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710633
 ]

ASF GitHub Bot logged work on HIVE-24805:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 16:39
Start Date: 18/Jan/22 16:39
Worklog Time Spent: 10m 
  Work Description: asinkovits commented on a change in pull request #2906:
URL: https://github.com/apache/hive/pull/2906#discussion_r786948668



##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java
##
@@ -589,6 +589,9 @@ public static ConfVars getMetaConf(String name) {
 "in which a warning is being raised if multiple worker version are 
detected.\n" +
 "The setting has no effect if the metastore.metrics.enabled is 
disabled \n" +
 "or the metastore.acidmetrics.thread.on is turned off."),
+
COMPACTOR_METADATA_CACHE_TIMEOUT("metastore.compactor.metadata.cache.timeout",
+  "hive.metastore.compactor.metadata.cache.timeout", 60, TimeUnit.SECONDS,
+  "Number of seconds the table/partition metadata are cached by the 
compactor. Setting it to zero disables the feature."),

Review comment:
   Missed this, when removed partition caching. Fixed, thanks.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710633)
Time Spent: 3h  (was: 2h 50m)

> Compactor: Initiator shouldn't fetch table details again and again for 
> partitioned tables
> -
>
> Key: HIVE-24805
> URL: https://issues.apache.org/jira/browse/HIVE-24805
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Rajesh Balamohan
>Assignee: Antal Sinkovits
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Initiator shouldn't be fetch table details for all its partitions. When there 
> are large number of databases/tables, it takes lot of time for Initiator to 
> complete its initial iteration and load on DB also goes higher.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java#L129
> https://github.com/apache/hive/blob/64bb52316f19426ebea0087ee15e282cbde1d852/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java#L456
> For all the following partitions, table details would be the same. However, 
> it ends up fetching table details from HMS again and again.
> {noformat}
> 2021-02-22 08:13:16,106 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2451899
> 2021-02-22 08:13:16,124 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2451830
> 2021-02-22 08:13:16,140 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2452586
> 2021-02-22 08:13:16,149 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2452698
> 2021-02-22 08:13:16,158 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2452063
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710604=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710604
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 15:57
Start Date: 18/Jan/22 15:57
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r786901861



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/metrics/DeltaFilesMetricReporter.java
##
@@ -512,7 +284,177 @@ private void shutdown() {
 }
   }
 
-  public static class DeltaFilesMetadata implements Serializable {
-public String dbName, tableName, partitionName;
+  public static void updateMetricsFromInitiator(AcidDirectory dir, String 
dbName, String tableName, String partitionName,
+  Configuration conf, TxnStore txnHandler) {
+LOG.debug("Updating delta file metrics from initiator");
+double deltaPctThreshold = MetastoreConf.getDoubleVar(conf, 
MetastoreConf.ConfVars.METASTORE_DELTAMETRICS_DELTA_PCT_THRESHOLD);
+int deltasThreshold = MetastoreConf.getIntVar(conf, 
MetastoreConf.ConfVars.METASTORE_DELTAMETRICS_DELTA_NUM_THRESHOLD);
+int obsoleteDeltasThreshold = MetastoreConf.getIntVar(conf,
+
MetastoreConf.ConfVars.METASTORE_DELTAMETRICS_OBSOLETE_DELTA_NUM_THRESHOLD);
+try {
+  // We have an AcidDir from the initiator, therefore we can use that to 
calculate active,small, obsolete delta
+  // count
+  long baseSize = getBaseSize(dir);
+
+  int numDeltas = dir.getCurrentDirectories().size();
+  int numSmallDeltas = 0;
+
+  for (AcidUtils.ParsedDelta delta : dir.getCurrentDirectories()) {
+long deltaSize = getDirSize(delta, dir.getFs());
+if (baseSize != 0 && deltaSize / (float) baseSize < deltaPctThreshold) 
{
+  numSmallDeltas++;
+}
+  }
+
+  int numObsoleteDeltas = dir.getObsolete().size();

Review comment:
   Do we want to calculate aborted directories in the obsolete delta count? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710604)
Time Spent: 2h 40m  (was: 2.5h)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> FUNCTIONALITY: Metrics are collected only when a Tez query runs a table 
> (select * and select count( * ) don't update the metrics)
> Metrics aren't updated after compaction or cleaning after compaction, so 
> users will probably see "issues" with compaction (like many active or 
> obsolete or small deltas) that don't exist.
> RISK: Metrics are collected during queries – we tried to put a try-catch 
> around each method in DeltaFilesMetricsReporter but of course this isn't 
> foolproof. This is a HUGE performance and functionality liability. Tests 
> caught some issues, but our tests aren't perfect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710602=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710602
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 15:56
Start Date: 18/Jan/22 15:56
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r786900655



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/metrics/DeltaFilesMetricReporter.java
##
@@ -512,7 +284,177 @@ private void shutdown() {
 }
   }
 
-  public static class DeltaFilesMetadata implements Serializable {
-public String dbName, tableName, partitionName;
+  public static void updateMetricsFromInitiator(AcidDirectory dir, String 
dbName, String tableName, String partitionName,
+  Configuration conf, TxnStore txnHandler) {
+LOG.debug("Updating delta file metrics from initiator");
+double deltaPctThreshold = MetastoreConf.getDoubleVar(conf, 
MetastoreConf.ConfVars.METASTORE_DELTAMETRICS_DELTA_PCT_THRESHOLD);
+int deltasThreshold = MetastoreConf.getIntVar(conf, 
MetastoreConf.ConfVars.METASTORE_DELTAMETRICS_DELTA_NUM_THRESHOLD);
+int obsoleteDeltasThreshold = MetastoreConf.getIntVar(conf,
+
MetastoreConf.ConfVars.METASTORE_DELTAMETRICS_OBSOLETE_DELTA_NUM_THRESHOLD);
+try {
+  // We have an AcidDir from the initiator, therefore we can use that to 
calculate active,small, obsolete delta
+  // count
+  long baseSize = getBaseSize(dir);
+
+  int numDeltas = dir.getCurrentDirectories().size();
+  int numSmallDeltas = 0;
+
+  for (AcidUtils.ParsedDelta delta : dir.getCurrentDirectories()) {
+long deltaSize = getDirSize(delta, dir.getFs());
+if (baseSize != 0 && deltaSize / (float) baseSize < deltaPctThreshold) 
{
+  numSmallDeltas++;
+}
+  }
+
+  int numObsoleteDeltas = dir.getObsolete().size();
+
+  if (numDeltas > deltasThreshold) {
+updateMetrics(dbName, tableName, partitionName, 
CompactionMetricsData.MetricType.NUM_DELTAS, numDeltas,
+txnHandler);
+  }
+
+  if (numSmallDeltas > deltasThreshold) {
+updateMetrics(dbName, tableName, partitionName, 
CompactionMetricsData.MetricType.NUM_SMALL_DELTAS,
+numSmallDeltas, txnHandler);
+  }
+
+  if (numObsoleteDeltas > obsoleteDeltasThreshold) {
+updateMetrics(dbName, tableName, partitionName, 
CompactionMetricsData.MetricType.NUM_OBSOLETE_DELTAS,
+numObsoleteDeltas, txnHandler);
+  }
+
+  LOG.debug("Finished updating delta file metrics from initiator.\n 
deltaPctThreshold = {}, deltasThreshold = {}, "
+  + "obsoleteDeltasThreshold = {}, numDeltas = {}, numSmallDeltas = 
{},  numObsoleteDeltas = {}",
+  deltaPctThreshold, deltasThreshold, obsoleteDeltasThreshold, 
numDeltas, numSmallDeltas, numObsoleteDeltas);
+
+} catch (Throwable t) {
+  LOG.warn("Unknown throwable caught while updating delta metrics. Metrics 
will not be updated.", t);
+}
+  }
+
+  public static void updateMetricsFromWorker(AcidDirectory directory, String 
dbName, String tableName, String partitionName,
+  CompactionType type, Configuration conf, IMetaStoreClient client) {
+LOG.debug("Updating delta file metrics from worker");
+int deltasThreshold = MetastoreConf.getIntVar(conf, 
MetastoreConf.ConfVars.METASTORE_DELTAMETRICS_DELTA_NUM_THRESHOLD);
+int obsoleteDeltasThreshold = MetastoreConf.getIntVar(conf,
+
MetastoreConf.ConfVars.METASTORE_DELTAMETRICS_OBSOLETE_DELTA_NUM_THRESHOLD);
+try {
+  // we have an instance of the AcidDirectory before the compaction worker 
was started
+  // from this we can get how many delta directories existed
+  // the previously active delta directories are now moved to obsolete
+  int numObsoleteDeltas = directory.getCurrentDirectories().size();
+  if (numObsoleteDeltas > obsoleteDeltasThreshold) {
+updateMetrics(dbName, tableName, partitionName, 
CompactionMetricsMetricType.NUM_OBSOLETE_DELTAS,
+numObsoleteDeltas, client);
+  }
+
+  // We don't know the size of the newly create delta directories, that 
would require a fresh AcidDirectory
+  // Clear the small delta num counter from the cache for this key
+  client.removeCompactionMetricsData(dbName, tableName, partitionName, 
CompactionMetricsMetricType.NUM_SMALL_DELTAS);
+
+  // The new number of active delta dirs are either 0, 1 or 2.
+  // If we ran MAJOR compaction, no new delta is created, just base dir
+  // If we ran MINOR compaction, we can have 1 or 2 new delta dirs, 
depending on whether we had deltas or
+  // delete deltas.
+  if (type == CompactionType.MAJOR) {
+

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710599=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710599
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 15:48
Start Date: 18/Jan/22 15:48
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r786892581



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/metrics/DeltaFilesMetricReporter.java
##
@@ -398,13 +156,6 @@ private static long getBaseSize(AcidDirectory dir) throws 
IOException {
 return baseSize;
   }
 
-  private static long getModificationTime(AcidUtils.ParsedDirectory dir, 
FileSystem fs) throws IOException {
-return dir.getFiles(fs, Ref.from(false)).stream()
-  .map(HadoopShims.HdfsFileStatusWithId::getFileStatus)
-  .mapToLong(FileStatus::getModificationTime)
-  .max()
-  .orElse(new Date().getTime());
-  }
 
   private static long getDirSize(AcidUtils.ParsedDirectory dir, FileSystem fs) 
throws IOException {

Review comment:
   It will slow it down, but there is no other way we could calculate the 
directory size. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710599)
Time Spent: 2h 20m  (was: 2h 10m)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> FUNCTIONALITY: Metrics are collected only when a Tez query runs a table 
> (select * and select count( * ) don't update the metrics)
> Metrics aren't updated after compaction or cleaning after compaction, so 
> users will probably see "issues" with compaction (like many active or 
> obsolete or small deltas) that don't exist.
> RISK: Metrics are collected during queries – we tried to put a try-catch 
> around each method in DeltaFilesMetricsReporter but of course this isn't 
> foolproof. This is a HUGE performance and functionality liability. Tests 
> caught some issues, but our tests aren't perfect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710591=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710591
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 15:39
Start Date: 18/Jan/22 15:39
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r786756287



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Worker.java
##
@@ -142,6 +144,9 @@ public void init(AtomicBoolean stop) throws Exception {
 super.init(stop);
 this.workerName = getWorkerId();
 setName(workerName);
+metricsEnabled = MetastoreConf.getBoolVar(conf, 
MetastoreConf.ConfVars.METRICS_ENABLED) &&

Review comment:
   Done

##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -87,14 +106,15 @@ public void init(AtomicBoolean stop) throws Exception {
 cleanerExecutor = CompactorUtil.createExecutorWithThreadFactory(
 
conf.getIntVar(HiveConf.ConfVars.HIVE_COMPACTOR_CLEANER_THREADS_NUM),
 COMPACTOR_CLEANER_THREAD_NAME_FORMAT);
+metricsEnabled = MetastoreConf.getBoolVar(conf, 
MetastoreConf.ConfVars.METRICS_ENABLED) &&
+MetastoreConf.getBoolVar(conf, 
MetastoreConf.ConfVars.METASTORE_ACIDMETRICS_EXT_ON) &&
+MetastoreConf.getBoolVar(conf, 
MetastoreConf.ConfVars.COMPACTOR_INITIATOR_ON);

Review comment:
   It doesn't hurt if we double check :)

##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Worker.java
##
@@ -142,6 +144,9 @@ public void init(AtomicBoolean stop) throws Exception {
 super.init(stop);
 this.workerName = getWorkerId();
 setName(workerName);
+metricsEnabled = MetastoreConf.getBoolVar(conf, 
MetastoreConf.ConfVars.METRICS_ENABLED) &&
+MetastoreConf.getBoolVar(conf, 
MetastoreConf.ConfVars.METASTORE_ACIDMETRICS_EXT_ON) &&
+MetastoreConf.getBoolVar(conf, 
MetastoreConf.ConfVars.COMPACTOR_INITIATOR_ON);

Review comment:
   Do we want to update metrics, when the initiator/Cleaner is not running? 
Can that be a valid use case?

##
File path: 
ql/src/test/org/apache/hadoop/hive/ql/txn/compactor/TestCompactionMetrics.java
##
@@ -81,6 +81,7 @@
   public void setUp() throws Exception {
 MetastoreConf.setBoolVar(conf, MetastoreConf.ConfVars.METRICS_ENABLED, 
true);
 MetastoreConf.setBoolVar(conf, 
MetastoreConf.ConfVars.TXN_USE_MIN_HISTORY_LEVEL, true);
+MetastoreConf.setBoolVar(conf, 
MetastoreConf.ConfVars.COMPACTOR_INITIATOR_ON, true);

Review comment:
   We need this flag set `true`, otherwise the metrics are not collected. 

##
File path: 
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HMSHandler.java
##
@@ -8831,6 +8833,34 @@ public void mark_failed(CompactionInfoStruct cr) throws 
MetaException {
 getTxnHandler().markFailed(CompactionInfo.compactionStructToInfo(cr));
   }
 
+  @Override
+  public CompactionMetricsDataResponse get_compaction_metrics_data(String 
dbName, String tblName, String partitionName, CompactionMetricsMetricType type) 
throws MetaException {
+CompactionMetricsData metricsData =
+getTxnHandler().getCompactionMetricsData(dbName, tblName, 
partitionName,
+
CompactionMetricsDataConverter.thriftCompactionMetricType2DbType(type));
+CompactionMetricsDataResponse response = new 
CompactionMetricsDataResponse();
+if (metricsData != null) {
+  
response.setData(CompactionMetricsDataConverter.dataToStruct(metricsData));
+}
+return response;
+  }
+
+  @Override
+  public boolean update_compaction_metrics_data(CompactionMetricsDataStruct 
struct, int version) throws MetaException {
+  return 
getTxnHandler().updateCompactionMetricsData(CompactionMetricsDataConverter.structToData(struct),
 version);

Review comment:
   Per java doc, the object must be always non-null.

##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Worker.java
##
@@ -671,6 +679,13 @@ private String getWorkerId() {
 return name.toString();
   }
 
+  private void updateDeltaFilesMetrics(AcidDirectory directory, String dbName, 
String tableName, String partName,
+  CompactionType type) {
+if (metricsEnabled) {
+  DeltaFilesMetricReporter.updateMetricsFromWorker(directory, dbName, 
tableName, partName, type, conf, msc);

Review comment:
   All the `updateMetricsFrom*` methods are static. They are completely 
stateless, and the outcome of the metrics computation is stored in the backend 
DB, which is accessible by all the compaction threads regardless of which 
process is hosting them. 

##
File path: service/src/java/org/apache/hive/service/server/HiveServer2.java

[jira] [Work logged] (HIVE-25266) Fix TestWarehouseExternalDir

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25266?focusedWorklogId=710552=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710552
 ]

ASF GitHub Bot logged work on HIVE-25266:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 14:42
Start Date: 18/Jan/22 14:42
Worklog Time Spent: 10m 
  Work Description: mbathori-cloudera commented on pull request #2951:
URL: https://github.com/apache/hive/pull/2951#issuecomment-1015478633


   The TestCliDriver[mapjoin_memcheck] test failure seems to be an intermittent 
issue. It is passing locally.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710552)
Time Spent: 20m  (was: 10m)

> Fix TestWarehouseExternalDir
> 
>
> Key: HIVE-25266
> URL: https://issues.apache.org/jira/browse/HIVE-25266
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Haindrich
>Assignee: Mark Bathori
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> test is unstable 
> http://ci.hive.apache.org/job/hive-flaky-check/244/



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HIVE-25266) Fix TestWarehouseExternalDir

2022-01-18 Thread Mark Bathori (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Bathori updated HIVE-25266:

Status: Patch Available  (was: Open)

> Fix TestWarehouseExternalDir
> 
>
> Key: HIVE-25266
> URL: https://issues.apache.org/jira/browse/HIVE-25266
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Haindrich
>Assignee: Mark Bathori
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> test is unstable 
> http://ci.hive.apache.org/job/hive-flaky-check/244/



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (HIVE-25266) Fix TestWarehouseExternalDir

2022-01-18 Thread Mark Bathori (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Bathori reassigned HIVE-25266:
---

Assignee: Mark Bathori

> Fix TestWarehouseExternalDir
> 
>
> Key: HIVE-25266
> URL: https://issues.apache.org/jira/browse/HIVE-25266
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Haindrich
>Assignee: Mark Bathori
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> test is unstable 
> http://ci.hive.apache.org/job/hive-flaky-check/244/



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HIVE-23959) Provide an option to wipe out column stats for partitioned tables in case of column removal

2022-01-18 Thread Stamatis Zampetakis (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-23959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477841#comment-17477841
 ] 

Stamatis Zampetakis commented on HIVE-23959:


Before this change a DDL statement updating a column in a partitioned table 
would remove the statistics for the updated column from every partition but 
would leave the stats for other columns intact.

After this change, if the appropriate configuration property is set, updating a 
column removes *all* partition statistics (for all columns of the table).

[~kgyrtkirk]  is my understanding correct or did I miss something?

> Provide an option to wipe out column stats for partitioned tables in case of 
> column removal
> ---
>
> Key: HIVE-23959
> URL: https://issues.apache.org/jira/browse/HIVE-23959
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> in case of column removal / replacement - an update for each partition is 
> neccessary; which could take a while.
> goal here is to provide an option to switch to the bulk removal of column 
> statistics instead of working hard to retain as much as possible from the old 
> stats.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HIVE-25874) Slow filter evaluation of nest struct fields in vectorized executions

2022-01-18 Thread Zoltan Haindrich (Jira)



[ 
https://issues.apache.org/jira/browse/HIVE-25874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477819#comment-17477819
 ] 

Zoltan Haindrich commented on HIVE-25874:
-

issue is caused by that VectorStructField doesnt resets the output vector - 
which causes that the array in it will retain all previous elementsand it 
will keep expanding the backing vector.

it took 21 minutes to execute the query before the patch; after it 2seconds

> Slow filter evaluation of nest struct fields in vectorized executions
> -
>
> Key: HIVE-25874
> URL: https://issues.apache.org/jira/browse/HIVE-25874
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> time is spent at resizing vectors around 
> [here|https://github.com/apache/hive/blob/200c0bf1feb259f4d95bf065a2ab38fe684383da/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/ColumnVector.java#L252]
>  or in some other "ensureSize" method
> {code:java}
> create table t as
> select
> named_struct('id',13,'str','string','nest',named_struct('id',12,'str','string','arr',array('value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value')))
> s;
> -- go up to 1M rows
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> -- insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> set hive.fetch.task.conversion=none;
> select count(1) from t;
> --explain
> select s
> .id from t
> where 
> s
> .nest
> .id  > 0;
>  {code}
> interestingly; the issue is not present:
> * for a query not looking into the nested struct
> * and in case the struct with the array is at the top level
> {code}
> select count(1) from t;
> --explain
> select s
> .id from t
> where 
> s
> -- .nest
> .id  > 0;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HIVE-25874) Slow filter evaluation of nest struct fields in vectorized executions

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-25874:
--
Labels: pull-request-available  (was: )

> Slow filter evaluation of nest struct fields in vectorized executions
> -
>
> Key: HIVE-25874
> URL: https://issues.apache.org/jira/browse/HIVE-25874
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> time is spent at resizing vectors around 
> [here|https://github.com/apache/hive/blob/200c0bf1feb259f4d95bf065a2ab38fe684383da/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/ColumnVector.java#L252]
>  or in some other "ensureSize" method
> {code:java}
> create table t as
> select
> named_struct('id',13,'str','string','nest',named_struct('id',12,'str','string','arr',array('value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value')))
> s;
> -- go up to 1M rows
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> -- insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> set hive.fetch.task.conversion=none;
> select count(1) from t;
> --explain
> select s
> .id from t
> where 
> s
> .nest
> .id  > 0;
>  {code}
> interestingly; the issue is not present:
> * for a query not looking into the nested struct
> * and in case the struct with the array is at the top level
> {code}
> select count(1) from t;
> --explain
> select s
> .id from t
> where 
> s
> -- .nest
> .id  > 0;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25874) Slow filter evaluation of nest struct fields in vectorized executions

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25874?focusedWorklogId=710456=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710456
 ]

ASF GitHub Bot logged work on HIVE-25874:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 12:31
Start Date: 18/Jan/22 12:31
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk opened a new pull request #2952:
URL: https://github.com/apache/hive/pull/2952


   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710456)
Remaining Estimate: 0h
Time Spent: 10m

> Slow filter evaluation of nest struct fields in vectorized executions
> -
>
> Key: HIVE-25874
> URL: https://issues.apache.org/jira/browse/HIVE-25874
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> time is spent at resizing vectors around 
> [here|https://github.com/apache/hive/blob/200c0bf1feb259f4d95bf065a2ab38fe684383da/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/ColumnVector.java#L252]
>  or in some other "ensureSize" method
> {code:java}
> create table t as
> select
> named_struct('id',13,'str','string','nest',named_struct('id',12,'str','string','arr',array('value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value')))
> s;
> -- go up to 1M rows
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> -- insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> set hive.fetch.task.conversion=none;
> select count(1) from t;
> --explain
> select s
> .id from t
> where 
> s
> .nest
> .id  > 0;
>  {code}
> interestingly; the issue is not present:
> * for a query not looking into the nested struct
> * and in case the struct with the array is at the top level
> {code}
> select count(1) from t;
> --explain
> select s
> .id from t
> where 
> s
> -- .nest
> .id  > 0;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Assigned] (HIVE-25874) Slow filter evaluation of nest struct fields in vectorized executions

2022-01-18 Thread Zoltan Haindrich (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Haindrich reassigned HIVE-25874:
---

Assignee: Zoltan Haindrich

> Slow filter evaluation of nest struct fields in vectorized executions
> -
>
> Key: HIVE-25874
> URL: https://issues.apache.org/jira/browse/HIVE-25874
> Project: Hive
>  Issue Type: Improvement
>Reporter: Zoltan Haindrich
>Assignee: Zoltan Haindrich
>Priority: Major
>
> time is spent at resizing vectors around 
> [here|https://github.com/apache/hive/blob/200c0bf1feb259f4d95bf065a2ab38fe684383da/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/ColumnVector.java#L252]
>  or in some other "ensureSize" method
> {code:java}
> create table t as
> select
> named_struct('id',13,'str','string','nest',named_struct('id',12,'str','string','arr',array('value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value')))
> s;
> -- go up to 1M rows
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> -- insert into table t select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t union all select * from t union all select * from t union all 
> select * from t;
> set hive.fetch.task.conversion=none;
> select count(1) from t;
> --explain
> select s
> .id from t
> where 
> s
> .nest
> .id  > 0;
>  {code}
> interestingly; the issue is not present:
> * for a query not looking into the nested struct
> * and in case the struct with the array is at the top level
> {code}
> select count(1) from t;
> --explain
> select s
> .id from t
> where 
> s
> -- .nest
> .id  > 0;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25871) Hive should set name mapping table property for migrated Iceberg tables

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25871?focusedWorklogId=710433=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710433
 ]

ASF GitHub Bot logged work on HIVE-25871:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 11:47
Start Date: 18/Jan/22 11:47
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2948:
URL: https://github.com/apache/hive/pull/2948#discussion_r786672462



##
File path: 
iceberg/iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergMigration.java
##
@@ -225,12 +226,15 @@ private void validateSd(Table hmsTable, String format) {
   private void validateTblProps(Table hmsTable, boolean migrationSucceeded) {
 String migratedProp = 
hmsTable.getParameters().get(HiveIcebergMetaHook.MIGRATED_TO_ICEBERG);
 String tableTypeProp = 
hmsTable.getParameters().get(BaseMetastoreTableOperations.TABLE_TYPE_PROP);
+String nameMappingProp = 
hmsTable.getParameters().get(TableProperties.DEFAULT_NAME_MAPPING);
 if (migrationSucceeded) {
   Assert.assertTrue(Boolean.parseBoolean(migratedProp));
   
Assert.assertEquals(BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(),
 tableTypeProp);
+  Assert.assertTrue(nameMappingProp != null && !nameMappingProp.isEmpty());
 } else {
   Assert.assertNull(migratedProp);
   
Assert.assertNotEquals(BaseMetastoreTableOperations.ICEBERG_TABLE_TYPE_VALUE.toUpperCase(),
 tableTypeProp);
+  Assert.assertTrue(nameMappingProp == null);

Review comment:
   nit: could use Assert.assertNull here




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710433)
Time Spent: 20m  (was: 10m)

> Hive should set name mapping table property for migrated Iceberg tables
> ---
>
> Key: HIVE-25871
> URL: https://issues.apache.org/jira/browse/HIVE-25871
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltán Borók-Nagy
>Assignee: Zoltán Borók-Nagy
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Hive should  set the name-mapping table property during table migration.
> It would be useful for [column 
> projection|https://iceberg.apache.org/#spec/#column-projection] for files 
> without field ids.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25266) Fix TestWarehouseExternalDir

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25266?focusedWorklogId=710420=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710420
 ]

ASF GitHub Bot logged work on HIVE-25266:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 11:25
Start Date: 18/Jan/22 11:25
Worklog Time Spent: 10m 
  Work Description: mbathori-cloudera opened a new pull request #2951:
URL: https://github.com/apache/hive/pull/2951


   ### What changes were proposed in this pull request?
   - moved the initializing parts from the constructor to `BeforeClass`
   - moved the initializing and closing database connection in 
`setup`/`teardown` methods
   - removed unnecessary logging
   - swapped deprecated warehouse default database location constant
   
   ### Why are the changes needed?
   Fix the flakiness of `TestWarehouseExternalDir` tests.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Run the hive-flaky-check job: 
http://ci.hive.apache.org/job/hive-flaky-check/509
   The tests were executed successfully 87 times, the interrupting error seems 
to be environment related and not caused by the test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710420)
Remaining Estimate: 0h
Time Spent: 10m

> Fix TestWarehouseExternalDir
> 
>
> Key: HIVE-25266
> URL: https://issues.apache.org/jira/browse/HIVE-25266
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Haindrich
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> test is unstable 
> http://ci.hive.apache.org/job/hive-flaky-check/244/



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HIVE-25266) Fix TestWarehouseExternalDir

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-25266:
--
Labels: pull-request-available  (was: )

> Fix TestWarehouseExternalDir
> 
>
> Key: HIVE-25266
> URL: https://issues.apache.org/jira/browse/HIVE-25266
> Project: Hive
>  Issue Type: Bug
>Reporter: Zoltan Haindrich
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> test is unstable 
> http://ci.hive.apache.org/job/hive-flaky-check/244/



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (HIVE-25874) Slow filter evaluation of nest struct fields in vectorized executions

2022-01-18 Thread Zoltan Haindrich (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Haindrich updated HIVE-25874:

Description: 
time is spent at resizing vectors around 
[here|https://github.com/apache/hive/blob/200c0bf1feb259f4d95bf065a2ab38fe684383da/storage-api/src/java/org/apache/hadoop/hive/ql/exec/vector/ColumnVector.java#L252]
 or in some other "ensureSize" method

{code:java}

create table t as
select
named_struct('id',13,'str','string','nest',named_struct('id',12,'str','string','arr',array('value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value')))
s;

-- go up to 1M rows
insert into table t select * from t union all select * from t union all select 
* from t union all select * from t union all select * from t union all select * 
from t union all select * from t union all select * from t union all select * 
from t;
insert into table t select * from t union all select * from t union all select 
* from t union all select * from t union all select * from t union all select * 
from t union all select * from t union all select * from t union all select * 
from t;
insert into table t select * from t union all select * from t union all select 
* from t union all select * from t union all select * from t union all select * 
from t union all select * from t union all select * from t union all select * 
from t;
insert into table t select * from t union all select * from t union all select 
* from t union all select * from t union all select * from t union all select * 
from t union all select * from t union all select * from t union all select * 
from t;
insert into table t select * from t union all select * from t union all select 
* from t union all select * from t union all select * from t union all select * 
from t union all select * from t union all select * from t union all select * 
from t;
-- insert into table t select * from t union all select * from t union all 
select * from t union all select * from t union all select * from t union all 
select * from t union all select * from t union all select * from t union all 
select * from t;


set hive.fetch.task.conversion=none;

select count(1) from t;
--explain
select s
.id from t
where 
s
.nest
.id  > 0;

 {code}


interestingly; the issue is not present:
* for a query not looking into the nested struct
* and in case the struct with the array is at the top level

{code}
select count(1) from t;
--explain
select s
.id from t
where 
s
-- .nest
.id  > 0;
{code}

  was:
{code:java}

create table t as
select
named_struct('id',13,'str','string','nest',named_struct('id',12,'str','string','arr',array('value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value','value')))
s;

-- go up to 1M rows
insert into table t select * from t union all select * from t union all select 
* from t union all select * from t union all select * from t union all select * 
from t union all select * from t union all select * from t union all select * 
from t;
insert into table t select * from t union all select * from t union all select 
* from t union all select * from t union all select * from t union all select * 
from t union all select * from t union all select * from t union all select * 
from t;
insert into table t select * from t union all select * from t union all select 
* from t union all select * from t union all select * from t union all select * 
from t union all select * from t union all select * from t union all select * 
from t;
insert into table t select * from t union all select * from t union all select 
* from t union all select * from t union all select * from t union all select * 
from t union all select * from t union all select * from t union all select * 
from t;
insert into table t select * from t union all select * from t union all select 
* from t union all select * from t union all select * from t union all select * 
from t union all select * from t union all select * from t union all select * 
from t;
-- insert into table t select * from t union all select * from t union all 
select * from t union all select * from t union all select * from t union all 
select * from t union all select * from t union all select * from t union all 
select * from t;


set hive.fetch.task.conversion=none;

select count(1) from t;
--explain
select s
.id from t
where 
s
.nest
.id  > 0;

 {code}


interestingly; the issue is not present:
* for a query not looking into the nested struct
* and in case the struct with the array is at the top level

{code}
select count(1) from t;
--explain
select s
.id from

[jira] [Work logged] (HIVE-24805) Compactor: Initiator shouldn't fetch table details again and again for partitioned tables

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24805?focusedWorklogId=710329=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710329
 ]

ASF GitHub Bot logged work on HIVE-24805:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 08:41
Start Date: 18/Jan/22 08:41
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2906:
URL: https://github.com/apache/hive/pull/2906#discussion_r786521394



##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java
##
@@ -589,6 +589,9 @@ public static ConfVars getMetaConf(String name) {
 "in which a warning is being raised if multiple worker version are 
detected.\n" +
 "The setting has no effect if the metastore.metrics.enabled is 
disabled \n" +
 "or the metastore.acidmetrics.thread.on is turned off."),
+
COMPACTOR_METADATA_CACHE_TIMEOUT("metastore.compactor.metadata.cache.timeout",
+  "hive.metastore.compactor.metadata.cache.timeout", 60, TimeUnit.SECONDS,

Review comment:
   Might be too small. If we have a query running more than a min, then we 
will not cross-cache between Initiator and Cleaner




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710329)
Time Spent: 2h 50m  (was: 2h 40m)

> Compactor: Initiator shouldn't fetch table details again and again for 
> partitioned tables
> -
>
> Key: HIVE-24805
> URL: https://issues.apache.org/jira/browse/HIVE-24805
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Rajesh Balamohan
>Assignee: Antal Sinkovits
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> Initiator shouldn't be fetch table details for all its partitions. When there 
> are large number of databases/tables, it takes lot of time for Initiator to 
> complete its initial iteration and load on DB also goes higher.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java#L129
> https://github.com/apache/hive/blob/64bb52316f19426ebea0087ee15e282cbde1d852/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java#L456
> For all the following partitions, table details would be the same. However, 
> it ends up fetching table details from HMS again and again.
> {noformat}
> 2021-02-22 08:13:16,106 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2451899
> 2021-02-22 08:13:16,124 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2451830
> 2021-02-22 08:13:16,140 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2452586
> 2021-02-22 08:13:16,149 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2452698
> 2021-02-22 08:13:16,158 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2452063
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-24805) Compactor: Initiator shouldn't fetch table details again and again for partitioned tables

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-24805?focusedWorklogId=710326=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710326
 ]

ASF GitHub Bot logged work on HIVE-24805:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 08:39
Start Date: 18/Jan/22 08:39
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2906:
URL: https://github.com/apache/hive/pull/2906#discussion_r786519996



##
File path: 
standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/conf/MetastoreConf.java
##
@@ -589,6 +589,9 @@ public static ConfVars getMetaConf(String name) {
 "in which a warning is being raised if multiple worker version are 
detected.\n" +
 "The setting has no effect if the metastore.metrics.enabled is 
disabled \n" +
 "or the metastore.acidmetrics.thread.on is turned off."),
+
COMPACTOR_METADATA_CACHE_TIMEOUT("metastore.compactor.metadata.cache.timeout",
+  "hive.metastore.compactor.metadata.cache.timeout", 60, TimeUnit.SECONDS,
+  "Number of seconds the table/partition metadata are cached by the 
compactor. Setting it to zero disables the feature."),

Review comment:
   This is only table level




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710326)
Time Spent: 2h 40m  (was: 2.5h)

> Compactor: Initiator shouldn't fetch table details again and again for 
> partitioned tables
> -
>
> Key: HIVE-24805
> URL: https://issues.apache.org/jira/browse/HIVE-24805
> Project: Hive
>  Issue Type: Improvement
>  Components: Transactions
>Reporter: Rajesh Balamohan
>Assignee: Antal Sinkovits
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Initiator shouldn't be fetch table details for all its partitions. When there 
> are large number of databases/tables, it takes lot of time for Initiator to 
> complete its initial iteration and load on DB also goes higher.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java#L129
> https://github.com/apache/hive/blob/64bb52316f19426ebea0087ee15e282cbde1d852/ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java#L456
> For all the following partitions, table details would be the same. However, 
> it ends up fetching table details from HMS again and again.
> {noformat}
> 2021-02-22 08:13:16,106 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2451899
> 2021-02-22 08:13:16,124 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2451830
> 2021-02-22 08:13:16,140 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2452586
> 2021-02-22 08:13:16,149 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2452698
> 2021-02-22 08:13:16,158 INFO  
> org.apache.hadoop.hive.ql.txn.compactor.Initiator: [Thread-11]: Checking to 
> see if we should compact 
> tpcds_bin_partitioned_orc_1000.store_returns_tmp2.sr_returned_date_sk=2452063
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710319=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710319
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 08:24
Start Date: 18/Jan/22 08:24
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r786508087



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Worker.java
##
@@ -671,6 +679,13 @@ private String getWorkerId() {
 return name.toString();
   }
 
+  private void updateDeltaFilesMetrics(AcidDirectory directory, String dbName, 
String tableName, String partName,

Review comment:
   move to the common place if possible. updateDeltaFilesMetrics could be 
triggered by multiple threads leading to resource starvation, think about 
introducing connection pool 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710319)
Time Spent: 2h  (was: 1h 50m)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> FUNCTIONALITY: Metrics are collected only when a Tez query runs a table 
> (select * and select count( * ) don't update the metrics)
> Metrics aren't updated after compaction or cleaning after compaction, so 
> users will probably see "issues" with compaction (like many active or 
> obsolete or small deltas) that don't exist.
> RISK: Metrics are collected during queries – we tried to put a try-catch 
> around each method in DeltaFilesMetricsReporter but of course this isn't 
> foolproof. This is a HUGE performance and functionality liability. Tests 
> caught some issues, but our tests aren't perfect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710317=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710317
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 08:21
Start Date: 18/Jan/22 08:21
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r786506022



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java
##
@@ -381,17 +397,19 @@ public CompactionType run() throws Exception {
 }
   }
 
-  private CompactionType determineCompactionType(CompactionInfo ci, 
ValidWriteIdList writeIds,
- StorageDescriptor sd, 
Map tblproperties)
+  private AcidDirectory getAcidDirectory(StorageDescriptor sd,ValidWriteIdList 
writeIds) throws IOException {

Review comment:
   space




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710317)
Time Spent: 1h 50m  (was: 1h 40m)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> FUNCTIONALITY: Metrics are collected only when a Tez query runs a table 
> (select * and select count( * ) don't update the metrics)
> Metrics aren't updated after compaction or cleaning after compaction, so 
> users will probably see "issues" with compaction (like many active or 
> obsolete or small deltas) that don't exist.
> RISK: Metrics are collected during queries – we tried to put a try-catch 
> around each method in DeltaFilesMetricsReporter but of course this isn't 
> foolproof. This is a HUGE performance and functionality liability. Tests 
> caught some issues, but our tests aren't perfect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710316=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710316
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 08:18
Start Date: 18/Jan/22 08:18
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r786503651



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java
##
@@ -331,6 +339,12 @@ private boolean 
foundCurrentOrFailedCompactions(ShowCompactResponse compactions,
 }
 return false;
   }
+  
+  private void updateDeltaFilesMetrics(AcidDirectory directory, String dbName, 
String tableName, String partName) {

Review comment:
   i think, this method could be extracted to a common place to be used by 
Cleaner and Initiator, like MetaStoreCompactorThread




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710316)
Time Spent: 1h 40m  (was: 1.5h)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> FUNCTIONALITY: Metrics are collected only when a Tez query runs a table 
> (select * and select count( * ) don't update the metrics)
> Metrics aren't updated after compaction or cleaning after compaction, so 
> users will probably see "issues" with compaction (like many active or 
> obsolete or small deltas) that don't exist.
> RISK: Metrics are collected during queries – we tried to put a try-catch 
> around each method in DeltaFilesMetricsReporter but of course this isn't 
> foolproof. This is a HUGE performance and functionality liability. Tests 
> caught some issues, but our tests aren't perfect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710315=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710315
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 08:16
Start Date: 18/Jan/22 08:16
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r786502218



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Initiator.java
##
@@ -275,6 +276,13 @@ public void init(AtomicBoolean stop) throws Exception {
 compactionExecutor = CompactorUtil.createExecutorWithThreadFactory(
 conf.getIntVar(HiveConf.ConfVars.HIVE_COMPACTOR_REQUEST_QUEUE),
 COMPACTOR_INTIATOR_THREAD_NAME_FORMAT);
+metricsEnabled = MetastoreConf.getBoolVar(conf, 
MetastoreConf.ConfVars.METRICS_ENABLED) &&

Review comment:
   same as above, q is whether we need to support runtime changes to the 
metrics configs




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710315)
Time Spent: 1.5h  (was: 1h 20m)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> FUNCTIONALITY: Metrics are collected only when a Tez query runs a table 
> (select * and select count( * ) don't update the metrics)
> Metrics aren't updated after compaction or cleaning after compaction, so 
> users will probably see "issues" with compaction (like many active or 
> obsolete or small deltas) that don't exist.
> RISK: Metrics are collected during queries – we tried to put a try-catch 
> around each method in DeltaFilesMetricsReporter but of course this isn't 
> foolproof. This is a HUGE performance and functionality liability. Tests 
> caught some issues, but our tests aren't perfect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710312=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710312
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 08:14
Start Date: 18/Jan/22 08:14
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r786500771



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -396,7 +423,9 @@ private boolean removeFiles(String location, 
ValidWriteIdList writeIdList, Compa
 }
 StringBuilder extraDebugInfo = new 
StringBuilder("[").append(obsoleteDirs.stream()
 .map(Path::getName).collect(Collectors.joining(",")));
-return remove(location, ci, obsoleteDirs, true, fs, extraDebugInfo);
+boolean success = remove(location, ci, obsoleteDirs, true, fs, 
extraDebugInfo);
+updateDeltaFilesMetrics(ci.dbname, ci.tableName, ci.partName, 
dir.getObsolete().size());

Review comment:
   there are a few remove methods, see line #338 (soft-drop partition), 
updateDeltaFilesMetrics should be called there as well, or if possible we 
should move it inside of remove at line #431




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710312)
Time Spent: 1h 20m  (was: 1h 10m)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> FUNCTIONALITY: Metrics are collected only when a Tez query runs a table 
> (select * and select count( * ) don't update the metrics)
> Metrics aren't updated after compaction or cleaning after compaction, so 
> users will probably see "issues" with compaction (like many active or 
> obsolete or small deltas) that don't exist.
> RISK: Metrics are collected during queries – we tried to put a try-catch 
> around each method in DeltaFilesMetricsReporter but of course this isn't 
> foolproof. This is a HUGE performance and functionality liability. Tests 
> caught some issues, but our tests aren't perfect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Work logged] (HIVE-25842) Reimplement delta file metric collection

2022-01-18 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HIVE-25842?focusedWorklogId=710308=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-710308
 ]

ASF GitHub Bot logged work on HIVE-25842:
-

Author: ASF GitHub Bot
Created on: 18/Jan/22 08:06
Start Date: 18/Jan/22 08:06
Worklog Time Spent: 10m 
  Work Description: deniskuzZ commented on a change in pull request #2916:
URL: https://github.com/apache/hive/pull/2916#discussion_r786495219



##
File path: ql/src/java/org/apache/hadoop/hive/ql/txn/compactor/Cleaner.java
##
@@ -87,14 +106,15 @@ public void init(AtomicBoolean stop) throws Exception {
 cleanerExecutor = CompactorUtil.createExecutorWithThreadFactory(
 
conf.getIntVar(HiveConf.ConfVars.HIVE_COMPACTOR_CLEANER_THREADS_NUM),
 COMPACTOR_CLEANER_THREAD_NAME_FORMAT);
+metricsEnabled = MetastoreConf.getBoolVar(conf, 
MetastoreConf.ConfVars.METRICS_ENABLED) &&

Review comment:
   with that change you won't be able to enable/disable metrics collection 
at runtime, is it intentional?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 710308)
Time Spent: 1h 10m  (was: 1h)

> Reimplement delta file metric collection
> 
>
> Key: HIVE-25842
> URL: https://issues.apache.org/jira/browse/HIVE-25842
> Project: Hive
>  Issue Type: Improvement
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> FUNCTIONALITY: Metrics are collected only when a Tez query runs a table 
> (select * and select count( * ) don't update the metrics)
> Metrics aren't updated after compaction or cleaning after compaction, so 
> users will probably see "issues" with compaction (like many active or 
> obsolete or small deltas) that don't exist.
> RISK: Metrics are collected during queries – we tried to put a try-catch 
> around each method in DeltaFilesMetricsReporter but of course this isn't 
> foolproof. This is a HUGE performance and functionality liability. Tests 
> caught some issues, but our tests aren't perfect.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

49 matches

Mail list logo