Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-23 Thread via GitHub


ramitg254 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3790293809

   Thanks! @dengzhhu653 @kasakrisz @saihemanth-cloudera @deniskuzZ for review


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-22 Thread via GitHub


dengzhhu653 merged PR #6089:
URL: https://github.com/apache/hive/pull/6089


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-22 Thread via GitHub


dengzhhu653 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3788242767

   Filed the HIVE-29416 to track the optimization.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-21 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3777259188

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [65 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [1.8% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-21 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2711755061


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List 
aggrStatsUseJava(String catName, String dbName
 areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
   }
 
-  private List aggrStatsUseDB(String catName, String 
dbName,
-  String tableName, List partNames, List colNames, String 
engine,
-  boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation, 
double ndvTuner) throws MetaException {
+  private List aggrStatsUseDB(String catName, String 
dbName, String tableName,
+   List partNames, 
List colNames, String engine,
+   boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {
 // TODO: all the extrapolation logic should be moved out of this class,
 // only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV 
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true. 
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of 
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
-// and same as one of the partition NDV (domain of column value in all 
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing 
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV 
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation = 
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of 
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the 
sum
+// of partition NDVs (UpperBound). But if there is some overlay 
then
+// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
+// and same as one of the partition NDV (domain of column value in 
all other
+// partitions is subset of the domain value in one of the 
partition)
+// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by 
comparing it to the
+// UpperBound (c

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-21 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3776771237

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [71 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [1.8% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-20 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2710878500


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List 
aggrStatsUseJava(String catName, String dbName
 areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
   }
 
-  private List aggrStatsUseDB(String catName, String 
dbName,
-  String tableName, List partNames, List colNames, String 
engine,
-  boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation, 
double ndvTuner) throws MetaException {
+  private List aggrStatsUseDB(String catName, String 
dbName, String tableName,
+   List partNames, 
List colNames, String engine,
+   boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {
 // TODO: all the extrapolation logic should be moved out of this class,
 // only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV 
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true. 
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of 
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
-// and same as one of the partition NDV (domain of column value in all 
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing 
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV 
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation = 
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of 
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the 
sum
+// of partition NDVs (UpperBound). But if there is some overlay 
then
+// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
+// and same as one of the partition NDV (domain of column value in 
all other
+// partitions is subset of the domain value in one of the 
partition)
+// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by 
comparing it to the
+// UpperBound 

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-20 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3774543723

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [413 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [1.8% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-20 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2709521876


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1884,14 +1904,8 @@ private List 
columnStatisticsObjForPartitions(
 return Batchable.runBatched(batchSize, colNames, new Batchable() {

Review Comment:
   done moved to `DirectSqlAggrStats.java`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-20 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2709518413


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List 
aggrStatsUseJava(String catName, String dbName
 areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
   }
 
-  private List aggrStatsUseDB(String catName, String 
dbName,
-  String tableName, List partNames, List colNames, String 
engine,
-  boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation, 
double ndvTuner) throws MetaException {
+  private List aggrStatsUseDB(String catName, String 
dbName, String tableName,
+   List partNames, 
List colNames, String engine,
+   boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {
 // TODO: all the extrapolation logic should be moved out of this class,
 // only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV 
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true. 
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of 
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
-// and same as one of the partition NDV (domain of column value in all 
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing 
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV 
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation = 
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of 
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the 
sum
+// of partition NDVs (UpperBound). But if there is some overlay 
then
+// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
+// and same as one of the partition NDV (domain of column value in 
all other
+// partitions is subset of the domain value in one of the 
partition)
+// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by 
comparing it to the
+// UpperBound (c

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-20 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2708622327


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List 
aggrStatsUseJava(String catName, String dbName
 areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
   }
 
-  private List aggrStatsUseDB(String catName, String 
dbName,
-  String tableName, List partNames, List colNames, String 
engine,
-  boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation, 
double ndvTuner) throws MetaException {
+  private List aggrStatsUseDB(String catName, String 
dbName, String tableName,
+   List partNames, 
List colNames, String engine,
+   boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {
 // TODO: all the extrapolation logic should be moved out of this class,
 // only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV 
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true. 
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of 
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
-// and same as one of the partition NDV (domain of column value in all 
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing 
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV 
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation = 
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of 
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the 
sum
+// of partition NDVs (UpperBound). But if there is some overlay 
then
+// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
+// and same as one of the partition NDV (domain of column value in 
all other
+// partitions is subset of the domain value in one of the 
partition)
+// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by 
comparing it to the
+// UpperBound (c

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-20 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2708622327


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List 
aggrStatsUseJava(String catName, String dbName
 areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
   }
 
-  private List aggrStatsUseDB(String catName, String 
dbName,
-  String tableName, List partNames, List colNames, String 
engine,
-  boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation, 
double ndvTuner) throws MetaException {
+  private List aggrStatsUseDB(String catName, String 
dbName, String tableName,
+   List partNames, 
List colNames, String engine,
+   boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {
 // TODO: all the extrapolation logic should be moved out of this class,
 // only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV 
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true. 
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of 
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
-// and same as one of the partition NDV (domain of column value in all 
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing 
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV 
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation = 
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of 
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the 
sum
+// of partition NDVs (UpperBound). But if there is some overlay 
then
+// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
+// and same as one of the partition NDV (domain of column value in 
all other
+// partitions is subset of the domain value in one of the 
partition)
+// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by 
comparing it to the
+// UpperBound (c

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-20 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2707224857


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1884,14 +1904,8 @@ private List 
columnStatisticsObjForPartitions(
 return Batchable.runBatched(batchSize, colNames, new Batchable() {

Review Comment:
   could you please move all `aggrColStatsForPartitions` related methods to a 
new class, similar to `DirectSqlUpdatePart`



##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1884,14 +1904,8 @@ private List 
columnStatisticsObjForPartitions(
 return Batchable.runBatched(batchSize, colNames, new Batchable() {

Review Comment:
   Could you please move all `aggrColStatsForPartitions` related methods to a 
new class, similar to `DirectSqlUpdatePart`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-20 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2707211354


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List 
aggrStatsUseJava(String catName, String dbName
 areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
   }
 
-  private List aggrStatsUseDB(String catName, String 
dbName,
-  String tableName, List partNames, List colNames, String 
engine,
-  boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation, 
double ndvTuner) throws MetaException {
+  private List aggrStatsUseDB(String catName, String 
dbName, String tableName,
+   List partNames, 
List colNames, String engine,
+   boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {
 // TODO: all the extrapolation logic should be moved out of this class,
 // only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV 
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true. 
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of 
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
-// and same as one of the partition NDV (domain of column value in all 
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing 
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV 
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation = 
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of 
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the 
sum
+// of partition NDVs (UpperBound). But if there is some overlay 
then
+// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
+// and same as one of the partition NDV (domain of column value in 
all other
+// partitions is subset of the domain value in one of the 
partition)
+// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by 
comparing it to the
+// UpperBound 

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-20 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2707211354


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List 
aggrStatsUseJava(String catName, String dbName
 areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
   }
 
-  private List aggrStatsUseDB(String catName, String 
dbName,
-  String tableName, List partNames, List colNames, String 
engine,
-  boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation, 
double ndvTuner) throws MetaException {
+  private List aggrStatsUseDB(String catName, String 
dbName, String tableName,
+   List partNames, 
List colNames, String engine,
+   boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {
 // TODO: all the extrapolation logic should be moved out of this class,
 // only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV 
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true. 
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of 
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
-// and same as one of the partition NDV (domain of column value in all 
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing 
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV 
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation = 
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of 
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the 
sum
+// of partition NDVs (UpperBound). But if there is some overlay 
then
+// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
+// and same as one of the partition NDV (domain of column value in 
all other
+// partitions is subset of the domain value in one of the 
partition)
+// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by 
comparing it to the
+// UpperBound 

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-20 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2707208061


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List 
aggrStatsUseJava(String catName, String dbName
 areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
   }
 
-  private List aggrStatsUseDB(String catName, String 
dbName,
-  String tableName, List partNames, List colNames, String 
engine,
-  boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation, 
double ndvTuner) throws MetaException {
+  private List aggrStatsUseDB(String catName, String 
dbName, String tableName,
+   List partNames, 
List colNames, String engine,
+   boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {
 // TODO: all the extrapolation logic should be moved out of this class,
 // only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV 
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true. 
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of 
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
-// and same as one of the partition NDV (domain of column value in all 
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing 
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV 
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation = 
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of 
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the 
sum
+// of partition NDVs (UpperBound). But if there is some overlay 
then
+// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
+// and same as one of the partition NDV (domain of column value in 
all other
+// partitions is subset of the domain value in one of the 
partition)
+// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by 
comparing it to the
+// UpperBound 

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-19 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3769336143

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [23 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [3.2% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-19 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3768189816

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [23 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [3.2% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-19 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2704473151


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   @dengzhhu653 removed `areAllPartsFound` and tried to keep `aggrStatsUseDB` 
simple. Please have a look



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-19 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2703738929


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2159,102 +2187,180 @@ private List 
aggrStatsUseDB(String catName, String dbName,
   if (index == null) {
 index = IExtrapolatePartStatus.indexMaps.get("default");
   }
+
+  //for avg calculation
+  queryText = "select " + 
"sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"
+  + 
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
+  + "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
+  + "count(1)"
+  + " from " + PART_COL_STATS + ""
+  + " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
+  + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " 
+ TBLS + ".\"TBL_ID\""
+  + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
+  + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" 
= ? and " + TBLS + ".\"TBL_NAME\" = ? "
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+  + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+  + " group by \"COLUMN_NAME\"";
+
+  columnWisePartitionBatches =
+  columnWisePartitionBatcher(queryText, catName, dbName, 
tableName, partNames, engine, doTrace);
+  Integer[] sumNdvIndices = new Integer[]{14, 15, 16};
+  try {
+list = Batchable.runBatched(batchSize, 
Collections.singletonList(colName), columnWisePartitionBatches);
+for (Object[] batch : list) {
+  // filling in ndv sums for avg calculation
+  for (int i = 0; i < 3; i++) {
+row[sumNdvIndices[i]] = 
MetastoreDirectSqlUtils.sum(row[sumNdvIndices[i]], batch[i]);
+  }
+  // filling in count rows for avg calculation
+  row[17] = MetastoreDirectSqlUtils.sum(row[17], batch[3]);
+}
+  } finally {
+columnWisePartitionBatches.closeAllQueries();
+  }
   for (int colStatIndex : index) {
 String colStatName = 
IExtrapolatePartStatus.colStatNames[colStatIndex];
 // if the aggregation type is sum, we do a scale-up
 if (IExtrapolatePartStatus.aggrTypes[colStatIndex] == 
IExtrapolatePartStatus.AggrType.Sum) {
   Object o = sumMap.get(colName).get(colStatIndex);
+  // +3 only for the case of SUM_NUM_DISTINCTS which is after 
count rows index
+  int rowIndex = (colStatIndex == 15) ? colStatIndex + 3 : 
colStatIndex + 2;
   if (o == null) {
-row[2 + colStatIndex] = null;
+row[rowIndex] = null;
   } else {
 Long val = MetastoreDirectSqlUtils.extractSqlLong(o);
-row[2 + colStatIndex] = val / sumVal * (partNames.size());
+row[rowIndex] = val / sumVal * (partNames.size());
   }
 } else if (IExtrapolatePartStatus.aggrTypes[colStatIndex] == 
IExtrapolatePartStatus.AggrType.Min
 || IExtrapolatePartStatus.aggrTypes[colStatIndex] == 
IExtrapolatePartStatus.AggrType.Max) {
   // if the aggregation type is min/max, we extrapolate from the
   // left/right borders
-  if (!decimal) {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from 
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + 
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + 
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by \"" + colStatName + "\"";
-  } else {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from 
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + 
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
- 

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-08 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2674852462


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   @dengzhhu653 is there anything else that I should address?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2026-01-08 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2674852462


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   @dengzhhu653 is there anything else that needed to be taken care?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-29 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2652181949


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   yes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-29 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2652059815


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   this extrapolation is just based on the existing stats on colA, am I right?
   here the `val` is sum(colA) of the partition: part1,part2,part3, sumVal is 
3, and partNames.size() is 5.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-28 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3694658584

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [21 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-28 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2649590246


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   yes as you mentioned some assumption is need for missing two stats so this 
else block is taking care of that only as it is first finding the the available 
stats which are already there then extrapolate its results with some logic like:
   ```
   if (IExtrapolatePartStatus.aggrTypes[colStatIndex] == 
IExtrapolatePartStatus.AggrType.Sum) {
 Object o = sumMap.get(colName).get(colStatIndex);
 // +3 only for the case of SUM_NUM_DISTINCTS which is after 
count rows index
 int rowIndex = (colStatIndex == 15) ? colStatIndex + 3 : 
colStatIndex + 2;
 if (o == null) {
   row[rowIndex] = null;
 } else {
   Long val = MetastoreDirectSqlUtils.extractSqlLong(o);
   row[rowIndex] = val / sumVal * (partNames.size());
 }
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-21 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2638281029


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   I think this is not fully right. let's take an example:
   In db, colA has the stats for part1, part2, part3, and colB has stats for 
part1,part2,part3,par4,part5 
   If we want to aggregate the column stats among the 
partition(part1,part2,part3), as colA and colB both have the partition column 
stats, we just fetch the stats and aggregate them.
   however if we want the aggregated stats for 
partition(part1,part2,part3,part4,par5), as colB has all of them, so just 
aggregate them. While for colA, as part4 and part5 are missing, we need some 
assumptions on those two missing stats.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-21 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2638281029


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   I think this is not fully right. let's take an example:
   In db, colA has the stats for part1, part2, part3, and colB has stats for 
part1,part2,part3,par4,part5 
   If we want to aggregate the column stats among the 
partition(part1,part2,part3), as colA and colB both have the partition column 
stats, we just fetch the stats and aggregate them.
   however if we want the aggregated stats for 
partition(part1,part2,part3,part4,par5), as colB has all of them, so just 
aggregate them. While for colA, as part4 and part5 is missing, we need some 
assumptions on those two missing stats.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-19 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3674066508

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [21 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-18 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2634078220


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   like let's say for colA stats computed when table has 5 parts and for colB 
stats computed when table has 8 parts
   so here `areAllPartsFound` will be false



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-18 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2634066049


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   correct me if I'm wrong:
   we have `ANALYZE TABLE {table name} PARTITION({part cols}) COMPUTE 
STATISTICS FOR COLUMNS {columns passed};`
   so due to this query there can be scenarios where partitions found for a 
column of a table where stats have beeen computed can vary from column to 
column of a table and in that case partNames will be fixed and 
`areAllPartsFound` can be false and we need to handle this scenario as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-18 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2634066049


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   correct me if I'm wrong:
   we have `ANALYZE TABLE {table name} PARTITION({part cols}) COMPUTE 
STATISTICS FOR COLUMNS {columns passed};`
   so due to this query there can be scenarios where partitions found for a 
column of a table where stats have beeen computed can vary from column to 
column of a table and in that case `partNames ` will be fixed and 
`areAllPartsFound` can be false and we need to handle this scenario as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-18 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2634066049


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   correct me if I'm wrong:
   we have ANALYZE TABLE {table name} PARTITION({part cols}) COMPUTE STATISTICS 
FOR COLUMNS {columns passed};
   so due to this query there can be scenarios where partitions found for a 
column of a table where stats have beeen computed can vary from column to 
column of a table and in that case partNames will be fixed and 
`areAllPartsFound` can be false and we need to handle this scenario as well.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-18 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2634019511


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   This method aggregates the column stats for `partNames`, this is known 
beforehand, why do we should take account of the "new partitions"?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-18 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2633995211


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   also this else branch of  `areAllPartsFound` should not be removed as it is 
helpful in cases when statistics are computed before and some new partitions 
are added later on so some partition stats will be missing in that we will need 
this branch for that.
   I have added the test case `testMissingPartsStatsAggrWithBackendDB` for this 
scenario in `TestObjectStore.java` as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-18 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2633985059


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   I agree this is same and can be merged with common prefix but we should not 
do that as this query fetches way less parameters than `commonPrefix` one and 
it would be faster than it and also this query is necessary prerequisite for 
finding `noExtraColumnNames` later on which uses `commonPrefix` query so we 
can't just store the result of query execution of `commonPrefix` one once and 
use for both this prerequisite and for `noExtraColumnNames` as columns needed 
to be passed to both of them will be different.
   but I have replaced `count("PART_COL_STATS"."PART_ID")` with `count(1)`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-18 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2633903272


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -165,6 +166,26 @@ class MetaStoreDirectSql {
   private DirectSqlUpdatePart directSqlUpdatePart;
   private DirectSqlInsertPart directSqlInsertPart;
 
+  private static final int COLNAME = 0;

Review Comment:
   done added as an enum in the IExtrapolatePartStatus.java



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-18 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3670668944

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [16 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-17 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2629934599


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   the `areAllPartsFound` makes little sense here, we can remove this branch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-17 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2629878760


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -165,6 +166,26 @@ class MetaStoreDirectSql {
   private DirectSqlUpdatePart directSqlUpdatePart;
   private DirectSqlInsertPart directSqlInsertPart;
 
+  private static final int COLNAME = 0;

Review Comment:
   personally I dont like this number of static fields, maybe we could move 
them somewhere, like IExtrapolatePartStatus



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-17 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2629860466


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
   + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " + 
PART_COL_STATS + ".\"COLUMN_TYPE\"";

Review Comment:
   In this query, the `count("PART_COL_STATS"."PART_ID")` would be the same as 
`count(1)` in the `commonPrefix`, so this query can be merged as well.
   - For each column name, if the count == partNames.size() or count < 2, then 
we simply merge the stats;
   - otherwise, we need extrapolation for this column, 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-16 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3659478674

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [14 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-16 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2622247286


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1977,47 +2034,29 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 // And, we also guarantee that the estimation makes sense by comparing 
it to the
 // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
 // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
++ 
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"

Review Comment:
   my bad, done updated with count(1)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-15 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2621939204


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1977,47 +2034,29 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 // And, we also guarantee that the estimation makes sense by comparing 
it to the
 // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
 // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
++ 
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"

Review Comment:
   and also if I am understanding it right only one out of long, double and big 
decimal  expression will be not null according to col type so it will need 
checks to find it earlier in order to use count(1)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-15 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2621939204


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1977,47 +2034,29 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 // And, we also guarantee that the estimation makes sense by comparing 
it to the
 // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
 // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
++ 
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"

Review Comment:
   and also if I am understanding it right only one out of long, double and big 
decimal  expression will be not null according to col type so it will need 
checks to find it earlier in order to use count(1)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-15 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2621910371


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1977,47 +2034,29 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 // And, we also guarantee that the estimation makes sense by comparing 
it to the
 // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
 // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
++ 
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"

Review Comment:
   are you saying that instead of count(expression) we can use count(1) as 
whenever it is not null all rows will be counted?
   but isn't this count(expr) is more readable and understandable



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-15 Thread via GitHub


ramitg254 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3658972819

   > could you please consider this comments? [#6089 
(comment)](https://github.com/apache/hive/pull/6089#discussion_r2589091584) 
[#6089 
(comment)](https://github.com/apache/hive/pull/6089#discussion_r2589103939)
   
   @dengzhhu653 I have already considered and implemented these comments:
   1. https://github.com/apache/hive/pull/6089#discussion_r2621892496
   2. https://github.com/apache/hive/pull/6089#discussion_r2621896928


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-15 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2621896928


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2108,33 +2126,44 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
 + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
 + " group by " + PART_COL_STATS + ".\"COLUMN_NAME\"";
-start = doTrace ? System.nanoTime() : 0;
-try (QueryWrapper query = new 
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
-  List extraColumnNames = new ArrayList();
+
+columnWisePartitionBatches =
+columnWisePartitionBatcher(queryText, catName, dbName, 
tableName, partNames, engine, doTrace);
+try {
+  List extraColumnNames = new ArrayList<>();
   extraColumnNames.addAll(extraColumnNameTypeParts.keySet());
-  Object qResult = executeWithArray(query.getInnerQuery(),
-  prepareParams(catName, dbName, tableName, partNames,
-  extraColumnNames, engine), queryText);
-  if (qResult == null) {
-return Collections.emptyList();
+  List unmergedList = Batchable.runBatched(batchSize, 
extraColumnNames, columnWisePartitionBatches);
+  Map colSumStatsMap = new HashMap<>();
+  for (Object[] row : unmergedList) {

Review Comment:
   using unmergedList and map for mergedRow as mentioned in: 
https://github.com/apache/hive/pull/6089#discussion_r2589103939



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-15 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2621892496


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2159,102 +2188,185 @@ private List 
aggrStatsUseDB(String catName, String dbName,
   if (index == null) {
 index = IExtrapolatePartStatus.indexMaps.get("default");
   }
+
+  //for avg calculation
+  queryText = "select " + 
"sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"
+  + 
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"
+  + 
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
+  + 
"count((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
+  + "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
+  + "count((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
+  + " from " + PART_COL_STATS + ""
+  + " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
+  + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " 
+ TBLS + ".\"TBL_ID\""
+  + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
+  + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" 
= ? and " + TBLS + ".\"TBL_NAME\" = ? "
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+  + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+  + " group by \"COLUMN_NAME\"";
+
+  columnWisePartitionBatches =
+  columnWisePartitionBatcher(queryText, catName, dbName, 
tableName, partNames, engine, doTrace);
+  Object[] sum = new Object[3];
+  Object[] count = new Object[3];
+  Integer[] avgIndex = new Integer[]{14, 15, 16};
+  try {
+list = Batchable.runBatched(batchSize, 
Collections.singletonList(colName), columnWisePartitionBatches);
+for (int i = 0; i < 6; i += 2) {
+  for (Object[] batch : list) {
+sum[i / 2] = MetastoreDirectSqlUtils.sum(sum[i / 2], batch[i]);
+count[i / 2] = MetastoreDirectSqlUtils.sum(count[i / 2], 
batch[i + 1]);
+  }
+  // filling in sum and count in row for avg calculation later on
+  row[avgIndex[i / 2] + i / 2] = sum[i / 2];
+  row[avgIndex[i / 2] + i / 2 + 1] = count[i / 2];
+}
+  } finally {
+columnWisePartitionBatches.closeAllQueries();
+  }
   for (int colStatIndex : index) {
 String colStatName = 
IExtrapolatePartStatus.colStatNames[colStatIndex];
 // if the aggregation type is sum, we do a scale-up
 if (IExtrapolatePartStatus.aggrTypes[colStatIndex] == 
IExtrapolatePartStatus.AggrType.Sum) {
   Object o = sumMap.get(colName).get(colStatIndex);
+  // +5 only for the case of SUM_NUM_DISTINCTS which is after avg 
indices
+  int rowIndex = (colStatIndex == 15) ? colStatIndex + 5 : 
colStatIndex + 2;
   if (o == null) {
-row[2 + colStatIndex] = null;
+row[rowIndex] = null;
   } else {
 Long val = MetastoreDirectSqlUtils.extractSqlLong(o);
-row[2 + colStatIndex] = val / sumVal * (partNames.size());
+row[rowIndex] = val / sumVal * (partNames.size());
   }
 } else if (IExtrapolatePartStatus.aggrTypes[colStatIndex] == 
IExtrapolatePartStatus.AggrType.Min
 || IExtrapolatePartStatus.aggrTypes[colStatIndex] == 
IExtrapolatePartStatus.AggrType.Max) {
   // if the aggregation type is min/max, we extrapolate from the
   // left/right borders
-  if (!decimal) {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from 
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + 
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + 
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS 

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-15 Thread via GitHub


dengzhhu653 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3658797922

   could you please consider this comments?
   https://github.com/apache/hive/pull/6089#discussion_r2589091584
   https://github.com/apache/hive/pull/6089#discussion_r2589103939
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-15 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2621735849


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1977,47 +2034,29 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 // And, we also guarantee that the estimation makes sense by comparing 
it to the
 // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
 // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
++ 
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"

Review Comment:
   looks a single count `count(1)` is enough for me, 
   `count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal))`
   `count((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\")`
   `count((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-15 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2621735849


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1977,47 +2034,29 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 // And, we also guarantee that the estimation makes sense by comparing 
it to the
 // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
 // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
++ 
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"

Review Comment:
   looks a single count is enough for me, 
   `count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal))`
   `count((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\")`
   `count((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-12 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3645388441

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [16 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-09 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3632597922

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [14 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-09 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2602634109


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1954,9 +2007,20 @@ private List 
aggrStatsUseJava(String catName, String dbName
 areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
   }
 
-  private List aggrStatsUseDB(String catName, String 
dbName,
-  String tableName, List partNames, List colNames, String 
engine,
-  boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation, 
double ndvTuner) throws MetaException {
+  private Map> columnWiseSubList(List list) {
+Map> colSubList = new HashMap<>();
+for (Object[] row : list) {
+  String colName = (String) row[0];
+  colSubList.putIfAbsent(colName, new ArrayList<>());

Review Comment:
   Thanks for explaining it , I misunderstood it earlier so I have updated it 
with this implementation 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-09 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2602628155


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2174,86 +2270,132 @@ private List 
aggrStatsUseDB(String catName, String dbName,
 || IExtrapolatePartStatus.aggrTypes[colStatIndex] == 
IExtrapolatePartStatus.AggrType.Max) {
   // if the aggregation type is min/max, we extrapolate from the
   // left/right borders
-  if (!decimal) {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from 
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + 
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + 
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by \"" + colStatName + "\"";
-  } else {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from 
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + 
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + 
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by cast(\"" + colStatName + "\" as decimal)";
-  }
-  start = doTrace ? System.nanoTime() : 0;
-  try (QueryWrapper query = new 
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
-Object qResult = executeWithArray(query.getInnerQuery(),
-prepareParams(catName, dbName, tableName, partNames, 
Arrays.asList(colName), engine), queryText);
-if (qResult == null) {
-  return Collections.emptyList();
-}
-fqr = (ForwardQueryResult) qResult;
-Object[] min = (Object[]) (fqr.get(0));
-Object[] max = (Object[]) (fqr.get(fqr.size() - 1));
-end = doTrace ? System.nanoTime() : 0;
-MetastoreDirectSqlUtils.timingTrace(doTrace, queryText, start, 
end);
-if (min[0] == null || max[0] == null) {
-  row[2 + colStatIndex] = null;
-} else {
-  row[2 + colStatIndex] = extrapolateMethod
-  .extrapolate(min, max, colStatIndex, indexMap);
-}
-  }
-} else {
-  // if the aggregation type is avg, we use the average on the 
existing ones.
-  queryText = "select "
-  + 
"avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"
-  + 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-  + "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")"
-  + " from " + PART_COL_STATS + ""
+  String orderByExpr = decimal ? "cast(\"" + colStatName + "\" as 
decimal)" : "\"" + colStatName + "\"";
+
+  queryText = "select \"" + colStatName + "\",\"PART_NAME\" from " 
+ PART_COL_STATS
   + " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" 
= " + TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + 
DBS + ".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-09 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2602606147


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2174,86 +2270,132 @@ private List 
aggrStatsUseDB(String catName, String dbName,
 || IExtrapolatePartStatus.aggrTypes[colStatIndex] == 
IExtrapolatePartStatus.AggrType.Max) {
   // if the aggregation type is min/max, we extrapolate from the
   // left/right borders
-  if (!decimal) {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from 
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + 
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + 
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by \"" + colStatName + "\"";
-  } else {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from 
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + 
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + 
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by cast(\"" + colStatName + "\" as decimal)";
-  }
-  start = doTrace ? System.nanoTime() : 0;
-  try (QueryWrapper query = new 
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
-Object qResult = executeWithArray(query.getInnerQuery(),
-prepareParams(catName, dbName, tableName, partNames, 
Arrays.asList(colName), engine), queryText);
-if (qResult == null) {
-  return Collections.emptyList();
-}
-fqr = (ForwardQueryResult) qResult;
-Object[] min = (Object[]) (fqr.get(0));
-Object[] max = (Object[]) (fqr.get(fqr.size() - 1));
-end = doTrace ? System.nanoTime() : 0;
-MetastoreDirectSqlUtils.timingTrace(doTrace, queryText, start, 
end);
-if (min[0] == null || max[0] == null) {
-  row[2 + colStatIndex] = null;
-} else {
-  row[2 + colStatIndex] = extrapolateMethod
-  .extrapolate(min, max, colStatIndex, indexMap);
-}
-  }
-} else {
-  // if the aggregation type is avg, we use the average on the 
existing ones.
-  queryText = "select "
-  + 
"avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"
-  + 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-  + "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")"
-  + " from " + PART_COL_STATS + ""
+  String orderByExpr = decimal ? "cast(\"" + colStatName + "\" as 
decimal)" : "\"" + colStatName + "\"";
+
+  queryText = "select \"" + colStatName + "\",\"PART_NAME\" from " 
+ PART_COL_STATS
   + " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
   + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" 
= " + TBLS + ".\"TBL_ID\""
   + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + 
DBS + ".\"DB_ID\""
   + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
+  + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+  + " and " + PARTITIONS + ".

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-09 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2602575145


##
itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestTezBatchedStatsCliDriver.java:
##
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.cli;
+
+import org.apache.hadoop.hive.cli.control.CliAdapter;
+import org.apache.hadoop.hive.cli.control.CliConfigs;
+import org.junit.ClassRule;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TestRule;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+import org.junit.runners.Parameterized.Parameters;
+
+import java.io.File;
+import java.util.List;
+
+@RunWith(Parameterized.class)
+public class TestTezBatchedStatsCliDriver {

Review Comment:
   Thanks for pointing it, I agree with this and added corresponding test in 
`TestObjectStore.java` and removed earlier one



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-08 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3629061660

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [14 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-05 Thread via GitHub


kasakrisz commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2588754346


##
itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestTezBatchedStatsCliDriver.java:
##
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.cli;
+
+import org.apache.hadoop.hive.cli.control.CliAdapter;
+import org.apache.hadoop.hive.cli.control.CliConfigs;
+import org.junit.ClassRule;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TestRule;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+import org.junit.runners.Parameterized.Parameters;
+
+import java.io.File;
+import java.util.List;
+
+@RunWith(Parameterized.class)
+public class TestTezBatchedStatsCliDriver {

Review Comment:
   IMHO there is no need to add a new driver for testing the 
`ObjectStore.get_aggr_stats_for` method. Doing so would imply that we need new 
drivers to cover all combinations of `hive.stats.fetch.bitvector`, 
`hive.stats.fetch.kll`, `hive.metastore.direct.sql.batch.size`, etc. The number 
of drivers would explode if we followed this pattern.
   
   
   Please take a look at an existing test case. It is not a q-test–based one, 
but a good integration test that focuses only on `ObjectStore`.
   
https://github.com/apache/hive/blob/ca105f8124072d19d88a83b2ced613d326c9a26b/standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestObjectStore.java#L1025-L1047
   (The test method name `testAggrStatsUseDB` is misleading, though, because 
the test actually calls the `ObjectStore.get_aggr_stats_for` method, which 
eventually calls `MetaStoreDirectSql.testAggrStatsUseDB`.)
   
   Configurations can be easily changed in these kinds of tests via the 
MetastoreConf object. We also have better ways to write assertions. These tests 
run faster than q-tests.
   
   I temporarily added:
   ```
   MetastoreConf.setLongVar(conf2, ConfVars.DIRECT_SQL_PARTITION_BATCH_SIZE, 2);
   ```
   and I was able to reproduce the bug described in 
[HIVE-29203](https://issues.apache.org/jira/browse/HIVE-29203). Both with 
aggregating in the backend db (`aggrStatsUseDB`) and HMS side 
(`aggrStatsUseJava`).
   



##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2174,86 +2270,132 @@ private List 
aggrStatsUseDB(String catName, String dbName,
 || IExtrapolatePartStatus.aggrTypes[colStatIndex] == 
IExtrapolatePartStatus.AggrType.Max) {
   // if the aggregation type is min/max, we extrapolate from the
   // left/right borders
-  if (!decimal) {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from 
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + 
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + 
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by \"" + colStatName + "\"";
-  } else {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from 
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + 
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + 
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + P

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3605393243

   and sonar is pointing out indentation issues which i don't think needs to be 
addressed as I kept it same as we have in those other drive files and config 
file and others are large brain methods like `aggrStatsUseDB` which are already 
part of code and splitting it up into separate methods with entire 
implementation intact can take up time and may break some scenarios addressed 
in that method


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3605348137

   Hi,
   so it is regarding the unit tests added:
   1. why I am going for driver based approach instead of simply adding a q 
file under an already present driver:
  -> the `-Dhive.stats.fetch.bitvector` and 
`-Dhive.metastore.direct.sql.batch.size` can't be set at the query time inside 
a q file and needed to be set in `hive-site.xml`
  
2. why going for a docker image instead of a q file loading the data into 
the table:
  -> these changes are for scenario where statistics have been computed 
already and then someone specify this property 
`hive.metastore.direct.sql.batch.size` and restart hms so in that case results 
are different. But to replicate this in unit test it won't be possible if I 
define table and load data in q file as ` it will give ORM related exception 
when placing debug pointer although getting executed` . So what i did is 
created a `test_stats` table having  data in 13 partitions and then compute 
statistics for its columns and built a docker image with the stats dump and 
used that docker image.
 
   the q files added are performing self join when 
`hive.metastore.direct.sql.batch.size=5` and the 
`hive.metastore.direct.sql.batch.size` is set according to the q file name.
   
   you can validate the results via:
   cherry-picking the unit tests added commit on the top of the current master 
branch and run:
   `mvn test -pl itests/qtest -Pitests -Dtest=TestTezBatchedStatsCliDriver  
-Dqfile=sketch_query.q -Dtest.output.overwrite`
   similar for `no_sketch_query.q`
   and you will see different results from expected one added in commit and if 
you just drop the property `hive.metastore.direct.sql.batch.size` from the 
`hive-site.xml`  added for this driver you will see the expected results which 
is the validation for corrected expected output with these changes in case of 
batching.
   
   FYI: the stats are added in such a way that overlapping values lie in 
between batches of size 5 resulting in just taking maximum of ndv is not enough 
to evaluate ndv calculated without batching as it is the case failing only 
otherwise it passes even after having redundant per column elements which just 
simply gets concatenated via simple function like maximum and  all on the 
higher level
 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3605270439

   > I think we can make the `aggrStatsUseDB` be simpler,
   > 
   > 1. run the below SQL batched by part names:
   > 
   > ```java
   >String query  = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
   >+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
   >+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
   >+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
   >+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
   >// The following data is used to compute a partitioned table's NDV 
based
   >// on partitions' NDV when useDensityFunctionForNDVEstimation = 
true. Global NDVs cannot be
   >// accurately derived from partition NDVs, because the domain of 
column value two partitions
   >// can overlap. If there is no overlap then global NDV is just the 
sum
   >// of partition NDVs (UpperBound). But if there is some overlay then
   >// global NDV can be anywhere between sum of partition NDVs (no 
overlap)
   >// and same as one of the partition NDV (domain of column value in 
all other
   >// partitions is subset of the domain value in one of the partition)
   >// (LowerBound).But under uniform distribution, we can roughly 
estimate the global
   >// NDV by leveraging the min/max values.
   >// And, we also guarantee that the estimation makes sense by 
comparing it to the
   >// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
   >// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
   >+ 
"sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"
   >+ 
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
   >+ "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
   >+ "count(PARTITIONS + ".\"PART_ID\"),"
   >+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" + " 
inner join " + PARTITIONS + " on "
   >+ PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\"" 
+ " inner join " + TBLS + " on " + PARTITIONS
   >+ ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" + " inner join " + DBS + 
" on " + TBLS + ".\"DB_ID\" = " + DBS
   >+ ".\"DB_ID\"" + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS 
+ ".\"NAME\" = ? and " + TBLS
   >+ ".\"TBL_NAME\" = ?  and \"COLUMN_NAME\" in (%1$s)" + " and " + 
PARTITIONS + ".\"PART_NAME\" in (%2$s)"
   >  + " and \"ENGINE\" = ? " + " group by \"COLUMN_NAME\", 
\"COLUMN_TYPE\"";
   > ```
   > 
   > 2. For each batch, merge the returned rows as we does in this PR:
   > 
   > ```java
   > try {
   >  List list = Batchable.runBatched(batchSize, partNames, b);
   >  Map> colSubList = columnWiseSubList(list);
   >  for (Map.Entry> entry : colSubList.entrySet()) 
{
   >colStats.add(columnStatisticsObjWithAdjustedNDV(entry.getValue(), 
0, useDensityFunctionForNDVEstimation, ndvTuner));
   >Deadline.checkTimeout();
   >  }
   >} finally {
   >  b.closeAllQueries();
   >}
   > ```
   > 
   >  3. for each `ColumnStatisticsObj` got from step 2, change the `sum` by 
`sum * partNames.size() / count(PARTITIONS + ".\"PART_ID\") for this column`
   > 
   > By this, we can make the `aggrStatsUseDB` simple and clear, and don't send 
the unnecessary queries back and forth to backing db.
   
   IIUC you are talking about the different queries for the `else` block of `if 
(areAllPartsFound)` but I don't have the unit test to test this behaviour as it 
is for a broken behaviour when a column is not present in all partiitions. So, 
I didn't change any implementation and queries for those case and only replaced 
the results of list with a map and then performed merging and add to the final 
stats list so that it won't break any functionality.
   
   I can't change the implementation without having a unit test for that 
behaviour.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583757083


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2098,29 +2121,39 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 }
 // get sum for all columns to reduce the number of queries
 Map> sumMap = new HashMap>();
-queryText = "select \"COLUMN_NAME\", sum(\"NUM_NULLS\"), 
sum(\"NUM_TRUES\"), sum(\"NUM_FALSES\"), sum(\"NUM_DISTINCTS\")"
-+ " from " + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = 
? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(extraColumnNameTypeParts.size()) + ")"
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\"";
-start = doTrace ? System.nanoTime() : 0;
-try (QueryWrapper query = new 
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
+queryText =
+"select \"COLUMN_NAME\", sum(\"NUM_NULLS\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), sum(\"NUM_DISTINCTS\")"
++ " from " + PART_COL_STATS + " inner join " + PARTITIONS + " 
on " + PART_COL_STATS + ".\"PART_ID\" = "
++ PARTITIONS + ".\"PART_ID\"" + " inner join " + TBLS + " on " 
+ PARTITIONS + ".\"TBL_ID\" = " + TBLS
++ ".\"TBL_ID\"" + " inner join " + DBS + " on " + TBLS + 
".\"DB_ID\" = " + DBS + ".\"DB_ID\""
++ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
++ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(extraColumnNameTypeParts.size()) + ")"
++ " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")" + " and "
++ PART_COL_STATS + ".\"ENGINE\" = ? " + " group by " + 
PART_COL_STATS + ".\"COLUMN_NAME\"";
+
+b = jobsBatching(queryText, catName, dbName, tableName, partNames, 
engine, doTrace);
+try {
   List extraColumnNames = new ArrayList();
   extraColumnNames.addAll(extraColumnNameTypeParts.keySet());
-  Object qResult = executeWithArray(query.getInnerQuery(),
-  prepareParams(catName, dbName, tableName, partNames,
-  extraColumnNames, engine), queryText);
-  if (qResult == null) {
-return Collections.emptyList();
+  List unmergedList = Batchable.runBatched(batchSize, 
extraColumnNames, b);
+  Map> colSubList = 
columnWiseSubList(unmergedList);
+  List mergedList = new ArrayList<>();
+  for (Map.Entry> entry : 
colSubList.entrySet()) {

Review Comment:
   I don't think it can be avoided same as the above mentioned reason



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583755269


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2098,29 +2121,39 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 }
 // get sum for all columns to reduce the number of queries
 Map> sumMap = new HashMap>();
-queryText = "select \"COLUMN_NAME\", sum(\"NUM_NULLS\"), 
sum(\"NUM_TRUES\"), sum(\"NUM_FALSES\"), sum(\"NUM_DISTINCTS\")"
-+ " from " + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = 
? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(extraColumnNameTypeParts.size()) + ")"
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\"";
-start = doTrace ? System.nanoTime() : 0;
-try (QueryWrapper query = new 
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
+queryText =
+"select \"COLUMN_NAME\", sum(\"NUM_NULLS\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), sum(\"NUM_DISTINCTS\")"
++ " from " + PART_COL_STATS + " inner join " + PARTITIONS + " 
on " + PART_COL_STATS + ".\"PART_ID\" = "
++ PARTITIONS + ".\"PART_ID\"" + " inner join " + TBLS + " on " 
+ PARTITIONS + ".\"TBL_ID\" = " + TBLS
++ ".\"TBL_ID\"" + " inner join " + DBS + " on " + TBLS + 
".\"DB_ID\" = " + DBS + ".\"DB_ID\""
++ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
++ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" + 
makeParams(extraColumnNameTypeParts.size()) + ")"
++ " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")" + " and "
++ PART_COL_STATS + ".\"ENGINE\" = ? " + " group by " + 
PART_COL_STATS + ".\"COLUMN_NAME\"";

Review Comment:
   done, restored 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583754720


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1977,116 +2020,96 @@ private List 
aggrStatsUseDB(String catName, String dbName,
 // And, we also guarantee that the estimation makes sense by comparing 
it to the
 // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
 // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? ";
++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
++ 
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"
++ 
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
++ 
"count((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
++ "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "count((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" + " inner 
join " + PARTITIONS + " on "
++ PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\"" + " 
inner join " + TBLS + " on " + PARTITIONS
++ ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" + " inner join " + DBS + " 
on " + TBLS + ".\"DB_ID\" = " + DBS
++ ".\"DB_ID\"" + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS
++ ".\"TBL_NAME\" = ? ";
 String queryText = null;
-long start = 0;
-long end = 0;
 
 boolean doTrace = LOG.isDebugEnabled();
 ForwardQueryResult fqr = null;
 // Check if the status of all the columns of all the partitions exists
 // Extrapolation is not needed.
 if (areAllPartsFound) {
-  queryText = commonPrefix + " and \"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
-  + " and \"ENGINE\" = ? "
-  + " group by \"COLUMN_NAME\", \"COLUMN_TYPE\"";
-  start = doTrace ? System.nanoTime() : 0;
-  try (QueryWrapper query = new 
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
-Object qResult = executeWithArray(query.getInnerQuery(),
-prepareParams(catName, dbName, tableName, partNames, colNames,
-engine), queryText);
-if (qResult == null) {
-  return Collections.emptyList();
-}
-end = doTrace ? System.nanoTime() : 0;
-MetastoreDirectSqlUtils.timingTrace(doTrace, queryText, start, end);
-List list = MetastoreDirectSqlUtils.ensureList(qResult);
-List colStats =
-new ArrayList(list.size());
-for (Object[] row : list) {
-  colStats.add(prepareCSObjWithAdjustedNDV(row, 0,
-  useDensityFunctionForNDVEstimation, ndvTuner));
+  queryText = commonPrefix + " and \"COLUMN_NAME\" in (%1$s)" + " and " + 
PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+  + " and \"ENGINE\" = ? " + " group by \"COLUMN_NAME\", 
\"COLUMN_TYPE\"";
+  Batchable b = jobsBatching(queryText, catName, dbName, 
tableName, partNames, engine, doTrace);
+  List colStats = new ArrayList<>(colNames.size());
+  try {
+List list = Batchable.runBatched(batchSize, colNames, b);
+Map> colSubList = columnWiseSubList(list);
+for (Map.Entry> entry : colSubList.entrySet()) {

Review Comment:
   I understand but we will need map to store the redundant column results 
obtained from different batches for same column name and those needs to be 
merged and other issue is we can't be sure of the no. of elements for a 
particular column name which needs to be merged as there are else cases for 
`areAllPartsFound` otherwise we would have gone for sorting the obtained 
results and then merging the results obtained at a particular gap.
   
   currently we are only calling `columnStatisticsObjWithAdjustedNDV` once per 
column as we are calling it w

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583707418


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 }
   }
 
+  private ColumnStatisticsObj 
columnStatisticsObjWithAdjustedNDV(List list, int i,
+ boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {

Review Comment:
   I am not sure about this as this method is replacing 
`prepareCSObjWithAdjustedNDV` method and it is only limited to `aggrStatsUseDB` 
and not other places



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583697128


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 }
   }
 
+  private ColumnStatisticsObj 
columnStatisticsObjWithAdjustedNDV(List list, int i,
+ boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {
+if (list.isEmpty()) {
+  return null;
+}
+ColumnStatisticsData data = new ColumnStatisticsData();
+int j = i;
+Object[] row = list.getFirst();
+String colName = (String) row[j++];
+String colType = (String) row[j++];
+ColumnStatisticsObj cso = new ColumnStatisticsObj(colName, colType, data);
+Object llow = row[j++];
+Object lhigh = row[j++];
+Object dlow = row[j++];
+Object dhigh = row[j++];
+Object declow = row[j++];
+Object dechigh = row[j++];
+Object nulls = row[j++];
+Object dist = row[j++];
+Object avglen = row[j++];
+Object maxlen = row[j++];
+Object trues = row[j++];
+Object falses = row[j++];
+Object sumLong = row[j++];
+Object countLong = row[j++];
+Object sumDouble = row[j++];
+Object countDouble = row[j++];
+Object sumDecimal = row[j++];
+Object countDecimal = row[j++];
+Object sumDist = row[j];
+for (int k = 1; k < list.size(); k++) {

Review Comment:
   I don't think we should do that as current loops start with `index =1`  and 
it is able to replace the method `prepareCSObjWithAdjustedNDV` and if we try to 
implement it in the other way the we will need to take care of the conditionals 
and return statement inside loop which will not be cleaner than current one



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583697643


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 }
   }
 
+  private ColumnStatisticsObj 
columnStatisticsObjWithAdjustedNDV(List list, int i,
+ boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {
+if (list.isEmpty()) {
+  return null;
+}
+ColumnStatisticsData data = new ColumnStatisticsData();
+int j = i;
+Object[] row = list.getFirst();
+String colName = (String) row[j++];
+String colType = (String) row[j++];
+ColumnStatisticsObj cso = new ColumnStatisticsObj(colName, colType, data);
+Object llow = row[j++];
+Object lhigh = row[j++];
+Object dlow = row[j++];
+Object dhigh = row[j++];
+Object declow = row[j++];
+Object dechigh = row[j++];
+Object nulls = row[j++];
+Object dist = row[j++];
+Object avglen = row[j++];
+Object maxlen = row[j++];
+Object trues = row[j++];
+Object falses = row[j++];
+Object sumLong = row[j++];
+Object countLong = row[j++];
+Object sumDouble = row[j++];
+Object countDouble = row[j++];
+Object sumDecimal = row[j++];
+Object countDecimal = row[j++];
+Object sumDist = row[j];
+for (int k = 1; k < list.size(); k++) {
+  j = i + 2;
+  row = list.get(k);
+  llow = MetastoreDirectSqlUtils.min(llow, row[j++]);
+  lhigh = MetastoreDirectSqlUtils.max(lhigh, row[j++]);
+  dlow = MetastoreDirectSqlUtils.min(dlow, row[j++]);
+  dhigh = MetastoreDirectSqlUtils.max(dhigh, row[j++]);
+  declow = MetastoreDirectSqlUtils.min(declow, row[j++]);
+  dechigh = MetastoreDirectSqlUtils.max(dechigh, row[j++]);
+  nulls = MetastoreDirectSqlUtils.sum(nulls, row[j++]);
+  dist = MetastoreDirectSqlUtils.max(dist, row[j++]);
+  avglen = MetastoreDirectSqlUtils.max(avglen, row[j++]);
+  maxlen = MetastoreDirectSqlUtils.max(maxlen, row[j++]);
+  trues = MetastoreDirectSqlUtils.sum(trues, row[j++]);
+  falses = MetastoreDirectSqlUtils.sum(falses, row[j++]);
+  sumLong = MetastoreDirectSqlUtils.sum(sumLong, row[j++]);
+  countLong = MetastoreDirectSqlUtils.sum(countLong, row[j++]);
+  sumDouble = MetastoreDirectSqlUtils.sum(sumDouble, row[j++]);
+  countDouble = MetastoreDirectSqlUtils.sum(countDouble, row[j++]);
+  sumDecimal = MetastoreDirectSqlUtils.sum(sumDecimal, row[j++]);
+  countDecimal = MetastoreDirectSqlUtils.sum(countDecimal, row[j++]);
+  sumDist = MetastoreDirectSqlUtils.sum(sumDist, row[j]);

Review Comment:
   done, added constants in place of it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583688671


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 }
   }
 
+  private ColumnStatisticsObj 
columnStatisticsObjWithAdjustedNDV(List list, int i,
+ boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {
+if (list.isEmpty()) {
+  return null;
+}
+ColumnStatisticsData data = new ColumnStatisticsData();
+int j = i;
+Object[] row = list.getFirst();
+String colName = (String) row[j++];
+String colType = (String) row[j++];
+ColumnStatisticsObj cso = new ColumnStatisticsObj(colName, colType, data);
+Object llow = row[j++];
+Object lhigh = row[j++];
+Object dlow = row[j++];
+Object dhigh = row[j++];
+Object declow = row[j++];
+Object dechigh = row[j++];
+Object nulls = row[j++];
+Object dist = row[j++];
+Object avglen = row[j++];
+Object maxlen = row[j++];
+Object trues = row[j++];
+Object falses = row[j++];
+Object sumLong = row[j++];
+Object countLong = row[j++];
+Object sumDouble = row[j++];
+Object countDouble = row[j++];
+Object sumDecimal = row[j++];
+Object countDecimal = row[j++];

Review Comment:
   done, added constants in place of it



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583687667


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 }
   }
 
+  private ColumnStatisticsObj 
columnStatisticsObjWithAdjustedNDV(List list, int i,

Review Comment:
   done, renamed list to `columnBatchesOutput` and removed it as it was not 
needed
   and you also need to use `-Dhive.stats.fetch.bitvector=false` also to step 
into this method as it is associated with `aggrStatsUseDB`. 
   and if you want to compare results then first run the command without 
`-Dhive.metastore.direct.sql.batch.size=1000` and store the results and then 
compare those results with including 
`-Dhive.metastore.direct.sql.batch.size=1000` to compare the results produced 
after this.
   
   this need to be done because `aggrStatsUseDB` sometimes produces different 
results from expected as with disabled bit vector and disabled kll sketch the 
exact ndv  estimation is not possible in some cases so approximate results are 
produced by `aggrStatsUseDB` and these changes are keeping the `aggrStatsUseDB` 
batched results consistent with `aggrStatsUseDB` unbatched results.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583655581


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2174,76 +2240,48 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 || IExtrapolatePartStatus.aggrTypes[colStatIndex] == 
IExtrapolatePartStatus.AggrType.Max) {
   // if the aggregation type is min/max, we extrapolate from the
   // left/right borders
-  if (!decimal) {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from 
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + 
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + 
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by \"" + colStatName + "\"";
-  } else {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from 
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + 
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + 
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by cast(\"" + colStatName + "\" as decimal)";
-  }
-  start = doTrace ? System.nanoTime() : 0;
-  try (QueryWrapper query = new 
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
-Object qResult = executeWithArray(query.getInnerQuery(),
-prepareParams(catName, dbName, tableName, partNames, 
Arrays.asList(colName), engine), queryText);
-if (qResult == null) {
-  return Collections.emptyList();
+  String orderByExpr = decimal ? "cast(\"" + colStatName + "\" as 
decimal)" : "\"" + colStatName + "\"";
+
+  queryText =
+  "select \"" + colStatName + "\",\"PART_NAME\" from " + 
PART_COL_STATS + " inner join " + PARTITIONS
+  + " on " + PART_COL_STATS + ".\"PART_ID\" = " + 
PARTITIONS + ".\"PART_ID\"" + " inner join "
+  + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + TBLS + 
".\"TBL_ID\"" + " inner join " + DBS
+  + " on " + TBLS + ".\"DB_ID\" = " + DBS + ".\"DB_ID\"" + 
" where " + DBS
+  + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? and " 
+ TBLS + ".\"TBL_NAME\" = ? " + " and "
+  + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)" + " and 
" + PARTITIONS
+  + ".\"PART_NAME\" in (%2$s)" + " and " + PART_COL_STATS 
+ ".\"ENGINE\" = ? " + " order by "
+  + orderByExpr;

Review Comment:
   done restored the formatting



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583654940


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1977,116 +2020,96 @@ private List 
aggrStatsUseDB(String catName, String dbName,
 // And, we also guarantee that the estimation makes sense by comparing 
it to the
 // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
 // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? ";
++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
++ 
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"
++ 
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
++ 
"count((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
++ "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "count((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" + " inner 
join " + PARTITIONS + " on "
++ PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\"" + " 
inner join " + TBLS + " on " + PARTITIONS
++ ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" + " inner join " + DBS + " 
on " + TBLS + ".\"DB_ID\" = " + DBS
++ ".\"DB_ID\"" + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS
++ ".\"TBL_NAME\" = ? ";
 String queryText = null;
-long start = 0;
-long end = 0;
 
 boolean doTrace = LOG.isDebugEnabled();
 ForwardQueryResult fqr = null;
 // Check if the status of all the columns of all the partitions exists
 // Extrapolation is not needed.
 if (areAllPartsFound) {
-  queryText = commonPrefix + " and \"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
-  + " and \"ENGINE\" = ? "
-  + " group by \"COLUMN_NAME\", \"COLUMN_TYPE\"";
-  start = doTrace ? System.nanoTime() : 0;
-  try (QueryWrapper query = new 
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
-Object qResult = executeWithArray(query.getInnerQuery(),
-prepareParams(catName, dbName, tableName, partNames, colNames,
-engine), queryText);
-if (qResult == null) {
-  return Collections.emptyList();
-}
-end = doTrace ? System.nanoTime() : 0;
-MetastoreDirectSqlUtils.timingTrace(doTrace, queryText, start, end);
-List list = MetastoreDirectSqlUtils.ensureList(qResult);
-List colStats =
-new ArrayList(list.size());
-for (Object[] row : list) {
-  colStats.add(prepareCSObjWithAdjustedNDV(row, 0,
-  useDensityFunctionForNDVEstimation, ndvTuner));
+  queryText = commonPrefix + " and \"COLUMN_NAME\" in (%1$s)" + " and " + 
PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+  + " and \"ENGINE\" = ? " + " group by \"COLUMN_NAME\", 
\"COLUMN_TYPE\"";
+  Batchable b = jobsBatching(queryText, catName, dbName, 
tableName, partNames, engine, doTrace);
+  List colStats = new ArrayList<>(colNames.size());
+  try {
+List list = Batchable.runBatched(batchSize, colNames, b);
+Map> colSubList = columnWiseSubList(list);
+for (Map.Entry> entry : colSubList.entrySet()) {
+  colStats.add(columnStatisticsObjWithAdjustedNDV(entry.getValue(), 0, 
useDensityFunctionForNDVEstimation, ndvTuner));
   Deadline.checkTimeout();
 }
-return colStats;
+  } finally {
+b.closeAllQueries();
   }
+  return colStats;
 } else {
   // Extrapolation is needed for some columns.
   // In this case, at least a column status for a partition is missing.
   // We need to extrapolate this partition based on the other partitions
   List colStats = new 
ArrayList(colNames.size());
-  query

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583652938


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1942,6 +1937,43 @@ private List 
columnStatisticsObjForPartitionsBatch(String c
 }
   }
 
+  private Batchable jobsBatching(final String queryText0, 
final String catName, final String dbName,
+  final String tableName, final List partNames, final String 
engine, final boolean doTrace) {
+return new Batchable() {
+  @Override
+  public List run(final List inputColNames)
+  throws MetaException {
+Batchable b2 = new Batchable() {

Review Comment:
   done, I went with `partitionBatchesFetcher`, it was sounding more clearer to 
me



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583646648


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1942,6 +1937,43 @@ private List 
columnStatisticsObjForPartitionsBatch(String c
 }
   }
 
+  private Batchable jobsBatching(final String queryText0, 
final String catName, final String dbName,

Review Comment:
   done renamed to `columnWisePartitionBatcher`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583648972


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1977,116 +2020,96 @@ private List 
aggrStatsUseDB(String catName, String dbName,
 // And, we also guarantee that the estimation makes sense by comparing 
it to the
 // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
 // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? ";
++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
++ 
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"
++ 
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
++ 
"count((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
++ "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "count((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" + " inner 
join " + PARTITIONS + " on "
++ PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\"" + " 
inner join " + TBLS + " on " + PARTITIONS
++ ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" + " inner join " + DBS + " 
on " + TBLS + ".\"DB_ID\" = " + DBS
++ ".\"DB_ID\"" + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS
++ ".\"TBL_NAME\" = ? ";
 String queryText = null;
-long start = 0;
-long end = 0;
 
 boolean doTrace = LOG.isDebugEnabled();
 ForwardQueryResult fqr = null;
 // Check if the status of all the columns of all the partitions exists
 // Extrapolation is not needed.
 if (areAllPartsFound) {
-  queryText = commonPrefix + " and \"COLUMN_NAME\" in (" + 
makeParams(colNames.size()) + ")"
-  + " and " + PARTITIONS + ".\"PART_NAME\" in (" + 
makeParams(partNames.size()) + ")"
-  + " and \"ENGINE\" = ? "
-  + " group by \"COLUMN_NAME\", \"COLUMN_TYPE\"";
-  start = doTrace ? System.nanoTime() : 0;
-  try (QueryWrapper query = new 
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
-Object qResult = executeWithArray(query.getInnerQuery(),
-prepareParams(catName, dbName, tableName, partNames, colNames,
-engine), queryText);
-if (qResult == null) {
-  return Collections.emptyList();
-}
-end = doTrace ? System.nanoTime() : 0;
-MetastoreDirectSqlUtils.timingTrace(doTrace, queryText, start, end);
-List list = MetastoreDirectSqlUtils.ensureList(qResult);
-List colStats =
-new ArrayList(list.size());
-for (Object[] row : list) {
-  colStats.add(prepareCSObjWithAdjustedNDV(row, 0,
-  useDensityFunctionForNDVEstimation, ndvTuner));
+  queryText = commonPrefix + " and \"COLUMN_NAME\" in (%1$s)" + " and " + 
PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+  + " and \"ENGINE\" = ? " + " group by \"COLUMN_NAME\", 
\"COLUMN_TYPE\"";
+  Batchable b = jobsBatching(queryText, catName, dbName, 
tableName, partNames, engine, doTrace);
+  List colStats = new ArrayList<>(colNames.size());
+  try {
+List list = Batchable.runBatched(batchSize, colNames, b);
+Map> colSubList = columnWiseSubList(list);
+for (Map.Entry> entry : colSubList.entrySet()) {
+  colStats.add(columnStatisticsObjWithAdjustedNDV(entry.getValue(), 0, 
useDensityFunctionForNDVEstimation, ndvTuner));
   Deadline.checkTimeout();
 }
-return colStats;
+  } finally {
+b.closeAllQueries();
   }
+  return colStats;
 } else {
   // Extrapolation is needed for some columns.
   // In this case, at least a column status for a partition is missing.
   // We need to extrapolate this partition based on the other partitions
   List colStats = new 
ArrayList(colNames.size());
-  query

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583648609


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1977,116 +2020,96 @@ private List 
aggrStatsUseDB(String catName, String dbName,
 // And, we also guarantee that the estimation makes sense by comparing 
it to the
 // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
 // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
-+ 
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + 
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + 
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + 
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? 
and " + TBLS + ".\"TBL_NAME\" = ? ";
++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" 
as decimal)),"
++ 
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"
++ 
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
++ 
"count((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
++ "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "count((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" + " inner 
join " + PARTITIONS + " on "
++ PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\"" + " 
inner join " + TBLS + " on " + PARTITIONS
++ ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" + " inner join " + DBS + " 
on " + TBLS + ".\"DB_ID\" = " + DBS
++ ".\"DB_ID\"" + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS
++ ".\"TBL_NAME\" = ? ";

Review Comment:
   done, restored the original one



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583646648


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1942,6 +1937,43 @@ private List 
columnStatisticsObjForPartitionsBatch(String c
 }
   }
 
+  private Batchable jobsBatching(final String queryText0, 
final String catName, final String dbName,

Review Comment:
   done renamed to columnWisePartitionBatcher



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3602358839

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [45 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-12-02 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3601903089

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [45 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-30 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3593774118

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [45 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-30 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3592985471

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [61 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-30 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3592523773

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [61 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-30 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3592468808

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [61 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-29 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3591895211

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [61 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-26 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2564992611


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 }
   }
 
+  private ColumnStatisticsObj 
columnStatisticsObjWithAdjustedNDV(List list, int i,
+ boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {

Review Comment:
   can this method move into `StatObjectConverter`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-26 Thread via GitHub


dengzhhu653 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3581136612

   I think we can make the `aggrStatsUseDB` be simpler, 
1. run the below SQL batched by part names:
```java
   String query  = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
   + "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"), 
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
   + "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)), 
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
   + "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
   + "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"), 
sum(\"NUM_FALSES\"), "
   // The following data is used to compute a partitioned table's NDV 
based
   // on partitions' NDV when useDensityFunctionForNDVEstimation = 
true. Global NDVs cannot be
   // accurately derived from partition NDVs, because the domain of 
column value two partitions
   // can overlap. If there is no overlap then global NDV is just the 
sum
   // of partition NDVs (UpperBound). But if there is some overlay then
   // global NDV can be anywhere between sum of partition NDVs (no 
overlap)
   // and same as one of the partition NDV (domain of column value in 
all other
   // partitions is subset of the domain value in one of the partition)
   // (LowerBound).But under uniform distribution, we can roughly 
estimate the global
   // NDV by leveraging the min/max values.
   // And, we also guarantee that the estimation makes sense by 
comparing it to the
   // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
   // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
   + 
"sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as 
decimal)),"
   + 
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
   + "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as 
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
   + "count(PARTITIONS + ".\"PART_ID\"),"
   + "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" + " 
inner join " + PARTITIONS + " on "
   + PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\"" + 
" inner join " + TBLS + " on " + PARTITIONS
   + ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" + " inner join " + DBS + " 
on " + TBLS + ".\"DB_ID\" = " + DBS
   + ".\"DB_ID\"" + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + 
".\"NAME\" = ? and " + TBLS
   + ".\"TBL_NAME\" = ?  and \"COLUMN_NAME\" in (%1$s)" + " and " + 
PARTITIONS + ".\"PART_NAME\" in (%2$s)"
 + " and \"ENGINE\" = ? " + " group by \"COLUMN_NAME\", 
\"COLUMN_TYPE\"";
   ```
  2. For each batch, merge the returned rows as we does in this PR:
  ```java
 try {
   List list = Batchable.runBatched(batchSize, partNames, b);
   Map> colSubList = columnWiseSubList(list);
   for (Map.Entry> entry : 
colSubList.entrySet()) {
 colStats.add(columnStatisticsObjWithAdjustedNDV(entry.getValue(), 
0, useDensityFunctionForNDVEstimation, ndvTuner));
 Deadline.checkTimeout();
   }
 } finally {
   b.closeAllQueries();
 }
   ```
3.  for each `ColumnStatisticsObj` got from step 2, change the `sum` by 
`sum * partNames.size() / count(PARTITIONS + ".\"PART_ID\") for this column`

   By this, we can make the  `aggrStatsUseDB` simple and clear, and don't send 
the unnecessary queries back and forth to backing db.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-25 Thread via GitHub


kasakrisz commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2560044635


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 }
   }
 
+  private ColumnStatisticsObj 
columnStatisticsObjWithAdjustedNDV(List list, int i,
+ boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {
+if (list.isEmpty()) {
+  return null;
+}
+ColumnStatisticsData data = new ColumnStatisticsData();
+int j = i;
+Object[] row = list.getFirst();
+String colName = (String) row[j++];
+String colType = (String) row[j++];
+ColumnStatisticsObj cso = new ColumnStatisticsObj(colName, colType, data);
+Object llow = row[j++];
+Object lhigh = row[j++];
+Object dlow = row[j++];
+Object dhigh = row[j++];
+Object declow = row[j++];
+Object dechigh = row[j++];
+Object nulls = row[j++];
+Object dist = row[j++];
+Object avglen = row[j++];
+Object maxlen = row[j++];
+Object trues = row[j++];
+Object falses = row[j++];
+Object sumLong = row[j++];
+Object countLong = row[j++];
+Object sumDouble = row[j++];
+Object countDouble = row[j++];
+Object sumDecimal = row[j++];
+Object countDecimal = row[j++];
+Object sumDist = row[j];
+for (int k = 1; k < list.size(); k++) {

Review Comment:
   Can this be replaced with
   ```
   for (Object[] row : list)
   ```
   ?



##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1942,6 +1937,43 @@ private List 
columnStatisticsObjForPartitionsBatch(String c
 }
   }
 
+  private Batchable jobsBatching(final String queryText0, 
final String catName, final String dbName,

Review Comment:
   Can this method be used for other than loading per partition column 
statistics? If not could you please rename it how more specific name.



##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String 
catName, String dbName,
 }
   }
 
+  private ColumnStatisticsObj 
columnStatisticsObjWithAdjustedNDV(List list, int i,
+ boolean 
useDensityFunctionForNDVEstimation, double ndvTuner)
+  throws MetaException {
+if (list.isEmpty()) {
+  return null;
+}
+ColumnStatisticsData data = new ColumnStatisticsData();
+int j = i;
+Object[] row = list.getFirst();
+String colName = (String) row[j++];
+String colType = (String) row[j++];
+ColumnStatisticsObj cso = new ColumnStatisticsObj(colName, colType, data);
+Object llow = row[j++];
+Object lhigh = row[j++];
+Object dlow = row[j++];
+Object dhigh = row[j++];
+Object declow = row[j++];
+Object dechigh = row[j++];
+Object nulls = row[j++];
+Object dist = row[j++];
+Object avglen = row[j++];
+Object maxlen = row[j++];
+Object trues = row[j++];
+Object falses = row[j++];
+Object sumLong = row[j++];
+Object countLong = row[j++];
+Object sumDouble = row[j++];
+Object countDouble = row[j++];
+Object sumDecimal = row[j++];
+Object countDecimal = row[j++];
+Object sumDist = row[j];
+for (int k = 1; k < list.size(); k++) {
+  j = i + 2;
+  row = list.get(k);
+  llow = MetastoreDirectSqlUtils.min(llow, row[j++]);
+  lhigh = MetastoreDirectSqlUtils.max(lhigh, row[j++]);
+  dlow = MetastoreDirectSqlUtils.min(dlow, row[j++]);
+  dhigh = MetastoreDirectSqlUtils.max(dhigh, row[j++]);
+  declow = MetastoreDirectSqlUtils.min(declow, row[j++]);
+  dechigh = MetastoreDirectSqlUtils.max(dechigh, row[j++]);
+  nulls = MetastoreDirectSqlUtils.sum(nulls, row[j++]);
+  dist = MetastoreDirectSqlUtils.max(dist, row[j++]);
+  avglen = MetastoreDirectSqlUtils.max(avglen, row[j++]);
+  maxlen = MetastoreDirectSqlUtils.max(maxlen, row[j++]);
+  trues = MetastoreDirectSqlUtils.sum(trues, row[j++]);
+  falses = MetastoreDirectSqlUtils.sum(falses, row[j++]);
+  sumLong = MetastoreDirectSqlUtils.sum(sumLong, row[j++]);
+  countLong = MetastoreDirectSqlUtils.sum(countLong, row[j++]);
+  sumDouble = MetastoreDirectSqlUtils.sum(sumDouble, row[j++]);
+  countDouble = MetastoreDirectSqlUtils.sum(countDouble, row[j++]);
+  sumDecimal = MetastoreDirectSqlUtils.sum(sumDecimal, row[j++]);
+  countDecimal = MetastoreDirectSqlUtils.sum(countDecimal, row[j++]);
+  sumDist = MetastoreDirectSqlUtils.sum(sumDist, row[j]);

Review Comment:
   Please use constant indices instead of `j++`.



##
standalone-metastore/metastore-server/src/

Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-25 Thread via GitHub


ramitg254 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3575750117

   > @ramitg254 Thanks for reporting this bug and working on the fix. I have a 
few questions to get a better picture of the issue:
   > 
   > > I ran a script which executes the queries of pattern cbo_* and query* 
under perf directory with
   > 
   > 1. Did you actually execute all TPC-DS queries or just compile them? The 
driver `TestTezTPCDS30TBPerfCliDriver` doesn't execute the queries since the 
data is not available. It uses a Postgres HMS backend db dump to simulate an 
environment where the TPC-DS schema exists and calls Hive's SQL compiler using 
the `explain` and `explain cbo` commands.
   > 2. Do the numbers you shared in the tables (Apache master version, with 
aggrStatsUseDB/without aggrStatsUseDB and with batching/without batching) show 
the overall compilation time of all queries?
   > 3. I haven't found any new tests in this patch, no golden file changes 
either. Could you please provide a minimal repro of the issue? Please don't 
copy-paste any q file of the tpc-ds queries. IIUC this issue should be 
reproducible with a table having a few partitions and a batch size smaller than 
the number of partitions. Unit tests are also welcome.
   
   This only addresses the fact that stats results shouldn't change after batch 
size is defined to some value.
   
   1. I didn't ran the actual queries only the simulation used by the 
`TestTezTPCDS30TBPerfCliDriver` and ran those explain queries as you mentioned 
and recorded their time elapsed. 
   
   2. yes this the overall compilation time of the queries executed via 
`TestTezTPCDS30TBPerfCliDriver` under different scenarios
   
   3. I'll try to add some unit test to replicate this behaviour.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-25 Thread via GitHub


kasakrisz commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3575499587

   @ramitg254 
   Thanks for reporting this bug and working on the fix. I have a few questions 
to get a better picture of the issue:
   
   > I ran a script which executes the queries of pattern cbo_* and query* 
under perf directory with
   
   1. Did you actually execute all TPC-DS queries or just compile them? The 
driver `TestTezTPCDS30TBPerfCliDriver` doesn't execute the queries since the 
data is not available. It uses a Postgres HMS backend db dump to simulate an 
environment where the TPC-DS schema exists and calls Hive's SQL compiler using 
the `explain` and `explain cbo` commands.
   
   2. Do the numbers you shared in the tables (Apache master version, with 
aggrStatsUseDB/without aggrStatsUseDB and with batching/without batching) show 
the overall compilation time of all queries?
   
   3. I haven't found any new tests in this patch, no golden file changes 
either. Could you please provide a minimal repro of the issue? Please don't 
copy-paste any q file of the tpc-ds queries. IIUC this issue should be 
reproducible with a table having a few partitions and a batch size smaller than 
the number of partitions. Unit tests are also welcome. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-20 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3557376806

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [17 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-19 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3556284727

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [17 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-19 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3554262804

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [17 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-18 Thread via GitHub


ramitg254 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3547579033

   Hi,
   I ran a script which executes the queries of pattern cbo_* and query* under 
perf directory with `-Dhive.stats.fetch.bitvector=false`  and with the postgres 
dump the overall time elapsed results were like below:
   https://github.com/user-attachments/assets/0a577bdd-c4b7-49a7-9ddf-26ce35ca5343";
 />
   
   and since there is significant gap in overall time elapsed to execute all 
the queries and also results of some queries varies after the drop of 
`aggrStatsUseDB` in case of `-Dhive.stats.fetch.bitvector=false`. Considering 
this performance drop and varied results I don't think we should  go ahead with 
dropping of `aggrStatsUseDB`.
   
   Since currently that's the case so I have raised this current patch which 
addresses my original concern of handling the behaviour of merging in case of 
direct sql batch size defined and in this patch I am just adding merging logic 
which and no change in the earlier behaviour so there won't be any performance 
issues as well  but similar like I tested it out with same set of queries and 
config and results which I saved earlier originally get generated via 
`aggrStatsUseDB` didn't change with this logic as well and their time elapsed 
results are somewhat like this:
   https://github.com/user-attachments/assets/7241c86b-08a6-406f-b589-87e7b8be0ceb";
 />
   
   @dengzhhu653 @deniskuzZ @kasakrisz @saihemanth-cloudera  please share your 
thoughts on this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-18 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3547021003

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [31 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-16 Thread via GitHub


sonarqubecloud[bot] commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3539012333

   ## [![Quality Gate 
Passed](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/checks/QualityGateBadge/qg-passed-20px.png
 'Quality Gate 
Passed')](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) 
**Quality Gate passed**  
   Issues  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [31 New 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/accepted-16px.png
 '') [0 Accepted 
issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED)
   
   Measures  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0 Security 
Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Coverage on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)
  
   
![](https://sonarsource.github.io/sonarcloud-github-static-resources/v2/common/passed-16px.png
 '') [0.0% Duplication on New 
Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list)
  
 
   [See analysis details on SonarQube 
Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-11-03 Thread via GitHub


ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2486500601


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1953,306 +1942,6 @@ private List 
aggrStatsUseJava(String catName, String dbName
 areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
   }
 
-  private List aggrStatsUseDB(String catName, String 
dbName,

Review Comment:
   @saihemanth-cloudera  as per HIVE-16997 bit vector is set to true for tpcds 
tests even if default value is false making it go back to `aggrStatsUseJava` 
and even if i set it to false to fall back to `aggrStatsUseDB` the results are 
different for .out files of tpcds queries.
   So my concern is do I need to first obtain those different results generated 
via `aggrStatsUseDB` and then run performance testing while validating with 
those results.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-10-18 Thread via GitHub


dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2375233641


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1880,14 +1880,8 @@ private List 
columnStatisticsObjForPartitions(
 return Batchable.runBatched(batchSize, colNames, new Batchable() {
   @Override
   public List run(final List inputColNames) 
throws MetaException {
-return Batchable.runBatched(batchSize, partNames, new 
Batchable() {
-  @Override
-  public List run(List inputPartNames) 
throws MetaException {
-return columnStatisticsObjForPartitionsBatch(catName, dbName, 
tableName, inputPartNames,
-inputColNames, engine, areAllPartsFound, 
useDensityFunctionForNDVEstimation, ndvTuner,
-enableBitVector, enableKll);
-  }
-});
+return columnStatisticsObjForPartitionsBatch(catName, dbName, 
tableName, partNames, inputColNames, engine,

Review Comment:
   nit: can we just aggrStatsUseJava directly here and remove the 
columnStatisticsObjForPartitionsBatch



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-10-18 Thread via GitHub


saihemanth-cloudera commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2414814623


##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1953,306 +1942,6 @@ private List 
aggrStatsUseJava(String catName, String dbName
 areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
   }
 
-  private List aggrStatsUseDB(String catName, String 
dbName,

Review Comment:
   > Do we have tests using other dbs than Derby and the PostgreSQL image?
   Oracle has a 1000 parameter limit. It would be nice to run performance 
testing on PostgreSQL, Oracle, mysql before removing this optimization.



##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1953,306 +1942,6 @@ private List 
aggrStatsUseJava(String catName, String dbName
 areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
   }
 
-  private List aggrStatsUseDB(String catName, String 
dbName,

Review Comment:
   > Do we have tests using other dbs than Derby and the PostgreSQL image?
   
   Oracle has a 1000 parameter limit. It would be nice to run performance 
testing on PostgreSQL, Oracle, mysql before removing this optimization.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]

2025-10-18 Thread via GitHub


ramitg254 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3351135571

   > > ### How was this patch tested?
   > > by running tpcds tests locally with setting 
hive.metastore.direct.sql.batch.size as 1000 in HiveConf.java and 
MetastoreConf.java.
   > 
   > Hi @ramitg254, Would the tpcds tests fail without this PR? BTW, What is 
the size of the tpcds test dataset?
   
   Hi @zhangbutao, without this pr tpcds tests will fail whenever 
`hive.metastore.direct.sql.batch.size` > 0 in cases partition size is greater 
than the batch size.
   there were around 55 tpcds test failures when property is set to 1000, and 
one such example is query16.q 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



  1   2   >