Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3790293809 Thanks! @dengzhhu653 @kasakrisz @saihemanth-cloudera @deniskuzZ for review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 merged PR #6089: URL: https://github.com/apache/hive/pull/6089 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3788242767 Filed the HIVE-29416 to track the optimization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3777259188 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [65 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [1.8% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2711755061
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List
aggrStatsUseJava(String catName, String dbName
areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
}
- private List aggrStatsUseDB(String catName, String
dbName,
- String tableName, List partNames, List colNames, String
engine,
- boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation,
double ndvTuner) throws MetaException {
+ private List aggrStatsUseDB(String catName, String
dbName, String tableName,
+ List partNames,
List colNames, String engine,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
// TODO: all the extrapolation logic should be moved out of this class,
// only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true.
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no
overlap)
-// and same as one of the partition NDV (domain of column value in all
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\"
as decimal)),"
-+
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation =
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the
sum
+// of partition NDVs (UpperBound). But if there is some overlay
then
+// global NDV can be anywhere between sum of partition NDVs (no
overlap)
+// and same as one of the partition NDV (domain of column value in
all other
+// partitions is subset of the domain value in one of the
partition)
+// (LowerBound).But under uniform distribution, we can roughly
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by
comparing it to the
+// UpperBound (c
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3776771237 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [71 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [1.8% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2710878500
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List
aggrStatsUseJava(String catName, String dbName
areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
}
- private List aggrStatsUseDB(String catName, String
dbName,
- String tableName, List partNames, List colNames, String
engine,
- boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation,
double ndvTuner) throws MetaException {
+ private List aggrStatsUseDB(String catName, String
dbName, String tableName,
+ List partNames,
List colNames, String engine,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
// TODO: all the extrapolation logic should be moved out of this class,
// only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true.
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no
overlap)
-// and same as one of the partition NDV (domain of column value in all
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\"
as decimal)),"
-+
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation =
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the
sum
+// of partition NDVs (UpperBound). But if there is some overlay
then
+// global NDV can be anywhere between sum of partition NDVs (no
overlap)
+// and same as one of the partition NDV (domain of column value in
all other
+// partitions is subset of the domain value in one of the
partition)
+// (LowerBound).But under uniform distribution, we can roughly
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by
comparing it to the
+// UpperBound
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3774543723 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [413 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [1.8% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2709521876
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1884,14 +1904,8 @@ private List
columnStatisticsObjForPartitions(
return Batchable.runBatched(batchSize, colNames, new Batchable() {
Review Comment:
done moved to `DirectSqlAggrStats.java`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2709518413
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List
aggrStatsUseJava(String catName, String dbName
areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
}
- private List aggrStatsUseDB(String catName, String
dbName,
- String tableName, List partNames, List colNames, String
engine,
- boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation,
double ndvTuner) throws MetaException {
+ private List aggrStatsUseDB(String catName, String
dbName, String tableName,
+ List partNames,
List colNames, String engine,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
// TODO: all the extrapolation logic should be moved out of this class,
// only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true.
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no
overlap)
-// and same as one of the partition NDV (domain of column value in all
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\"
as decimal)),"
-+
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation =
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the
sum
+// of partition NDVs (UpperBound). But if there is some overlay
then
+// global NDV can be anywhere between sum of partition NDVs (no
overlap)
+// and same as one of the partition NDV (domain of column value in
all other
+// partitions is subset of the domain value in one of the
partition)
+// (LowerBound).But under uniform distribution, we can roughly
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by
comparing it to the
+// UpperBound (c
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2708622327
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List
aggrStatsUseJava(String catName, String dbName
areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
}
- private List aggrStatsUseDB(String catName, String
dbName,
- String tableName, List partNames, List colNames, String
engine,
- boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation,
double ndvTuner) throws MetaException {
+ private List aggrStatsUseDB(String catName, String
dbName, String tableName,
+ List partNames,
List colNames, String engine,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
// TODO: all the extrapolation logic should be moved out of this class,
// only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true.
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no
overlap)
-// and same as one of the partition NDV (domain of column value in all
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\"
as decimal)),"
-+
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation =
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the
sum
+// of partition NDVs (UpperBound). But if there is some overlay
then
+// global NDV can be anywhere between sum of partition NDVs (no
overlap)
+// and same as one of the partition NDV (domain of column value in
all other
+// partitions is subset of the domain value in one of the
partition)
+// (LowerBound).But under uniform distribution, we can roughly
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by
comparing it to the
+// UpperBound (c
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2708622327
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List
aggrStatsUseJava(String catName, String dbName
areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
}
- private List aggrStatsUseDB(String catName, String
dbName,
- String tableName, List partNames, List colNames, String
engine,
- boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation,
double ndvTuner) throws MetaException {
+ private List aggrStatsUseDB(String catName, String
dbName, String tableName,
+ List partNames,
List colNames, String engine,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
// TODO: all the extrapolation logic should be moved out of this class,
// only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true.
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no
overlap)
-// and same as one of the partition NDV (domain of column value in all
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\"
as decimal)),"
-+
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation =
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the
sum
+// of partition NDVs (UpperBound). But if there is some overlay
then
+// global NDV can be anywhere between sum of partition NDVs (no
overlap)
+// and same as one of the partition NDV (domain of column value in
all other
+// partitions is subset of the domain value in one of the
partition)
+// (LowerBound).But under uniform distribution, we can roughly
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by
comparing it to the
+// UpperBound (c
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2707224857
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1884,14 +1904,8 @@ private List
columnStatisticsObjForPartitions(
return Batchable.runBatched(batchSize, colNames, new Batchable() {
Review Comment:
could you please move all `aggrColStatsForPartitions` related methods to a
new class, similar to `DirectSqlUpdatePart`
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1884,14 +1904,8 @@ private List
columnStatisticsObjForPartitions(
return Batchable.runBatched(batchSize, colNames, new Batchable() {
Review Comment:
Could you please move all `aggrColStatsForPartitions` related methods to a
new class, similar to `DirectSqlUpdatePart`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2707211354
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List
aggrStatsUseJava(String catName, String dbName
areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
}
- private List aggrStatsUseDB(String catName, String
dbName,
- String tableName, List partNames, List colNames, String
engine,
- boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation,
double ndvTuner) throws MetaException {
+ private List aggrStatsUseDB(String catName, String
dbName, String tableName,
+ List partNames,
List colNames, String engine,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
// TODO: all the extrapolation logic should be moved out of this class,
// only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true.
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no
overlap)
-// and same as one of the partition NDV (domain of column value in all
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\"
as decimal)),"
-+
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation =
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the
sum
+// of partition NDVs (UpperBound). But if there is some overlay
then
+// global NDV can be anywhere between sum of partition NDVs (no
overlap)
+// and same as one of the partition NDV (domain of column value in
all other
+// partitions is subset of the domain value in one of the
partition)
+// (LowerBound).But under uniform distribution, we can roughly
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by
comparing it to the
+// UpperBound
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2707211354
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List
aggrStatsUseJava(String catName, String dbName
areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
}
- private List aggrStatsUseDB(String catName, String
dbName,
- String tableName, List partNames, List colNames, String
engine,
- boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation,
double ndvTuner) throws MetaException {
+ private List aggrStatsUseDB(String catName, String
dbName, String tableName,
+ List partNames,
List colNames, String engine,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
// TODO: all the extrapolation logic should be moved out of this class,
// only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true.
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no
overlap)
-// and same as one of the partition NDV (domain of column value in all
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\"
as decimal)),"
-+
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation =
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the
sum
+// of partition NDVs (UpperBound). But if there is some overlay
then
+// global NDV can be anywhere between sum of partition NDVs (no
overlap)
+// and same as one of the partition NDV (domain of column value in
all other
+// partitions is subset of the domain value in one of the
partition)
+// (LowerBound).But under uniform distribution, we can roughly
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by
comparing it to the
+// UpperBound
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2707208061
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1957,307 +2010,224 @@ private List
aggrStatsUseJava(String catName, String dbName
areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
}
- private List aggrStatsUseDB(String catName, String
dbName,
- String tableName, List partNames, List colNames, String
engine,
- boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation,
double ndvTuner) throws MetaException {
+ private List aggrStatsUseDB(String catName, String
dbName, String tableName,
+ List partNames,
List colNames, String engine,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
// TODO: all the extrapolation logic should be moved out of this class,
// only mechanical data retrieval should remain here.
-String commonPrefix = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
-+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
-+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
-+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
-+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
-// The following data is used to compute a partitioned table's NDV
based
-// on partitions' NDV when useDensityFunctionForNDVEstimation = true.
Global NDVs cannot be
-// accurately derived from partition NDVs, because the domain of
column value two partitions
-// can overlap. If there is no overlap then global NDV is just the sum
-// of partition NDVs (UpperBound). But if there is some overlay then
-// global NDV can be anywhere between sum of partition NDVs (no
overlap)
-// and same as one of the partition NDV (domain of column value in all
other
-// partitions is subset of the domain value in one of the partition)
-// (LowerBound).But under uniform distribution, we can roughly
estimate the global
-// NDV by leveraging the min/max values.
-// And, we also guarantee that the estimation makes sense by comparing
it to the
-// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
-// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\"
as decimal)),"
-+
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? ";
-String queryText = null;
-long start = 0;
-long end = 0;
+String queryText = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
++ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
++ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
++ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
++ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
+// The following data is used to compute a partitioned table's NDV
based
+// on partitions' NDV when useDensityFunctionForNDVEstimation =
true. Global NDVs cannot be
+// accurately derived from partition NDVs, because the domain of
column value two partitions
+// can overlap. If there is no overlap then global NDV is just the
sum
+// of partition NDVs (UpperBound). But if there is some overlay
then
+// global NDV can be anywhere between sum of partition NDVs (no
overlap)
+// and same as one of the partition NDV (domain of column value in
all other
+// partitions is subset of the domain value in one of the
partition)
+// (LowerBound).But under uniform distribution, we can roughly
estimate the global
+// NDV by leveraging the min/max values.
+// And, we also guarantee that the estimation makes sense by
comparing it to the
+// UpperBound
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3769336143 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [23 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [3.2% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3768189816 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [23 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [3.2% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2704473151
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
@dengzhhu653 removed `areAllPartsFound` and tried to keep `aggrStatsUseDB`
simple. Please have a look
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2703738929
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2159,102 +2187,180 @@ private List
aggrStatsUseDB(String catName, String dbName,
if (index == null) {
index = IExtrapolatePartStatus.indexMaps.get("default");
}
+
+ //for avg calculation
+ queryText = "select " +
"sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as
decimal)),"
+ +
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
+ + "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
+ + "count(1)"
+ + " from " + PART_COL_STATS + ""
+ + " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
+ + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = "
+ TBLS + ".\"TBL_ID\""
+ + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\"
= ? and " + TBLS + ".\"TBL_NAME\" = ? "
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ + " group by \"COLUMN_NAME\"";
+
+ columnWisePartitionBatches =
+ columnWisePartitionBatcher(queryText, catName, dbName,
tableName, partNames, engine, doTrace);
+ Integer[] sumNdvIndices = new Integer[]{14, 15, 16};
+ try {
+list = Batchable.runBatched(batchSize,
Collections.singletonList(colName), columnWisePartitionBatches);
+for (Object[] batch : list) {
+ // filling in ndv sums for avg calculation
+ for (int i = 0; i < 3; i++) {
+row[sumNdvIndices[i]] =
MetastoreDirectSqlUtils.sum(row[sumNdvIndices[i]], batch[i]);
+ }
+ // filling in count rows for avg calculation
+ row[17] = MetastoreDirectSqlUtils.sum(row[17], batch[3]);
+}
+ } finally {
+columnWisePartitionBatches.closeAllQueries();
+ }
for (int colStatIndex : index) {
String colStatName =
IExtrapolatePartStatus.colStatNames[colStatIndex];
// if the aggregation type is sum, we do a scale-up
if (IExtrapolatePartStatus.aggrTypes[colStatIndex] ==
IExtrapolatePartStatus.AggrType.Sum) {
Object o = sumMap.get(colName).get(colStatIndex);
+ // +3 only for the case of SUM_NUM_DISTINCTS which is after
count rows index
+ int rowIndex = (colStatIndex == 15) ? colStatIndex + 3 :
colStatIndex + 2;
if (o == null) {
-row[2 + colStatIndex] = null;
+row[rowIndex] = null;
} else {
Long val = MetastoreDirectSqlUtils.extractSqlLong(o);
-row[2 + colStatIndex] = val / sumVal * (partNames.size());
+row[rowIndex] = val / sumVal * (partNames.size());
}
} else if (IExtrapolatePartStatus.aggrTypes[colStatIndex] ==
IExtrapolatePartStatus.AggrType.Min
|| IExtrapolatePartStatus.aggrTypes[colStatIndex] ==
IExtrapolatePartStatus.AggrType.Max) {
// if the aggregation type is min/max, we extrapolate from the
// left/right borders
- if (!decimal) {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS +
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " +
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by \"" + colStatName + "\"";
- } else {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS +
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2674852462
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
@dengzhhu653 is there anything else that I should address?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2674852462
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
@dengzhhu653 is there anything else that needed to be taken care?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2652181949
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
yes
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2652059815
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
this extrapolation is just based on the existing stats on colA, am I right?
here the `val` is sum(colA) of the partition: part1,part2,part3, sumVal is
3, and partNames.size() is 5.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3694658584 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [21 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2649590246
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
yes as you mentioned some assumption is need for missing two stats so this
else block is taking care of that only as it is first finding the the available
stats which are already there then extrapolate its results with some logic like:
```
if (IExtrapolatePartStatus.aggrTypes[colStatIndex] ==
IExtrapolatePartStatus.AggrType.Sum) {
Object o = sumMap.get(colName).get(colStatIndex);
// +3 only for the case of SUM_NUM_DISTINCTS which is after
count rows index
int rowIndex = (colStatIndex == 15) ? colStatIndex + 3 :
colStatIndex + 2;
if (o == null) {
row[rowIndex] = null;
} else {
Long val = MetastoreDirectSqlUtils.extractSqlLong(o);
row[rowIndex] = val / sumVal * (partNames.size());
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2638281029
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
I think this is not fully right. let's take an example:
In db, colA has the stats for part1, part2, part3, and colB has stats for
part1,part2,part3,par4,part5
If we want to aggregate the column stats among the
partition(part1,part2,part3), as colA and colB both have the partition column
stats, we just fetch the stats and aggregate them.
however if we want the aggregated stats for
partition(part1,part2,part3,part4,par5), as colB has all of them, so just
aggregate them. While for colA, as part4 and part5 are missing, we need some
assumptions on those two missing stats.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2638281029
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
I think this is not fully right. let's take an example:
In db, colA has the stats for part1, part2, part3, and colB has stats for
part1,part2,part3,par4,part5
If we want to aggregate the column stats among the
partition(part1,part2,part3), as colA and colB both have the partition column
stats, we just fetch the stats and aggregate them.
however if we want the aggregated stats for
partition(part1,part2,part3,part4,par5), as colB has all of them, so just
aggregate them. While for colA, as part4 and part5 is missing, we need some
assumptions on those two missing stats.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3674066508 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [21 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2634078220
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
like let's say for colA stats computed when table has 5 parts and for colB
stats computed when table has 8 parts
so here `areAllPartsFound` will be false
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2634066049
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
correct me if I'm wrong:
we have `ANALYZE TABLE {table name} PARTITION({part cols}) COMPUTE
STATISTICS FOR COLUMNS {columns passed};`
so due to this query there can be scenarios where partitions found for a
column of a table where stats have beeen computed can vary from column to
column of a table and in that case partNames will be fixed and
`areAllPartsFound` can be false and we need to handle this scenario as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2634066049
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
correct me if I'm wrong:
we have `ANALYZE TABLE {table name} PARTITION({part cols}) COMPUTE
STATISTICS FOR COLUMNS {columns passed};`
so due to this query there can be scenarios where partitions found for a
column of a table where stats have beeen computed can vary from column to
column of a table and in that case `partNames ` will be fixed and
`areAllPartsFound` can be false and we need to handle this scenario as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2634066049
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
correct me if I'm wrong:
we have ANALYZE TABLE {table name} PARTITION({part cols}) COMPUTE STATISTICS
FOR COLUMNS {columns passed};
so due to this query there can be scenarios where partitions found for a
column of a table where stats have beeen computed can vary from column to
column of a table and in that case partNames will be fixed and
`areAllPartsFound` can be false and we need to handle this scenario as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2634019511
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
This method aggregates the column stats for `partNames`, this is known
beforehand, why do we should take account of the "new partitions"?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2633995211
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
also this else branch of `areAllPartsFound` should not be removed as it is
helpful in cases when statistics are computed before and some new partitions
are added later on so some partition stats will be missing in that we will need
this branch for that.
I have added the test case `testMissingPartsStatsAggrWithBackendDB` for this
scenario in `TestObjectStore.java` as well
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2633985059
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
I agree this is same and can be merged with common prefix but we should not
do that as this query fetches way less parameters than `commonPrefix` one and
it would be faster than it and also this query is necessary prerequisite for
finding `noExtraColumnNames` later on which uses `commonPrefix` query so we
can't just store the result of query execution of `commonPrefix` one once and
use for both this prerequisite and for `noExtraColumnNames` as columns needed
to be passed to both of them will be different.
but I have replaced `count("PART_COL_STATS"."PART_ID")` with `count(1)`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2633903272
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -165,6 +166,26 @@ class MetaStoreDirectSql {
private DirectSqlUpdatePart directSqlUpdatePart;
private DirectSqlInsertPart directSqlInsertPart;
+ private static final int COLNAME = 0;
Review Comment:
done added as an enum in the IExtrapolatePartStatus.java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3670668944 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [16 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2629934599
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
the `areAllPartsFound` makes little sense here, we can remove this branch
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2629878760
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -165,6 +166,26 @@ class MetaStoreDirectSql {
private DirectSqlUpdatePart directSqlUpdatePart;
private DirectSqlInsertPart directSqlInsertPart;
+ private static final int COLNAME = 0;
Review Comment:
personally I dont like this number of static fields, maybe we could move
them somewhere, like IExtrapolatePartStatus
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2629860466
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2029,65 +2064,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\", " +
PART_COL_STATS + ".\"COLUMN_TYPE\"";
Review Comment:
In this query, the `count("PART_COL_STATS"."PART_ID")` would be the same as
`count(1)` in the `commonPrefix`, so this query can be merged as well.
- For each column name, if the count == partNames.size() or count < 2, then
we simply merge the stats;
- otherwise, we need extrapolation for this column,
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3659478674 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [14 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089: URL: https://github.com/apache/hive/pull/6089#discussion_r2622247286 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java: ## @@ -1977,47 +2034,29 @@ private List aggrStatsUseDB(String catName, String dbName, // And, we also guarantee that the estimation makes sense by comparing it to the // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")") // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")") -+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," -+ "avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\")," -+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")," ++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," ++ "count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," Review Comment: my bad, done updated with count(1) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089: URL: https://github.com/apache/hive/pull/6089#discussion_r2621939204 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java: ## @@ -1977,47 +2034,29 @@ private List aggrStatsUseDB(String catName, String dbName, // And, we also guarantee that the estimation makes sense by comparing it to the // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")") // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")") -+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," -+ "avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\")," -+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")," ++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," ++ "count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," Review Comment: and also if I am understanding it right only one out of long, double and big decimal expression will be not null according to col type so it will need checks to find it earlier in order to use count(1) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089: URL: https://github.com/apache/hive/pull/6089#discussion_r2621939204 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java: ## @@ -1977,47 +2034,29 @@ private List aggrStatsUseDB(String catName, String dbName, // And, we also guarantee that the estimation makes sense by comparing it to the // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")") // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")") -+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," -+ "avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\")," -+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")," ++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," ++ "count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," Review Comment: and also if I am understanding it right only one out of long, double and big decimal expression will be not null according to col type so it will need checks to find it earlier in order to use count(1) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089: URL: https://github.com/apache/hive/pull/6089#discussion_r2621910371 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java: ## @@ -1977,47 +2034,29 @@ private List aggrStatsUseDB(String catName, String dbName, // And, we also guarantee that the estimation makes sense by comparing it to the // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")") // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")") -+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," -+ "avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\")," -+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")," ++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," ++ "count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," Review Comment: are you saying that instead of count(expression) we can use count(1) as whenever it is not null all rows will be counted? but isn't this count(expr) is more readable and understandable -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3658972819 > could you please consider this comments? [#6089 (comment)](https://github.com/apache/hive/pull/6089#discussion_r2589091584) [#6089 (comment)](https://github.com/apache/hive/pull/6089#discussion_r2589103939) @dengzhhu653 I have already considered and implemented these comments: 1. https://github.com/apache/hive/pull/6089#discussion_r2621892496 2. https://github.com/apache/hive/pull/6089#discussion_r2621896928 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2621896928
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2108,33 +2126,44 @@ private List aggrStatsUseDB(String
catName, String dbName,
+ " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\"";
-start = doTrace ? System.nanoTime() : 0;
-try (QueryWrapper query = new
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
- List extraColumnNames = new ArrayList();
+
+columnWisePartitionBatches =
+columnWisePartitionBatcher(queryText, catName, dbName,
tableName, partNames, engine, doTrace);
+try {
+ List extraColumnNames = new ArrayList<>();
extraColumnNames.addAll(extraColumnNameTypeParts.keySet());
- Object qResult = executeWithArray(query.getInnerQuery(),
- prepareParams(catName, dbName, tableName, partNames,
- extraColumnNames, engine), queryText);
- if (qResult == null) {
-return Collections.emptyList();
+ List unmergedList = Batchable.runBatched(batchSize,
extraColumnNames, columnWisePartitionBatches);
+ Map colSumStatsMap = new HashMap<>();
+ for (Object[] row : unmergedList) {
Review Comment:
using unmergedList and map for mergedRow as mentioned in:
https://github.com/apache/hive/pull/6089#discussion_r2589103939
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2621892496
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2159,102 +2188,185 @@ private List
aggrStatsUseDB(String catName, String dbName,
if (index == null) {
index = IExtrapolatePartStatus.indexMaps.get("default");
}
+
+ //for avg calculation
+ queryText = "select " +
"sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as
decimal)),"
+ +
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as
decimal)),"
+ +
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
+ +
"count((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
+ + "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
+ + "count((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
+ + " from " + PART_COL_STATS + ""
+ + " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
+ + " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = "
+ TBLS + ".\"TBL_ID\""
+ + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
+ + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\"
= ? and " + TBLS + ".\"TBL_NAME\" = ? "
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ + " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
+ + " group by \"COLUMN_NAME\"";
+
+ columnWisePartitionBatches =
+ columnWisePartitionBatcher(queryText, catName, dbName,
tableName, partNames, engine, doTrace);
+ Object[] sum = new Object[3];
+ Object[] count = new Object[3];
+ Integer[] avgIndex = new Integer[]{14, 15, 16};
+ try {
+list = Batchable.runBatched(batchSize,
Collections.singletonList(colName), columnWisePartitionBatches);
+for (int i = 0; i < 6; i += 2) {
+ for (Object[] batch : list) {
+sum[i / 2] = MetastoreDirectSqlUtils.sum(sum[i / 2], batch[i]);
+count[i / 2] = MetastoreDirectSqlUtils.sum(count[i / 2],
batch[i + 1]);
+ }
+ // filling in sum and count in row for avg calculation later on
+ row[avgIndex[i / 2] + i / 2] = sum[i / 2];
+ row[avgIndex[i / 2] + i / 2 + 1] = count[i / 2];
+}
+ } finally {
+columnWisePartitionBatches.closeAllQueries();
+ }
for (int colStatIndex : index) {
String colStatName =
IExtrapolatePartStatus.colStatNames[colStatIndex];
// if the aggregation type is sum, we do a scale-up
if (IExtrapolatePartStatus.aggrTypes[colStatIndex] ==
IExtrapolatePartStatus.AggrType.Sum) {
Object o = sumMap.get(colName).get(colStatIndex);
+ // +5 only for the case of SUM_NUM_DISTINCTS which is after avg
indices
+ int rowIndex = (colStatIndex == 15) ? colStatIndex + 5 :
colStatIndex + 2;
if (o == null) {
-row[2 + colStatIndex] = null;
+row[rowIndex] = null;
} else {
Long val = MetastoreDirectSqlUtils.extractSqlLong(o);
-row[2 + colStatIndex] = val / sumVal * (partNames.size());
+row[rowIndex] = val / sumVal * (partNames.size());
}
} else if (IExtrapolatePartStatus.aggrTypes[colStatIndex] ==
IExtrapolatePartStatus.AggrType.Min
|| IExtrapolatePartStatus.aggrTypes[colStatIndex] ==
IExtrapolatePartStatus.AggrType.Max) {
// if the aggregation type is min/max, we extrapolate from the
// left/right borders
- if (!decimal) {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS +
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " +
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3658797922 could you please consider this comments? https://github.com/apache/hive/pull/6089#discussion_r2589091584 https://github.com/apache/hive/pull/6089#discussion_r2589103939 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089: URL: https://github.com/apache/hive/pull/6089#discussion_r2621735849 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java: ## @@ -1977,47 +2034,29 @@ private List aggrStatsUseDB(String catName, String dbName, // And, we also guarantee that the estimation makes sense by comparing it to the // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")") // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")") -+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," -+ "avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\")," -+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")," ++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," ++ "count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," Review Comment: looks a single count `count(1)` is enough for me, `count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))` `count((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\")` `count((cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089: URL: https://github.com/apache/hive/pull/6089#discussion_r2621735849 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java: ## @@ -1977,47 +2034,29 @@ private List aggrStatsUseDB(String catName, String dbName, // And, we also guarantee that the estimation makes sense by comparing it to the // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")") // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")") -+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," -+ "avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\")," -+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")," ++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," ++ "count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," Review Comment: looks a single count is enough for me, `count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))` `count((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\")` `count((cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3645388441 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [16 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3632597922 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [14 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2602634109
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1954,9 +2007,20 @@ private List
aggrStatsUseJava(String catName, String dbName
areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner);
}
- private List aggrStatsUseDB(String catName, String
dbName,
- String tableName, List partNames, List colNames, String
engine,
- boolean areAllPartsFound, boolean useDensityFunctionForNDVEstimation,
double ndvTuner) throws MetaException {
+ private Map> columnWiseSubList(List list) {
+Map> colSubList = new HashMap<>();
+for (Object[] row : list) {
+ String colName = (String) row[0];
+ colSubList.putIfAbsent(colName, new ArrayList<>());
Review Comment:
Thanks for explaining it , I misunderstood it earlier so I have updated it
with this implementation
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2602628155
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2174,86 +2270,132 @@ private List
aggrStatsUseDB(String catName, String dbName,
|| IExtrapolatePartStatus.aggrTypes[colStatIndex] ==
IExtrapolatePartStatus.AggrType.Max) {
// if the aggregation type is min/max, we extrapolate from the
// left/right borders
- if (!decimal) {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS +
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " +
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by \"" + colStatName + "\"";
- } else {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS +
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " +
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by cast(\"" + colStatName + "\" as decimal)";
- }
- start = doTrace ? System.nanoTime() : 0;
- try (QueryWrapper query = new
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
-Object qResult = executeWithArray(query.getInnerQuery(),
-prepareParams(catName, dbName, tableName, partNames,
Arrays.asList(colName), engine), queryText);
-if (qResult == null) {
- return Collections.emptyList();
-}
-fqr = (ForwardQueryResult) qResult;
-Object[] min = (Object[]) (fqr.get(0));
-Object[] max = (Object[]) (fqr.get(fqr.size() - 1));
-end = doTrace ? System.nanoTime() : 0;
-MetastoreDirectSqlUtils.timingTrace(doTrace, queryText, start,
end);
-if (min[0] == null || max[0] == null) {
- row[2 + colStatIndex] = null;
-} else {
- row[2 + colStatIndex] = extrapolateMethod
- .extrapolate(min, max, colStatIndex, indexMap);
-}
- }
-} else {
- // if the aggregation type is avg, we use the average on the
existing ones.
- queryText = "select "
- +
"avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as
decimal)),"
- +
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
- + "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")"
- + " from " + PART_COL_STATS + ""
+ String orderByExpr = decimal ? "cast(\"" + colStatName + "\" as
decimal)" : "\"" + colStatName + "\"";
+
+ queryText = "select \"" + colStatName + "\",\"PART_NAME\" from "
+ PART_COL_STATS
+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\"
= " + TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " +
DBS + ".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2602606147
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2174,86 +2270,132 @@ private List
aggrStatsUseDB(String catName, String dbName,
|| IExtrapolatePartStatus.aggrTypes[colStatIndex] ==
IExtrapolatePartStatus.AggrType.Max) {
// if the aggregation type is min/max, we extrapolate from the
// left/right borders
- if (!decimal) {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS +
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " +
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by \"" + colStatName + "\"";
- } else {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS +
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " +
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by cast(\"" + colStatName + "\" as decimal)";
- }
- start = doTrace ? System.nanoTime() : 0;
- try (QueryWrapper query = new
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
-Object qResult = executeWithArray(query.getInnerQuery(),
-prepareParams(catName, dbName, tableName, partNames,
Arrays.asList(colName), engine), queryText);
-if (qResult == null) {
- return Collections.emptyList();
-}
-fqr = (ForwardQueryResult) qResult;
-Object[] min = (Object[]) (fqr.get(0));
-Object[] max = (Object[]) (fqr.get(fqr.size() - 1));
-end = doTrace ? System.nanoTime() : 0;
-MetastoreDirectSqlUtils.timingTrace(doTrace, queryText, start,
end);
-if (min[0] == null || max[0] == null) {
- row[2 + colStatIndex] = null;
-} else {
- row[2 + colStatIndex] = extrapolateMethod
- .extrapolate(min, max, colStatIndex, indexMap);
-}
- }
-} else {
- // if the aggregation type is avg, we use the average on the
existing ones.
- queryText = "select "
- +
"avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as
decimal)),"
- +
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
- + "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")"
- + " from " + PART_COL_STATS + ""
+ String orderByExpr = decimal ? "cast(\"" + colStatName + "\" as
decimal)" : "\"" + colStatName + "\"";
+
+ queryText = "select \"" + colStatName + "\",\"PART_NAME\" from "
+ PART_COL_STATS
+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\"
= " + TBLS + ".\"TBL_ID\""
+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " +
DBS + ".\"DB_ID\""
+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
- + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
+ + " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)"
+ + " and " + PARTITIONS + ".
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2602575145
##
itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestTezBatchedStatsCliDriver.java:
##
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.cli;
+
+import org.apache.hadoop.hive.cli.control.CliAdapter;
+import org.apache.hadoop.hive.cli.control.CliConfigs;
+import org.junit.ClassRule;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TestRule;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+import org.junit.runners.Parameterized.Parameters;
+
+import java.io.File;
+import java.util.List;
+
+@RunWith(Parameterized.class)
+public class TestTezBatchedStatsCliDriver {
Review Comment:
Thanks for pointing it, I agree with this and added corresponding test in
`TestObjectStore.java` and removed earlier one
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3629061660 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [14 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
kasakrisz commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2588754346
##
itests/qtest/src/test/java/org/apache/hadoop/hive/cli/TestTezBatchedStatsCliDriver.java:
##
@@ -0,0 +1,64 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hive.cli;
+
+import org.apache.hadoop.hive.cli.control.CliAdapter;
+import org.apache.hadoop.hive.cli.control.CliConfigs;
+import org.junit.ClassRule;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.TestRule;
+import org.junit.runner.RunWith;
+import org.junit.runners.Parameterized;
+import org.junit.runners.Parameterized.Parameters;
+
+import java.io.File;
+import java.util.List;
+
+@RunWith(Parameterized.class)
+public class TestTezBatchedStatsCliDriver {
Review Comment:
IMHO there is no need to add a new driver for testing the
`ObjectStore.get_aggr_stats_for` method. Doing so would imply that we need new
drivers to cover all combinations of `hive.stats.fetch.bitvector`,
`hive.stats.fetch.kll`, `hive.metastore.direct.sql.batch.size`, etc. The number
of drivers would explode if we followed this pattern.
Please take a look at an existing test case. It is not a q-test–based one,
but a good integration test that focuses only on `ObjectStore`.
https://github.com/apache/hive/blob/ca105f8124072d19d88a83b2ced613d326c9a26b/standalone-metastore/metastore-server/src/test/java/org/apache/hadoop/hive/metastore/TestObjectStore.java#L1025-L1047
(The test method name `testAggrStatsUseDB` is misleading, though, because
the test actually calls the `ObjectStore.get_aggr_stats_for` method, which
eventually calls `MetaStoreDirectSql.testAggrStatsUseDB`.)
Configurations can be easily changed in these kinds of tests via the
MetastoreConf object. We also have better ways to write assertions. These tests
run faster than q-tests.
I temporarily added:
```
MetastoreConf.setLongVar(conf2, ConfVars.DIRECT_SQL_PARTITION_BATCH_SIZE, 2);
```
and I was able to reproduce the bug described in
[HIVE-29203](https://issues.apache.org/jira/browse/HIVE-29203). Both with
aggregating in the backend db (`aggrStatsUseDB`) and HMS side
(`aggrStatsUseJava`).
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2174,86 +2270,132 @@ private List
aggrStatsUseDB(String catName, String dbName,
|| IExtrapolatePartStatus.aggrTypes[colStatIndex] ==
IExtrapolatePartStatus.AggrType.Max) {
// if the aggregation type is min/max, we extrapolate from the
// left/right borders
- if (!decimal) {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS +
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " +
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by \"" + colStatName + "\"";
- } else {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS +
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " +
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + P
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3605393243 and sonar is pointing out indentation issues which i don't think needs to be addressed as I kept it same as we have in those other drive files and config file and others are large brain methods like `aggrStatsUseDB` which are already part of code and splitting it up into separate methods with entire implementation intact can take up time and may break some scenarios addressed in that method -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3605348137 Hi, so it is regarding the unit tests added: 1. why I am going for driver based approach instead of simply adding a q file under an already present driver: -> the `-Dhive.stats.fetch.bitvector` and `-Dhive.metastore.direct.sql.batch.size` can't be set at the query time inside a q file and needed to be set in `hive-site.xml` 2. why going for a docker image instead of a q file loading the data into the table: -> these changes are for scenario where statistics have been computed already and then someone specify this property `hive.metastore.direct.sql.batch.size` and restart hms so in that case results are different. But to replicate this in unit test it won't be possible if I define table and load data in q file as ` it will give ORM related exception when placing debug pointer although getting executed` . So what i did is created a `test_stats` table having data in 13 partitions and then compute statistics for its columns and built a docker image with the stats dump and used that docker image. the q files added are performing self join when `hive.metastore.direct.sql.batch.size=5` and the `hive.metastore.direct.sql.batch.size` is set according to the q file name. you can validate the results via: cherry-picking the unit tests added commit on the top of the current master branch and run: `mvn test -pl itests/qtest -Pitests -Dtest=TestTezBatchedStatsCliDriver -Dqfile=sketch_query.q -Dtest.output.overwrite` similar for `no_sketch_query.q` and you will see different results from expected one added in commit and if you just drop the property `hive.metastore.direct.sql.batch.size` from the `hive-site.xml` added for this driver you will see the expected results which is the validation for corrected expected output with these changes in case of batching. FYI: the stats are added in such a way that overlapping values lie in between batches of size 5 resulting in just taking maximum of ndv is not enough to evaluate ndv calculated without batching as it is the case failing only otherwise it passes even after having redundant per column elements which just simply gets concatenated via simple function like maximum and all on the higher level -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3605270439
> I think we can make the `aggrStatsUseDB` be simpler,
>
> 1. run the below SQL batched by part names:
>
> ```java
>String query = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
>+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
>+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
>+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
>+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
>// The following data is used to compute a partitioned table's NDV
based
>// on partitions' NDV when useDensityFunctionForNDVEstimation =
true. Global NDVs cannot be
>// accurately derived from partition NDVs, because the domain of
column value two partitions
>// can overlap. If there is no overlap then global NDV is just the
sum
>// of partition NDVs (UpperBound). But if there is some overlay then
>// global NDV can be anywhere between sum of partition NDVs (no
overlap)
>// and same as one of the partition NDV (domain of column value in
all other
>// partitions is subset of the domain value in one of the partition)
>// (LowerBound).But under uniform distribution, we can roughly
estimate the global
>// NDV by leveraging the min/max values.
>// And, we also guarantee that the estimation makes sense by
comparing it to the
>// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
>// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
>+
"sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as
decimal)),"
>+
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
>+ "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
>+ "count(PARTITIONS + ".\"PART_ID\"),"
>+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" + "
inner join " + PARTITIONS + " on "
>+ PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
+ " inner join " + TBLS + " on " + PARTITIONS
>+ ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" + " inner join " + DBS +
" on " + TBLS + ".\"DB_ID\" = " + DBS
>+ ".\"DB_ID\"" + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS
+ ".\"NAME\" = ? and " + TBLS
>+ ".\"TBL_NAME\" = ? and \"COLUMN_NAME\" in (%1$s)" + " and " +
PARTITIONS + ".\"PART_NAME\" in (%2$s)"
> + " and \"ENGINE\" = ? " + " group by \"COLUMN_NAME\",
\"COLUMN_TYPE\"";
> ```
>
> 2. For each batch, merge the returned rows as we does in this PR:
>
> ```java
> try {
> List list = Batchable.runBatched(batchSize, partNames, b);
> Map> colSubList = columnWiseSubList(list);
> for (Map.Entry> entry : colSubList.entrySet())
{
>colStats.add(columnStatisticsObjWithAdjustedNDV(entry.getValue(),
0, useDensityFunctionForNDVEstimation, ndvTuner));
>Deadline.checkTimeout();
> }
>} finally {
> b.closeAllQueries();
>}
> ```
>
> 3. for each `ColumnStatisticsObj` got from step 2, change the `sum` by
`sum * partNames.size() / count(PARTITIONS + ".\"PART_ID\") for this column`
>
> By this, we can make the `aggrStatsUseDB` simple and clear, and don't send
the unnecessary queries back and forth to backing db.
IIUC you are talking about the different queries for the `else` block of `if
(areAllPartsFound)` but I don't have the unit test to test this behaviour as it
is for a broken behaviour when a column is not present in all partiitions. So,
I didn't change any implementation and queries for those case and only replaced
the results of list with a map and then performed merging and add to the final
stats list so that it won't break any functionality.
I can't change the implementation without having a unit test for that
behaviour.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583757083
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2098,29 +2121,39 @@ private List aggrStatsUseDB(String
catName, String dbName,
}
// get sum for all columns to reduce the number of queries
Map> sumMap = new HashMap>();
-queryText = "select \"COLUMN_NAME\", sum(\"NUM_NULLS\"),
sum(\"NUM_TRUES\"), sum(\"NUM_FALSES\"), sum(\"NUM_DISTINCTS\")"
-+ " from " + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" =
? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(extraColumnNameTypeParts.size()) + ")"
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\"";
-start = doTrace ? System.nanoTime() : 0;
-try (QueryWrapper query = new
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
+queryText =
+"select \"COLUMN_NAME\", sum(\"NUM_NULLS\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), sum(\"NUM_DISTINCTS\")"
++ " from " + PART_COL_STATS + " inner join " + PARTITIONS + "
on " + PART_COL_STATS + ".\"PART_ID\" = "
++ PARTITIONS + ".\"PART_ID\"" + " inner join " + TBLS + " on "
+ PARTITIONS + ".\"TBL_ID\" = " + TBLS
++ ".\"TBL_ID\"" + " inner join " + DBS + " on " + TBLS +
".\"DB_ID\" = " + DBS + ".\"DB_ID\""
++ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
++ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(extraColumnNameTypeParts.size()) + ")"
++ " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")" + " and "
++ PART_COL_STATS + ".\"ENGINE\" = ? " + " group by " +
PART_COL_STATS + ".\"COLUMN_NAME\"";
+
+b = jobsBatching(queryText, catName, dbName, tableName, partNames,
engine, doTrace);
+try {
List extraColumnNames = new ArrayList();
extraColumnNames.addAll(extraColumnNameTypeParts.keySet());
- Object qResult = executeWithArray(query.getInnerQuery(),
- prepareParams(catName, dbName, tableName, partNames,
- extraColumnNames, engine), queryText);
- if (qResult == null) {
-return Collections.emptyList();
+ List unmergedList = Batchable.runBatched(batchSize,
extraColumnNames, b);
+ Map> colSubList =
columnWiseSubList(unmergedList);
+ List mergedList = new ArrayList<>();
+ for (Map.Entry> entry :
colSubList.entrySet()) {
Review Comment:
I don't think it can be avoided same as the above mentioned reason
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583755269
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2098,29 +2121,39 @@ private List aggrStatsUseDB(String
catName, String dbName,
}
// get sum for all columns to reduce the number of queries
Map> sumMap = new HashMap>();
-queryText = "select \"COLUMN_NAME\", sum(\"NUM_NULLS\"),
sum(\"NUM_TRUES\"), sum(\"NUM_FALSES\"), sum(\"NUM_DISTINCTS\")"
-+ " from " + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" =
? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(extraColumnNameTypeParts.size()) + ")"
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " group by " + PART_COL_STATS + ".\"COLUMN_NAME\"";
-start = doTrace ? System.nanoTime() : 0;
-try (QueryWrapper query = new
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
+queryText =
+"select \"COLUMN_NAME\", sum(\"NUM_NULLS\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), sum(\"NUM_DISTINCTS\")"
++ " from " + PART_COL_STATS + " inner join " + PARTITIONS + "
on " + PART_COL_STATS + ".\"PART_ID\" = "
++ PARTITIONS + ".\"PART_ID\"" + " inner join " + TBLS + " on "
+ PARTITIONS + ".\"TBL_ID\" = " + TBLS
++ ".\"TBL_ID\"" + " inner join " + DBS + " on " + TBLS +
".\"DB_ID\" = " + DBS + ".\"DB_ID\""
++ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
++ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" in (" +
makeParams(extraColumnNameTypeParts.size()) + ")"
++ " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")" + " and "
++ PART_COL_STATS + ".\"ENGINE\" = ? " + " group by " +
PART_COL_STATS + ".\"COLUMN_NAME\"";
Review Comment:
done, restored
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583754720
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1977,116 +2020,96 @@ private List
aggrStatsUseDB(String catName, String dbName,
// And, we also guarantee that the estimation makes sense by comparing
it to the
// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\"
as decimal)),"
-+
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? ";
++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\"
as decimal)),"
++
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as
decimal)),"
++
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
++
"count((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
++ "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "count((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" + " inner
join " + PARTITIONS + " on "
++ PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\"" + "
inner join " + TBLS + " on " + PARTITIONS
++ ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" + " inner join " + DBS + "
on " + TBLS + ".\"DB_ID\" = " + DBS
++ ".\"DB_ID\"" + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS
++ ".\"TBL_NAME\" = ? ";
String queryText = null;
-long start = 0;
-long end = 0;
boolean doTrace = LOG.isDebugEnabled();
ForwardQueryResult fqr = null;
// Check if the status of all the columns of all the partitions exists
// Extrapolation is not needed.
if (areAllPartsFound) {
- queryText = commonPrefix + " and \"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
- + " and \"ENGINE\" = ? "
- + " group by \"COLUMN_NAME\", \"COLUMN_TYPE\"";
- start = doTrace ? System.nanoTime() : 0;
- try (QueryWrapper query = new
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
-Object qResult = executeWithArray(query.getInnerQuery(),
-prepareParams(catName, dbName, tableName, partNames, colNames,
-engine), queryText);
-if (qResult == null) {
- return Collections.emptyList();
-}
-end = doTrace ? System.nanoTime() : 0;
-MetastoreDirectSqlUtils.timingTrace(doTrace, queryText, start, end);
-List list = MetastoreDirectSqlUtils.ensureList(qResult);
-List colStats =
-new ArrayList(list.size());
-for (Object[] row : list) {
- colStats.add(prepareCSObjWithAdjustedNDV(row, 0,
- useDensityFunctionForNDVEstimation, ndvTuner));
+ queryText = commonPrefix + " and \"COLUMN_NAME\" in (%1$s)" + " and " +
PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ + " and \"ENGINE\" = ? " + " group by \"COLUMN_NAME\",
\"COLUMN_TYPE\"";
+ Batchable b = jobsBatching(queryText, catName, dbName,
tableName, partNames, engine, doTrace);
+ List colStats = new ArrayList<>(colNames.size());
+ try {
+List list = Batchable.runBatched(batchSize, colNames, b);
+Map> colSubList = columnWiseSubList(list);
+for (Map.Entry> entry : colSubList.entrySet()) {
Review Comment:
I understand but we will need map to store the redundant column results
obtained from different batches for same column name and those needs to be
merged and other issue is we can't be sure of the no. of elements for a
particular column name which needs to be merged as there are else cases for
`areAllPartsFound` otherwise we would have gone for sorting the obtained
results and then merging the results obtained at a particular gap.
currently we are only calling `columnStatisticsObjWithAdjustedNDV` once per
column as we are calling it w
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583707418
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String
catName, String dbName,
}
}
+ private ColumnStatisticsObj
columnStatisticsObjWithAdjustedNDV(List list, int i,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
Review Comment:
I am not sure about this as this method is replacing
`prepareCSObjWithAdjustedNDV` method and it is only limited to `aggrStatsUseDB`
and not other places
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583697128
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String
catName, String dbName,
}
}
+ private ColumnStatisticsObj
columnStatisticsObjWithAdjustedNDV(List list, int i,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
+if (list.isEmpty()) {
+ return null;
+}
+ColumnStatisticsData data = new ColumnStatisticsData();
+int j = i;
+Object[] row = list.getFirst();
+String colName = (String) row[j++];
+String colType = (String) row[j++];
+ColumnStatisticsObj cso = new ColumnStatisticsObj(colName, colType, data);
+Object llow = row[j++];
+Object lhigh = row[j++];
+Object dlow = row[j++];
+Object dhigh = row[j++];
+Object declow = row[j++];
+Object dechigh = row[j++];
+Object nulls = row[j++];
+Object dist = row[j++];
+Object avglen = row[j++];
+Object maxlen = row[j++];
+Object trues = row[j++];
+Object falses = row[j++];
+Object sumLong = row[j++];
+Object countLong = row[j++];
+Object sumDouble = row[j++];
+Object countDouble = row[j++];
+Object sumDecimal = row[j++];
+Object countDecimal = row[j++];
+Object sumDist = row[j];
+for (int k = 1; k < list.size(); k++) {
Review Comment:
I don't think we should do that as current loops start with `index =1` and
it is able to replace the method `prepareCSObjWithAdjustedNDV` and if we try to
implement it in the other way the we will need to take care of the conditionals
and return statement inside loop which will not be cleaner than current one
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583697643
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String
catName, String dbName,
}
}
+ private ColumnStatisticsObj
columnStatisticsObjWithAdjustedNDV(List list, int i,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
+if (list.isEmpty()) {
+ return null;
+}
+ColumnStatisticsData data = new ColumnStatisticsData();
+int j = i;
+Object[] row = list.getFirst();
+String colName = (String) row[j++];
+String colType = (String) row[j++];
+ColumnStatisticsObj cso = new ColumnStatisticsObj(colName, colType, data);
+Object llow = row[j++];
+Object lhigh = row[j++];
+Object dlow = row[j++];
+Object dhigh = row[j++];
+Object declow = row[j++];
+Object dechigh = row[j++];
+Object nulls = row[j++];
+Object dist = row[j++];
+Object avglen = row[j++];
+Object maxlen = row[j++];
+Object trues = row[j++];
+Object falses = row[j++];
+Object sumLong = row[j++];
+Object countLong = row[j++];
+Object sumDouble = row[j++];
+Object countDouble = row[j++];
+Object sumDecimal = row[j++];
+Object countDecimal = row[j++];
+Object sumDist = row[j];
+for (int k = 1; k < list.size(); k++) {
+ j = i + 2;
+ row = list.get(k);
+ llow = MetastoreDirectSqlUtils.min(llow, row[j++]);
+ lhigh = MetastoreDirectSqlUtils.max(lhigh, row[j++]);
+ dlow = MetastoreDirectSqlUtils.min(dlow, row[j++]);
+ dhigh = MetastoreDirectSqlUtils.max(dhigh, row[j++]);
+ declow = MetastoreDirectSqlUtils.min(declow, row[j++]);
+ dechigh = MetastoreDirectSqlUtils.max(dechigh, row[j++]);
+ nulls = MetastoreDirectSqlUtils.sum(nulls, row[j++]);
+ dist = MetastoreDirectSqlUtils.max(dist, row[j++]);
+ avglen = MetastoreDirectSqlUtils.max(avglen, row[j++]);
+ maxlen = MetastoreDirectSqlUtils.max(maxlen, row[j++]);
+ trues = MetastoreDirectSqlUtils.sum(trues, row[j++]);
+ falses = MetastoreDirectSqlUtils.sum(falses, row[j++]);
+ sumLong = MetastoreDirectSqlUtils.sum(sumLong, row[j++]);
+ countLong = MetastoreDirectSqlUtils.sum(countLong, row[j++]);
+ sumDouble = MetastoreDirectSqlUtils.sum(sumDouble, row[j++]);
+ countDouble = MetastoreDirectSqlUtils.sum(countDouble, row[j++]);
+ sumDecimal = MetastoreDirectSqlUtils.sum(sumDecimal, row[j++]);
+ countDecimal = MetastoreDirectSqlUtils.sum(countDecimal, row[j++]);
+ sumDist = MetastoreDirectSqlUtils.sum(sumDist, row[j]);
Review Comment:
done, added constants in place of it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583688671
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String
catName, String dbName,
}
}
+ private ColumnStatisticsObj
columnStatisticsObjWithAdjustedNDV(List list, int i,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
+if (list.isEmpty()) {
+ return null;
+}
+ColumnStatisticsData data = new ColumnStatisticsData();
+int j = i;
+Object[] row = list.getFirst();
+String colName = (String) row[j++];
+String colType = (String) row[j++];
+ColumnStatisticsObj cso = new ColumnStatisticsObj(colName, colType, data);
+Object llow = row[j++];
+Object lhigh = row[j++];
+Object dlow = row[j++];
+Object dhigh = row[j++];
+Object declow = row[j++];
+Object dechigh = row[j++];
+Object nulls = row[j++];
+Object dist = row[j++];
+Object avglen = row[j++];
+Object maxlen = row[j++];
+Object trues = row[j++];
+Object falses = row[j++];
+Object sumLong = row[j++];
+Object countLong = row[j++];
+Object sumDouble = row[j++];
+Object countDouble = row[j++];
+Object sumDecimal = row[j++];
+Object countDecimal = row[j++];
Review Comment:
done, added constants in place of it
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089: URL: https://github.com/apache/hive/pull/6089#discussion_r2583687667 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java: ## @@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String catName, String dbName, } } + private ColumnStatisticsObj columnStatisticsObjWithAdjustedNDV(List list, int i, Review Comment: done, renamed list to `columnBatchesOutput` and removed it as it was not needed and you also need to use `-Dhive.stats.fetch.bitvector=false` also to step into this method as it is associated with `aggrStatsUseDB`. and if you want to compare results then first run the command without `-Dhive.metastore.direct.sql.batch.size=1000` and store the results and then compare those results with including `-Dhive.metastore.direct.sql.batch.size=1000` to compare the results produced after this. this need to be done because `aggrStatsUseDB` sometimes produces different results from expected as with disabled bit vector and disabled kll sketch the exact ndv estimation is not possible in some cases so approximate results are produced by `aggrStatsUseDB` and these changes are keeping the `aggrStatsUseDB` batched results consistent with `aggrStatsUseDB` unbatched results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583655581
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2174,76 +2240,48 @@ private List aggrStatsUseDB(String
catName, String dbName,
|| IExtrapolatePartStatus.aggrTypes[colStatIndex] ==
IExtrapolatePartStatus.AggrType.Max) {
// if the aggregation type is min/max, we extrapolate from the
// left/right borders
- if (!decimal) {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS +
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " +
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by \"" + colStatName + "\"";
- } else {
-queryText = "select \"" + colStatName + "\",\"PART_NAME\" from
" + PART_COL_STATS
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS +
".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " +
DBS + ".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "
-+ " and " + PART_COL_STATS + ".\"COLUMN_NAME\" = ? "
-+ " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
-+ " and " + PART_COL_STATS + ".\"ENGINE\" = ? "
-+ " order by cast(\"" + colStatName + "\" as decimal)";
- }
- start = doTrace ? System.nanoTime() : 0;
- try (QueryWrapper query = new
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
-Object qResult = executeWithArray(query.getInnerQuery(),
-prepareParams(catName, dbName, tableName, partNames,
Arrays.asList(colName), engine), queryText);
-if (qResult == null) {
- return Collections.emptyList();
+ String orderByExpr = decimal ? "cast(\"" + colStatName + "\" as
decimal)" : "\"" + colStatName + "\"";
+
+ queryText =
+ "select \"" + colStatName + "\",\"PART_NAME\" from " +
PART_COL_STATS + " inner join " + PARTITIONS
+ + " on " + PART_COL_STATS + ".\"PART_ID\" = " +
PARTITIONS + ".\"PART_ID\"" + " inner join "
+ + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + TBLS +
".\"TBL_ID\"" + " inner join " + DBS
+ + " on " + TBLS + ".\"DB_ID\" = " + DBS + ".\"DB_ID\"" +
" where " + DBS
+ + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? and "
+ TBLS + ".\"TBL_NAME\" = ? " + " and "
+ + PART_COL_STATS + ".\"COLUMN_NAME\" in (%1$s)" + " and
" + PARTITIONS
+ + ".\"PART_NAME\" in (%2$s)" + " and " + PART_COL_STATS
+ ".\"ENGINE\" = ? " + " order by "
+ + orderByExpr;
Review Comment:
done restored the formatting
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583654940
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1977,116 +2020,96 @@ private List
aggrStatsUseDB(String catName, String dbName,
// And, we also guarantee that the estimation makes sense by comparing
it to the
// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\"
as decimal)),"
-+
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? ";
++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\"
as decimal)),"
++
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as
decimal)),"
++
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
++
"count((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
++ "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "count((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" + " inner
join " + PARTITIONS + " on "
++ PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\"" + "
inner join " + TBLS + " on " + PARTITIONS
++ ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" + " inner join " + DBS + "
on " + TBLS + ".\"DB_ID\" = " + DBS
++ ".\"DB_ID\"" + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS
++ ".\"TBL_NAME\" = ? ";
String queryText = null;
-long start = 0;
-long end = 0;
boolean doTrace = LOG.isDebugEnabled();
ForwardQueryResult fqr = null;
// Check if the status of all the columns of all the partitions exists
// Extrapolation is not needed.
if (areAllPartsFound) {
- queryText = commonPrefix + " and \"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
- + " and \"ENGINE\" = ? "
- + " group by \"COLUMN_NAME\", \"COLUMN_TYPE\"";
- start = doTrace ? System.nanoTime() : 0;
- try (QueryWrapper query = new
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
-Object qResult = executeWithArray(query.getInnerQuery(),
-prepareParams(catName, dbName, tableName, partNames, colNames,
-engine), queryText);
-if (qResult == null) {
- return Collections.emptyList();
-}
-end = doTrace ? System.nanoTime() : 0;
-MetastoreDirectSqlUtils.timingTrace(doTrace, queryText, start, end);
-List list = MetastoreDirectSqlUtils.ensureList(qResult);
-List colStats =
-new ArrayList(list.size());
-for (Object[] row : list) {
- colStats.add(prepareCSObjWithAdjustedNDV(row, 0,
- useDensityFunctionForNDVEstimation, ndvTuner));
+ queryText = commonPrefix + " and \"COLUMN_NAME\" in (%1$s)" + " and " +
PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ + " and \"ENGINE\" = ? " + " group by \"COLUMN_NAME\",
\"COLUMN_TYPE\"";
+ Batchable b = jobsBatching(queryText, catName, dbName,
tableName, partNames, engine, doTrace);
+ List colStats = new ArrayList<>(colNames.size());
+ try {
+List list = Batchable.runBatched(batchSize, colNames, b);
+Map> colSubList = columnWiseSubList(list);
+for (Map.Entry> entry : colSubList.entrySet()) {
+ colStats.add(columnStatisticsObjWithAdjustedNDV(entry.getValue(), 0,
useDensityFunctionForNDVEstimation, ndvTuner));
Deadline.checkTimeout();
}
-return colStats;
+ } finally {
+b.closeAllQueries();
}
+ return colStats;
} else {
// Extrapolation is needed for some columns.
// In this case, at least a column status for a partition is missing.
// We need to extrapolate this partition based on the other partitions
List colStats = new
ArrayList(colNames.size());
- query
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583652938
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1942,6 +1937,43 @@ private List
columnStatisticsObjForPartitionsBatch(String c
}
}
+ private Batchable jobsBatching(final String queryText0,
final String catName, final String dbName,
+ final String tableName, final List partNames, final String
engine, final boolean doTrace) {
+return new Batchable() {
+ @Override
+ public List run(final List inputColNames)
+ throws MetaException {
+Batchable b2 = new Batchable() {
Review Comment:
done, I went with `partitionBatchesFetcher`, it was sounding more clearer to
me
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089: URL: https://github.com/apache/hive/pull/6089#discussion_r2583646648 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java: ## @@ -1942,6 +1937,43 @@ private List columnStatisticsObjForPartitionsBatch(String c } } + private Batchable jobsBatching(final String queryText0, final String catName, final String dbName, Review Comment: done renamed to `columnWisePartitionBatcher` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2583648972
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1977,116 +2020,96 @@ private List
aggrStatsUseDB(String catName, String dbName,
// And, we also guarantee that the estimation makes sense by comparing
it to the
// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
-+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\"
as decimal)),"
-+
"avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
-+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
-+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + ""
-+ " inner join " + PARTITIONS + " on " + PART_COL_STATS +
".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\""
-+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " +
TBLS + ".\"TBL_ID\""
-+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS +
".\"DB_ID\""
-+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ?
and " + TBLS + ".\"TBL_NAME\" = ? ";
++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\"
as decimal)),"
++
"count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as
decimal)),"
++
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
++
"count((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
++ "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "count((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
++ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" + " inner
join " + PARTITIONS + " on "
++ PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\"" + "
inner join " + TBLS + " on " + PARTITIONS
++ ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" + " inner join " + DBS + "
on " + TBLS + ".\"DB_ID\" = " + DBS
++ ".\"DB_ID\"" + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS
++ ".\"TBL_NAME\" = ? ";
String queryText = null;
-long start = 0;
-long end = 0;
boolean doTrace = LOG.isDebugEnabled();
ForwardQueryResult fqr = null;
// Check if the status of all the columns of all the partitions exists
// Extrapolation is not needed.
if (areAllPartsFound) {
- queryText = commonPrefix + " and \"COLUMN_NAME\" in (" +
makeParams(colNames.size()) + ")"
- + " and " + PARTITIONS + ".\"PART_NAME\" in (" +
makeParams(partNames.size()) + ")"
- + " and \"ENGINE\" = ? "
- + " group by \"COLUMN_NAME\", \"COLUMN_TYPE\"";
- start = doTrace ? System.nanoTime() : 0;
- try (QueryWrapper query = new
QueryWrapper(pm.newQuery("javax.jdo.query.SQL", queryText))) {
-Object qResult = executeWithArray(query.getInnerQuery(),
-prepareParams(catName, dbName, tableName, partNames, colNames,
-engine), queryText);
-if (qResult == null) {
- return Collections.emptyList();
-}
-end = doTrace ? System.nanoTime() : 0;
-MetastoreDirectSqlUtils.timingTrace(doTrace, queryText, start, end);
-List list = MetastoreDirectSqlUtils.ensureList(qResult);
-List colStats =
-new ArrayList(list.size());
-for (Object[] row : list) {
- colStats.add(prepareCSObjWithAdjustedNDV(row, 0,
- useDensityFunctionForNDVEstimation, ndvTuner));
+ queryText = commonPrefix + " and \"COLUMN_NAME\" in (%1$s)" + " and " +
PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ + " and \"ENGINE\" = ? " + " group by \"COLUMN_NAME\",
\"COLUMN_TYPE\"";
+ Batchable b = jobsBatching(queryText, catName, dbName,
tableName, partNames, engine, doTrace);
+ List colStats = new ArrayList<>(colNames.size());
+ try {
+List list = Batchable.runBatched(batchSize, colNames, b);
+Map> colSubList = columnWiseSubList(list);
+for (Map.Entry> entry : colSubList.entrySet()) {
+ colStats.add(columnStatisticsObjWithAdjustedNDV(entry.getValue(), 0,
useDensityFunctionForNDVEstimation, ndvTuner));
Deadline.checkTimeout();
}
-return colStats;
+ } finally {
+b.closeAllQueries();
}
+ return colStats;
} else {
// Extrapolation is needed for some columns.
// In this case, at least a column status for a partition is missing.
// We need to extrapolate this partition based on the other partitions
List colStats = new
ArrayList(colNames.size());
- query
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089: URL: https://github.com/apache/hive/pull/6089#discussion_r2583648609 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java: ## @@ -1977,116 +2020,96 @@ private List aggrStatsUseDB(String catName, String dbName, // And, we also guarantee that the estimation makes sense by comparing it to the // UpperBound (calculated by "sum(\"NUM_DISTINCTS\")") // and LowerBound (calculated by "max(\"NUM_DISTINCTS\")") -+ "avg((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," -+ "avg((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\")," -+ "avg((cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")," -+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" -+ " inner join " + PARTITIONS + " on " + PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\"" -+ " inner join " + TBLS + " on " + PARTITIONS + ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" -+ " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS + ".\"DB_ID\"" -+ " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? and " + TBLS + ".\"TBL_NAME\" = ? "; ++ "sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," ++ "count((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as decimal))," ++ "sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\")," ++ "count((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\")," ++ "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")," ++ "count((cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\")," ++ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" + " inner join " + PARTITIONS + " on " ++ PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\"" + " inner join " + TBLS + " on " + PARTITIONS ++ ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" + " inner join " + DBS + " on " + TBLS + ".\"DB_ID\" = " + DBS ++ ".\"DB_ID\"" + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS + ".\"NAME\" = ? and " + TBLS ++ ".\"TBL_NAME\" = ? "; Review Comment: done, restored the original one -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089: URL: https://github.com/apache/hive/pull/6089#discussion_r2583646648 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java: ## @@ -1942,6 +1937,43 @@ private List columnStatisticsObjForPartitionsBatch(String c } } + private Batchable jobsBatching(final String queryText0, final String catName, final String dbName, Review Comment: done renamed to columnWisePartitionBatcher -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3602358839 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [45 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3601903089 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [45 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3593774118 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [45 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3592985471 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [61 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3592523773 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [61 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3592468808 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [61 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3591895211 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [61 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2564992611
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String
catName, String dbName,
}
}
+ private ColumnStatisticsObj
columnStatisticsObjWithAdjustedNDV(List list, int i,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
Review Comment:
can this method move into `StatObjectConverter`?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on PR #6089:
URL: https://github.com/apache/hive/pull/6089#issuecomment-3581136612
I think we can make the `aggrStatsUseDB` be simpler,
1. run the below SQL batched by part names:
```java
String query = "select \"COLUMN_NAME\", \"COLUMN_TYPE\", "
+ "min(\"LONG_LOW_VALUE\"), max(\"LONG_HIGH_VALUE\"),
min(\"DOUBLE_LOW_VALUE\"), max(\"DOUBLE_HIGH_VALUE\"), "
+ "min(cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal)),
max(cast(\"BIG_DECIMAL_HIGH_VALUE\" as decimal)), "
+ "sum(\"NUM_NULLS\"), max(\"NUM_DISTINCTS\"), "
+ "max(\"AVG_COL_LEN\"), max(\"MAX_COL_LEN\"), sum(\"NUM_TRUES\"),
sum(\"NUM_FALSES\"), "
// The following data is used to compute a partitioned table's NDV
based
// on partitions' NDV when useDensityFunctionForNDVEstimation =
true. Global NDVs cannot be
// accurately derived from partition NDVs, because the domain of
column value two partitions
// can overlap. If there is no overlap then global NDV is just the
sum
// of partition NDVs (UpperBound). But if there is some overlay then
// global NDV can be anywhere between sum of partition NDVs (no
overlap)
// and same as one of the partition NDV (domain of column value in
all other
// partitions is subset of the domain value in one of the partition)
// (LowerBound).But under uniform distribution, we can roughly
estimate the global
// NDV by leveraging the min/max values.
// And, we also guarantee that the estimation makes sense by
comparing it to the
// UpperBound (calculated by "sum(\"NUM_DISTINCTS\")")
// and LowerBound (calculated by "max(\"NUM_DISTINCTS\")")
+
"sum((\"LONG_HIGH_VALUE\"-\"LONG_LOW_VALUE\")/cast(\"NUM_DISTINCTS\" as
decimal)),"
+
"sum((\"DOUBLE_HIGH_VALUE\"-\"DOUBLE_LOW_VALUE\")/\"NUM_DISTINCTS\"),"
+ "sum((cast(\"BIG_DECIMAL_HIGH_VALUE\" as
decimal)-cast(\"BIG_DECIMAL_LOW_VALUE\" as decimal))/\"NUM_DISTINCTS\"),"
+ "count(PARTITIONS + ".\"PART_ID\"),"
+ "sum(\"NUM_DISTINCTS\")" + " from " + PART_COL_STATS + "" + "
inner join " + PARTITIONS + " on "
+ PART_COL_STATS + ".\"PART_ID\" = " + PARTITIONS + ".\"PART_ID\"" +
" inner join " + TBLS + " on " + PARTITIONS
+ ".\"TBL_ID\" = " + TBLS + ".\"TBL_ID\"" + " inner join " + DBS + "
on " + TBLS + ".\"DB_ID\" = " + DBS
+ ".\"DB_ID\"" + " where " + DBS + ".\"CTLG_NAME\" = ? and " + DBS +
".\"NAME\" = ? and " + TBLS
+ ".\"TBL_NAME\" = ? and \"COLUMN_NAME\" in (%1$s)" + " and " +
PARTITIONS + ".\"PART_NAME\" in (%2$s)"
+ " and \"ENGINE\" = ? " + " group by \"COLUMN_NAME\",
\"COLUMN_TYPE\"";
```
2. For each batch, merge the returned rows as we does in this PR:
```java
try {
List list = Batchable.runBatched(batchSize, partNames, b);
Map> colSubList = columnWiseSubList(list);
for (Map.Entry> entry :
colSubList.entrySet()) {
colStats.add(columnStatisticsObjWithAdjustedNDV(entry.getValue(),
0, useDensityFunctionForNDVEstimation, ndvTuner));
Deadline.checkTimeout();
}
} finally {
b.closeAllQueries();
}
```
3. for each `ColumnStatisticsObj` got from step 2, change the `sum` by
`sum * partNames.size() / count(PARTITIONS + ".\"PART_ID\") for this column`
By this, we can make the `aggrStatsUseDB` simple and clear, and don't send
the unnecessary queries back and forth to backing db.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
kasakrisz commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2560044635
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String
catName, String dbName,
}
}
+ private ColumnStatisticsObj
columnStatisticsObjWithAdjustedNDV(List list, int i,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
+if (list.isEmpty()) {
+ return null;
+}
+ColumnStatisticsData data = new ColumnStatisticsData();
+int j = i;
+Object[] row = list.getFirst();
+String colName = (String) row[j++];
+String colType = (String) row[j++];
+ColumnStatisticsObj cso = new ColumnStatisticsObj(colName, colType, data);
+Object llow = row[j++];
+Object lhigh = row[j++];
+Object dlow = row[j++];
+Object dhigh = row[j++];
+Object declow = row[j++];
+Object dechigh = row[j++];
+Object nulls = row[j++];
+Object dist = row[j++];
+Object avglen = row[j++];
+Object maxlen = row[j++];
+Object trues = row[j++];
+Object falses = row[j++];
+Object sumLong = row[j++];
+Object countLong = row[j++];
+Object sumDouble = row[j++];
+Object countDouble = row[j++];
+Object sumDecimal = row[j++];
+Object countDecimal = row[j++];
+Object sumDist = row[j];
+for (int k = 1; k < list.size(); k++) {
Review Comment:
Can this be replaced with
```
for (Object[] row : list)
```
?
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1942,6 +1937,43 @@ private List
columnStatisticsObjForPartitionsBatch(String c
}
}
+ private Batchable jobsBatching(final String queryText0,
final String catName, final String dbName,
Review Comment:
Can this method be used for other than loading per partition column
statistics? If not could you please rename it how more specific name.
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -2254,6 +2292,69 @@ private List aggrStatsUseDB(String
catName, String dbName,
}
}
+ private ColumnStatisticsObj
columnStatisticsObjWithAdjustedNDV(List list, int i,
+ boolean
useDensityFunctionForNDVEstimation, double ndvTuner)
+ throws MetaException {
+if (list.isEmpty()) {
+ return null;
+}
+ColumnStatisticsData data = new ColumnStatisticsData();
+int j = i;
+Object[] row = list.getFirst();
+String colName = (String) row[j++];
+String colType = (String) row[j++];
+ColumnStatisticsObj cso = new ColumnStatisticsObj(colName, colType, data);
+Object llow = row[j++];
+Object lhigh = row[j++];
+Object dlow = row[j++];
+Object dhigh = row[j++];
+Object declow = row[j++];
+Object dechigh = row[j++];
+Object nulls = row[j++];
+Object dist = row[j++];
+Object avglen = row[j++];
+Object maxlen = row[j++];
+Object trues = row[j++];
+Object falses = row[j++];
+Object sumLong = row[j++];
+Object countLong = row[j++];
+Object sumDouble = row[j++];
+Object countDouble = row[j++];
+Object sumDecimal = row[j++];
+Object countDecimal = row[j++];
+Object sumDist = row[j];
+for (int k = 1; k < list.size(); k++) {
+ j = i + 2;
+ row = list.get(k);
+ llow = MetastoreDirectSqlUtils.min(llow, row[j++]);
+ lhigh = MetastoreDirectSqlUtils.max(lhigh, row[j++]);
+ dlow = MetastoreDirectSqlUtils.min(dlow, row[j++]);
+ dhigh = MetastoreDirectSqlUtils.max(dhigh, row[j++]);
+ declow = MetastoreDirectSqlUtils.min(declow, row[j++]);
+ dechigh = MetastoreDirectSqlUtils.max(dechigh, row[j++]);
+ nulls = MetastoreDirectSqlUtils.sum(nulls, row[j++]);
+ dist = MetastoreDirectSqlUtils.max(dist, row[j++]);
+ avglen = MetastoreDirectSqlUtils.max(avglen, row[j++]);
+ maxlen = MetastoreDirectSqlUtils.max(maxlen, row[j++]);
+ trues = MetastoreDirectSqlUtils.sum(trues, row[j++]);
+ falses = MetastoreDirectSqlUtils.sum(falses, row[j++]);
+ sumLong = MetastoreDirectSqlUtils.sum(sumLong, row[j++]);
+ countLong = MetastoreDirectSqlUtils.sum(countLong, row[j++]);
+ sumDouble = MetastoreDirectSqlUtils.sum(sumDouble, row[j++]);
+ countDouble = MetastoreDirectSqlUtils.sum(countDouble, row[j++]);
+ sumDecimal = MetastoreDirectSqlUtils.sum(sumDecimal, row[j++]);
+ countDecimal = MetastoreDirectSqlUtils.sum(countDecimal, row[j++]);
+ sumDist = MetastoreDirectSqlUtils.sum(sumDist, row[j]);
Review Comment:
Please use constant indices instead of `j++`.
##
standalone-metastore/metastore-server/src/
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3575750117 > @ramitg254 Thanks for reporting this bug and working on the fix. I have a few questions to get a better picture of the issue: > > > I ran a script which executes the queries of pattern cbo_* and query* under perf directory with > > 1. Did you actually execute all TPC-DS queries or just compile them? The driver `TestTezTPCDS30TBPerfCliDriver` doesn't execute the queries since the data is not available. It uses a Postgres HMS backend db dump to simulate an environment where the TPC-DS schema exists and calls Hive's SQL compiler using the `explain` and `explain cbo` commands. > 2. Do the numbers you shared in the tables (Apache master version, with aggrStatsUseDB/without aggrStatsUseDB and with batching/without batching) show the overall compilation time of all queries? > 3. I haven't found any new tests in this patch, no golden file changes either. Could you please provide a minimal repro of the issue? Please don't copy-paste any q file of the tpc-ds queries. IIUC this issue should be reproducible with a table having a few partitions and a batch size smaller than the number of partitions. Unit tests are also welcome. This only addresses the fact that stats results shouldn't change after batch size is defined to some value. 1. I didn't ran the actual queries only the simulation used by the `TestTezTPCDS30TBPerfCliDriver` and ran those explain queries as you mentioned and recorded their time elapsed. 2. yes this the overall compilation time of the queries executed via `TestTezTPCDS30TBPerfCliDriver` under different scenarios 3. I'll try to add some unit test to replicate this behaviour. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
kasakrisz commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3575499587 @ramitg254 Thanks for reporting this bug and working on the fix. I have a few questions to get a better picture of the issue: > I ran a script which executes the queries of pattern cbo_* and query* under perf directory with 1. Did you actually execute all TPC-DS queries or just compile them? The driver `TestTezTPCDS30TBPerfCliDriver` doesn't execute the queries since the data is not available. It uses a Postgres HMS backend db dump to simulate an environment where the TPC-DS schema exists and calls Hive's SQL compiler using the `explain` and `explain cbo` commands. 2. Do the numbers you shared in the tables (Apache master version, with aggrStatsUseDB/without aggrStatsUseDB and with batching/without batching) show the overall compilation time of all queries? 3. I haven't found any new tests in this patch, no golden file changes either. Could you please provide a minimal repro of the issue? Please don't copy-paste any q file of the tpc-ds queries. IIUC this issue should be reproducible with a table having a few partitions and a batch size smaller than the number of partitions. Unit tests are also welcome. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3557376806 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [17 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3556284727 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [17 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3554262804 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [17 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3547579033 Hi, I ran a script which executes the queries of pattern cbo_* and query* under perf directory with `-Dhive.stats.fetch.bitvector=false` and with the postgres dump the overall time elapsed results were like below: https://github.com/user-attachments/assets/0a577bdd-c4b7-49a7-9ddf-26ce35ca5343"; /> and since there is significant gap in overall time elapsed to execute all the queries and also results of some queries varies after the drop of `aggrStatsUseDB` in case of `-Dhive.stats.fetch.bitvector=false`. Considering this performance drop and varied results I don't think we should go ahead with dropping of `aggrStatsUseDB`. Since currently that's the case so I have raised this current patch which addresses my original concern of handling the behaviour of merging in case of direct sql batch size defined and in this patch I am just adding merging logic which and no change in the earlier behaviour so there won't be any performance issues as well but similar like I tested it out with same set of queries and config and results which I saved earlier originally get generated via `aggrStatsUseDB` didn't change with this logic as well and their time elapsed results are somewhat like this: https://github.com/user-attachments/assets/7241c86b-08a6-406f-b589-87e7b8be0ceb"; /> @dengzhhu653 @deniskuzZ @kasakrisz @saihemanth-cloudera please share your thoughts on this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3547021003 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [31 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
sonarqubecloud[bot] commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3539012333 ## [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) **Quality Gate passed** Issues  [31 New issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0 Accepted issues](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6089&issueStatuses=ACCEPTED) Measures  [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=6089&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true)  [0.0% Coverage on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_coverage&view=list)  [0.0% Duplication on New Code](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=6089&metric=new_duplicated_lines_density&view=list) [See analysis details on SonarQube Cloud](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=6089) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on code in PR #6089: URL: https://github.com/apache/hive/pull/6089#discussion_r2486500601 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java: ## @@ -1953,306 +1942,6 @@ private List aggrStatsUseJava(String catName, String dbName areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner); } - private List aggrStatsUseDB(String catName, String dbName, Review Comment: @saihemanth-cloudera as per HIVE-16997 bit vector is set to true for tpcds tests even if default value is false making it go back to `aggrStatsUseJava` and even if i set it to false to fall back to `aggrStatsUseDB` the results are different for .out files of tpcds queries. So my concern is do I need to first obtain those different results generated via `aggrStatsUseDB` and then run performance testing while validating with those results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
dengzhhu653 commented on code in PR #6089:
URL: https://github.com/apache/hive/pull/6089#discussion_r2375233641
##
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java:
##
@@ -1880,14 +1880,8 @@ private List
columnStatisticsObjForPartitions(
return Batchable.runBatched(batchSize, colNames, new Batchable() {
@Override
public List run(final List inputColNames)
throws MetaException {
-return Batchable.runBatched(batchSize, partNames, new
Batchable() {
- @Override
- public List run(List inputPartNames)
throws MetaException {
-return columnStatisticsObjForPartitionsBatch(catName, dbName,
tableName, inputPartNames,
-inputColNames, engine, areAllPartsFound,
useDensityFunctionForNDVEstimation, ndvTuner,
-enableBitVector, enableKll);
- }
-});
+return columnStatisticsObjForPartitionsBatch(catName, dbName,
tableName, partNames, inputColNames, engine,
Review Comment:
nit: can we just aggrStatsUseJava directly here and remove the
columnStatisticsObjForPartitionsBatch
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
saihemanth-cloudera commented on code in PR #6089: URL: https://github.com/apache/hive/pull/6089#discussion_r2414814623 ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java: ## @@ -1953,306 +1942,6 @@ private List aggrStatsUseJava(String catName, String dbName areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner); } - private List aggrStatsUseDB(String catName, String dbName, Review Comment: > Do we have tests using other dbs than Derby and the PostgreSQL image? Oracle has a 1000 parameter limit. It would be nice to run performance testing on PostgreSQL, Oracle, mysql before removing this optimization. ## standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java: ## @@ -1953,306 +1942,6 @@ private List aggrStatsUseJava(String catName, String dbName areAllPartsFound, useDensityFunctionForNDVEstimation, ndvTuner); } - private List aggrStatsUseDB(String catName, String dbName, Review Comment: > Do we have tests using other dbs than Derby and the PostgreSQL image? Oracle has a 1000 parameter limit. It would be nice to run performance testing on PostgreSQL, Oracle, mysql before removing this optimization. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] HIVE-29203:get_aggr_stats_for doesn't aggregate stats when direct sql… [hive]
ramitg254 commented on PR #6089: URL: https://github.com/apache/hive/pull/6089#issuecomment-3351135571 > > ### How was this patch tested? > > by running tpcds tests locally with setting hive.metastore.direct.sql.batch.size as 1000 in HiveConf.java and MetastoreConf.java. > > Hi @ramitg254, Would the tpcds tests fail without this PR? BTW, What is the size of the tpcds test dataset? Hi @zhangbutao, without this pr tpcds tests will fail whenever `hive.metastore.direct.sql.batch.size` > 0 in cases partition size is greater than the batch size. there were around 55 tpcds test failures when property is set to 1000, and one such example is query16.q -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
