[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=584904=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-584904 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 19/Apr/21 06:22 Start Date: 19/Apr/21 06:22 Worklog Time Spent: 10m Work Description: lcspinter merged pull request #2111: URL: https://github.com/apache/hive/pull/2111 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 584904) Time Spent: 6h 20m (was: 6h 10m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 6h 20m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=581853=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-581853 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 13/Apr/21 15:27 Start Date: 13/Apr/21 15:27 Worklog Time Spent: 10m Work Description: lcspinter commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r612552854 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TaskCompiler.java ## @@ -422,7 +422,7 @@ private String extractTableFullName(StatsTask tsk) throws SemanticException { TableSpec tableSpec = new TableSpec(table, partitions); tableScan.getConf().getTableMetadata().setTableSpec(tableSpec); -if (BasicStatsNoJobTask.canUseFooterScan(table, inputFormat)) { +if (BasicStatsNoJobTask.canUseColumnStats(table, inputFormat)) { Review comment: Right, fixed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 581853) Time Spent: 6h 10m (was: 6h) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 6h 10m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=581849=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-581849 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 13/Apr/21 15:26 Start Date: 13/Apr/21 15:26 Worklog Time Spent: 10m Work Description: lcspinter commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r612551277 ## File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java ## @@ -153,6 +156,37 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese return predicate; } + @Override + public boolean canProvideBasicStatistics() { +return true; + } + + @Override + public Map getBasicStatistics(TableDesc tableDesc) { +Table table = Catalogs.loadTable(conf, tableDesc.getProperties()); +Map stats = new HashMap<>(); +if (table.currentSnapshot() != null) { + Map summary = table.currentSnapshot().summary(); + if (summary != null) { +if (summary.containsKey(SnapshotSummary.TOTAL_DATA_FILES_PROP)) { + stats.put(StatsSetupConst.NUM_FILES, summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP)); +} +if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) { + stats.put(StatsSetupConst.ROW_COUNT, summary.get(SnapshotSummary.TOTAL_RECORDS_PROP)); +} +// TODO: add TOTAL_SIZE when iceberg 0.12 is released +if (summary.containsKey("total-files-size")) { + stats.put(StatsSetupConst.TOTAL_SIZE, summary.get("total-files-size")); +} + } +} else { + stats.put(StatsSetupConst.NUM_FILES, "0"); Review comment: In the case of an empty table, the current snapshot is null. I thought setting all the basic stats to 0 is the right approach since we don't have any data. When the summary of the snapshot is not available I return an empty statistics map. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 581849) Time Spent: 6h (was: 5h 50m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 6h > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=581843=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-581843 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 13/Apr/21 15:19 Start Date: 13/Apr/21 15:19 Worklog Time Spent: 10m Work Description: lcspinter commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r612545094 ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ## @@ -119,16 +129,83 @@ public String getName() { return "STATS-NO-JOB"; } - static class StatItem { -Partish partish; -Map params; -Object result; + abstract static class StatCollector implements Runnable { + +protected Partish partish; +protected Object result; +protected LogHelper console; + +public static Function SIMPLE_NAME_FUNCTION = +sc -> String.format("%s#%s", sc.partish().getTable().getCompleteName(), sc.partish().getPartishType()); + +public static Function EXTRACT_RESULT_FUNCTION = sc -> (Partition) sc.result(); + +abstract Partish partish(); +abstract boolean isValid(); +abstract Object result(); +abstract void init(HiveConf conf, LogHelper console) throws IOException; + +protected String toString(Map parameters) { + return StatsSetupConst.SUPPORTED_STATS.stream().map(st -> st + "=" + parameters.get(st)) + .collect(Collectors.joining(", ")); +} } - static class FooterStatCollector implements Runnable { + static class HiveStorageHandlerStatCollector extends StatCollector { + +public HiveStorageHandlerStatCollector(Partish partish) { + this.partish = partish; +} + +@Override +public void init(HiveConf conf, LogHelper console) throws IOException { + this.console = console; +} + +@Override +public void run() { + try { +Table table = partish.getTable(); +Map parameters = partish.getPartParameters(); +TableDesc tableDesc = Utilities.getTableDesc(table); +Map basicStatistics = table.getStorageHandler().getBasicStatistics(tableDesc); Review comment: Correct, I missed that. I will provide the `partish` object which is enough to calculate the table/partition stats on StorageHandler side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 581843) Time Spent: 5h 50m (was: 5h 40m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=580872=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-580872 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 12/Apr/21 10:29 Start Date: 12/Apr/21 10:29 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r611489108 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TaskCompiler.java ## @@ -422,7 +422,7 @@ private String extractTableFullName(StatsTask tsk) throws SemanticException { TableSpec tableSpec = new TableSpec(table, partitions); tableScan.getConf().getTableMetadata().setTableSpec(tableSpec); -if (BasicStatsNoJobTask.canUseFooterScan(table, inputFormat)) { +if (BasicStatsNoJobTask.canUseColumnStats(table, inputFormat)) { Review comment: we have a `BasicStatsNoJobTask.canUseStats` and a `BasicStatsNoJobTask.canUseColumnStats` - I think "footerscan" is the basicstats stuff ; could this be a typo? ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ## @@ -90,12 +91,21 @@ public BasicStatsNoJobTask(HiveConf conf, BasicStatsNoJobWork work) { console = new LogHelper(LOG); } - public static boolean canUseFooterScan( + public static boolean canUseStats( Table table, Class inputFormat) { + return canUseColumnStats(table, inputFormat) || useBasicStatsFromStorageHandler(table); + } + + public static boolean canUseColumnStats(Table table, Class inputFormat) { Review comment: this has nothing to do with column stats - that's a different thing ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ## @@ -312,23 +384,23 @@ private int aggregateStats(ExecutorService threadPool, Hive db) { return ret; } - private int updatePartitions(Hive db, List scs, Table table) throws InvalidOperationException, HiveException { + private int updatePartitions(Hive db, List scs, Table table) throws InvalidOperationException, HiveException { String tableFullName = table.getFullyQualifiedName(); if (scs.isEmpty()) { return 0; } if (work.isStatsReliable()) { Review comment: note: it might make sense to somehow communicate this `work.isStatsReliable` somehow to the `StatCollector` so it can make that `LOG` entry if it has to... ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ## @@ -119,16 +129,83 @@ public String getName() { return "STATS-NO-JOB"; } - static class StatItem { -Partish partish; -Map params; -Object result; + abstract static class StatCollector implements Runnable { + +protected Partish partish; +protected Object result; +protected LogHelper console; + +public static Function SIMPLE_NAME_FUNCTION = +sc -> String.format("%s#%s", sc.partish().getTable().getCompleteName(), sc.partish().getPartishType()); + +public static Function EXTRACT_RESULT_FUNCTION = sc -> (Partition) sc.result(); + +abstract Partish partish(); +abstract boolean isValid(); +abstract Object result(); +abstract void init(HiveConf conf, LogHelper console) throws IOException; + +protected String toString(Map parameters) { + return StatsSetupConst.SUPPORTED_STATS.stream().map(st -> st + "=" + parameters.get(st)) + .collect(Collectors.joining(", ")); +} } - static class FooterStatCollector implements Runnable { + static class HiveStorageHandlerStatCollector extends StatCollector { + +public HiveStorageHandlerStatCollector(Partish partish) { + this.partish = partish; +} + +@Override +public void init(HiveConf conf, LogHelper console) throws IOException { + this.console = console; +} + +@Override +public void run() { + try { +Table table = partish.getTable(); +Map parameters = partish.getPartParameters(); +TableDesc tableDesc = Utilities.getTableDesc(table); +Map basicStatistics = table.getStorageHandler().getBasicStatistics(tableDesc); + +StatsSetupConst.setBasicStatsState(parameters, StatsSetupConst.TRUE); Review comment: I don't understand why we make changes to the `Table` when we could be updating infos of a partition as well... I guess in case of IceBerg you will not have regular partitions ; so it will probably work for that correctly I think here you want to change `parameters` ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ## @@ -90,12 +91,21 @@ public BasicStatsNoJobTask(HiveConf conf, BasicStatsNoJobWork work) { console = new LogHelper(LOG); } - public static boolean
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=579836=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579836 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 09/Apr/21 07:45 Start Date: 09/Apr/21 07:45 Worklog Time Spent: 10m Work Description: pvary commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r610417708 ## File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java ## @@ -153,6 +156,37 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese return predicate; } + @Override + public boolean canProvideBasicStatistics() { +return true; + } + + @Override + public Map getBasicStatistics(TableDesc tableDesc) { +Table table = Catalogs.loadTable(conf, tableDesc.getProperties()); +Map stats = new HashMap<>(); +if (table.currentSnapshot() != null) { + Map summary = table.currentSnapshot().summary(); + if (summary != null) { +if (summary.containsKey(SnapshotSummary.TOTAL_DATA_FILES_PROP)) { + stats.put(StatsSetupConst.NUM_FILES, summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP)); +} +if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) { + stats.put(StatsSetupConst.ROW_COUNT, summary.get(SnapshotSummary.TOTAL_RECORDS_PROP)); +} +// TODO: add TOTAL_SIZE when iceberg 0.12 is released +if (summary.containsKey("total-files-size")) { + stats.put(StatsSetupConst.TOTAL_SIZE, summary.get("total-files-size")); +} + } +} else { + stats.put(StatsSetupConst.NUM_FILES, "0"); Review comment: Is this for empty table, or when we do not have statistics at hand? We might want to handle the situation when we do not have statistics calculated yet, or we have an incomplete table info. On the Iceberg dev list I have seen this conversation: https://mail-archives.apache.org/mod_mbox/iceberg-dev/202104.mbox/%3c9a11adb4-27d8-40f1-8141-531287c03...@gmail.com%3e > So the tldr, Missing is OK, but inaccurate is not -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 579836) Time Spent: 5.5h (was: 5h 20m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 5.5h > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=579835=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579835 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 09/Apr/21 07:40 Start Date: 09/Apr/21 07:40 Worklog Time Spent: 10m Work Description: pvary commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r610415034 ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ## @@ -119,16 +129,83 @@ public String getName() { return "STATS-NO-JOB"; } - static class StatItem { -Partish partish; -Map params; -Object result; + abstract static class StatCollector implements Runnable { + +protected Partish partish; +protected Object result; +protected LogHelper console; + +public static Function SIMPLE_NAME_FUNCTION = +sc -> String.format("%s#%s", sc.partish().getTable().getCompleteName(), sc.partish().getPartishType()); + +public static Function EXTRACT_RESULT_FUNCTION = sc -> (Partition) sc.result(); + +abstract Partish partish(); +abstract boolean isValid(); +abstract Object result(); +abstract void init(HiveConf conf, LogHelper console) throws IOException; + +protected String toString(Map parameters) { + return StatsSetupConst.SUPPORTED_STATS.stream().map(st -> st + "=" + parameters.get(st)) + .collect(Collectors.joining(", ")); +} } - static class FooterStatCollector implements Runnable { + static class HiveStorageHandlerStatCollector extends StatCollector { + +public HiveStorageHandlerStatCollector(Partish partish) { + this.partish = partish; +} + +@Override +public void init(HiveConf conf, LogHelper console) throws IOException { + this.console = console; +} + +@Override +public void run() { + try { +Table table = partish.getTable(); +Map parameters = partish.getPartParameters(); +TableDesc tableDesc = Utilities.getTableDesc(table); +Map basicStatistics = table.getStorageHandler().getBasicStatistics(tableDesc); Review comment: If the table would be partitioned then this would not provide enough information to the StorageHandler to generated partition related statistics. Either we should document it or provide some info to the StorageHandler to calculate partition statistics -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 579835) Time Spent: 5h 20m (was: 5h 10m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 5h 20m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=579834=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579834 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 09/Apr/21 07:38 Start Date: 09/Apr/21 07:38 Worklog Time Spent: 10m Work Description: pvary commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r610413396 ## File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java ## @@ -197,4 +197,22 @@ default boolean addDynamicSplitPruningEdge(ExprNodeDesc syntheticFilterPredicate default Map getOperatorDescProperties(OperatorDesc operatorDesc, Map initialProps) { return initialProps; } + + /** + * Return some basic statistics (numRows, numFiles, totalSize) calculated by the underlying storage handler + * implementation. + * @param tableDesc a valid table description, used to load the table + * @return map of basic statistics, can be null + */ + default Map getBasicStatistics(TableDesc tableDesc) { +return null; + } + + /** + * Check if the storage handler can provide basic statistics. + * @return true if the storage handler can supply the basic statistics + */ + default boolean canProvideBasicStatistics() { Review comment: Ok.. I see why it is separated... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 579834) Time Spent: 5h 10m (was: 5h) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 5h 10m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=579827=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579827 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 09/Apr/21 07:35 Start Date: 09/Apr/21 07:35 Worklog Time Spent: 10m Work Description: pvary commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r610411498 ## File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java ## @@ -197,4 +197,22 @@ default boolean addDynamicSplitPruningEdge(ExprNodeDesc syntheticFilterPredicate default Map getOperatorDescProperties(OperatorDesc operatorDesc, Map initialProps) { return initialProps; } + + /** + * Return some basic statistics (numRows, numFiles, totalSize) calculated by the underlying storage handler + * implementation. + * @param tableDesc a valid table description, used to load the table + * @return map of basic statistics, can be null + */ + default Map getBasicStatistics(TableDesc tableDesc) { +return null; + } + + /** + * Check if the storage handler can provide basic statistics. + * @return true if the storage handler can supply the basic statistics + */ + default boolean canProvideBasicStatistics() { Review comment: Do we need both methods? Wouldn't it be better to handle `null` from `getBasicStatistics()` as `!canProvideBasicStatistics()`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 579827) Time Spent: 5h (was: 4h 50m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 5h > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=579825=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579825 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 09/Apr/21 07:33 Start Date: 09/Apr/21 07:33 Worklog Time Spent: 10m Work Description: pvary commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r610410409 ## File path: iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java ## @@ -92,6 +97,11 @@ Types.TimestampType.withoutZone(), Types.StringType.get(), Types.BinaryType.get(), Types.DecimalType.of(3, 1), Types.UUIDType.get(), Types.FixedType.ofLength(5), Types.TimeType.get()); + private static final Map STATS_MAPPING = ImmutableMap.of( Review comment: nit: maybe newline -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 579825) Time Spent: 4h 50m (was: 4h 40m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4h 50m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577490=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577490 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 11:21 Start Date: 06/Apr/21 11:21 Worklog Time Spent: 10m Work Description: lcspinter commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r607762003 ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ## @@ -90,12 +91,27 @@ public BasicStatsNoJobTask(HiveConf conf, BasicStatsNoJobWork work) { console = new LogHelper(LOG); } - public static boolean canUseFooterScan( + public static boolean canUseStats( Table table, Class inputFormat) { + return (OrcInputFormat.class.isAssignableFrom(inputFormat) && !AcidUtils.isFullAcidTable(table)) + || MapredParquetInputFormat.class.isAssignableFrom(inputFormat) + || useBasicStatsFromStorageHandler(table); + } + + public static boolean canUseColumnStats(Table table, Class inputFormat) { return (OrcInputFormat.class.isAssignableFrom(inputFormat) && !AcidUtils.isFullAcidTable(table)) || MapredParquetInputFormat.class.isAssignableFrom(inputFormat); } + private static boolean useBasicStatsFromStorageHandler(Table table) { +if (table.isNonNative()) { + TableDesc tableDesc = Utilities.getTableDesc(table); + return table.getStorageHandler().getBasicStatistics(tableDesc) != null; Review comment: Introduced a new method on the interface to check whether the storage handler can provide stats. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577490) Time Spent: 4h 40m (was: 4.5h) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577488=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577488 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 11:20 Start Date: 06/Apr/21 11:20 Worklog Time Spent: 10m Work Description: lcspinter commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r607761311 ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ## @@ -119,16 +135,92 @@ public String getName() { return "STATS-NO-JOB"; } - static class StatItem { -Partish partish; -Map params; -Object result; + abstract static class StatCollector implements Runnable { + +protected Partish partish; +protected Object result; +protected LogHelper console; + +public static Function SIMPLE_NAME_FUNCTION = +sc -> String.format("%s#%s", sc.partish().getTable().getCompleteName(), sc.partish().getPartishType()); + +public static Function EXTRACT_RESULT_FUNCTION = input -> (Partition) input.result(); Review comment: This was a legacy code snippet. Fixed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577488) Time Spent: 4h 20m (was: 4h 10m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4h 20m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577489=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577489 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 11:20 Start Date: 06/Apr/21 11:20 Worklog Time Spent: 10m Work Description: lcspinter commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r607761504 ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ## @@ -119,16 +135,92 @@ public String getName() { return "STATS-NO-JOB"; } - static class StatItem { -Partish partish; -Map params; -Object result; + abstract static class StatCollector implements Runnable { + +protected Partish partish; +protected Object result; +protected LogHelper console; + +public static Function SIMPLE_NAME_FUNCTION = +sc -> String.format("%s#%s", sc.partish().getTable().getCompleteName(), sc.partish().getPartishType()); + +public static Function EXTRACT_RESULT_FUNCTION = input -> (Partition) input.result(); + +abstract Partish partish(); +abstract boolean isValid(); +abstract Object result(); +abstract void init(HiveConf conf, LogHelper console) throws IOException; + +protected String toString(Map parameters) { + StringBuilder builder = new StringBuilder(); Review comment: Again, legacy code :), but I changed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577489) Time Spent: 4.5h (was: 4h 20m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4.5h > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577484=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577484 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 11:16 Start Date: 06/Apr/21 11:16 Worklog Time Spent: 10m Work Description: lcspinter commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r607759142 ## File path: iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java ## @@ -826,4 +866,31 @@ private StringBuilder buildComplexTypeInnerQuery(Object field, Type type) { } return query; } + + private void validateBasicStats(Table table, String tableName) { +List describeResult = shell.executeStatement("DESCRIBE EXTENDED " + tableName); +Optional tableInfo = Review comment: Thanks for letting me know. I changed the test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577484) Time Spent: 4h (was: 3h 50m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4h > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577485=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577485 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 11:16 Start Date: 06/Apr/21 11:16 Worklog Time Spent: 10m Work Description: lcspinter commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r607759343 ## File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java ## @@ -197,4 +197,8 @@ default boolean addDynamicSplitPruningEdge(ExprNodeDesc syntheticFilterPredicate default Map getOperatorDescProperties(OperatorDesc operatorDesc, Map initialProps) { return initialProps; } + + default Map getBasicStatistics(TableDesc tableDesc) { Review comment: Right, fixed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577485) Time Spent: 4h 10m (was: 4h) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 4h 10m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577480=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577480 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 11:15 Start Date: 06/Apr/21 11:15 Worklog Time Spent: 10m Work Description: lcspinter commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r607758461 ## File path: iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java ## @@ -826,4 +866,31 @@ private StringBuilder buildComplexTypeInnerQuery(Object field, Type type) { } return query; } + + private void validateBasicStats(Table table, String tableName) { Review comment: I would rather keep the separate tableName param. table.name() returns it in `hive.default.customers` format. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577480) Time Spent: 3h 50m (was: 3h 40m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3h 50m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577475=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577475 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 11:08 Start Date: 06/Apr/21 11:08 Worklog Time Spent: 10m Work Description: lcspinter commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r607754506 ## File path: iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java ## @@ -92,6 +99,11 @@ Types.TimestampType.withoutZone(), Types.StringType.get(), Types.BinaryType.get(), Types.DecimalType.of(3, 1), Types.UUIDType.get(), Types.FixedType.ofLength(5), Types.TimeType.get()); + private static final Map STATS_MAPPING = ImmutableMap.of( + StatsSetupConst.NUM_FILES, SnapshotSummary.TOTAL_DATA_FILES_PROP, + StatsSetupConst.ROW_COUNT, SnapshotSummary.TOTAL_RECORDS_PROP + // TODO: add ROW_COUNT -> TOTAL_SIZE mapping after iceberg 0.12 is released Review comment: Fixed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577475) Time Spent: 3h 40m (was: 3.5h) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577424=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577424 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 09:06 Start Date: 06/Apr/21 09:06 Worklog Time Spent: 10m Work Description: marton-bod commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r607675365 ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ## @@ -119,16 +135,92 @@ public String getName() { return "STATS-NO-JOB"; } - static class StatItem { -Partish partish; -Map params; -Object result; + abstract static class StatCollector implements Runnable { + +protected Partish partish; +protected Object result; +protected LogHelper console; + +public static Function SIMPLE_NAME_FUNCTION = +sc -> String.format("%s#%s", sc.partish().getTable().getCompleteName(), sc.partish().getPartishType()); + +public static Function EXTRACT_RESULT_FUNCTION = input -> (Partition) input.result(); + +abstract Partish partish(); +abstract boolean isValid(); +abstract Object result(); +abstract void init(HiveConf conf, LogHelper console) throws IOException; + +protected String toString(Map parameters) { + StringBuilder builder = new StringBuilder(); Review comment: nit: no strong opinion here, but I usually find it more readable to use streams and then concatenate the results using `Collectors.joining(",");` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577424) Time Spent: 3.5h (was: 3h 20m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577422=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577422 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 09:03 Start Date: 06/Apr/21 09:03 Worklog Time Spent: 10m Work Description: marton-bod commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r607672874 ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ## @@ -119,16 +135,92 @@ public String getName() { return "STATS-NO-JOB"; } - static class StatItem { -Partish partish; -Map params; -Object result; + abstract static class StatCollector implements Runnable { + +protected Partish partish; +protected Object result; +protected LogHelper console; + +public static Function SIMPLE_NAME_FUNCTION = +sc -> String.format("%s#%s", sc.partish().getTable().getCompleteName(), sc.partish().getPartishType()); + +public static Function EXTRACT_RESULT_FUNCTION = input -> (Partition) input.result(); Review comment: nit: since we called the StatCollector `sc` in the above lambda, can we rename `input` to `sc` here as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577422) Time Spent: 3h 20m (was: 3h 10m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577421=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577421 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 09:02 Start Date: 06/Apr/21 09:02 Worklog Time Spent: 10m Work Description: marton-bod commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r607671652 ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java ## @@ -90,12 +91,27 @@ public BasicStatsNoJobTask(HiveConf conf, BasicStatsNoJobWork work) { console = new LogHelper(LOG); } - public static boolean canUseFooterScan( + public static boolean canUseStats( Table table, Class inputFormat) { + return (OrcInputFormat.class.isAssignableFrom(inputFormat) && !AcidUtils.isFullAcidTable(table)) + || MapredParquetInputFormat.class.isAssignableFrom(inputFormat) + || useBasicStatsFromStorageHandler(table); + } + + public static boolean canUseColumnStats(Table table, Class inputFormat) { return (OrcInputFormat.class.isAssignableFrom(inputFormat) && !AcidUtils.isFullAcidTable(table)) || MapredParquetInputFormat.class.isAssignableFrom(inputFormat); } + private static boolean useBasicStatsFromStorageHandler(Table table) { +if (table.isNonNative()) { + TableDesc tableDesc = Utilities.getTableDesc(table); + return table.getStorageHandler().getBasicStatistics(tableDesc) != null; Review comment: Aren't we calculating all the stats here by calling `getBasicStatistics`, just to then discard the results? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577421) Time Spent: 3h 10m (was: 3h) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577417=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577417 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 08:57 Start Date: 06/Apr/21 08:57 Worklog Time Spent: 10m Work Description: marton-bod commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r607667423 ## File path: ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java ## @@ -197,4 +197,8 @@ default boolean addDynamicSplitPruningEdge(ExprNodeDesc syntheticFilterPredicate default Map getOperatorDescProperties(OperatorDesc operatorDesc, Map initialProps) { return initialProps; } + + default Map getBasicStatistics(TableDesc tableDesc) { Review comment: Since it's a new interface method, can you add some javadoc please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577417) Time Spent: 3h (was: 2h 50m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3h > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577416=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577416 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 08:55 Start Date: 06/Apr/21 08:55 Worklog Time Spent: 10m Work Description: marton-bod commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r607666001 ## File path: iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java ## @@ -826,4 +866,31 @@ private StringBuilder buildComplexTypeInnerQuery(Object field, Type type) { } return query; } + + private void validateBasicStats(Table table, String tableName) { +List describeResult = shell.executeStatement("DESCRIBE EXTENDED " + tableName); +Optional tableInfo = Review comment: Instead of using a `describe` command, wouldn't it be cleaner to load the HMS table and check its HMS params, then we wouldn't need all the parsing. It's done in a similar fashion here: https://github.com/apache/iceberg/blob/master/mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerNoScan.java#L585 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577416) Time Spent: 2h 50m (was: 2h 40m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577413=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577413 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 08:49 Start Date: 06/Apr/21 08:49 Worklog Time Spent: 10m Work Description: marton-bod commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r607660940 ## File path: iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java ## @@ -826,4 +866,31 @@ private StringBuilder buildComplexTypeInnerQuery(Object field, Type type) { } return query; } + + private void validateBasicStats(Table table, String tableName) { Review comment: Do we need the tableName param here? Can we use table.name()? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577413) Time Spent: 2h 40m (was: 2.5h) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2h 40m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577411=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577411 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 08:47 Start Date: 06/Apr/21 08:47 Worklog Time Spent: 10m Work Description: marton-bod commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r607658821 ## File path: iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java ## @@ -92,6 +99,11 @@ Types.TimestampType.withoutZone(), Types.StringType.get(), Types.BinaryType.get(), Types.DecimalType.of(3, 1), Types.UUIDType.get(), Types.FixedType.ofLength(5), Types.TimeType.get()); + private static final Map STATS_MAPPING = ImmutableMap.of( + StatsSetupConst.NUM_FILES, SnapshotSummary.TOTAL_DATA_FILES_PROP, + StatsSetupConst.ROW_COUNT, SnapshotSummary.TOTAL_RECORDS_PROP + // TODO: add ROW_COUNT -> TOTAL_SIZE mapping after iceberg 0.12 is released Review comment: small typo: TOTAL_SIZE -> TOTAL_FILE_SIZE_PROP mapping -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577411) Time Spent: 2.5h (was: 2h 20m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2.5h > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577410=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577410 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 06/Apr/21 08:31 Start Date: 06/Apr/21 08:31 Worklog Time Spent: 10m Work Description: lcspinter commented on pull request #2111: URL: https://github.com/apache/hive/pull/2111#issuecomment-813936795 @marton-bod @szlta If you have time, could you please review this PR? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 577410) Time Spent: 2h 20m (was: 2h 10m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2h 20m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=576104=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-576104 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 02/Apr/21 13:54 Start Date: 02/Apr/21 13:54 Worklog Time Spent: 10m Work Description: lcspinter commented on pull request #2111: URL: https://github.com/apache/hive/pull/2111#issuecomment-812540900 @pvary @kgyrtkirk Could you please have a second look at this PR? Thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 576104) Time Spent: 2h 10m (was: 2h) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=574276=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-574276 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 30/Mar/21 17:32 Start Date: 30/Mar/21 17:32 Worklog Time Spent: 10m Work Description: lcspinter opened a new pull request #2111: URL: https://github.com/apache/hive/pull/2111 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 574276) Time Spent: 2h (was: 1h 50m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 2h > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=574275=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-574275 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 30/Mar/21 17:32 Start Date: 30/Mar/21 17:32 Worklog Time Spent: 10m Work Description: lcspinter closed pull request #2111: URL: https://github.com/apache/hive/pull/2111 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 574275) Time Spent: 1h 50m (was: 1h 40m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573581=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573581 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 29/Mar/21 16:04 Start Date: 29/Mar/21 16:04 Worklog Time Spent: 10m Work Description: lcspinter commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r603420853 ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java ## @@ -118,11 +118,15 @@ public String getName() { private boolean isMissingAcidState = false; private BasicStatsWork work; private boolean followedColStats1; +private boolean isBasicStatProvided; +private Map providedBasicStats; -public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf conf, boolean followedColStats2) { +public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork work, boolean followedColStats2) { Review comment: Per our discussion, I moved my changes to the `BasicStatsNoJobTask`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 573581) Time Spent: 1h 40m (was: 1.5h) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 40m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573579=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573579 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 29/Mar/21 16:03 Start Date: 29/Mar/21 16:03 Worklog Time Spent: 10m Work Description: lcspinter commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r603420091 ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java ## @@ -118,11 +118,15 @@ public String getName() { private boolean isMissingAcidState = false; private BasicStatsWork work; private boolean followedColStats1; +private boolean isBasicStatProvided; +private Map providedBasicStats; -public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf conf, boolean followedColStats2) { +public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork work, boolean followedColStats2) { Review comment: Removed it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 573579) Time Spent: 1.5h (was: 1h 20m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1.5h > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573578=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573578 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 29/Mar/21 16:02 Start Date: 29/Mar/21 16:02 Worklog Time Spent: 10m Work Description: lcspinter commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r603419311 ## File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java ## @@ -153,6 +156,24 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese return predicate; } + @Override + public Map getBasicStatistics(TableDesc tableDesc) { Review comment: We need the TableDesc, since the properties which are required to load the iceberg table are stored there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 573578) Time Spent: 1h 20m (was: 1h 10m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573368=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573368 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 29/Mar/21 08:50 Start Date: 29/Mar/21 08:50 Worklog Time Spent: 10m Work Description: kgyrtkirk commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r603114825 ## File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TaskCompiler.java ## @@ -433,6 +434,12 @@ private String extractTableFullName(StatsTask tsk) throws SemanticException { return TaskFactory.get(columnStatsWork); } else { BasicStatsWork statsWork = new BasicStatsWork(tableScan.getConf().getTableMetadata().getTableSpec()); + for (MapWork mapWork : (Collection) currentTask.getMapWork()) { +if (mapWork.getAliasToPartnInfo() != null && mapWork.getAliasToPartnInfo().containsKey(table.getTableName())) { Review comment: we have a full table object passed to the `StatsWork` constructor - what's wrong with that? ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java ## @@ -258,7 +268,7 @@ private int aggregateStats(Hive db) { Partish p; partishes.add(p = new Partish.PTable(table)); -BasicStatsProcessor basicStatsProcessor = new BasicStatsProcessor(p, work, conf, followedColStats); +BasicStatsProcessor basicStatsProcessor = new BasicStatsProcessor(table, p, work, followedColStats); Review comment: I don't think we need these table callse - you may simply use `Partish#getTable` to get access to the table object later ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java ## @@ -118,11 +118,15 @@ public String getName() { private boolean isMissingAcidState = false; private BasicStatsWork work; private boolean followedColStats1; +private boolean isBasicStatProvided; +private Map providedBasicStats; -public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf conf, boolean followedColStats2) { +public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork work, boolean followedColStats2) { Review comment: yes; you should use the partish: `Partish.buildFor(table)` adding a `Table` here will cause confusion... ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java ## @@ -118,11 +118,15 @@ public String getName() { private boolean isMissingAcidState = false; private BasicStatsWork work; private boolean followedColStats1; +private boolean isBasicStatProvided; +private Map providedBasicStats; -public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf conf, boolean followedColStats2) { +public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork work, boolean followedColStats2) { Review comment: you seem to have added a totally independent conditional logic to this class; wouldn't it be easier to simply introduce a new `Processor` class for the purpose your are targeting? ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java ## @@ -118,11 +118,15 @@ public String getName() { private boolean isMissingAcidState = false; private BasicStatsWork work; private boolean followedColStats1; +private boolean isBasicStatProvided; +private Map providedBasicStats; -public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf conf, boolean followedColStats2) { +public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork work, boolean followedColStats2) { this.partish = partish; this.work = work; followedColStats1 = followedColStats2; + providedBasicStats = table.getStorageHandler().getBasicStatistics(work.getTableDesc()); Review comment: I believe you don't need the `tableDesc` for this method -> you will not need to add it to `StatsWork` ## File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java ## @@ -153,6 +156,24 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese return predicate; } + @Override + public Map getBasicStatistics(TableDesc tableDesc) { Review comment: this method seem to be using a `TableDesc` to be able to identify the underlying Iceberg table - wouldn't it be possible to do the same from a `Table` object? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at:
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573352=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573352 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 29/Mar/21 07:48 Start Date: 29/Mar/21 07:48 Worklog Time Spent: 10m Work Description: pvary commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r603077680 ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java ## @@ -118,11 +118,15 @@ public String getName() { private boolean isMissingAcidState = false; private BasicStatsWork work; private boolean followedColStats1; +private boolean isBasicStatProvided; +private Map providedBasicStats; -public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf conf, boolean followedColStats2) { +public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork work, boolean followedColStats2) { this.partish = partish; this.work = work; followedColStats1 = followedColStats2; + providedBasicStats = table.getStorageHandler().getBasicStatistics(work.getTableDesc()); Review comment: Should we call this only if the table `isNonNative`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 573352) Time Spent: 1h (was: 50m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573351=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573351 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 29/Mar/21 07:47 Start Date: 29/Mar/21 07:47 Worklog Time Spent: 10m Work Description: pvary commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r603077185 ## File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java ## @@ -118,11 +118,15 @@ public String getName() { private boolean isMissingAcidState = false; private BasicStatsWork work; private boolean followedColStats1; +private boolean isBasicStatProvided; +private Map providedBasicStats; -public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf conf, boolean followedColStats2) { +public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork work, boolean followedColStats2) { Review comment: My feeling is that `Partish` should be some general object covering both `Table` and `Partition`. How hard/wasteful would it be to add the required data to the `Partish` object? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 573351) Time Spent: 50m (was: 40m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 50m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573349=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573349 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 29/Mar/21 07:44 Start Date: 29/Mar/21 07:44 Worklog Time Spent: 10m Work Description: pvary commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r603075162 ## File path: iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java ## @@ -92,6 +98,9 @@ Types.TimestampType.withoutZone(), Types.StringType.get(), Types.BinaryType.get(), Types.DecimalType.of(3, 1), Types.UUIDType.get(), Types.FixedType.ofLength(5), Types.TimeType.get()); + private static final Map STATS_MAPPING = ImmutableMap.of( Review comment: Maybe TODO here as well to check the `TOTAL_SIZE` too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 573349) Time Spent: 40m (was: 0.5h) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 40m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573348=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573348 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 29/Mar/21 07:42 Start Date: 29/Mar/21 07:42 Worklog Time Spent: 10m Work Description: pvary commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r603074290 ## File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java ## @@ -153,6 +156,24 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese return predicate; } + @Override + public Map getBasicStatistics(TableDesc tableDesc) { +Table table = Catalogs.loadTable(conf, tableDesc.getProperties()); +Map summary = table.currentSnapshot().summary(); +Map stats = new HashMap<>(); +if (summary.containsKey(SnapshotSummary.TOTAL_DATA_FILES_PROP)) { + stats.put(StatsSetupConst.NUM_FILES, summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP)); +} +if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) { + stats.put(StatsSetupConst.ROW_COUNT, summary.get(SnapshotSummary.TOTAL_RECORDS_PROP)); +} +// TODO: add TOTAL_SIZE when iceberg 0.12 is released Review comment: With this TODO, do we need to comment out the code below? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 573348) Time Spent: 0.5h (was: 20m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573347=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573347 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 29/Mar/21 07:41 Start Date: 29/Mar/21 07:41 Worklog Time Spent: 10m Work Description: pvary commented on a change in pull request #2111: URL: https://github.com/apache/hive/pull/2111#discussion_r603073899 ## File path: iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java ## @@ -153,6 +156,24 @@ public DecomposedPredicate decomposePredicate(JobConf jobConf, Deserializer dese return predicate; } + @Override + public Map getBasicStatistics(TableDesc tableDesc) { +Table table = Catalogs.loadTable(conf, tableDesc.getProperties()); +Map summary = table.currentSnapshot().summary(); Review comment: Could we check that the summary is always non-null? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 573347) Time Spent: 20m (was: 10m) > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 20m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler
[ https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=571000=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-571000 ] ASF GitHub Bot logged work on HIVE-24928: - Author: ASF GitHub Bot Created on: 24/Mar/21 08:48 Start Date: 24/Mar/21 08:48 Worklog Time Spent: 10m Work Description: lcspinter opened a new pull request #2111: URL: https://github.com/apache/hive/pull/2111 …veStorageHandler ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking --- Worklog Id: (was: 571000) Remaining Estimate: 0h Time Spent: 10m > In case of non-native tables use basic statistics from HiveStorageHandler > - > > Key: HIVE-24928 > URL: https://issues.apache.org/jira/browse/HIVE-24928 > Project: Hive > Issue Type: Bug > Components: Hive >Affects Versions: 4.0.0 >Reporter: László Pintér >Assignee: László Pintér >Priority: Major > Fix For: 4.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE > ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by > the BasicStatsTask class. This class tries to estimate the statistics by > scanning the directory of the table. > In the case of non-native tables (iceberg, hbase), the table directory might > contain metadata files as well, which would be counted by the BasicStatsTask > when calculating basic stats. > Instead of having this logic, the HiveStorageHandler implementation should > provide basic statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005)