[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=584904=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-584904
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 19/Apr/21 06:22
Start Date: 19/Apr/21 06:22
Worklog Time Spent: 10m 
  Work Description: lcspinter merged pull request #2111:
URL: https://github.com/apache/hive/pull/2111


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 584904)
Time Spent: 6h 20m  (was: 6h 10m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 6h 20m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=581853=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-581853
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 13/Apr/21 15:27
Start Date: 13/Apr/21 15:27
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r612552854



##
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TaskCompiler.java
##
@@ -422,7 +422,7 @@ private String extractTableFullName(StatsTask tsk) throws 
SemanticException {
 TableSpec tableSpec = new TableSpec(table, partitions);
 tableScan.getConf().getTableMetadata().setTableSpec(tableSpec);
 
-if (BasicStatsNoJobTask.canUseFooterScan(table, inputFormat)) {
+if (BasicStatsNoJobTask.canUseColumnStats(table, inputFormat)) {

Review comment:
   Right, fixed it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 581853)
Time Spent: 6h 10m  (was: 6h)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 6h 10m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=581849=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-581849
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 13/Apr/21 15:26
Start Date: 13/Apr/21 15:26
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r612551277



##
File path: 
iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##
@@ -153,6 +156,37 @@ public DecomposedPredicate decomposePredicate(JobConf 
jobConf, Deserializer dese
 return predicate;
   }
 
+  @Override
+  public boolean canProvideBasicStatistics() {
+return true;
+  }
+
+  @Override
+  public Map getBasicStatistics(TableDesc tableDesc) {
+Table table = Catalogs.loadTable(conf, tableDesc.getProperties());
+Map stats = new HashMap<>();
+if (table.currentSnapshot() != null) {
+  Map summary = table.currentSnapshot().summary();
+  if (summary != null) {
+if (summary.containsKey(SnapshotSummary.TOTAL_DATA_FILES_PROP)) {
+  stats.put(StatsSetupConst.NUM_FILES, 
summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP));
+}
+if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) {
+  stats.put(StatsSetupConst.ROW_COUNT, 
summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
+}
+// TODO: add TOTAL_SIZE when iceberg 0.12 is released
+if (summary.containsKey("total-files-size")) {
+  stats.put(StatsSetupConst.TOTAL_SIZE, 
summary.get("total-files-size"));
+}
+  }
+} else {
+  stats.put(StatsSetupConst.NUM_FILES, "0");

Review comment:
   In the case of an empty table, the current snapshot is null. I thought 
setting all the basic stats to 0 is the right approach since we don't have any 
data. 
   When the summary of the snapshot is not available I return an empty 
statistics map. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 581849)
Time Spent: 6h  (was: 5h 50m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 6h
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-13 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=581843=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-581843
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 13/Apr/21 15:19
Start Date: 13/Apr/21 15:19
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r612545094



##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##
@@ -119,16 +129,83 @@ public String getName() {
 return "STATS-NO-JOB";
   }
 
-  static class StatItem {
-Partish partish;
-Map params;
-Object result;
+  abstract static class StatCollector implements Runnable {
+
+protected Partish partish;
+protected Object result;
+protected LogHelper console;
+
+public static Function SIMPLE_NAME_FUNCTION =
+sc -> String.format("%s#%s", 
sc.partish().getTable().getCompleteName(), sc.partish().getPartishType());
+
+public static Function EXTRACT_RESULT_FUNCTION = 
sc -> (Partition) sc.result();
+
+abstract Partish partish();
+abstract boolean isValid();
+abstract Object result();
+abstract void init(HiveConf conf, LogHelper console) throws IOException;
+
+protected String toString(Map parameters) {
+  return StatsSetupConst.SUPPORTED_STATS.stream().map(st -> st + "=" + 
parameters.get(st))
+  .collect(Collectors.joining(", "));
+}
   }
 
-  static class FooterStatCollector implements Runnable {
+  static class HiveStorageHandlerStatCollector extends StatCollector {
+
+public HiveStorageHandlerStatCollector(Partish partish) {
+  this.partish = partish;
+}
+
+@Override
+public void init(HiveConf conf, LogHelper console) throws IOException {
+  this.console = console;
+}
+
+@Override
+public void run() {
+  try {
+Table table = partish.getTable();
+Map parameters = partish.getPartParameters();
+TableDesc tableDesc = Utilities.getTableDesc(table);
+Map basicStatistics = 
table.getStorageHandler().getBasicStatistics(tableDesc);

Review comment:
   Correct, I missed that. I will provide the `partish` object which is 
enough to calculate the table/partition stats on StorageHandler side. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 581843)
Time Spent: 5h 50m  (was: 5h 40m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-12 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=580872=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-580872
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 12/Apr/21 10:29
Start Date: 12/Apr/21 10:29
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r611489108



##
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TaskCompiler.java
##
@@ -422,7 +422,7 @@ private String extractTableFullName(StatsTask tsk) throws 
SemanticException {
 TableSpec tableSpec = new TableSpec(table, partitions);
 tableScan.getConf().getTableMetadata().setTableSpec(tableSpec);
 
-if (BasicStatsNoJobTask.canUseFooterScan(table, inputFormat)) {
+if (BasicStatsNoJobTask.canUseColumnStats(table, inputFormat)) {

Review comment:
   we have a `BasicStatsNoJobTask.canUseStats` 
   and a `BasicStatsNoJobTask.canUseColumnStats`  - I think "footerscan" is the 
basicstats stuff ; could this be a typo?

##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##
@@ -90,12 +91,21 @@ public BasicStatsNoJobTask(HiveConf conf, 
BasicStatsNoJobWork work) {
 console = new LogHelper(LOG);
   }
 
-  public static boolean canUseFooterScan(
+  public static boolean canUseStats(
   Table table, Class inputFormat) {
+  return canUseColumnStats(table, inputFormat) || 
useBasicStatsFromStorageHandler(table);
+  }
+
+  public static boolean canUseColumnStats(Table table, Class inputFormat) {

Review comment:
   this has nothing to do with column stats - that's a different thing

##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##
@@ -312,23 +384,23 @@ private int aggregateStats(ExecutorService threadPool, 
Hive db) {
 return ret;
   }
 
-  private int updatePartitions(Hive db, List scs, Table 
table) throws InvalidOperationException, HiveException {
+  private int updatePartitions(Hive db, List scs, Table table) 
throws InvalidOperationException, HiveException {
 
 String tableFullName = table.getFullyQualifiedName();
 
 if (scs.isEmpty()) {
   return 0;
 }
 if (work.isStatsReliable()) {

Review comment:
   note: it might make sense to somehow communicate this 
`work.isStatsReliable` somehow to the `StatCollector` so it can make that `LOG` 
entry if it has to...

##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##
@@ -119,16 +129,83 @@ public String getName() {
 return "STATS-NO-JOB";
   }
 
-  static class StatItem {
-Partish partish;
-Map params;
-Object result;
+  abstract static class StatCollector implements Runnable {
+
+protected Partish partish;
+protected Object result;
+protected LogHelper console;
+
+public static Function SIMPLE_NAME_FUNCTION =
+sc -> String.format("%s#%s", 
sc.partish().getTable().getCompleteName(), sc.partish().getPartishType());
+
+public static Function EXTRACT_RESULT_FUNCTION = 
sc -> (Partition) sc.result();
+
+abstract Partish partish();
+abstract boolean isValid();
+abstract Object result();
+abstract void init(HiveConf conf, LogHelper console) throws IOException;
+
+protected String toString(Map parameters) {
+  return StatsSetupConst.SUPPORTED_STATS.stream().map(st -> st + "=" + 
parameters.get(st))
+  .collect(Collectors.joining(", "));
+}
   }
 
-  static class FooterStatCollector implements Runnable {
+  static class HiveStorageHandlerStatCollector extends StatCollector {
+
+public HiveStorageHandlerStatCollector(Partish partish) {
+  this.partish = partish;
+}
+
+@Override
+public void init(HiveConf conf, LogHelper console) throws IOException {
+  this.console = console;
+}
+
+@Override
+public void run() {
+  try {
+Table table = partish.getTable();
+Map parameters = partish.getPartParameters();
+TableDesc tableDesc = Utilities.getTableDesc(table);
+Map basicStatistics = 
table.getStorageHandler().getBasicStatistics(tableDesc);
+
+StatsSetupConst.setBasicStatsState(parameters, StatsSetupConst.TRUE);

Review comment:
   I don't understand why we make changes to the `Table` when we could be 
updating infos of a partition as well...
   I guess in case of IceBerg you will not have regular partitions ; so it will 
probably work for that correctly
   
   I think here you want to change `parameters`

##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##
@@ -90,12 +91,21 @@ public BasicStatsNoJobTask(HiveConf conf, 
BasicStatsNoJobWork work) {
 console = new LogHelper(LOG);
   }
 
-  public static boolean 

[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=579836=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579836
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 09/Apr/21 07:45
Start Date: 09/Apr/21 07:45
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r610417708



##
File path: 
iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##
@@ -153,6 +156,37 @@ public DecomposedPredicate decomposePredicate(JobConf 
jobConf, Deserializer dese
 return predicate;
   }
 
+  @Override
+  public boolean canProvideBasicStatistics() {
+return true;
+  }
+
+  @Override
+  public Map getBasicStatistics(TableDesc tableDesc) {
+Table table = Catalogs.loadTable(conf, tableDesc.getProperties());
+Map stats = new HashMap<>();
+if (table.currentSnapshot() != null) {
+  Map summary = table.currentSnapshot().summary();
+  if (summary != null) {
+if (summary.containsKey(SnapshotSummary.TOTAL_DATA_FILES_PROP)) {
+  stats.put(StatsSetupConst.NUM_FILES, 
summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP));
+}
+if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) {
+  stats.put(StatsSetupConst.ROW_COUNT, 
summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
+}
+// TODO: add TOTAL_SIZE when iceberg 0.12 is released
+if (summary.containsKey("total-files-size")) {
+  stats.put(StatsSetupConst.TOTAL_SIZE, 
summary.get("total-files-size"));
+}
+  }
+} else {
+  stats.put(StatsSetupConst.NUM_FILES, "0");

Review comment:
   Is this for empty table, or when we do not have statistics at hand?
   We might want to handle the situation when we do not have statistics 
calculated yet, or we have an incomplete table info.
   
   On the Iceberg dev list I have seen this conversation:
   
https://mail-archives.apache.org/mod_mbox/iceberg-dev/202104.mbox/%3c9a11adb4-27d8-40f1-8141-531287c03...@gmail.com%3e
   
   > So the tldr, Missing is OK, but inaccurate is not




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 579836)
Time Spent: 5.5h  (was: 5h 20m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=579835=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579835
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 09/Apr/21 07:40
Start Date: 09/Apr/21 07:40
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r610415034



##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##
@@ -119,16 +129,83 @@ public String getName() {
 return "STATS-NO-JOB";
   }
 
-  static class StatItem {
-Partish partish;
-Map params;
-Object result;
+  abstract static class StatCollector implements Runnable {
+
+protected Partish partish;
+protected Object result;
+protected LogHelper console;
+
+public static Function SIMPLE_NAME_FUNCTION =
+sc -> String.format("%s#%s", 
sc.partish().getTable().getCompleteName(), sc.partish().getPartishType());
+
+public static Function EXTRACT_RESULT_FUNCTION = 
sc -> (Partition) sc.result();
+
+abstract Partish partish();
+abstract boolean isValid();
+abstract Object result();
+abstract void init(HiveConf conf, LogHelper console) throws IOException;
+
+protected String toString(Map parameters) {
+  return StatsSetupConst.SUPPORTED_STATS.stream().map(st -> st + "=" + 
parameters.get(st))
+  .collect(Collectors.joining(", "));
+}
   }
 
-  static class FooterStatCollector implements Runnable {
+  static class HiveStorageHandlerStatCollector extends StatCollector {
+
+public HiveStorageHandlerStatCollector(Partish partish) {
+  this.partish = partish;
+}
+
+@Override
+public void init(HiveConf conf, LogHelper console) throws IOException {
+  this.console = console;
+}
+
+@Override
+public void run() {
+  try {
+Table table = partish.getTable();
+Map parameters = partish.getPartParameters();
+TableDesc tableDesc = Utilities.getTableDesc(table);
+Map basicStatistics = 
table.getStorageHandler().getBasicStatistics(tableDesc);

Review comment:
   If the table would be partitioned then this would not provide enough 
information to the StorageHandler to generated partition related statistics.
   Either we should document it or provide some info to the StorageHandler to 
calculate partition statistics




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 579835)
Time Spent: 5h 20m  (was: 5h 10m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=579834=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579834
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 09/Apr/21 07:38
Start Date: 09/Apr/21 07:38
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r610413396



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java
##
@@ -197,4 +197,22 @@ default boolean addDynamicSplitPruningEdge(ExprNodeDesc 
syntheticFilterPredicate
   default Map getOperatorDescProperties(OperatorDesc 
operatorDesc, Map initialProps) {
 return initialProps;
   }
+
+  /**
+   * Return some basic statistics (numRows, numFiles, totalSize) calculated by 
the underlying storage handler
+   * implementation.
+   * @param tableDesc a valid table description, used to load the table
+   * @return map of basic statistics, can be null
+   */
+  default Map getBasicStatistics(TableDesc tableDesc) {
+return null;
+  }
+
+  /**
+   * Check if the storage handler can provide basic statistics.
+   * @return true if the storage handler can supply the basic statistics
+   */
+  default boolean canProvideBasicStatistics() {

Review comment:
   Ok.. I see why it is separated...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 579834)
Time Spent: 5h 10m  (was: 5h)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 5h 10m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=579827=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579827
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 09/Apr/21 07:35
Start Date: 09/Apr/21 07:35
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r610411498



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java
##
@@ -197,4 +197,22 @@ default boolean addDynamicSplitPruningEdge(ExprNodeDesc 
syntheticFilterPredicate
   default Map getOperatorDescProperties(OperatorDesc 
operatorDesc, Map initialProps) {
 return initialProps;
   }
+
+  /**
+   * Return some basic statistics (numRows, numFiles, totalSize) calculated by 
the underlying storage handler
+   * implementation.
+   * @param tableDesc a valid table description, used to load the table
+   * @return map of basic statistics, can be null
+   */
+  default Map getBasicStatistics(TableDesc tableDesc) {
+return null;
+  }
+
+  /**
+   * Check if the storage handler can provide basic statistics.
+   * @return true if the storage handler can supply the basic statistics
+   */
+  default boolean canProvideBasicStatistics() {

Review comment:
   Do we need both methods?
   Wouldn't it be better to handle `null` from `getBasicStatistics()` as 
`!canProvideBasicStatistics()`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 579827)
Time Spent: 5h  (was: 4h 50m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 5h
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=579825=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-579825
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 09/Apr/21 07:33
Start Date: 09/Apr/21 07:33
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r610410409



##
File path: 
iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##
@@ -92,6 +97,11 @@
   Types.TimestampType.withoutZone(), Types.StringType.get(), 
Types.BinaryType.get(),
   Types.DecimalType.of(3, 1), Types.UUIDType.get(), 
Types.FixedType.ofLength(5),
   Types.TimeType.get());
+  private static final Map STATS_MAPPING = ImmutableMap.of(

Review comment:
   nit: maybe newline




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 579825)
Time Spent: 4h 50m  (was: 4h 40m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h 50m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577490=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577490
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 11:21
Start Date: 06/Apr/21 11:21
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r607762003



##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##
@@ -90,12 +91,27 @@ public BasicStatsNoJobTask(HiveConf conf, 
BasicStatsNoJobWork work) {
 console = new LogHelper(LOG);
   }
 
-  public static boolean canUseFooterScan(
+  public static boolean canUseStats(
   Table table, Class inputFormat) {
+  return (OrcInputFormat.class.isAssignableFrom(inputFormat) && 
!AcidUtils.isFullAcidTable(table))
+  || MapredParquetInputFormat.class.isAssignableFrom(inputFormat)
+  || useBasicStatsFromStorageHandler(table);
+  }
+
+  public static boolean canUseColumnStats(Table table, Class inputFormat) {
 return (OrcInputFormat.class.isAssignableFrom(inputFormat) && 
!AcidUtils.isFullAcidTable(table))
 || MapredParquetInputFormat.class.isAssignableFrom(inputFormat);
   }
 
+  private static boolean useBasicStatsFromStorageHandler(Table table) {
+if (table.isNonNative()) {
+  TableDesc tableDesc = Utilities.getTableDesc(table);
+  return table.getStorageHandler().getBasicStatistics(tableDesc) != null;

Review comment:
   Introduced a new method on the interface to check whether the storage 
handler can provide stats. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577490)
Time Spent: 4h 40m  (was: 4.5h)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577488=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577488
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 11:20
Start Date: 06/Apr/21 11:20
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r607761311



##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##
@@ -119,16 +135,92 @@ public String getName() {
 return "STATS-NO-JOB";
   }
 
-  static class StatItem {
-Partish partish;
-Map params;
-Object result;
+  abstract static class StatCollector implements Runnable {
+
+protected Partish partish;
+protected Object result;
+protected LogHelper console;
+
+public static Function SIMPLE_NAME_FUNCTION =
+sc -> String.format("%s#%s", 
sc.partish().getTable().getCompleteName(), sc.partish().getPartishType());
+
+public static Function EXTRACT_RESULT_FUNCTION = 
input -> (Partition) input.result();

Review comment:
   This was a legacy code snippet. Fixed it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577488)
Time Spent: 4h 20m  (was: 4h 10m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577489=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577489
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 11:20
Start Date: 06/Apr/21 11:20
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r607761504



##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##
@@ -119,16 +135,92 @@ public String getName() {
 return "STATS-NO-JOB";
   }
 
-  static class StatItem {
-Partish partish;
-Map params;
-Object result;
+  abstract static class StatCollector implements Runnable {
+
+protected Partish partish;
+protected Object result;
+protected LogHelper console;
+
+public static Function SIMPLE_NAME_FUNCTION =
+sc -> String.format("%s#%s", 
sc.partish().getTable().getCompleteName(), sc.partish().getPartishType());
+
+public static Function EXTRACT_RESULT_FUNCTION = 
input -> (Partition) input.result();
+
+abstract Partish partish();
+abstract boolean isValid();
+abstract Object result();
+abstract void init(HiveConf conf, LogHelper console) throws IOException;
+
+protected String toString(Map parameters) {
+  StringBuilder builder = new StringBuilder();

Review comment:
   Again, legacy code :), but I changed it. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577489)
Time Spent: 4.5h  (was: 4h 20m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4.5h
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577484=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577484
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 11:16
Start Date: 06/Apr/21 11:16
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r607759142



##
File path: 
iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##
@@ -826,4 +866,31 @@ private StringBuilder buildComplexTypeInnerQuery(Object 
field, Type type) {
 }
 return query;
   }
+
+  private void validateBasicStats(Table table, String tableName) {
+List describeResult = shell.executeStatement("DESCRIBE EXTENDED 
" + tableName);
+Optional tableInfo =

Review comment:
   Thanks for letting me know. I changed the test. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577484)
Time Spent: 4h  (was: 3h 50m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577485=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577485
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 11:16
Start Date: 06/Apr/21 11:16
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r607759343



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java
##
@@ -197,4 +197,8 @@ default boolean addDynamicSplitPruningEdge(ExprNodeDesc 
syntheticFilterPredicate
   default Map getOperatorDescProperties(OperatorDesc 
operatorDesc, Map initialProps) {
 return initialProps;
   }
+
+  default Map getBasicStatistics(TableDesc tableDesc) {

Review comment:
   Right, fixed it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577485)
Time Spent: 4h 10m  (was: 4h)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 4h 10m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577480=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577480
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 11:15
Start Date: 06/Apr/21 11:15
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r607758461



##
File path: 
iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##
@@ -826,4 +866,31 @@ private StringBuilder buildComplexTypeInnerQuery(Object 
field, Type type) {
 }
 return query;
   }
+
+  private void validateBasicStats(Table table, String tableName) {

Review comment:
   I would rather keep the separate tableName param. table.name() returns 
it in `hive.default.customers` format.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577480)
Time Spent: 3h 50m  (was: 3h 40m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 50m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577475=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577475
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 11:08
Start Date: 06/Apr/21 11:08
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r607754506



##
File path: 
iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##
@@ -92,6 +99,11 @@
   Types.TimestampType.withoutZone(), Types.StringType.get(), 
Types.BinaryType.get(),
   Types.DecimalType.of(3, 1), Types.UUIDType.get(), 
Types.FixedType.ofLength(5),
   Types.TimeType.get());
+  private static final Map STATS_MAPPING = ImmutableMap.of(
+  StatsSetupConst.NUM_FILES, SnapshotSummary.TOTAL_DATA_FILES_PROP,
+  StatsSetupConst.ROW_COUNT, SnapshotSummary.TOTAL_RECORDS_PROP
+  // TODO: add ROW_COUNT -> TOTAL_SIZE mapping after iceberg 0.12 is 
released

Review comment:
   Fixed it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577475)
Time Spent: 3h 40m  (was: 3.5h)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577424=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577424
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 09:06
Start Date: 06/Apr/21 09:06
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r607675365



##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##
@@ -119,16 +135,92 @@ public String getName() {
 return "STATS-NO-JOB";
   }
 
-  static class StatItem {
-Partish partish;
-Map params;
-Object result;
+  abstract static class StatCollector implements Runnable {
+
+protected Partish partish;
+protected Object result;
+protected LogHelper console;
+
+public static Function SIMPLE_NAME_FUNCTION =
+sc -> String.format("%s#%s", 
sc.partish().getTable().getCompleteName(), sc.partish().getPartishType());
+
+public static Function EXTRACT_RESULT_FUNCTION = 
input -> (Partition) input.result();
+
+abstract Partish partish();
+abstract boolean isValid();
+abstract Object result();
+abstract void init(HiveConf conf, LogHelper console) throws IOException;
+
+protected String toString(Map parameters) {
+  StringBuilder builder = new StringBuilder();

Review comment:
   nit: no strong opinion here, but I usually find it more readable to use 
streams and then concatenate the results using `Collectors.joining(",");`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577424)
Time Spent: 3.5h  (was: 3h 20m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577422=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577422
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 09:03
Start Date: 06/Apr/21 09:03
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r607672874



##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##
@@ -119,16 +135,92 @@ public String getName() {
 return "STATS-NO-JOB";
   }
 
-  static class StatItem {
-Partish partish;
-Map params;
-Object result;
+  abstract static class StatCollector implements Runnable {
+
+protected Partish partish;
+protected Object result;
+protected LogHelper console;
+
+public static Function SIMPLE_NAME_FUNCTION =
+sc -> String.format("%s#%s", 
sc.partish().getTable().getCompleteName(), sc.partish().getPartishType());
+
+public static Function EXTRACT_RESULT_FUNCTION = 
input -> (Partition) input.result();

Review comment:
   nit: since we called the StatCollector `sc` in the above lambda, can we 
rename `input` to `sc` here as well?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577422)
Time Spent: 3h 20m  (was: 3h 10m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577421=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577421
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 09:02
Start Date: 06/Apr/21 09:02
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r607671652



##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsNoJobTask.java
##
@@ -90,12 +91,27 @@ public BasicStatsNoJobTask(HiveConf conf, 
BasicStatsNoJobWork work) {
 console = new LogHelper(LOG);
   }
 
-  public static boolean canUseFooterScan(
+  public static boolean canUseStats(
   Table table, Class inputFormat) {
+  return (OrcInputFormat.class.isAssignableFrom(inputFormat) && 
!AcidUtils.isFullAcidTable(table))
+  || MapredParquetInputFormat.class.isAssignableFrom(inputFormat)
+  || useBasicStatsFromStorageHandler(table);
+  }
+
+  public static boolean canUseColumnStats(Table table, Class inputFormat) {
 return (OrcInputFormat.class.isAssignableFrom(inputFormat) && 
!AcidUtils.isFullAcidTable(table))
 || MapredParquetInputFormat.class.isAssignableFrom(inputFormat);
   }
 
+  private static boolean useBasicStatsFromStorageHandler(Table table) {
+if (table.isNonNative()) {
+  TableDesc tableDesc = Utilities.getTableDesc(table);
+  return table.getStorageHandler().getBasicStatistics(tableDesc) != null;

Review comment:
   Aren't we calculating all the stats here by calling 
`getBasicStatistics`, just to then discard the results?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577421)
Time Spent: 3h 10m  (was: 3h)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577417=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577417
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 08:57
Start Date: 06/Apr/21 08:57
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r607667423



##
File path: 
ql/src/java/org/apache/hadoop/hive/ql/metadata/HiveStorageHandler.java
##
@@ -197,4 +197,8 @@ default boolean addDynamicSplitPruningEdge(ExprNodeDesc 
syntheticFilterPredicate
   default Map getOperatorDescProperties(OperatorDesc 
operatorDesc, Map initialProps) {
 return initialProps;
   }
+
+  default Map getBasicStatistics(TableDesc tableDesc) {

Review comment:
   Since it's a new interface method, can you add some javadoc please?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577417)
Time Spent: 3h  (was: 2h 50m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577416=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577416
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 08:55
Start Date: 06/Apr/21 08:55
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r607666001



##
File path: 
iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##
@@ -826,4 +866,31 @@ private StringBuilder buildComplexTypeInnerQuery(Object 
field, Type type) {
 }
 return query;
   }
+
+  private void validateBasicStats(Table table, String tableName) {
+List describeResult = shell.executeStatement("DESCRIBE EXTENDED 
" + tableName);
+Optional tableInfo =

Review comment:
   Instead of using a `describe` command, wouldn't it be cleaner to load 
the HMS table and check its HMS params, then we wouldn't need all the parsing. 
It's done in a similar fashion here: 
https://github.com/apache/iceberg/blob/master/mr/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerNoScan.java#L585




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577416)
Time Spent: 2h 50m  (was: 2h 40m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 50m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577413=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577413
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 08:49
Start Date: 06/Apr/21 08:49
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r607660940



##
File path: 
iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##
@@ -826,4 +866,31 @@ private StringBuilder buildComplexTypeInnerQuery(Object 
field, Type type) {
 }
 return query;
   }
+
+  private void validateBasicStats(Table table, String tableName) {

Review comment:
   Do we need the tableName param here? Can we use table.name()?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577413)
Time Spent: 2h 40m  (was: 2.5h)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577411=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577411
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 08:47
Start Date: 06/Apr/21 08:47
Worklog Time Spent: 10m 
  Work Description: marton-bod commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r607658821



##
File path: 
iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##
@@ -92,6 +99,11 @@
   Types.TimestampType.withoutZone(), Types.StringType.get(), 
Types.BinaryType.get(),
   Types.DecimalType.of(3, 1), Types.UUIDType.get(), 
Types.FixedType.ofLength(5),
   Types.TimeType.get());
+  private static final Map STATS_MAPPING = ImmutableMap.of(
+  StatsSetupConst.NUM_FILES, SnapshotSummary.TOTAL_DATA_FILES_PROP,
+  StatsSetupConst.ROW_COUNT, SnapshotSummary.TOTAL_RECORDS_PROP
+  // TODO: add ROW_COUNT -> TOTAL_SIZE mapping after iceberg 0.12 is 
released

Review comment:
   small typo: TOTAL_SIZE -> TOTAL_FILE_SIZE_PROP mapping




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577411)
Time Spent: 2.5h  (was: 2h 20m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-06 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=577410=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-577410
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 06/Apr/21 08:31
Start Date: 06/Apr/21 08:31
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on pull request #2111:
URL: https://github.com/apache/hive/pull/2111#issuecomment-813936795


   @marton-bod @szlta If you have time, could you please review this PR? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 577410)
Time Spent: 2h 20m  (was: 2h 10m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-04-02 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=576104=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-576104
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 02/Apr/21 13:54
Start Date: 02/Apr/21 13:54
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on pull request #2111:
URL: https://github.com/apache/hive/pull/2111#issuecomment-812540900


   @pvary @kgyrtkirk Could you please have a second look at this PR? Thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 576104)
Time Spent: 2h 10m  (was: 2h)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=574276=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-574276
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 30/Mar/21 17:32
Start Date: 30/Mar/21 17:32
Worklog Time Spent: 10m 
  Work Description: lcspinter opened a new pull request #2111:
URL: https://github.com/apache/hive/pull/2111


   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 574276)
Time Spent: 2h  (was: 1h 50m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-03-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=574275=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-574275
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 30/Mar/21 17:32
Start Date: 30/Mar/21 17:32
Worklog Time Spent: 10m 
  Work Description: lcspinter closed pull request #2111:
URL: https://github.com/apache/hive/pull/2111


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 574275)
Time Spent: 1h 50m  (was: 1h 40m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-03-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573581=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573581
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 29/Mar/21 16:04
Start Date: 29/Mar/21 16:04
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r603420853



##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##
@@ -118,11 +118,15 @@ public String getName() {
 private boolean isMissingAcidState = false;
 private BasicStatsWork work;
 private boolean followedColStats1;
+private boolean isBasicStatProvided;
+private Map providedBasicStats;
 
-public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf 
conf, boolean followedColStats2) {
+public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork 
work, boolean followedColStats2) {

Review comment:
   Per our discussion, I moved my changes to the `BasicStatsNoJobTask`. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 573581)
Time Spent: 1h 40m  (was: 1.5h)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-03-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573579=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573579
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 29/Mar/21 16:03
Start Date: 29/Mar/21 16:03
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r603420091



##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##
@@ -118,11 +118,15 @@ public String getName() {
 private boolean isMissingAcidState = false;
 private BasicStatsWork work;
 private boolean followedColStats1;
+private boolean isBasicStatProvided;
+private Map providedBasicStats;
 
-public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf 
conf, boolean followedColStats2) {
+public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork 
work, boolean followedColStats2) {

Review comment:
   Removed it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 573579)
Time Spent: 1.5h  (was: 1h 20m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-03-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573578=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573578
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 29/Mar/21 16:02
Start Date: 29/Mar/21 16:02
Worklog Time Spent: 10m 
  Work Description: lcspinter commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r603419311



##
File path: 
iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##
@@ -153,6 +156,24 @@ public DecomposedPredicate decomposePredicate(JobConf 
jobConf, Deserializer dese
 return predicate;
   }
 
+  @Override
+  public Map getBasicStatistics(TableDesc tableDesc) {

Review comment:
   We need the TableDesc, since the properties which are required to load 
the iceberg table are stored there. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 573578)
Time Spent: 1h 20m  (was: 1h 10m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-03-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573368=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573368
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 29/Mar/21 08:50
Start Date: 29/Mar/21 08:50
Worklog Time Spent: 10m 
  Work Description: kgyrtkirk commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r603114825



##
File path: ql/src/java/org/apache/hadoop/hive/ql/parse/TaskCompiler.java
##
@@ -433,6 +434,12 @@ private String extractTableFullName(StatsTask tsk) throws 
SemanticException {
   return TaskFactory.get(columnStatsWork);
 } else {
   BasicStatsWork statsWork = new 
BasicStatsWork(tableScan.getConf().getTableMetadata().getTableSpec());
+  for (MapWork mapWork :  (Collection) currentTask.getMapWork()) {
+if (mapWork.getAliasToPartnInfo() != null && 
mapWork.getAliasToPartnInfo().containsKey(table.getTableName())) {

Review comment:
   we have a full table object passed to the `StatsWork` constructor - 
what's wrong with that?

##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##
@@ -258,7 +268,7 @@ private int aggregateStats(Hive db) {
 Partish p;
 partishes.add(p = new Partish.PTable(table));
 
-BasicStatsProcessor basicStatsProcessor = new BasicStatsProcessor(p, 
work, conf, followedColStats);
+BasicStatsProcessor basicStatsProcessor = new 
BasicStatsProcessor(table, p, work, followedColStats);

Review comment:
   I don't think we need these table callse - you may simply use 
`Partish#getTable` to get access to the table object later

##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##
@@ -118,11 +118,15 @@ public String getName() {
 private boolean isMissingAcidState = false;
 private BasicStatsWork work;
 private boolean followedColStats1;
+private boolean isBasicStatProvided;
+private Map providedBasicStats;
 
-public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf 
conf, boolean followedColStats2) {
+public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork 
work, boolean followedColStats2) {

Review comment:
   yes; you should use the partish: `Partish.buildFor(table)`
   
   adding a `Table` here will cause confusion...

##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##
@@ -118,11 +118,15 @@ public String getName() {
 private boolean isMissingAcidState = false;
 private BasicStatsWork work;
 private boolean followedColStats1;
+private boolean isBasicStatProvided;
+private Map providedBasicStats;
 
-public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf 
conf, boolean followedColStats2) {
+public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork 
work, boolean followedColStats2) {

Review comment:
   you seem to have added a totally independent conditional  logic to this 
class; wouldn't it be easier to simply introduce a new `Processor` class for 
the purpose your are targeting?

##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##
@@ -118,11 +118,15 @@ public String getName() {
 private boolean isMissingAcidState = false;
 private BasicStatsWork work;
 private boolean followedColStats1;
+private boolean isBasicStatProvided;
+private Map providedBasicStats;
 
-public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf 
conf, boolean followedColStats2) {
+public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork 
work, boolean followedColStats2) {
   this.partish = partish;
   this.work = work;
   followedColStats1 = followedColStats2;
+  providedBasicStats = 
table.getStorageHandler().getBasicStatistics(work.getTableDesc());

Review comment:
   I believe you don't need the `tableDesc` for this method -> you will not 
need to add it to `StatsWork` 

##
File path: 
iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##
@@ -153,6 +156,24 @@ public DecomposedPredicate decomposePredicate(JobConf 
jobConf, Deserializer dese
 return predicate;
   }
 
+  @Override
+  public Map getBasicStatistics(TableDesc tableDesc) {

Review comment:
   this method seem to be using a `TableDesc` to be able to identify the 
underlying Iceberg table - wouldn't it be possible to do the same from a  
`Table` object?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:

[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-03-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573352=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573352
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 29/Mar/21 07:48
Start Date: 29/Mar/21 07:48
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r603077680



##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##
@@ -118,11 +118,15 @@ public String getName() {
 private boolean isMissingAcidState = false;
 private BasicStatsWork work;
 private boolean followedColStats1;
+private boolean isBasicStatProvided;
+private Map providedBasicStats;
 
-public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf 
conf, boolean followedColStats2) {
+public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork 
work, boolean followedColStats2) {
   this.partish = partish;
   this.work = work;
   followedColStats1 = followedColStats2;
+  providedBasicStats = 
table.getStorageHandler().getBasicStatistics(work.getTableDesc());

Review comment:
   Should we call this only if the table `isNonNative`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 573352)
Time Spent: 1h  (was: 50m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-03-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573351=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573351
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 29/Mar/21 07:47
Start Date: 29/Mar/21 07:47
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r603077185



##
File path: ql/src/java/org/apache/hadoop/hive/ql/stats/BasicStatsTask.java
##
@@ -118,11 +118,15 @@ public String getName() {
 private boolean isMissingAcidState = false;
 private BasicStatsWork work;
 private boolean followedColStats1;
+private boolean isBasicStatProvided;
+private Map providedBasicStats;
 
-public BasicStatsProcessor(Partish partish, BasicStatsWork work, HiveConf 
conf, boolean followedColStats2) {
+public BasicStatsProcessor(Table table, Partish partish, BasicStatsWork 
work, boolean followedColStats2) {

Review comment:
   My feeling is that `Partish` should be some general object covering both 
`Table` and `Partition`. How hard/wasteful would it be to add the required data 
to the `Partish` object?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 573351)
Time Spent: 50m  (was: 40m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-03-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573349=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573349
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 29/Mar/21 07:44
Start Date: 29/Mar/21 07:44
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r603075162



##
File path: 
iceberg-handler/src/test/java/org/apache/iceberg/mr/hive/TestHiveIcebergStorageHandlerWithEngine.java
##
@@ -92,6 +98,9 @@
   Types.TimestampType.withoutZone(), Types.StringType.get(), 
Types.BinaryType.get(),
   Types.DecimalType.of(3, 1), Types.UUIDType.get(), 
Types.FixedType.ofLength(5),
   Types.TimeType.get());
+  private static final Map STATS_MAPPING = ImmutableMap.of(

Review comment:
   Maybe TODO here as well to check the `TOTAL_SIZE` too?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 573349)
Time Spent: 40m  (was: 0.5h)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-03-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573348=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573348
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 29/Mar/21 07:42
Start Date: 29/Mar/21 07:42
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r603074290



##
File path: 
iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##
@@ -153,6 +156,24 @@ public DecomposedPredicate decomposePredicate(JobConf 
jobConf, Deserializer dese
 return predicate;
   }
 
+  @Override
+  public Map getBasicStatistics(TableDesc tableDesc) {
+Table table = Catalogs.loadTable(conf, tableDesc.getProperties());
+Map summary = table.currentSnapshot().summary();
+Map stats = new HashMap<>();
+if (summary.containsKey(SnapshotSummary.TOTAL_DATA_FILES_PROP)) {
+  stats.put(StatsSetupConst.NUM_FILES, 
summary.get(SnapshotSummary.TOTAL_DATA_FILES_PROP));
+}
+if (summary.containsKey(SnapshotSummary.TOTAL_RECORDS_PROP)) {
+  stats.put(StatsSetupConst.ROW_COUNT, 
summary.get(SnapshotSummary.TOTAL_RECORDS_PROP));
+}
+// TODO: add TOTAL_SIZE when iceberg 0.12 is released

Review comment:
   With this TODO, do we need to comment out the code below?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 573348)
Time Spent: 0.5h  (was: 20m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-03-29 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=573347=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-573347
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 29/Mar/21 07:41
Start Date: 29/Mar/21 07:41
Worklog Time Spent: 10m 
  Work Description: pvary commented on a change in pull request #2111:
URL: https://github.com/apache/hive/pull/2111#discussion_r603073899



##
File path: 
iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java
##
@@ -153,6 +156,24 @@ public DecomposedPredicate decomposePredicate(JobConf 
jobConf, Deserializer dese
 return predicate;
   }
 
+  @Override
+  public Map getBasicStatistics(TableDesc tableDesc) {
+Table table = Catalogs.loadTable(conf, tableDesc.getProperties());
+Map summary = table.currentSnapshot().summary();

Review comment:
   Could we check that the summary is always non-null?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 573347)
Time Spent: 20m  (was: 10m)

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (HIVE-24928) In case of non-native tables use basic statistics from HiveStorageHandler

2021-03-24 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-24928?focusedWorklogId=571000=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-571000
 ]

ASF GitHub Bot logged work on HIVE-24928:
-

Author: ASF GitHub Bot
Created on: 24/Mar/21 08:48
Start Date: 24/Mar/21 08:48
Worklog Time Spent: 10m 
  Work Description: lcspinter opened a new pull request #2111:
URL: https://github.com/apache/hive/pull/2111


   …veStorageHandler
   
   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 571000)
Remaining Estimate: 0h
Time Spent: 10m

> In case of non-native tables use basic statistics from HiveStorageHandler
> -
>
> Key: HIVE-24928
> URL: https://issues.apache.org/jira/browse/HIVE-24928
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 4.0.0
>Reporter: László Pintér
>Assignee: László Pintér
>Priority: Major
> Fix For: 4.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When we are running `ANALYZE TABLE ... COMPUTE STATISTICS` or `ANALYZE TABLE 
> ... COMPUTE STATISTICS FOR COLUMNS` all the basic statistics are collected by 
> the BasicStatsTask class. This class tries to estimate the statistics by 
> scanning the directory of the table. 
> In the case of non-native tables (iceberg, hbase), the table directory might 
> contain metadata files as well, which would be counted by the BasicStatsTask 
> when calculating basic stats. 
> Instead of having this logic, the HiveStorageHandler implementation should 
> provide basic statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)