[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380237113
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/metadata/MetadataAggregateHelper.java
 ##
 @@ -117,16 +120,16 @@ private void createAggregatorInternal() {
   }
 }
 
-for (SchemaPath excludedColumn : excludedColumns) {
-  if 
(excludedColumn.equals(SchemaPath.getSimplePath(columnNamesOptions.rowGroupStart()))
-  || 
excludedColumn.equals(SchemaPath.getSimplePath(columnNamesOptions.rowGroupLength(
 {
-LogicalExpression lastModifiedTime = new FunctionCall("any_value",
+for (SchemaPath nonSchemaColumn : context.metadataColumns()) {
 
 Review comment:
   Sorry, missed it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380236720
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/metastore/ColumnNamesOptions.java
 ##
 @@ -40,6 +41,7 @@ public ColumnNamesOptions(OptionManager optionManager) {
 this.rowGroupStart = 
optionManager.getOption(ExecConstants.IMPLICIT_ROW_GROUP_START_COLUMN_LABEL).string_val;
 this.rowGroupLength = 
optionManager.getOption(ExecConstants.IMPLICIT_ROW_GROUP_LENGTH_COLUMN_LABEL).string_val;
 this.lastModifiedTime = 
optionManager.getOption(ExecConstants.IMPLICIT_LAST_MODIFIED_TIME_COLUMN_LABEL).string_val;
+this.projectMetadataColumn = 
optionManager.getOption(ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL).string_val;
 
 Review comment:
   Good idea, thanks, done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380128666
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/MetastoreAnalyzeTableHandler.java
 ##
 @@ -406,13 +406,13 @@ private DrillRel getTableAggRelNode(DrillRel 
convertedRelNode, boolean createNew
 SchemaPath lastModifiedTimeField =
 
SchemaPath.getSimplePath(config.getContext().getOptions().getString(ExecConstants.IMPLICIT_LAST_MODIFIED_TIME_COLUMN_LABEL));
 
-List excludedColumns = Arrays.asList(locationField, 
lastModifiedTimeField);
+List nonSchemaColumns = Arrays.asList(locationField, 
lastModifiedTimeField);
 
 Review comment:
   Thanks, renamed here and in other places.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380124653
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java
 ##
 @@ -237,6 +238,18 @@ private IterOutcome internalNext() {
   logger.trace("currentReader.next return recordCount={}", recordCount);
   Preconditions.checkArgument(recordCount >= 0, "recordCount from 
RecordReader.next() should not be negative");
   boolean isNewSchema = mutator.isNewSchema();
+  // adds additional record for the case of making scan for obtaining 
metadata if required
+  if (implicitValues != null) {
+String projectMetadataColumn = 
context.getOptions().getOption(ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL).string_val;
+if (recordCount > 0) {
+  // sets implicit value to false to signalize that some results were 
returned and there is no need for creating additional record
 
 Review comment:
   Thanks, updated the comment and added more details.
   
   Regarding the concept of the additional record, I will try to explain how 
Metastore collects the data in general cases, it may help to understand the 
reason for such a decision.
   
   Drill Metastore may collect metadata for every file or row group, so 
aggregation calls for every column with grouping by `fqn`, `rgi`, `dirX`... 
columns were added.
   This approach works fine for the case of non-empty files and row groups, but 
when an empty file is present, no data is passed to the aggregation from the 
Scan, so Metastore was ignoring such files.
   To solve this problem, I have added this logic to return a single record for 
the case when no data was read with the correct values of implicit columns, and 
this additional implicit column helps to distinguish such records and collect 
info about rows count, schema, etc.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380077493
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java
 ##
 @@ -511,6 +511,11 @@ private ExecConstants() {
   new OptionDescription("Available as of Drill 1.17. Sets the implicit 
column name for the lastModifiedTime column. " +
   "For internal usage when producing Metastore analyze."));
 
+  public static final String IMPLICIT_PROJECT_METADATA_COLUMN_LABEL = 
"drill.exec.storage.implicit.project_metadata.column.label";
+  public static final OptionValidator 
IMPLICIT_PROJECT_METADATA_COLUMN_LABEL_VALIDATOR = new 
StringValidator(IMPLICIT_PROJECT_METADATA_COLUMN_LABEL,
+  new OptionDescription("Available as of Drill 1.18. Sets the implicit 
column name for the $project_metadata$ column. " +
 
 Review comment:
   Good point about that. I specified version here and in other places to be 
consistent with other options descriptions.
   I think adding version in options descriptions was done to simplify updating 
docs for Drill Web site - there is no need to look up for the commit date and 
version of Drill, where it was added, just copy and paste it from Drill Web-UI, 
or from this class.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380128453
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/metadata/MetadataAggregateHelper.java
 ##
 @@ -313,8 +317,19 @@ private void addColumnAggregateCalls(FieldReference 
fieldRef, String fieldName)
   if (interestingColumns == null || interestingColumns.contains(fieldRef)) 
{
 // collect statistics for all or only interesting columns if they are 
specified
 
AnalyzeColumnUtils.COLUMN_STATISTICS_FUNCTIONS.forEach((statisticsKind, 
sqlKind) -> {
+  // constructs "case when is not null projectMetadataColumn then 
column1 else null end" call
+  // to avoid using default values for required columns when data for 
empty result is obtained
 
 Review comment:
   Thanks for pointing this. Unfortunately, we can't use a plain SQL approach 
to collect metadata, since we do not have information about the schema, so we 
create aggregate calls dynamically. But Drill uses inbuilt aggregate functions 
for collecting summary statistics (`MIN`, `MAX`, ...).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files

2020-02-17 Thread GitBox
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE 
TABLE ... REFRESH METADATA does not work for empty Parquet files
URL: https://github.com/apache/drill/pull/1985#discussion_r380081293
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/metastore/analyze/MetadataAggregateContext.java
 ##
 @@ -63,8 +67,8 @@ public boolean createNewAggregations() {
   }
 
   @JsonProperty
-  public List excludedColumns() {
-return excludedColumns;
+  public List nonSchemaColumns() {
 
 Review comment:
   Thanks, `metadataColumns` name looks better, renamed this field.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services