[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380237113 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/metadata/MetadataAggregateHelper.java ## @@ -117,16 +120,16 @@ private void createAggregatorInternal() { } } -for (SchemaPath excludedColumn : excludedColumns) { - if (excludedColumn.equals(SchemaPath.getSimplePath(columnNamesOptions.rowGroupStart())) - || excludedColumn.equals(SchemaPath.getSimplePath(columnNamesOptions.rowGroupLength( { -LogicalExpression lastModifiedTime = new FunctionCall("any_value", +for (SchemaPath nonSchemaColumn : context.metadataColumns()) { Review comment: Sorry, missed it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380236720 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/metastore/ColumnNamesOptions.java ## @@ -40,6 +41,7 @@ public ColumnNamesOptions(OptionManager optionManager) { this.rowGroupStart = optionManager.getOption(ExecConstants.IMPLICIT_ROW_GROUP_START_COLUMN_LABEL).string_val; this.rowGroupLength = optionManager.getOption(ExecConstants.IMPLICIT_ROW_GROUP_LENGTH_COLUMN_LABEL).string_val; this.lastModifiedTime = optionManager.getOption(ExecConstants.IMPLICIT_LAST_MODIFIED_TIME_COLUMN_LABEL).string_val; +this.projectMetadataColumn = optionManager.getOption(ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL).string_val; Review comment: Good idea, thanks, done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380128666 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/MetastoreAnalyzeTableHandler.java ## @@ -406,13 +406,13 @@ private DrillRel getTableAggRelNode(DrillRel convertedRelNode, boolean createNew SchemaPath lastModifiedTimeField = SchemaPath.getSimplePath(config.getContext().getOptions().getString(ExecConstants.IMPLICIT_LAST_MODIFIED_TIME_COLUMN_LABEL)); -List excludedColumns = Arrays.asList(locationField, lastModifiedTimeField); +List nonSchemaColumns = Arrays.asList(locationField, lastModifiedTimeField); Review comment: Thanks, renamed here and in other places. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380124653 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java ## @@ -237,6 +238,18 @@ private IterOutcome internalNext() { logger.trace("currentReader.next return recordCount={}", recordCount); Preconditions.checkArgument(recordCount >= 0, "recordCount from RecordReader.next() should not be negative"); boolean isNewSchema = mutator.isNewSchema(); + // adds additional record for the case of making scan for obtaining metadata if required + if (implicitValues != null) { +String projectMetadataColumn = context.getOptions().getOption(ExecConstants.IMPLICIT_PROJECT_METADATA_COLUMN_LABEL).string_val; +if (recordCount > 0) { + // sets implicit value to false to signalize that some results were returned and there is no need for creating additional record Review comment: Thanks, updated the comment and added more details. Regarding the concept of the additional record, I will try to explain how Metastore collects the data in general cases, it may help to understand the reason for such a decision. Drill Metastore may collect metadata for every file or row group, so aggregation calls for every column with grouping by `fqn`, `rgi`, `dirX`... columns were added. This approach works fine for the case of non-empty files and row groups, but when an empty file is present, no data is passed to the aggregation from the Scan, so Metastore was ignoring such files. To solve this problem, I have added this logic to return a single record for the case when no data was read with the correct values of implicit columns, and this additional implicit column helps to distinguish such records and collect info about rows count, schema, etc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380077493 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ## @@ -511,6 +511,11 @@ private ExecConstants() { new OptionDescription("Available as of Drill 1.17. Sets the implicit column name for the lastModifiedTime column. " + "For internal usage when producing Metastore analyze.")); + public static final String IMPLICIT_PROJECT_METADATA_COLUMN_LABEL = "drill.exec.storage.implicit.project_metadata.column.label"; + public static final OptionValidator IMPLICIT_PROJECT_METADATA_COLUMN_LABEL_VALIDATOR = new StringValidator(IMPLICIT_PROJECT_METADATA_COLUMN_LABEL, + new OptionDescription("Available as of Drill 1.18. Sets the implicit column name for the $project_metadata$ column. " + Review comment: Good point about that. I specified version here and in other places to be consistent with other options descriptions. I think adding version in options descriptions was done to simplify updating docs for Drill Web site - there is no need to look up for the commit date and version of Drill, where it was added, just copy and paste it from Drill Web-UI, or from this class. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380128453 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/metadata/MetadataAggregateHelper.java ## @@ -313,8 +317,19 @@ private void addColumnAggregateCalls(FieldReference fieldRef, String fieldName) if (interestingColumns == null || interestingColumns.contains(fieldRef)) { // collect statistics for all or only interesting columns if they are specified AnalyzeColumnUtils.COLUMN_STATISTICS_FUNCTIONS.forEach((statisticsKind, sqlKind) -> { + // constructs "case when is not null projectMetadataColumn then column1 else null end" call + // to avoid using default values for required columns when data for empty result is obtained Review comment: Thanks for pointing this. Unfortunately, we can't use a plain SQL approach to collect metadata, since we do not have information about the schema, so we create aggregate calls dynamically. But Drill uses inbuilt aggregate functions for collecting summary statistics (`MIN`, `MAX`, ...). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [drill] vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files
vvysotskyi commented on a change in pull request #1985: DRILL-7565: ANALYZE TABLE ... REFRESH METADATA does not work for empty Parquet files URL: https://github.com/apache/drill/pull/1985#discussion_r380081293 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/metastore/analyze/MetadataAggregateContext.java ## @@ -63,8 +67,8 @@ public boolean createNewAggregations() { } @JsonProperty - public List excludedColumns() { -return excludedColumns; + public List nonSchemaColumns() { Review comment: Thanks, `metadataColumns` name looks better, renamed this field. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services