[GitHub] [incubator-iceberg] waterlx edited a comment on issue #279: Transforming timestamp to date should produce date
waterlx edited a comment on issue #279: Transforming timestamp to date should produce date URL: https://github.com/apache/incubator-iceberg/issues/279#issuecomment-521086258 @rdblue, regarding > @manishmalhotrawork, only the `day` transform should return a date. Years, months, and hours should continue returning ordinals like before. so you mean that we are supposed to keep `org.apache.iceberg.transforms.Timestamps` as it is (as `Transform`) and keep `org.apache.iceberg.transforms.Dates` as it is (as `Transform`), so as not to change the return of `Transforms.year(), month() and hour()`? Do you propose to only change the output of Transforms.day() to `Transform`? Do I get it right? @manishmalhotrawork yes, I am working on this issue. But please feel free to go ahead to share your thoughts or upload patches, I am new to iceberg This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] waterlx commented on issue #279: Transforming timestamp to date should produce date
waterlx commented on issue #279: Transforming timestamp to date should produce date URL: https://github.com/apache/incubator-iceberg/issues/279#issuecomment-521086258 @rdblue > @manishmalhotrawork, only the `day` transform should return a date. Years, months, and hours should continue returning ordinals like before. so you mean that we are supposed to keep `org.apache.iceberg.transforms.Timestamps` as it is (as `Transform`) and keep `org.apache.iceberg.transforms.Dates` as it is (as `Transform`, but only change the output of Transforms.day() to `Transform`? Do I get it right? @manishmalhotrawork yes, I am working on this issue. But please feel free to go ahead to share your thoughts or upload patches, I am new to iceberg This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] fbocse commented on a change in pull request #244: [WIP] Add new Update schema API to support adding of all new fields detected based on a companion schema
fbocse commented on a change in pull request #244: [WIP] Add new Update schema API to support adding of all new fields detected based on a companion schema URL: https://github.com/apache/incubator-iceberg/pull/244#discussion_r313681588 ## File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java ## @@ -186,6 +188,88 @@ public UpdateSchema updateColumnDoc(String name, String doc) { return this; } + @Override + public UpdateSchema equateNewFields(Schema schema) { +Preconditions.checkArgument(schema != null, "Cannot add new fields off of a null schema"); +List wrappedTypes = TypeUtil.CompanionSchemaVisitor.schema(this.schema, schema, +new CompanionSchemaNewFieldsCollector()); + +// With Iceberg we can only add to pre-existing fields so we need to "hydrate" all nested types. +wrappedTypes.forEach( +wrappedType -> { + Map.Entry entry = inflate(wrappedType); + // Based on being top-level or nested we use the corresponding API for adding new columns to the schema + Preconditions.checkArgument(entry != null, "Cannot inflate types of null predecessor"); + if (entry.getKey().isEmpty()) { +addColumn(entry.getValue().name(), entry.getValue().type(), entry.getValue().doc()); + } else { +addColumn(entry.getKey(), entry.getValue().name(), entry.getValue().type(), entry.getValue().doc()); + } +} +); + +return this; + } + + /** + * Given a nested type structure this will provide a fully hydrated {@link Types.NestedField} accompanied by + * predecessor name as dot separated string. + * TODO - any suggestions for modelling a tuple other than java.util.Map.Entry? + * + * @param wrappedType the nested type we want to inflate (will use method recursively for non-primitive structures) + * @return a tuple having a String representation of the type predecessor as key and the nested field as value + */ + private Map.Entry inflate(T wrappedType) { Review comment: Good point, when writing the code I did kind of ended up with this synthetic `WrappedType` construct but after putting together the code for inflating nested types I should just ditch using the `WrappedType` construct altogether. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] fbocse commented on a change in pull request #244: [WIP] Add new Update schema API to support adding of all new fields detected based on a companion schema
fbocse commented on a change in pull request #244: [WIP] Add new Update schema API to support adding of all new fields detected based on a companion schema URL: https://github.com/apache/incubator-iceberg/pull/244#discussion_r313681188 ## File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java ## @@ -186,6 +188,88 @@ public UpdateSchema updateColumnDoc(String name, String doc) { return this; } + @Override + public UpdateSchema equateNewFields(Schema schema) { +Preconditions.checkArgument(schema != null, "Cannot add new fields off of a null schema"); +List wrappedTypes = TypeUtil.CompanionSchemaVisitor.schema(this.schema, schema, +new CompanionSchemaNewFieldsCollector()); + +// With Iceberg we can only add to pre-existing fields so we need to "hydrate" all nested types. +wrappedTypes.forEach( +wrappedType -> { + Map.Entry entry = inflate(wrappedType); + // Based on being top-level or nested we use the corresponding API for adding new columns to the schema + Preconditions.checkArgument(entry != null, "Cannot inflate types of null predecessor"); + if (entry.getKey().isEmpty()) { +addColumn(entry.getValue().name(), entry.getValue().type(), entry.getValue().doc()); + } else { +addColumn(entry.getKey(), entry.getValue().name(), entry.getValue().type(), entry.getValue().doc()); + } +} +); + +return this; + } + + /** + * Given a nested type structure this will provide a fully hydrated {@link Types.NestedField} accompanied by + * predecessor name as dot separated string. + * TODO - any suggestions for modelling a tuple other than java.util.Map.Entry? + * + * @param wrappedType the nested type we want to inflate (will use method recursively for non-primitive structures) + * @return a tuple having a String representation of the type predecessor as key and the nested field as value Review comment: Yes, will rename to `parent`, in the case of a hierarchy `parent` sounds better. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] xabriel commented on a change in pull request #377: Add FindFiles helper API
xabriel commented on a change in pull request #377: Add FindFiles helper API URL: https://github.com/apache/incubator-iceberg/pull/377#discussion_r313667733 ## File path: core/src/main/java/org/apache/iceberg/FindFiles.java ## @@ -0,0 +1,195 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg; + +import com.google.common.base.Preconditions; +import java.time.Instant; +import java.time.LocalDateTime; +import java.time.ZoneId; +import java.time.format.DateTimeFormatter; +import java.util.Arrays; +import java.util.List; +import org.apache.iceberg.expressions.Expression; +import org.apache.iceberg.expressions.Expressions; +import org.apache.iceberg.io.CloseableIterable; + +public class FindFiles { + private static final DateTimeFormatter DATE_FORMAT = DateTimeFormatter.ofPattern("-MM-dd HH:mm:ss.SSS"); + + public static Builder in(Table table) { +return new Builder(table); + } + + public static class Builder { +private final Table table; +private final TableOperations ops; +private boolean caseSensitive = true; +private Long snapshotId = null; +private Expression rowFilter = Expressions.alwaysTrue(); +private Expression fileFilter = Expressions.alwaysTrue(); +private Expression partitionFilter = Expressions.alwaysTrue(); + +public Builder(Table table) { + this.table = table; + this.ops = ((HasTableOperations) table).operations(); +} + +public Builder caseInsensitive() { + this.caseSensitive = false; + return this; +} + +public Builder caseSensitive(boolean caseSensitive) { + this.caseSensitive = caseSensitive; + return this; +} + +/** + * Base results on the given snapshot. + * + * @param snapshotId a snapshot ID + * @return this for method chaining + */ +public Builder inSnapshot(long snapshotId) { + Preconditions.checkArgument(this.snapshotId == null, + "Cannot set snapshot multiple times, already set to id=%s", snapshotId); + Preconditions.checkArgument(table.snapshot(snapshotId) != null, + "Cannot find snapshot for id=%s", snapshotId); + this.snapshotId = snapshotId; + return this; +} + +/** + * Base results on files in the snapshot that was current as of a timestamp. + * + * @param timestampMillis a timestamp in milliseconds + * @return this for method chaining + */ +public Builder asOfTime(long timestampMillis) { + Preconditions.checkArgument(this.snapshotId == null, + "Cannot set snapshot multiple times, already set to id=%s", snapshotId); + + Long lastSnapshotId = null; + for (HistoryEntry logEntry : ops.current().snapshotLog()) { +if (logEntry.timestampMillis() <= timestampMillis) { + lastSnapshotId = logEntry.snapshotId(); +} Review comment: I assume `snapshotLog()` is in ascending order. Should we then `break` after the condition is not true anymore? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates
jun-he commented on a change in pull request #357: Add in and not in predicates URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313653184 ## File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java ## @@ -134,18 +134,18 @@ public Boolean or(Boolean leftResult, Boolean rightResult) { } @Override -public Boolean in(BoundReference ref, Literal lit) { - throw new UnsupportedOperationException("In is not supported yet"); +public Boolean in(BoundReference ref, LiteralSet literalSet) { + return literalSet.contains(ref.get(struct)); } @Override -public Boolean notIn(BoundReference ref, Literal lit) { - return !in(ref, lit); +public Boolean notIn(BoundReference ref, LiteralSet literalSet) { + return !in(ref, literalSet); } @Override public Boolean startsWith(BoundReference ref, Literal lit) { - return ((String) ref.get(struct)).startsWith((String) lit.value()); + return ((String) ref.get(struct)).startsWith(lit.value().toString()); Review comment: Here, I changed the construction of `StringLiteral` to be ``` StringLiteral(CharSequence value) { super(new CharSeqWrapper(value)); } ``` The other way is to only wrap `CharSequence` within LiteralSet. In that way, there is no need to change `startsWith`. Wrapping inside seems cleaner with less side effect and I will update the PR accordingly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #372: Allow read split behavior overrides
rdblue commented on issue #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#issuecomment-521056199 Merged. Thanks @xabriel! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #372: Allow read split behavior overrides
rdblue merged pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue opened a new pull request #378: Handle null values in transforms
rdblue opened a new pull request #378: Handle null values in transforms URL: https://github.com/apache/incubator-iceberg/pull/378 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] xabriel commented on a change in pull request #372: Allow read split behavior overrides
xabriel commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313655207 ## File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java ## @@ -54,22 +56,24 @@ private final TableOperations ops; private final Table table; private final Long snapshotId; + private final Map options; private final Schema schema; private final Expression rowFilter; private final boolean caseSensitive; private final boolean colStats; private final Collection selectedColumns; protected BaseTableScan(TableOperations ops, Table table, Schema schema) { -this(ops, table, null, schema, Expressions.alwaysTrue(), true, false, null); +this(ops, table, null, null, schema, Expressions.alwaysTrue(), true, false, null); } - protected BaseTableScan(TableOperations ops, Table table, Long snapshotId, Schema schema, -Expression rowFilter, boolean caseSensitive, boolean colStats, + protected BaseTableScan(TableOperations ops, Table table, Long snapshotId, Map options, Review comment: Fixed to be added to the end. The reason it was in the middle before is that it used to be `splitOptions` and those were related to `snapshotId` as in that they modify the read behavior. Now they're generic so agreed it makes sense to keep at end. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] xabriel commented on a change in pull request #372: Allow read split behavior overrides
xabriel commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313654409 ## File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java ## @@ -54,22 +56,24 @@ private final TableOperations ops; private final Table table; private final Long snapshotId; + private final Map options; private final Schema schema; private final Expression rowFilter; private final boolean caseSensitive; private final boolean colStats; private final Collection selectedColumns; protected BaseTableScan(TableOperations ops, Table table, Schema schema) { -this(ops, table, null, schema, Expressions.alwaysTrue(), true, false, null); +this(ops, table, null, null, schema, Expressions.alwaysTrue(), true, false, null); Review comment: Using an `ImmutableMap` all around now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] xabriel commented on a change in pull request #372: Allow read split behavior overrides
xabriel commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313654308 ## File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java ## @@ -123,31 +128,46 @@ public TableScan asOfTime(long timestampMillis) { return useSnapshot(lastSnapshotId); } + @Override + public TableScan option(String property, String value) { +Map localOptions = Maps.newHashMap(); Review comment: Agreed. Fixed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates
jun-he commented on a change in pull request #357: Add in and not in predicates URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313653184 ## File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java ## @@ -134,18 +134,18 @@ public Boolean or(Boolean leftResult, Boolean rightResult) { } @Override -public Boolean in(BoundReference ref, Literal lit) { - throw new UnsupportedOperationException("In is not supported yet"); +public Boolean in(BoundReference ref, LiteralSet literalSet) { + return literalSet.contains(ref.get(struct)); } @Override -public Boolean notIn(BoundReference ref, Literal lit) { - return !in(ref, lit); +public Boolean notIn(BoundReference ref, LiteralSet literalSet) { + return !in(ref, literalSet); } @Override public Boolean startsWith(BoundReference ref, Literal lit) { - return ((String) ref.get(struct)).startsWith((String) lit.value()); + return ((String) ref.get(struct)).startsWith(lit.value().toString()); Review comment: Here, I changed the construction of `StringLiteral` to be ``` StringLiteral(CharSequence value) { super(new CharSeqWrapper(value)); } ``` The other way is to only wrap `CharSequence` within LiteralSet. In that way, there is no need to change `startsWith`. This seems cleaner with less side effect and I will update the PR accordingly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates
jun-he commented on a change in pull request #357: Add in and not in predicates URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313653184 ## File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java ## @@ -134,18 +134,18 @@ public Boolean or(Boolean leftResult, Boolean rightResult) { } @Override -public Boolean in(BoundReference ref, Literal lit) { - throw new UnsupportedOperationException("In is not supported yet"); +public Boolean in(BoundReference ref, LiteralSet literalSet) { + return literalSet.contains(ref.get(struct)); } @Override -public Boolean notIn(BoundReference ref, Literal lit) { - return !in(ref, lit); +public Boolean notIn(BoundReference ref, LiteralSet literalSet) { + return !in(ref, literalSet); } @Override public Boolean startsWith(BoundReference ref, Literal lit) { - return ((String) ref.get(struct)).startsWith((String) lit.value()); + return ((String) ref.get(struct)).startsWith(lit.value().toString()); Review comment: Here, I changed the construction of `StringLiteral` to be ```StringLiteral(CharSequence value) { super(new CharSeqWrapper(value)); }``` The other way is to only wrap `CharSequence` within LiteralSet. In that way, there is no need to change `startsWith`. This seems cleaner with less side effect and I will update the PR accordingly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates
jun-he commented on a change in pull request #357: Add in and not in predicates URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313653184 ## File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java ## @@ -134,18 +134,18 @@ public Boolean or(Boolean leftResult, Boolean rightResult) { } @Override -public Boolean in(BoundReference ref, Literal lit) { - throw new UnsupportedOperationException("In is not supported yet"); +public Boolean in(BoundReference ref, LiteralSet literalSet) { + return literalSet.contains(ref.get(struct)); } @Override -public Boolean notIn(BoundReference ref, Literal lit) { - return !in(ref, lit); +public Boolean notIn(BoundReference ref, LiteralSet literalSet) { + return !in(ref, literalSet); } @Override public Boolean startsWith(BoundReference ref, Literal lit) { - return ((String) ref.get(struct)).startsWith((String) lit.value()); + return ((String) ref.get(struct)).startsWith(lit.value().toString()); Review comment: Here, I changed the construction of `StringLiteral` to be ```StringLiteral(CharSequence value) { super(new CharSeqWrapper(value)); } ``` The other way is to only wrap `CharSequence` within LiteralSet. In that way, there is no need to change `startsWith`. This seems cleaner with less side effect and I will update the PR accordingly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue opened a new pull request #377: Add FindFiles helper API
rdblue opened a new pull request #377: Add FindFiles helper API URL: https://github.com/apache/incubator-iceberg/pull/377 This adds `FindFiles`, a helper API for working with files in a table. This API is an alternative to building a scan and calling `planFiles`. Using a scan returns `FileScanTask` instead of `DataFile` instances and send a scan notification, which may not be useful when inspecting a table's files. This is intended to be used for inspecting a table's contents without needing a scan. For example, a merge process could use this API to locate files under a certain size in a partition to merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313647841 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +121,40 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictingAppendsRowFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictingAppendsRowFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +// we do partition-level conflict resolution right now +// we can enhance it by leveraging column stats and MetricsEvaluator +ValidationException.check( +!inclusive.eval(newFile.partition()), +"A conflicting file was appended that matches filter '%s': %s", +conflictingAppendsRowFilter, newFile.path()); + } +} + return super.apply(base); } + + private List collectNewFiles(TableMetadata meta) { +List newFiles = new ArrayList<>(); + +Long currentSnapshotId = meta.currentSnapshot() == null ? null : meta.currentSnapshot().snapshotId(); +while (currentSnapshotId != null && !currentSnapshotId.equals(readSnapshotId)) { + Snapshot currentSnapshot = meta.snapshot(currentSnapshotId); + + if (currentSnapshot == null) { +throw new ValidationException("Cannot find snapshot %d. Was it expired?", currentSnapshotId); Review comment: If `meta.currentSnapshot` is null, this will print "Cannot find snapshot null". I think it would be better to start the method with a precondition that checks that meta has a current snapshot and fails with a better error message if it doesn't. Also, there are a few reasons why the snapshot can't be found. The base snapshot may have been expired, but it also could have been rolled back. I think this should just state "Cannot determine history between base snapshot %s and current %s". This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313646140 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -27,10 +30,15 @@ import org.apache.iceberg.expressions.StrictMetricsEvaluator; public class OverwriteData extends MergingSnapshotProducer implements OverwriteFiles { - private boolean validateAddedFiles = false; + private boolean overwriteByRowFilter; + private boolean validateAddedFilesMatchRowFilter; + private boolean deleteFilesByPath; + private Long readSnapshotId; + private Expression conflictingAppendsRowFilter; protected OverwriteData(TableOperations ops) { super(ops); +failMissingDeletePaths(); Review comment: Is this always required? It seems like it might only be needed when validating conflicting operations. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313645766 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -74,5 +90,22 @@ * * @return this for method chaining */ - OverwriteFiles validateAddedFiles(); + OverwriteFiles validateAddedFilesMatchRowFilter(); + + /** + * Enables validation of files that are added concurrently. Review comment: I'd like to add something about the type of validation. Does something like this make sense? > Enables validation that files added concurrently do not conflict with this commit's operation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313645460 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -74,5 +90,22 @@ * * @return this for method chaining */ - OverwriteFiles validateAddedFiles(); + OverwriteFiles validateAddedFilesMatchRowFilter(); + + /** + * Enables validation of files that are added concurrently. + * + * This method should be called when the table is queried to determine files to delete/append. + * If a concurrent operation commits a new file after the data was read and that file might + * contain rows matching the specified row filter, the overwrite operation will detect + * this during retries and fail. + * + * Calling this method with a correct row filter is required to maintain serializable isolation. + * Otherwise, the isolation level will be snapshot isolation. + * + * @param readSnapshotId the snapshot id that was used to read the data + * @param rowFilter an expression on rows in the table + * @return this for method chaining + */ + OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression rowFilter); Review comment: Can we rename `rowFilter` to `conflictDetectionFilter`? Why use `Long` instead of `long`? That would allow passing null. Is that intentional? If null, then use the base snapshot ID? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313644916 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -74,5 +90,22 @@ * * @return this for method chaining */ - OverwriteFiles validateAddedFiles(); + OverwriteFiles validateAddedFilesMatchRowFilter(); Review comment: Since a row filter is passed to `validateNoConflictingAppends`, I think this should be `validateAddedFilesMatchOverwriteFilter` instead. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313643617 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -65,6 +70,17 @@ */ OverwriteFiles addFile(DataFile file); + /** + * Delete a {@link DataFile} from the table. + * + * Note that either this method or {@link OverwriteFiles#overwriteByRowFilter(Expression)} + * should be used to mark deleted files, not both. Review comment: Like my comment above, I don't think this paragraph is needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313643529 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -50,6 +49,12 @@ * * Files that may contain some rows that match the expression and some rows that do not will * result in a {@link ValidationException}. + * + * This has no requirements for the latest snapshot and will not fail based on other snapshot + * changes. + * + * Note that either this method or {@link OverwriteFiles#deleteFile(DataFile)} + * should be used to mark deleted files, not both. Review comment: This update isn't quite true. There are methods to determine which files to delete, and there are separate ways to configure validations. When this operation is used to produce an idempotent overwrite, where it can be completed regardless of what files are added after it starts, it doesn't matter that the files were matched by a filter or directly by `deleteFile`. For example, I might want to overwrite an old file with new values for a column that was added. That overwrite is idempotent, but might use `deleteFile`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides
rdblue commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313640101 ## File path: site/docs/configuration.md ## @@ -10,7 +10,7 @@ Iceberg tables support table properties to configure table behavior, like the de | | -- | -- | | read.split.target-size | 134217728 (128 MB) | Target size when combining input splits| | read.split.planning-lookback | 10 | Number of bins to consider when combining input splits | - +| read.split.open-file-cost| 4194304 (4 MB) | The estimated cost to open a file, used as a minimum weight when combining splits. | Review comment: Thanks for catching this oversight! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides
rdblue commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313639861 ## File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java ## @@ -54,22 +56,24 @@ private final TableOperations ops; private final Table table; private final Long snapshotId; + private final Map options; private final Schema schema; private final Expression rowFilter; private final boolean caseSensitive; private final boolean colStats; private final Collection selectedColumns; protected BaseTableScan(TableOperations ops, Table table, Schema schema) { -this(ops, table, null, schema, Expressions.alwaysTrue(), true, false, null); +this(ops, table, null, null, schema, Expressions.alwaysTrue(), true, false, null); } - protected BaseTableScan(TableOperations ops, Table table, Long snapshotId, Schema schema, -Expression rowFilter, boolean caseSensitive, boolean colStats, + protected BaseTableScan(TableOperations ops, Table table, Long snapshotId, Map options, Review comment: Why add options in the middle instead of at the end? Seems like you're changing more lines by displacing the other arguments. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides
rdblue commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313639689 ## File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java ## @@ -54,22 +56,24 @@ private final TableOperations ops; private final Table table; private final Long snapshotId; + private final Map options; private final Schema schema; private final Expression rowFilter; private final boolean caseSensitive; private final boolean colStats; private final Collection selectedColumns; protected BaseTableScan(TableOperations ops, Table table, Schema schema) { -this(ops, table, null, schema, Expressions.alwaysTrue(), true, false, null); +this(ops, table, null, null, schema, Expressions.alwaysTrue(), true, false, null); Review comment: You can avoid null checks by always having a map instance. Using `ImmutableMap.of()` would be easier. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides
rdblue commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313639543 ## File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java ## @@ -123,31 +128,46 @@ public TableScan asOfTime(long timestampMillis) { return useSnapshot(lastSnapshotId); } + @Override + public TableScan option(String property, String value) { +Map localOptions = Maps.newHashMap(); Review comment: As long as this creates a new map each time, I think it makes sense to use an ImmutableMap: ```java ImmutableMap.Builder builder = ImmutableMap.builder(); builder.putAll(options); builder.put(property, value); return newRefinedScan(..., builder.build()); ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides
rdblue commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313638753 ## File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java ## @@ -176,11 +188,26 @@ public Expression filter() { @Override public CloseableIterable planTasks() { -long splitSize = targetSplitSize(ops); -int lookback = ops.current().propertyAsInt( -TableProperties.SPLIT_LOOKBACK, TableProperties.SPLIT_LOOKBACK_DEFAULT); -long openFileCost = ops.current().propertyAsLong( -TableProperties.SPLIT_OPEN_FILE_COST, TableProperties.SPLIT_OPEN_FILE_COST_DEFAULT); +long splitSize; +if (splitOptions != null && splitOptions.getSplitSize() != null) { + splitSize = splitOptions.getSplitSize(); +} else { + splitSize = targetSplitSize(ops); +} +int lookback; +if (splitOptions != null && splitOptions.getSplitLookback() != null) { + lookback = splitOptions.getSplitLookback(); +} else { + lookback = ops.current().propertyAsInt( + TableProperties.SPLIT_LOOKBACK, TableProperties.SPLIT_LOOKBACK_DEFAULT); Review comment: I think accepting both is a good idea, but we can add that later. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides
rdblue commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313638753 ## File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java ## @@ -176,11 +188,26 @@ public Expression filter() { @Override public CloseableIterable planTasks() { -long splitSize = targetSplitSize(ops); -int lookback = ops.current().propertyAsInt( -TableProperties.SPLIT_LOOKBACK, TableProperties.SPLIT_LOOKBACK_DEFAULT); -long openFileCost = ops.current().propertyAsLong( -TableProperties.SPLIT_OPEN_FILE_COST, TableProperties.SPLIT_OPEN_FILE_COST_DEFAULT); +long splitSize; +if (splitOptions != null && splitOptions.getSplitSize() != null) { + splitSize = splitOptions.getSplitSize(); +} else { + splitSize = targetSplitSize(ops); +} +int lookback; +if (splitOptions != null && splitOptions.getSplitLookback() != null) { + lookback = splitOptions.getSplitLookback(); +} else { + lookback = ops.current().propertyAsInt( + TableProperties.SPLIT_LOOKBACK, TableProperties.SPLIT_LOOKBACK_DEFAULT); Review comment: I think accepting both is a good idea This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #357: Add in and not in predicates
rdblue commented on a change in pull request #357: Add in and not in predicates URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313633929 ## File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java ## @@ -134,18 +134,18 @@ public Boolean or(Boolean leftResult, Boolean rightResult) { } @Override -public Boolean in(BoundReference ref, Literal lit) { - throw new UnsupportedOperationException("In is not supported yet"); +public Boolean in(BoundReference ref, LiteralSet literalSet) { + return literalSet.contains(ref.get(struct)); } @Override -public Boolean notIn(BoundReference ref, Literal lit) { - return !in(ref, lit); +public Boolean notIn(BoundReference ref, LiteralSet literalSet) { + return !in(ref, literalSet); } @Override public Boolean startsWith(BoundReference ref, Literal lit) { - return ((String) ref.get(struct)).startsWith((String) lit.value()); + return ((String) ref.get(struct)).startsWith(lit.value().toString()); Review comment: Why did the type used by the literal change? I don't think it needs to. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue opened a new pull request #376: Do not use original manifest list when adding filters
rdblue opened a new pull request #376: Do not use original manifest list when adding filters URL: https://github.com/apache/incubator-iceberg/pull/376 This fixes a bug in `ManifestGroup` that caused all manifests to be read instead of correctly filtering when either `ignoreExisting` or `ignoreDeleted` was used because those filters were applied to the wrong variable, `manifests` instead of `filteredManifests`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates
jun-he commented on a change in pull request #357: Add in and not in predicates URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313593900 ## File path: api/src/main/java/org/apache/iceberg/expressions/Predicate.java ## @@ -19,15 +19,44 @@ package org.apache.iceberg.expressions; +import com.google.common.base.Preconditions; +import com.google.common.collect.ImmutableSet; +import java.util.Collection; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.Stream; + public abstract class Predicate implements Expression { private final Operation op; private final R ref; private final Literal literal; + private final Set> literalSet; Review comment: @rdblue Sorry for the confusion. In this change, only factory methods in `Expressions` and the `bind` in `UnboundPredicate` can change the returned predicate. What I meant is that the factory method `in` will return an `EQ` predicate if only a single element is provided because UnboundPredicate factory method for IN operation with a single item should return `EQ` predicate. As the `bind` returns `Expression`, it is straightforward to separated the `BoundPredicate`. But in the case of the factory method, if it also returns `Expression` or a new interface class instead of `UnboundPredicate`, then later when it is used, it has to be explicitly cast by checking the `operation`. This seems over-complicated compared to the current approach. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates
jun-he commented on a change in pull request #357: Add in and not in predicates URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313593900 ## File path: api/src/main/java/org/apache/iceberg/expressions/Predicate.java ## @@ -19,15 +19,44 @@ package org.apache.iceberg.expressions; +import com.google.common.base.Preconditions; +import com.google.common.collect.ImmutableSet; +import java.util.Collection; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.Stream; + public abstract class Predicate implements Expression { private final Operation op; private final R ref; private final Literal literal; + private final Set> literalSet; Review comment: @rdblue Sorry for the misunderstanding. In this change, only factory methods in `Expressions` and the `bind` in `UnboundPredicate` can change the returned predicate. What I meant is that the factory method `in` will return an `EQ` predicate if only a single element is provided because UnboundPredicate factory method for IN operation with a single item should return `EQ` predicate. As the `bind` returns `Expression`, it is straightforward to separated the `BoundPredicate`. But in the case of the factory method, if it also returns `Expression` or a new interface class instead of `UnboundPredicate`, then later when it is used, it has to be explicitly cast by checking the `operation`. This seems over-complicated compared to the current approach. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] xabriel commented on a change in pull request #372: Allow read split behavior overrides
xabriel commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313589124 ## File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java ## @@ -176,11 +188,26 @@ public Expression filter() { @Override public CloseableIterable planTasks() { -long splitSize = targetSplitSize(ops); -int lookback = ops.current().propertyAsInt( -TableProperties.SPLIT_LOOKBACK, TableProperties.SPLIT_LOOKBACK_DEFAULT); -long openFileCost = ops.current().propertyAsLong( -TableProperties.SPLIT_OPEN_FILE_COST, TableProperties.SPLIT_OPEN_FILE_COST_DEFAULT); +long splitSize; +if (splitOptions != null && splitOptions.getSplitSize() != null) { + splitSize = splitOptions.getSplitSize(); +} else { + splitSize = targetSplitSize(ops); +} +int lookback; +if (splitOptions != null && splitOptions.getSplitLookback() != null) { + lookback = splitOptions.getSplitLookback(); +} else { + lookback = ops.current().propertyAsInt( + TableProperties.SPLIT_LOOKBACK, TableProperties.SPLIT_LOOKBACK_DEFAULT); Review comment: Went with short option names for the Spark `Reader`, but full property name when passed to the `TableScan` object. Let me know if this matches the precedent? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates
jun-he commented on a change in pull request #357: Add in and not in predicates URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313576493 ## File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java ## @@ -134,18 +134,18 @@ public Boolean or(Boolean leftResult, Boolean rightResult) { } @Override -public Boolean in(BoundReference ref, Literal lit) { - throw new UnsupportedOperationException("In is not supported yet"); +public Boolean in(BoundReference ref, LiteralSet literalSet) { + return literalSet.contains(ref.get(struct)); } @Override -public Boolean notIn(BoundReference ref, Literal lit) { - return !in(ref, lit); +public Boolean notIn(BoundReference ref, LiteralSet literalSet) { + return !in(ref, literalSet); } @Override public Boolean startsWith(BoundReference ref, Literal lit) { - return ((String) ref.get(struct)).startsWith((String) lit.value()); + return ((String) ref.get(struct)).startsWith(lit.value().toString()); Review comment: The issue is that `lit.value()` in this case will return a `CharSeqWrapper` instead of `String`. It cannot be cast to `String`. If the underlying is a `String`, `toString()` will still return it without extra costs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] xabriel commented on a change in pull request #372: Allow read split behavior overrides
xabriel commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313525003 ## File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java ## @@ -176,11 +188,26 @@ public Expression filter() { @Override public CloseableIterable planTasks() { -long splitSize = targetSplitSize(ops); -int lookback = ops.current().propertyAsInt( -TableProperties.SPLIT_LOOKBACK, TableProperties.SPLIT_LOOKBACK_DEFAULT); -long openFileCost = ops.current().propertyAsLong( -TableProperties.SPLIT_OPEN_FILE_COST, TableProperties.SPLIT_OPEN_FILE_COST_DEFAULT); +long splitSize; +if (splitOptions != null && splitOptions.getSplitSize() != null) { + splitSize = splitOptions.getSplitSize(); +} else { + splitSize = targetSplitSize(ops); +} +int lookback; +if (splitOptions != null && splitOptions.getSplitLookback() != null) { + lookback = splitOptions.getSplitLookback(); +} else { + lookback = ops.current().propertyAsInt( + TableProperties.SPLIT_LOOKBACK, TableProperties.SPLIT_LOOKBACK_DEFAULT); Review comment: Do you mean to not accept, say, `read.split.target-size` in favor of `split-size`, or that we should accept both? Re updating the docs, will do. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] xabriel commented on issue #371: Fix reading of old .metadata.json.gz files
xabriel commented on issue #371: Fix reading of old .metadata.json.gz files URL: https://github.com/apache/incubator-iceberg/pull/371#issuecomment-520931166 I think the guarantee should match what we have for table evolution: If Iceberg generated it, then Iceberg should be able to read it. And we should have the tests to prove it. Having said that, I do understand that we are still moving fast and that there has been no official release yet. So I would be ok if we decide to drop this particular backwards compatible code when we cut the first release. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides
rdblue commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313520937 ## File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java ## @@ -176,11 +188,26 @@ public Expression filter() { @Override public CloseableIterable planTasks() { -long splitSize = targetSplitSize(ops); -int lookback = ops.current().propertyAsInt( -TableProperties.SPLIT_LOOKBACK, TableProperties.SPLIT_LOOKBACK_DEFAULT); -long openFileCost = ops.current().propertyAsLong( -TableProperties.SPLIT_OPEN_FILE_COST, TableProperties.SPLIT_OPEN_FILE_COST_DEFAULT); +long splitSize; +if (splitOptions != null && splitOptions.getSplitSize() != null) { + splitSize = splitOptions.getSplitSize(); +} else { + splitSize = targetSplitSize(ops); +} +int lookback; +if (splitOptions != null && splitOptions.getSplitLookback() != null) { + lookback = splitOptions.getSplitLookback(); +} else { + lookback = ops.current().propertyAsInt( + TableProperties.SPLIT_LOOKBACK, TableProperties.SPLIT_LOOKBACK_DEFAULT); Review comment: Actually, "snapshot-id" is a read option, not a write option: http://iceberg.apache.org/configuration/#read-options Not sure how I missed that. Anyway, I think we've set a precedent that we should use shorter option names for interactive cases like this, instead of full config names. Could you also update the read options table? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #365: test cases for parquetMetrics with multiple rowgroup
rdblue commented on issue #365: test cases for parquetMetrics with multiple rowgroup URL: https://github.com/apache/incubator-iceberg/pull/365#issuecomment-520927967 Thanks for contributing this, @manishmalhotrawork! I merged it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #365: test cases for parquetMetrics with multiple rowgroup
rdblue merged pull request #365: test cases for parquetMetrics with multiple rowgroup URL: https://github.com/apache/incubator-iceberg/pull/365 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides
rdblue commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313517386 ## File path: api/src/main/java/org/apache/iceberg/TableScan.java ## @@ -58,6 +58,15 @@ */ TableScan asOfTime(long timestampMillis); + /** + * Create a new {@link TableScan} from this scan's configuration that will override the {@link Table}'s read splitting + * behavior based on the incoming {@link SplitOptions}. + * + * @param splitOptions a {@link SplitOptions} containing the splitting options + * @return a new scan based on this with the splitting options set + */ + TableScan splitOptions(SplitOptions splitOptions); Review comment: This would also make it easier to translate from Spark to Iceberg. You'd just need to add all of the options to the scan. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides
rdblue commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313516781 ## File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java ## @@ -176,11 +188,26 @@ public Expression filter() { @Override public CloseableIterable planTasks() { -long splitSize = targetSplitSize(ops); -int lookback = ops.current().propertyAsInt( -TableProperties.SPLIT_LOOKBACK, TableProperties.SPLIT_LOOKBACK_DEFAULT); -long openFileCost = ops.current().propertyAsLong( -TableProperties.SPLIT_OPEN_FILE_COST, TableProperties.SPLIT_OPEN_FILE_COST_DEFAULT); +long splitSize; +if (splitOptions != null && splitOptions.getSplitSize() != null) { + splitSize = splitOptions.getSplitSize(); +} else { + splitSize = targetSplitSize(ops); +} +int lookback; +if (splitOptions != null && splitOptions.getSplitLookback() != null) { + lookback = splitOptions.getSplitLookback(); +} else { + lookback = ops.current().propertyAsInt( + TableProperties.SPLIT_LOOKBACK, TableProperties.SPLIT_LOOKBACK_DEFAULT); Review comment: On the write path, we decided to go with easier to remember option names, like "snapshot-id" and "write-format", to be less verbose. What do you think about doing that here? This could be "lookback", "split-size", and "file-open-cost". This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides
rdblue commented on a change in pull request #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313515860 ## File path: api/src/main/java/org/apache/iceberg/TableScan.java ## @@ -58,6 +58,15 @@ */ TableScan asOfTime(long timestampMillis); + /** + * Create a new {@link TableScan} from this scan's configuration that will override the {@link Table}'s read splitting + * behavior based on the incoming {@link SplitOptions}. + * + * @param splitOptions a {@link SplitOptions} containing the splitting options + * @return a new scan based on this with the splitting options set + */ + TableScan splitOptions(SplitOptions splitOptions); Review comment: What about using more generic options that are string keys and values? Something like this: ```java public TableScan option(String property, String value); ``` That would make this more generic so we can use options to control behavior other than splits, without adding a lot of methods to the scan API. What do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #372: Allow read split behavior overrides
rdblue commented on issue #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#issuecomment-520925050 I restarted the tests. The Hive tests are a little flaky right now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #375: Vectorized iceberg reader
rdblue merged pull request #375: Vectorized iceberg reader URL: https://github.com/apache/incubator-iceberg/pull/375 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #371: Fix reading of old .metadata.json.gz files
rdblue merged pull request #371: Fix reading of old .metadata.json.gz files URL: https://github.com/apache/incubator-iceberg/pull/371 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #371: Fix reading of old .metadata.json.gz files
rdblue commented on issue #371: Fix reading of old .metadata.json.gz files URL: https://github.com/apache/incubator-iceberg/pull/371#issuecomment-520924428 Thanks for fixing this, @xabriel! I didn't realize that #258 had broken the old extension. I'm a little concerned about needing to check for so many file extensions for compatibility; hopefully we can standardize this and eventually stop. Would be great to hear what you think the guarantee from Iceberg should be. FYI @aokolnychyi. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup
manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup URL: https://github.com/apache/incubator-iceberg/pull/365#discussion_r313506603 ## File path: core/src/test/java/org/apache/iceberg/TestMetrics.java ## @@ -270,6 +289,97 @@ public void testMetricsForNullColumns() throws IOException { assertBounds(1, IntegerType.get(), null, null, metrics); } + @Test + public void testMetricsForTopLevelWithMultipleRG() throws Exception { Review comment: thanks @rdblue. please let me know if with changes it looks good. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup
manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup URL: https://github.com/apache/incubator-iceberg/pull/365#discussion_r313506252 ## File path: core/src/test/java/org/apache/iceberg/TestMetrics.java ## @@ -270,6 +289,97 @@ public void testMetricsForNullColumns() throws IOException { assertBounds(1, IntegerType.get(), null, null, metrics); } + @Test + public void testMetricsForTopLevelWithMultipleRG() throws Exception { +Schema schema = getSimpleSchema(); +int minimumRowGroupRecordCount = 100; Review comment: updated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup
manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup URL: https://github.com/apache/incubator-iceberg/pull/365#discussion_r313506183 ## File path: core/src/test/java/org/apache/iceberg/TestMetrics.java ## @@ -270,6 +289,97 @@ public void testMetricsForNullColumns() throws IOException { assertBounds(1, IntegerType.get(), null, null, metrics); } + @Test + public void testMetricsForTopLevelWithMultipleRG() throws Exception { +Schema schema = getSimpleSchema(); +int minimumRowGroupRecordCount = 100; +List records = new ArrayList<>(201); + +for (int i = 0; i < 201; i++) { + GenericData.Record newRecord = new GenericData.Record(AvroSchemaUtil.convert(schema.asStruct())); + newRecord.put("booleanCol", i == 0 ? false : true); + newRecord.put("intCol", i + 1); + newRecord.put("longCol", i == 0 ? null : i + 1L); + newRecord.put("floatCol", i + 1.0F); + newRecord.put("doubleCol", i == 0 ? null : i + 1.0D); + newRecord.put("decimalCol", i == 0 ? null : new BigDecimal(i + "").add(new BigDecimal("1.00"))); + newRecord.put("stringCol", "AAA"); + newRecord.put("dateCol", i + 1); + newRecord.put("timeCol", i + 1L); + newRecord.put("timestampCol", i + 1L); + newRecord.put("uuidCol", uuid); + newRecord.put("fixedCol", fixed); + newRecord.put("binaryCol", "S".getBytes()); + records.add(newRecord); +} + +// create parquet file with multiple row groups. by using smaller number of bytes +File parquetFile = writeRecords( +schema, +ImmutableMap.of(PARQUET_ROW_GROUP_SIZE_BYTES, Integer.toString(ROW_GROUP_SIZE)), +records.toArray(new GenericData.Record[] {})); + +Assert.assertNotNull(parquetFile); +// rowgroup size should be > 1 +Assert.assertEquals(3, splitCount(parquetFile)); + +Metrics metrics = getMetrics(localInput(parquetFile)); +Assert.assertEquals(201L, (long) metrics.recordCount()); +assertCounts(1, 201L, 0L, metrics); +assertBounds(1, Types.BooleanType.get(), false, true, metrics); +assertBounds(2, Types.IntegerType.get(), 1, 201, metrics); +assertCounts(3, 201L, 1L, metrics); +assertBounds(3, Types.LongType.get(), 2L, 201L, metrics); +assertCounts(4, 201L, 0L, metrics); +assertBounds(4, Types.FloatType.get(), 1.0F, 201.0F, metrics); +assertCounts(5, 201L, 1L, metrics); +assertBounds(5, Types.DoubleType.get(), 2.0D, 201.0D, metrics); +assertCounts(6, 201L, 1L, metrics); +assertBounds(6, Types.DecimalType.of(10, 2), new BigDecimal("2.00"), +new BigDecimal("201.00"), metrics); + } + + @Test + public void testMetricsForNestedStructFieldsWithMultipleRG() throws IOException { +int minimumRowGroupRecordCount = 101; Review comment: updated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] xabriel commented on issue #372: Allow read split behavior overrides
xabriel commented on issue #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#issuecomment-520912826 Not sure how to retest. Failed tests do look unrelated, caused by a `java.net.SocketException`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #357: Add in and not in predicates
rdblue commented on a change in pull request #357: Add in and not in predicates URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313499455 ## File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java ## @@ -134,18 +134,18 @@ public Boolean or(Boolean leftResult, Boolean rightResult) { } @Override -public Boolean in(BoundReference ref, Literal lit) { - throw new UnsupportedOperationException("In is not supported yet"); +public Boolean in(BoundReference ref, LiteralSet literalSet) { + return literalSet.contains(ref.get(struct)); } @Override -public Boolean notIn(BoundReference ref, Literal lit) { - return !in(ref, lit); +public Boolean notIn(BoundReference ref, LiteralSet literalSet) { + return !in(ref, literalSet); } @Override public Boolean startsWith(BoundReference ref, Literal lit) { - return ((String) ref.get(struct)).startsWith((String) lit.value()); + return ((String) ref.get(struct)).startsWith(lit.value().toString()); Review comment: This isn't a related change, and the literal is guaranteed to be a string at this point. I don't think this should be included in this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #357: Add in and not in predicates
rdblue commented on a change in pull request #357: Add in and not in predicates URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313498614 ## File path: api/src/main/java/org/apache/iceberg/expressions/Predicate.java ## @@ -19,15 +19,44 @@ package org.apache.iceberg.expressions; +import com.google.common.base.Preconditions; +import com.google.common.collect.ImmutableSet; +import java.util.Collection; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.Stream; + public abstract class Predicate implements Expression { private final Operation op; private final R ref; private final Literal literal; + private final Set> literalSet; Review comment: > UnboundPredicate for IN predicate with a single item should return an EQ predicate. The separation will complicate it. I disagree because that over-complicates the predicate classes. The factory methods in `Expressions` are allowed to change the predicate that is returned; for example, `and(true, p)` returns just `p`. That's what we should use to return an equality predicate. The predicate class itself should not include logic for changing itself. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] anjalinorwood opened a new pull request #375: Vectorized iceberg reader
anjalinorwood opened a new pull request #375: Vectorized iceberg reader URL: https://github.com/apache/incubator-iceberg/pull/375 This commit: + implements batched version of iterators (BatchedPageReader, ColumnarBatchReader etc) as the value-based iterator abstractions were taking up too much CPU. + moves the reads close to page level by consulting the definition levels and directly reading from values vector. Co-Authored-By: Samarth Jain This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] xabriel commented on issue #372: Allow read split behavior overrides
xabriel commented on issue #372: Allow read split behavior overrides URL: https://github.com/apache/incubator-iceberg/pull/372#issuecomment-520893129 retest This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates
jun-he commented on a change in pull request #357: Add in and not in predicates URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313463995 ## File path: api/src/main/java/org/apache/iceberg/expressions/ExpressionVisitors.java ## @@ -89,11 +91,11 @@ public R or(R leftResult, R rightResult) { return null; } -public R in(BoundReference ref, Literal lit) { +public R in(BoundReference ref, Set> lits) { return null; Review comment: This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] chenjunjiedada opened a new pull request #374: Migrate spark table to iceberg table
chenjunjiedada opened a new pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374 This PR converts a spark table to iceberg table. It assumes no parallel operation is going on the original table and target table and thus is not thread-safe. PR is still working in progress. Unit tests will be added when APIs is settled down. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] chenjunjiedada closed pull request #373: Migrate spark table to iceberg table
chenjunjiedada closed pull request #373: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/373 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] chenjunjiedada opened a new pull request #373: Migrate spark table to iceberg table
chenjunjiedada opened a new pull request #373: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/373 This PR converts a spark table to iceberg table. It assumes no parallel operation is going on the original table and target table and thus is not thread-safe. PR is still working in progress. Unit tests will be added when APIs is settled down. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates
jun-he commented on a change in pull request #357: Add in and not in predicates URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313259192 ## File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java ## @@ -134,13 +135,13 @@ public Boolean or(Boolean leftResult, Boolean rightResult) { } @Override -public Boolean in(BoundReference ref, Literal lit) { - throw new UnsupportedOperationException("In is not supported yet"); +public Boolean in(BoundReference ref, Set> lits) { + return lits.contains(Literals.from(ref.get(struct))); Review comment: Oops, I missed your comment. Yep, you are right. I found this issue and came up with a similar solution. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup
manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup URL: https://github.com/apache/incubator-iceberg/pull/365#discussion_r313238208 ## File path: core/src/test/java/org/apache/iceberg/TestMetrics.java ## @@ -63,32 +69,39 @@ */ public abstract class TestMetrics { + public static final int ROW_GROUP_SIZE = 1600; private final UUID uuid = UUID.randomUUID(); private final GenericFixed fixed = new GenericData.Fixed( org.apache.avro.Schema.createFixed("fixedCol", null, null, 4), "abcd".getBytes(StandardCharsets.UTF_8)); + private static StructType leafStructType = StructType.of( Review comment: updated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup
manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup URL: https://github.com/apache/incubator-iceberg/pull/365#discussion_r313238331 ## File path: core/src/test/java/org/apache/iceberg/TestMetrics.java ## @@ -153,6 +166,24 @@ public void testMetricsForTopLevelFields() throws IOException { ByteBuffer.wrap("S".getBytes()), ByteBuffer.wrap("W".getBytes()), metrics); } + private Schema getSimpleSchema() { Review comment: sure, done. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup
manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup URL: https://github.com/apache/incubator-iceberg/pull/365#discussion_r313238232 ## File path: core/src/test/java/org/apache/iceberg/TestMetrics.java ## @@ -270,6 +289,97 @@ public void testMetricsForNullColumns() throws IOException { assertBounds(1, IntegerType.get(), null, null, metrics); } + @Test + public void testMetricsForTopLevelWithMultipleRG() throws Exception { Review comment: updated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org