[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r317800614 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictDetectionFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictDetectionFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + InclusiveMetricsEvaluator metrics = new InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +ValidationException.check( +!inclusive.eval(newFile.partition()) || !metrics.eval(newFile), +"A conflicting file was appended that matches filter '%s': %s", +conflictDetectionFilter, newFile.path()); + } +} + return super.apply(base); } + + private List collectNewFiles(TableMetadata meta) { +List newFiles = new ArrayList<>(); + +Long currentSnapshotId = meta.currentSnapshot() == null ? null : meta.currentSnapshot().snapshotId(); +while (currentSnapshotId != null && !currentSnapshotId.equals(readSnapshotId)) { Review comment: Okay, it should be fine if readSnapshotId is always explicitly set. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r317799344 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictDetectionFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictDetectionFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + InclusiveMetricsEvaluator metrics = new InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +ValidationException.check( +!inclusive.eval(newFile.partition()) || !metrics.eval(newFile), +"A conflicting file was appended that matches filter '%s': %s", +conflictDetectionFilter, newFile.path()); + } +} + return super.apply(base); } + + private List collectNewFiles(TableMetadata meta) { +List newFiles = new ArrayList<>(); + +Long currentSnapshotId = meta.currentSnapshot() == null ? null : meta.currentSnapshot().snapshotId(); Review comment: Right, that works. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464699 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -56,14 +67,34 @@ public OverwriteFiles addFile(DataFile file) { } @Override - public OverwriteFiles validateAddedFiles() { -this.validateAddedFiles = true; + public OverwriteFiles deleteFile(DataFile file) { +Preconditions.checkState(!overwriteByRowFilter, +"Either overwriteByRowFilter or deleteFile should be used, not both"); Review comment: I don't think these are mutually exclusive. I could still create an idempotent overwrite that deletes files directly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464625 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -56,14 +67,34 @@ public OverwriteFiles addFile(DataFile file) { } @Override - public OverwriteFiles validateAddedFiles() { -this.validateAddedFiles = true; + public OverwriteFiles deleteFile(DataFile file) { +Preconditions.checkState(!overwriteByRowFilter, +"Either overwriteByRowFilter or deleteFile should be used, not both"); +this.deleteFilesByPath = true; +delete(file.path()); +return this; + } + + @Override + public OverwriteFiles validateAddedFilesMatchOverwriteFilter() { +Preconditions.checkState(!deleteFilesByPath, +"Added files can be validated only if overwriteByRowFilter is used"); Review comment: I don't think these are mutually exclusive. I could still delete by filter and want to validate no files matching a filter (probably the same one) have been added. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464103 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictDetectionFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictDetectionFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + InclusiveMetricsEvaluator metrics = new InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +ValidationException.check( +!inclusive.eval(newFile.partition()) || !metrics.eval(newFile), +"A conflicting file was appended that matches filter '%s': %s", +conflictDetectionFilter, newFile.path()); + } +} + return super.apply(base); } + + private List collectNewFiles(TableMetadata meta) { +List newFiles = new ArrayList<>(); + +Long currentSnapshotId = meta.currentSnapshot() == null ? null : meta.currentSnapshot().snapshotId(); +while (currentSnapshotId != null && !currentSnapshotId.equals(readSnapshotId)) { Review comment: If `readSnapshotId` is null, then this use the _entire_ table history? Or am I missing where `readSnapshotId` is defaulted to the current snapshot ID when the operation starts? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464206 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -19,15 +19,23 @@ package org.apache.iceberg; +import com.google.common.base.Preconditions; +import com.google.common.collect.Iterables; +import java.util.ArrayList; import java.util.List; import org.apache.iceberg.exceptions.ValidationException; import org.apache.iceberg.expressions.Evaluator; import org.apache.iceberg.expressions.Expression; +import org.apache.iceberg.expressions.InclusiveMetricsEvaluator; import org.apache.iceberg.expressions.Projections; import org.apache.iceberg.expressions.StrictMetricsEvaluator; public class OverwriteData extends MergingSnapshotProducer implements OverwriteFiles { - private boolean validateAddedFiles = false; + private boolean overwriteByRowFilter; + private boolean validateAddedFilesMatchOverwriteFilter; + private boolean deleteFilesByPath; + private Long readSnapshotId; + private Expression conflictDetectionFilter; Review comment: I'd prefer to add explicit defaults for readability. Is there a reason to remove them? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315463778 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -74,5 +85,23 @@ * * @return this for method chaining */ - OverwriteFiles validateAddedFiles(); + OverwriteFiles validateAddedFilesMatchOverwriteFilter(); + + /** + * Enables validation that files added concurrently do not conflict with this commit's operation. + * + * This method should be called when the table is queried to determine which files to delete/append. + * If a concurrent operation commits a new file after the data was read and that file might + * contain rows matching the specified conflict detection filter, the overwrite operation + * will detect this during retries and fail. + * + * Calling this method with a correct conflict detection filter is required to maintain + * serializable isolation for eager update/delete operations. Otherwise, the isolation level + * will be snapshot isolation. + * + * @param readSnapshotId the snapshot id that was used to read the data Review comment: This should document what happens when `readSnapshotId` is null since that's supported. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315463525 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictDetectionFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictDetectionFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + InclusiveMetricsEvaluator metrics = new InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +ValidationException.check( +!inclusive.eval(newFile.partition()) || !metrics.eval(newFile), +"A conflicting file was appended that matches filter '%s': %s", Review comment: Minor: This isn't quite correct. This doesn't guarantee that a file matches the filter, just that it may contain data that matches the filter. For example, if a table is bucketed by `id` and a file has ids `x` and `z` where `x < y` and `y < z` and `x`, `y`, and `z` all hash to the same bucket, then `id = y` would match the file even though `y` isn't actually in it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315463525 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictDetectionFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictDetectionFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + InclusiveMetricsEvaluator metrics = new InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +ValidationException.check( +!inclusive.eval(newFile.partition()) || !metrics.eval(newFile), +"A conflicting file was appended that matches filter '%s': %s", Review comment: Minor: This isn't quite correct. This doesn't guarantee that a file matches the filter, just that it may contain data that matches the filter. For example, if a table is bucketed by `id` and a file has ids `x` and `z` where `x < y` and `y < z` and `x`, `y`, and `z` all hash to the same bucket, then this would match the file even though `y` isn't actually in it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315462678 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictDetectionFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictDetectionFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + InclusiveMetricsEvaluator metrics = new InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +ValidationException.check( +!inclusive.eval(newFile.partition()) || !metrics.eval(newFile), +"A conflicting file was appended that matches filter '%s': %s", +conflictDetectionFilter, newFile.path()); + } +} + return super.apply(base); } + + private List collectNewFiles(TableMetadata meta) { +List newFiles = new ArrayList<>(); + +Long currentSnapshotId = meta.currentSnapshot() == null ? null : meta.currentSnapshot().snapshotId(); Review comment: I think that if the current snapshot is null, the loop is skipped and this returns an empty list. That's not correct, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315462125 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -74,5 +85,23 @@ * * @return this for method chaining */ - OverwriteFiles validateAddedFiles(); + OverwriteFiles validateAddedFilesMatchOverwriteFilter(); + + /** + * Enables validation that files added concurrently do not conflict with this commit's operation. + * + * This method should be called when the table is queried to determine which files to delete/append. + * If a concurrent operation commits a new file after the data was read and that file might + * contain rows matching the specified conflict detection filter, the overwrite operation + * will detect this during retries and fail. + * + * Calling this method with a correct conflict detection filter is required to maintain Review comment: I agree with your take on it, though I'm not sure that I agree that compaction isn't serializable. That operation takes effect when it is committed and can be relocated in the order of commits without changing the work it did. But once it is committed, it is present in the order of operations and that won't change. Anyway, I think that isolation level isn't a useful table config. Some individual operations will probably want to relax it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315461591 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -24,18 +24,17 @@ import org.apache.iceberg.expressions.Projections; /** - * API for overwriting files in a table by filter expression. + * API for overwriting files in a table. * * This API accumulates file additions and produces a new {@link Snapshot} of the table by replacing - * all the files that match the filter expression with the set of additions. This operation is used - * to implement idempotent writes that always replace a section of a table with new data. + * all the deleted files with the set of additions. This operation is used to implement idempotent + * writes that always replace a section of a table with new data or update/delete operations that + * eagerly overwrite files. Review comment: Looks good. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r314385635 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +121,40 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictingAppendsRowFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictingAppendsRowFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +// we do partition-level conflict resolution right now +// we can enhance it by leveraging column stats and MetricsEvaluator +ValidationException.check( +!inclusive.eval(newFile.partition()), +"A conflicting file was appended that matches filter '%s': %s", +conflictingAppendsRowFilter, newFile.path()); + } +} + return super.apply(base); } + + private List collectNewFiles(TableMetadata meta) { +List newFiles = new ArrayList<>(); + +Long currentSnapshotId = meta.currentSnapshot() == null ? null : meta.currentSnapshot().snapshotId(); +while (currentSnapshotId != null && !currentSnapshotId.equals(readSnapshotId)) { + Snapshot currentSnapshot = meta.snapshot(currentSnapshotId); + + if (currentSnapshot == null) { +throw new ValidationException("Cannot find snapshot %d. Was it expired?", currentSnapshotId); Review comment: I agree that this would work correctly, but "Cannot find snapshot null" isn't a very helpful error message. I think it makes sense to catch that case earlier and have a better error. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313647841 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +121,40 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictingAppendsRowFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictingAppendsRowFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +// we do partition-level conflict resolution right now +// we can enhance it by leveraging column stats and MetricsEvaluator +ValidationException.check( +!inclusive.eval(newFile.partition()), +"A conflicting file was appended that matches filter '%s': %s", +conflictingAppendsRowFilter, newFile.path()); + } +} + return super.apply(base); } + + private List collectNewFiles(TableMetadata meta) { +List newFiles = new ArrayList<>(); + +Long currentSnapshotId = meta.currentSnapshot() == null ? null : meta.currentSnapshot().snapshotId(); +while (currentSnapshotId != null && !currentSnapshotId.equals(readSnapshotId)) { + Snapshot currentSnapshot = meta.snapshot(currentSnapshotId); + + if (currentSnapshot == null) { +throw new ValidationException("Cannot find snapshot %d. Was it expired?", currentSnapshotId); Review comment: If `meta.currentSnapshot` is null, this will print "Cannot find snapshot null". I think it would be better to start the method with a precondition that checks that meta has a current snapshot and fails with a better error message if it doesn't. Also, there are a few reasons why the snapshot can't be found. The base snapshot may have been expired, but it also could have been rolled back. I think this should just state "Cannot determine history between base snapshot %s and current %s". This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313646140 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -27,10 +30,15 @@ import org.apache.iceberg.expressions.StrictMetricsEvaluator; public class OverwriteData extends MergingSnapshotProducer implements OverwriteFiles { - private boolean validateAddedFiles = false; + private boolean overwriteByRowFilter; + private boolean validateAddedFilesMatchRowFilter; + private boolean deleteFilesByPath; + private Long readSnapshotId; + private Expression conflictingAppendsRowFilter; protected OverwriteData(TableOperations ops) { super(ops); +failMissingDeletePaths(); Review comment: Is this always required? It seems like it might only be needed when validating conflicting operations. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313645766 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -74,5 +90,22 @@ * * @return this for method chaining */ - OverwriteFiles validateAddedFiles(); + OverwriteFiles validateAddedFilesMatchRowFilter(); + + /** + * Enables validation of files that are added concurrently. Review comment: I'd like to add something about the type of validation. Does something like this make sense? > Enables validation that files added concurrently do not conflict with this commit's operation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313645460 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -74,5 +90,22 @@ * * @return this for method chaining */ - OverwriteFiles validateAddedFiles(); + OverwriteFiles validateAddedFilesMatchRowFilter(); + + /** + * Enables validation of files that are added concurrently. + * + * This method should be called when the table is queried to determine files to delete/append. + * If a concurrent operation commits a new file after the data was read and that file might + * contain rows matching the specified row filter, the overwrite operation will detect + * this during retries and fail. + * + * Calling this method with a correct row filter is required to maintain serializable isolation. + * Otherwise, the isolation level will be snapshot isolation. + * + * @param readSnapshotId the snapshot id that was used to read the data + * @param rowFilter an expression on rows in the table + * @return this for method chaining + */ + OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression rowFilter); Review comment: Can we rename `rowFilter` to `conflictDetectionFilter`? Why use `Long` instead of `long`? That would allow passing null. Is that intentional? If null, then use the base snapshot ID? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313644916 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -74,5 +90,22 @@ * * @return this for method chaining */ - OverwriteFiles validateAddedFiles(); + OverwriteFiles validateAddedFilesMatchRowFilter(); Review comment: Since a row filter is passed to `validateNoConflictingAppends`, I think this should be `validateAddedFilesMatchOverwriteFilter` instead. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313643617 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -65,6 +70,17 @@ */ OverwriteFiles addFile(DataFile file); + /** + * Delete a {@link DataFile} from the table. + * + * Note that either this method or {@link OverwriteFiles#overwriteByRowFilter(Expression)} + * should be used to mark deleted files, not both. Review comment: Like my comment above, I don't think this paragraph is needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313643529 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -50,6 +49,12 @@ * * Files that may contain some rows that match the expression and some rows that do not will * result in a {@link ValidationException}. + * + * This has no requirements for the latest snapshot and will not fail based on other snapshot + * changes. + * + * Note that either this method or {@link OverwriteFiles#deleteFile(DataFile)} + * should be used to mark deleted files, not both. Review comment: This update isn't quite true. There are methods to determine which files to delete, and there are separate ways to configure validations. When this operation is used to produce an idempotent overwrite, where it can be completed regardless of what files are added after it starts, it doesn't matter that the files were matched by a filter or directly by `deleteFile`. For example, I might want to overwrite an old file with new values for a column that was added. That overwrite is idempotent, but might use `deleteFile`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org