[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r317800614
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictDetectionFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictDetectionFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  InclusiveMetricsEvaluator metrics = new 
InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+ValidationException.check(
+!inclusive.eval(newFile.partition()) || !metrics.eval(newFile),
+"A conflicting file was appended that matches filter '%s': %s",
+conflictDetectionFilter, newFile.path());
+  }
+}
+
 return super.apply(base);
   }
+
+  private List collectNewFiles(TableMetadata meta) {
+List newFiles = new ArrayList<>();
+
+Long currentSnapshotId = meta.currentSnapshot() == null ? null : 
meta.currentSnapshot().snapshotId();
+while (currentSnapshotId != null && 
!currentSnapshotId.equals(readSnapshotId)) {
 
 Review comment:
   Okay, it should be fine if readSnapshotId is always explicitly set.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r317799344
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictDetectionFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictDetectionFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  InclusiveMetricsEvaluator metrics = new 
InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+ValidationException.check(
+!inclusive.eval(newFile.partition()) || !metrics.eval(newFile),
+"A conflicting file was appended that matches filter '%s': %s",
+conflictDetectionFilter, newFile.path());
+  }
+}
+
 return super.apply(base);
   }
+
+  private List collectNewFiles(TableMetadata meta) {
+List newFiles = new ArrayList<>();
+
+Long currentSnapshotId = meta.currentSnapshot() == null ? null : 
meta.currentSnapshot().snapshotId();
 
 Review comment:
   Right, that works.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464699
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -56,14 +67,34 @@ public OverwriteFiles addFile(DataFile file) {
   }
 
   @Override
-  public OverwriteFiles validateAddedFiles() {
-this.validateAddedFiles = true;
+  public OverwriteFiles deleteFile(DataFile file) {
+Preconditions.checkState(!overwriteByRowFilter,
+"Either overwriteByRowFilter or deleteFile should be used, not both");
 
 Review comment:
   I don't think these are mutually exclusive. I could still create an 
idempotent overwrite that deletes files directly.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464625
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -56,14 +67,34 @@ public OverwriteFiles addFile(DataFile file) {
   }
 
   @Override
-  public OverwriteFiles validateAddedFiles() {
-this.validateAddedFiles = true;
+  public OverwriteFiles deleteFile(DataFile file) {
+Preconditions.checkState(!overwriteByRowFilter,
+"Either overwriteByRowFilter or deleteFile should be used, not both");
+this.deleteFilesByPath = true;
+delete(file.path());
+return this;
+  }
+
+  @Override
+  public OverwriteFiles validateAddedFilesMatchOverwriteFilter() {
+Preconditions.checkState(!deleteFilesByPath,
+"Added files can be validated only if overwriteByRowFilter is used");
 
 Review comment:
   I don't think these are mutually exclusive. I could still delete by filter 
and want to validate no files matching a filter (probably the same one) have 
been added.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464103
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictDetectionFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictDetectionFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  InclusiveMetricsEvaluator metrics = new 
InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+ValidationException.check(
+!inclusive.eval(newFile.partition()) || !metrics.eval(newFile),
+"A conflicting file was appended that matches filter '%s': %s",
+conflictDetectionFilter, newFile.path());
+  }
+}
+
 return super.apply(base);
   }
+
+  private List collectNewFiles(TableMetadata meta) {
+List newFiles = new ArrayList<>();
+
+Long currentSnapshotId = meta.currentSnapshot() == null ? null : 
meta.currentSnapshot().snapshotId();
+while (currentSnapshotId != null && 
!currentSnapshotId.equals(readSnapshotId)) {
 
 Review comment:
   If `readSnapshotId` is null, then this use the _entire_ table history? Or am 
I missing where `readSnapshotId` is defaulted to the current snapshot ID when 
the operation starts?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464206
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -19,15 +19,23 @@
 
 package org.apache.iceberg;
 
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Iterables;
+import java.util.ArrayList;
 import java.util.List;
 import org.apache.iceberg.exceptions.ValidationException;
 import org.apache.iceberg.expressions.Evaluator;
 import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.InclusiveMetricsEvaluator;
 import org.apache.iceberg.expressions.Projections;
 import org.apache.iceberg.expressions.StrictMetricsEvaluator;
 
 public class OverwriteData extends MergingSnapshotProducer 
implements OverwriteFiles {
-  private boolean validateAddedFiles = false;
+  private boolean overwriteByRowFilter;
+  private boolean validateAddedFilesMatchOverwriteFilter;
+  private boolean deleteFilesByPath;
+  private Long readSnapshotId;
+  private Expression conflictDetectionFilter;
 
 Review comment:
   I'd prefer to add explicit defaults for readability. Is there a reason to 
remove them?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315463778
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -74,5 +85,23 @@
*
* @return this for method chaining
*/
-  OverwriteFiles validateAddedFiles();
+  OverwriteFiles validateAddedFilesMatchOverwriteFilter();
+
+  /**
+   * Enables validation that files added concurrently do not conflict with 
this commit's operation.
+   * 
+   * This method should be called when the table is queried to determine which 
files to delete/append.
+   * If a concurrent operation commits a new file after the data was read and 
that file might
+   * contain rows matching the specified conflict detection filter, the 
overwrite operation
+   * will detect this during retries and fail.
+   * 
+   * Calling this method with a correct conflict detection filter is required 
to maintain
+   * serializable isolation for eager update/delete operations. Otherwise, the 
isolation level
+   * will be snapshot isolation.
+   *
+   * @param readSnapshotId the snapshot id that was used to read the data
 
 Review comment:
   This should document what happens when `readSnapshotId` is null since that's 
supported.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315463525
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictDetectionFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictDetectionFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  InclusiveMetricsEvaluator metrics = new 
InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+ValidationException.check(
+!inclusive.eval(newFile.partition()) || !metrics.eval(newFile),
+"A conflicting file was appended that matches filter '%s': %s",
 
 Review comment:
   Minor: This isn't quite correct. This doesn't guarantee that a file matches 
the filter, just that it may contain data that matches the filter. For example, 
if a table is bucketed by `id` and a file has ids `x` and `z` where `x < y` and 
`y < z` and `x`, `y`, and `z` all hash to the same bucket, then `id = y` would 
match the file even though `y` isn't actually in it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315463525
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictDetectionFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictDetectionFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  InclusiveMetricsEvaluator metrics = new 
InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+ValidationException.check(
+!inclusive.eval(newFile.partition()) || !metrics.eval(newFile),
+"A conflicting file was appended that matches filter '%s': %s",
 
 Review comment:
   Minor: This isn't quite correct. This doesn't guarantee that a file matches 
the filter, just that it may contain data that matches the filter. For example, 
if a table is bucketed by `id` and a file has ids `x` and `z` where `x < y` and 
`y < z` and `x`, `y`, and `z` all hash to the same bucket, then this would 
match the file even though `y` isn't actually in it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315462678
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictDetectionFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictDetectionFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  InclusiveMetricsEvaluator metrics = new 
InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+ValidationException.check(
+!inclusive.eval(newFile.partition()) || !metrics.eval(newFile),
+"A conflicting file was appended that matches filter '%s': %s",
+conflictDetectionFilter, newFile.path());
+  }
+}
+
 return super.apply(base);
   }
+
+  private List collectNewFiles(TableMetadata meta) {
+List newFiles = new ArrayList<>();
+
+Long currentSnapshotId = meta.currentSnapshot() == null ? null : 
meta.currentSnapshot().snapshotId();
 
 Review comment:
   I think that if the current snapshot is null, the loop is skipped and this 
returns an empty list. That's not correct, right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315462125
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -74,5 +85,23 @@
*
* @return this for method chaining
*/
-  OverwriteFiles validateAddedFiles();
+  OverwriteFiles validateAddedFilesMatchOverwriteFilter();
+
+  /**
+   * Enables validation that files added concurrently do not conflict with 
this commit's operation.
+   * 
+   * This method should be called when the table is queried to determine which 
files to delete/append.
+   * If a concurrent operation commits a new file after the data was read and 
that file might
+   * contain rows matching the specified conflict detection filter, the 
overwrite operation
+   * will detect this during retries and fail.
+   * 
+   * Calling this method with a correct conflict detection filter is required 
to maintain
 
 Review comment:
   I agree with your take on it, though I'm not sure that I agree that 
compaction isn't serializable. That operation takes effect when it is committed 
and can be relocated in the order of commits without changing the work it did. 
But once it is committed, it is present in the order of operations and that 
won't change.
   
   Anyway, I think that isolation level isn't a useful table config. Some 
individual operations will probably want to relax it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315461591
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -24,18 +24,17 @@
 import org.apache.iceberg.expressions.Projections;
 
 /**
- * API for overwriting files in a table by filter expression.
+ * API for overwriting files in a table.
  * 
  * This API accumulates file additions and produces a new {@link Snapshot} of 
the table by replacing
- * all the files that match the filter expression with the set of additions. 
This operation is used
- * to implement idempotent writes that always replace a section of a table 
with new data.
+ * all the deleted files with the set of additions. This operation is used to 
implement idempotent
+ * writes that always replace a section of a table with new data or 
update/delete operations that
+ * eagerly overwrite files.
 
 Review comment:
   Looks good.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-15 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r314385635
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +121,40 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictingAppendsRowFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictingAppendsRowFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+// we do partition-level conflict resolution right now
+// we can enhance it by leveraging column stats and MetricsEvaluator
+ValidationException.check(
+!inclusive.eval(newFile.partition()),
+"A conflicting file was appended that matches filter '%s': %s",
+conflictingAppendsRowFilter, newFile.path());
+  }
+}
+
 return super.apply(base);
   }
+
+  private List collectNewFiles(TableMetadata meta) {
+List newFiles = new ArrayList<>();
+
+Long currentSnapshotId = meta.currentSnapshot() == null ? null : 
meta.currentSnapshot().snapshotId();
+while (currentSnapshotId != null && 
!currentSnapshotId.equals(readSnapshotId)) {
+  Snapshot currentSnapshot = meta.snapshot(currentSnapshotId);
+
+  if (currentSnapshot == null) {
+throw new ValidationException("Cannot find snapshot %d. Was it 
expired?", currentSnapshotId);
 
 Review comment:
   I agree that this would work correctly, but "Cannot find snapshot null" 
isn't a very helpful error message. I think it makes sense to catch that case 
earlier and have a better error.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313647841
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +121,40 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictingAppendsRowFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictingAppendsRowFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+// we do partition-level conflict resolution right now
+// we can enhance it by leveraging column stats and MetricsEvaluator
+ValidationException.check(
+!inclusive.eval(newFile.partition()),
+"A conflicting file was appended that matches filter '%s': %s",
+conflictingAppendsRowFilter, newFile.path());
+  }
+}
+
 return super.apply(base);
   }
+
+  private List collectNewFiles(TableMetadata meta) {
+List newFiles = new ArrayList<>();
+
+Long currentSnapshotId = meta.currentSnapshot() == null ? null : 
meta.currentSnapshot().snapshotId();
+while (currentSnapshotId != null && 
!currentSnapshotId.equals(readSnapshotId)) {
+  Snapshot currentSnapshot = meta.snapshot(currentSnapshotId);
+
+  if (currentSnapshot == null) {
+throw new ValidationException("Cannot find snapshot %d. Was it 
expired?", currentSnapshotId);
 
 Review comment:
   If `meta.currentSnapshot` is null, this will print "Cannot find snapshot 
null". I think it would be better to start the method with a precondition that 
checks that meta has a current snapshot and fails with a better error message 
if it doesn't.
   
   Also, there are a few reasons why the snapshot can't be found. The base 
snapshot may have been expired, but it also could have been rolled back. I 
think this should just state "Cannot determine history between base snapshot %s 
and current %s".


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313646140
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -27,10 +30,15 @@
 import org.apache.iceberg.expressions.StrictMetricsEvaluator;
 
 public class OverwriteData extends MergingSnapshotProducer 
implements OverwriteFiles {
-  private boolean validateAddedFiles = false;
+  private boolean overwriteByRowFilter;
+  private boolean validateAddedFilesMatchRowFilter;
+  private boolean deleteFilesByPath;
+  private Long readSnapshotId;
+  private Expression conflictingAppendsRowFilter;
 
   protected OverwriteData(TableOperations ops) {
 super(ops);
+failMissingDeletePaths();
 
 Review comment:
   Is this always required? It seems like it might only be needed when 
validating conflicting operations.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313645766
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -74,5 +90,22 @@
*
* @return this for method chaining
*/
-  OverwriteFiles validateAddedFiles();
+  OverwriteFiles validateAddedFilesMatchRowFilter();
+
+  /**
+   * Enables validation of files that are added concurrently.
 
 Review comment:
   I'd like to add something about the type of validation. Does something like 
this make sense?
   
   > Enables validation that files added concurrently do not conflict with this 
commit's operation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313645460
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -74,5 +90,22 @@
*
* @return this for method chaining
*/
-  OverwriteFiles validateAddedFiles();
+  OverwriteFiles validateAddedFilesMatchRowFilter();
+
+  /**
+   * Enables validation of files that are added concurrently.
+   * 
+   * This method should be called when the table is queried to determine files 
to delete/append.
+   * If a concurrent operation commits a new file after the data was read and 
that file might
+   * contain rows matching the specified row filter, the overwrite operation 
will detect
+   * this during retries and fail.
+   * 
+   * Calling this method with a correct row filter is required to maintain 
serializable isolation.
+   * Otherwise, the isolation level will be snapshot isolation.
+   *
+   * @param readSnapshotId the snapshot id that was used to read the data
+   * @param rowFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression 
rowFilter);
 
 Review comment:
   Can we rename `rowFilter` to `conflictDetectionFilter`?
   
   Why use `Long` instead of `long`? That would allow passing null. Is that 
intentional? If null, then use the base snapshot ID?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313644916
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -74,5 +90,22 @@
*
* @return this for method chaining
*/
-  OverwriteFiles validateAddedFiles();
+  OverwriteFiles validateAddedFilesMatchRowFilter();
 
 Review comment:
   Since a row filter is passed to `validateNoConflictingAppends`, I think this 
should be `validateAddedFilesMatchOverwriteFilter` instead.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313643617
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -65,6 +70,17 @@
*/
   OverwriteFiles addFile(DataFile file);
 
+  /**
+   * Delete a {@link DataFile} from the table.
+   * 
+   * Note that either this method or {@link 
OverwriteFiles#overwriteByRowFilter(Expression)}
+   * should be used to mark deleted files, not both.
 
 Review comment:
   Like my comment above, I don't think this paragraph is needed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313643529
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -50,6 +49,12 @@
* 
* Files that may contain some rows that match the expression and some rows 
that do not will
* result in a {@link ValidationException}.
+   * 
+   * This has no requirements for the latest snapshot and will not fail based 
on other snapshot
+   * changes.
+   * 
+   * Note that either this method or {@link 
OverwriteFiles#deleteFile(DataFile)}
+   * should be used to mark deleted files, not both.
 
 Review comment:
   This update isn't quite true. There are methods to determine which files to 
delete, and there are separate ways to configure validations.
   
   When this operation is used to produce an idempotent overwrite, where it can 
be completed regardless of what files are added after it starts, it doesn't 
matter that the files were matched by a filter or directly by `deleteFile`. For 
example, I might want to overwrite an old file with new values for a column 
that was added. That overwrite is idempotent, but might use `deleteFile`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org