[GitHub] [incubator-iceberg] jerryshao commented on a change in pull request #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations

2019-08-19 Thread GitBox
jerryshao commented on a change in pull request #395: Remove Hadoop 
Configuration dependency in BaseMetastoreTableOperations
URL: https://github.com/apache/incubator-iceberg/pull/395#discussion_r315498411
 
 

 ##
 File path: 
core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java
 ##
 @@ -43,17 +41,15 @@
   private static final String METADATA_FOLDER_NAME = "metadata";
   private static final String DATA_FOLDER_NAME = "data";
 
-  private final Configuration conf;
   private final FileIO fileIo;
 
   private TableMetadata currentMetadata = null;
   private String currentMetadataLocation = null;
   private boolean shouldRefresh = true;
   private int version = -1;
 
-  protected BaseMetastoreTableOperations(Configuration conf) {
-this.conf = conf;
-this.fileIo = new HadoopFileIO(conf);
+  protected BaseMetastoreTableOperations(FileIO fileIo) {
 
 Review comment:
   @aokolnychyi please help to review again, thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jerryshao commented on a change in pull request #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations

2019-08-19 Thread GitBox
jerryshao commented on a change in pull request #395: Remove Hadoop 
Configuration dependency in BaseMetastoreTableOperations
URL: https://github.com/apache/incubator-iceberg/pull/395#discussion_r315483176
 
 

 ##
 File path: 
core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java
 ##
 @@ -43,17 +41,15 @@
   private static final String METADATA_FOLDER_NAME = "metadata";
   private static final String DATA_FOLDER_NAME = "data";
 
-  private final Configuration conf;
   private final FileIO fileIo;
 
   private TableMetadata currentMetadata = null;
   private String currentMetadataLocation = null;
   private boolean shouldRefresh = true;
   private int version = -1;
 
-  protected BaseMetastoreTableOperations(Configuration conf) {
-this.conf = conf;
-this.fileIo = new HadoopFileIO(conf);
+  protected BaseMetastoreTableOperations(FileIO fileIo) {
 
 Review comment:
   OK, let me update the code.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue closed issue #269: Fix BaseTableScan to include column_sizes in stats columns

2019-08-19 Thread GitBox
rdblue closed issue #269: Fix BaseTableScan to include column_sizes in stats 
columns
URL: https://github.com/apache/incubator-iceberg/issues/269
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue merged pull request #393: include column_sizes in stats columns

2019-08-19 Thread GitBox
rdblue merged pull request #393: include column_sizes in stats columns
URL: https://github.com/apache/incubator-iceberg/pull/393
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #393: include column_sizes in stats columns

2019-08-19 Thread GitBox
rdblue commented on issue #393: include column_sizes in stats columns
URL: https://github.com/apache/incubator-iceberg/pull/393#issuecomment-522805147
 
 
   Looks great, thanks for fixing this @manishmalhotrawork!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #362: Support create and replace transactions in Catalog

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #362: Support create and replace 
transactions in Catalog
URL: https://github.com/apache/incubator-iceberg/pull/362#discussion_r315465619
 
 

 ##
 File path: 
hive/src/test/java/org/apache/iceberg/hive/HiveCreateReplaceTableTest.java
 ##
 @@ -0,0 +1,211 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg.hive;
+
+import com.google.common.collect.Maps;
+import java.io.IOException;
+import java.util.HashMap;
+import org.apache.hadoop.hive.conf.HiveConf;
+import org.apache.hadoop.hive.metastore.HiveMetaStoreClient;
+import org.apache.hadoop.hive.metastore.api.Database;
+import org.apache.iceberg.PartitionSpec;
+import org.apache.iceberg.Schema;
+import org.apache.iceberg.Table;
+import org.apache.iceberg.Transaction;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.NoSuchTableException;
+import org.apache.iceberg.types.Types;
+import org.junit.After;
+import org.junit.AfterClass;
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.BeforeClass;
+import org.junit.Rule;
+import org.junit.Test;
+import org.junit.rules.ExpectedException;
+import org.junit.rules.TemporaryFolder;
+
+import static org.apache.iceberg.PartitionSpec.builderFor;
+import static org.apache.iceberg.types.Types.NestedField.required;
+
+public class HiveCreateReplaceTableTest {
+
+  private static final String DB_NAME = "hivedb";
+  private static final String TABLE_NAME = "tbl";
+  private static final TableIdentifier TABLE_IDENTIFIER = 
TableIdentifier.of(DB_NAME, TABLE_NAME);
+  private static final Schema SCHEMA = new Schema(
+  required(3, "id", Types.IntegerType.get()),
+  required(4, "data", Types.StringType.get())
+  );
+  private static final PartitionSpec SPEC = builderFor(SCHEMA)
+  .identity("id")
+  .build();
+
+  private static TestHiveMetastore metastore;
+  private static HiveMetaStoreClient metastoreClient;
+  private static HiveConf hiveConf;
+  private static HiveCatalog catalog;
+
+  @Rule
+  public ExpectedException exceptionRule = ExpectedException.none();
+  @Rule
+  public TemporaryFolder temp = new TemporaryFolder();
+
+  private String tableLocation;
+
+  @BeforeClass
+  public static void startMetastore() throws Exception {
+HiveCreateReplaceTableTest.metastore = new TestHiveMetastore();
+metastore.start();
+HiveCreateReplaceTableTest.hiveConf = metastore.hiveConf();
+HiveCreateReplaceTableTest.metastoreClient = new 
HiveMetaStoreClient(hiveConf);
+String dbPath = metastore.getDatabasePath(DB_NAME);
+Database db = new Database(DB_NAME, "description", dbPath, new 
HashMap<>());
+metastoreClient.createDatabase(db);
+HiveCreateReplaceTableTest.catalog = new HiveCatalog(hiveConf);
+  }
+
+  @AfterClass
+  public static void stopMetastore() {
+catalog.close();
+HiveCreateReplaceTableTest.catalog = null;
+
+metastoreClient.close();
+HiveCreateReplaceTableTest.metastoreClient = null;
+
+metastore.stop();
+HiveCreateReplaceTableTest.metastore = null;
+  }
+
+  @Before
+  public void createTableLocation() throws IOException {
+tableLocation = temp.newFolder("hive-").getPath();
+  }
+
+  @After
+  public void cleanup() {
+catalog.dropTable(TABLE_IDENTIFIER);
+  }
+
+  @Test
+  public void testCreateTableTxn() {
+Assert.assertFalse("Table should not exist", 
catalog.tableExists(TABLE_IDENTIFIER));
+
+Transaction txn = catalog.newCreateTableTransaction(
+TABLE_IDENTIFIER, SCHEMA, SPEC, tableLocation, Maps.newHashMap());
+txn.updateProperties()
+.set("prop", "value")
+.commit();
+
+// verify the table is still not visible before the transaction is 
committed
+Assert.assertFalse(catalog.tableExists(TABLE_IDENTIFIER));
+
+txn.commitTransaction();
+
+Table table = catalog.loadTable(TABLE_IDENTIFIER);
+Assert.assertEquals("Table props should match", "value", 
table.properties().get("prop"));
+  }
+
+  @Test
+  public void testCreateTableTxnTableCreatedConcurrently() {
+exceptionRule.expect(RuntimeException.class);
+

[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464699
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -56,14 +67,34 @@ public OverwriteFiles addFile(DataFile file) {
   }
 
   @Override
-  public OverwriteFiles validateAddedFiles() {
-this.validateAddedFiles = true;
+  public OverwriteFiles deleteFile(DataFile file) {
+Preconditions.checkState(!overwriteByRowFilter,
+"Either overwriteByRowFilter or deleteFile should be used, not both");
 
 Review comment:
   I don't think these are mutually exclusive. I could still create an 
idempotent overwrite that deletes files directly.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464625
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -56,14 +67,34 @@ public OverwriteFiles addFile(DataFile file) {
   }
 
   @Override
-  public OverwriteFiles validateAddedFiles() {
-this.validateAddedFiles = true;
+  public OverwriteFiles deleteFile(DataFile file) {
+Preconditions.checkState(!overwriteByRowFilter,
+"Either overwriteByRowFilter or deleteFile should be used, not both");
+this.deleteFilesByPath = true;
+delete(file.path());
+return this;
+  }
+
+  @Override
+  public OverwriteFiles validateAddedFilesMatchOverwriteFilter() {
+Preconditions.checkState(!deleteFilesByPath,
+"Added files can be validated only if overwriteByRowFilter is used");
 
 Review comment:
   I don't think these are mutually exclusive. I could still delete by filter 
and want to validate no files matching a filter (probably the same one) have 
been added.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464103
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictDetectionFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictDetectionFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  InclusiveMetricsEvaluator metrics = new 
InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+ValidationException.check(
+!inclusive.eval(newFile.partition()) || !metrics.eval(newFile),
+"A conflicting file was appended that matches filter '%s': %s",
+conflictDetectionFilter, newFile.path());
+  }
+}
+
 return super.apply(base);
   }
+
+  private List collectNewFiles(TableMetadata meta) {
+List newFiles = new ArrayList<>();
+
+Long currentSnapshotId = meta.currentSnapshot() == null ? null : 
meta.currentSnapshot().snapshotId();
+while (currentSnapshotId != null && 
!currentSnapshotId.equals(readSnapshotId)) {
 
 Review comment:
   If `readSnapshotId` is null, then this use the _entire_ table history? Or am 
I missing where `readSnapshotId` is defaulted to the current snapshot ID when 
the operation starts?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464206
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -19,15 +19,23 @@
 
 package org.apache.iceberg;
 
+import com.google.common.base.Preconditions;
+import com.google.common.collect.Iterables;
+import java.util.ArrayList;
 import java.util.List;
 import org.apache.iceberg.exceptions.ValidationException;
 import org.apache.iceberg.expressions.Evaluator;
 import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.InclusiveMetricsEvaluator;
 import org.apache.iceberg.expressions.Projections;
 import org.apache.iceberg.expressions.StrictMetricsEvaluator;
 
 public class OverwriteData extends MergingSnapshotProducer 
implements OverwriteFiles {
-  private boolean validateAddedFiles = false;
+  private boolean overwriteByRowFilter;
+  private boolean validateAddedFilesMatchOverwriteFilter;
+  private boolean deleteFilesByPath;
+  private Long readSnapshotId;
+  private Expression conflictDetectionFilter;
 
 Review comment:
   I'd prefer to add explicit defaults for readability. Is there a reason to 
remove them?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315463778
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -74,5 +85,23 @@
*
* @return this for method chaining
*/
-  OverwriteFiles validateAddedFiles();
+  OverwriteFiles validateAddedFilesMatchOverwriteFilter();
+
+  /**
+   * Enables validation that files added concurrently do not conflict with 
this commit's operation.
+   * 
+   * This method should be called when the table is queried to determine which 
files to delete/append.
+   * If a concurrent operation commits a new file after the data was read and 
that file might
+   * contain rows matching the specified conflict detection filter, the 
overwrite operation
+   * will detect this during retries and fail.
+   * 
+   * Calling this method with a correct conflict detection filter is required 
to maintain
+   * serializable isolation for eager update/delete operations. Otherwise, the 
isolation level
+   * will be snapshot isolation.
+   *
+   * @param readSnapshotId the snapshot id that was used to read the data
 
 Review comment:
   This should document what happens when `readSnapshotId` is null since that's 
supported.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315463525
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictDetectionFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictDetectionFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  InclusiveMetricsEvaluator metrics = new 
InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+ValidationException.check(
+!inclusive.eval(newFile.partition()) || !metrics.eval(newFile),
+"A conflicting file was appended that matches filter '%s': %s",
 
 Review comment:
   Minor: This isn't quite correct. This doesn't guarantee that a file matches 
the filter, just that it may contain data that matches the filter. For example, 
if a table is bucketed by `id` and a file has ids `x` and `z` where `x < y` and 
`y < z` and `x`, `y`, and `z` all hash to the same bucket, then `id = y` would 
match the file even though `y` isn't actually in it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315463525
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictDetectionFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictDetectionFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  InclusiveMetricsEvaluator metrics = new 
InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+ValidationException.check(
+!inclusive.eval(newFile.partition()) || !metrics.eval(newFile),
+"A conflicting file was appended that matches filter '%s': %s",
 
 Review comment:
   Minor: This isn't quite correct. This doesn't guarantee that a file matches 
the filter, just that it may contain data that matches the filter. For example, 
if a table is bucketed by `id` and a file has ids `x` and `z` where `x < y` and 
`y < z` and `x`, `y`, and `z` all hash to the same bucket, then this would 
match the file even though `y` isn't actually in it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315462678
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictDetectionFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictDetectionFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  InclusiveMetricsEvaluator metrics = new 
InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+ValidationException.check(
+!inclusive.eval(newFile.partition()) || !metrics.eval(newFile),
+"A conflicting file was appended that matches filter '%s': %s",
+conflictDetectionFilter, newFile.path());
+  }
+}
+
 return super.apply(base);
   }
+
+  private List collectNewFiles(TableMetadata meta) {
+List newFiles = new ArrayList<>();
+
+Long currentSnapshotId = meta.currentSnapshot() == null ? null : 
meta.currentSnapshot().snapshotId();
 
 Review comment:
   I think that if the current snapshot is null, the loop is skipped and this 
returns an empty list. That's not correct, right?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] manishmalhotrawork commented on issue #193: Support Page Skipping in Iceberg Parquet Reader

2019-08-19 Thread GitBox
manishmalhotrawork commented on issue #193: Support Page Skipping in Iceberg 
Parquet Reader
URL: 
https://github.com/apache/incubator-iceberg/issues/193#issuecomment-522801245
 
 
   @aokolnychyi this is interesting. is this still valid ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315462125
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -74,5 +85,23 @@
*
* @return this for method chaining
*/
-  OverwriteFiles validateAddedFiles();
+  OverwriteFiles validateAddedFilesMatchOverwriteFilter();
+
+  /**
+   * Enables validation that files added concurrently do not conflict with 
this commit's operation.
+   * 
+   * This method should be called when the table is queried to determine which 
files to delete/append.
+   * If a concurrent operation commits a new file after the data was read and 
that file might
+   * contain rows matching the specified conflict detection filter, the 
overwrite operation
+   * will detect this during retries and fail.
+   * 
+   * Calling this method with a correct conflict detection filter is required 
to maintain
 
 Review comment:
   I agree with your take on it, though I'm not sure that I agree that 
compaction isn't serializable. That operation takes effect when it is committed 
and can be relocated in the order of commits without changing the work it did. 
But once it is committed, it is present in the order of operations and that 
won't change.
   
   Anyway, I think that isolation level isn't a useful table config. Some 
individual operations will probably want to relax it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315461591
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -24,18 +24,17 @@
 import org.apache.iceberg.expressions.Projections;
 
 /**
- * API for overwriting files in a table by filter expression.
+ * API for overwriting files in a table.
  * 
  * This API accumulates file additions and produces a new {@link Snapshot} of 
the table by replacing
- * all the files that match the filter expression with the set of additions. 
This operation is used
- * to implement idempotent writes that always replace a section of a table 
with new data.
+ * all the deleted files with the set of additions. This operation is used to 
implement idempotent
+ * writes that always replace a section of a table with new data or 
update/delete operations that
+ * eagerly overwrite files.
 
 Review comment:
   Looks good.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #279: Transforming timestamp to date should produce date

2019-08-19 Thread GitBox
rdblue commented on issue #279: Transforming timestamp to date should produce 
date
URL: 
https://github.com/apache/incubator-iceberg/issues/279#issuecomment-522800271
 
 
   > can you point me to any other test-case or code path, where 
[getResultType] is being used can be verified.
   
   There might not be one. Thanks for adding tests!
   
   > If we are returning DateType.get for Days why not for Month/Year as well
   
   The values for those aren't stored as dates. For example, 1990 is stored as 
20.
   
   > Anything specific I need to test/validate if this change impacts the 
metadata tables?
   
   Existing tests should cover it. The change should automatically update the 
Spark representation as well. Just fix any test failures, if there are any.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] manishmalhotrawork commented on issue #279: Transforming timestamp to date should produce date

2019-08-19 Thread GitBox
manishmalhotrawork commented on issue #279: Transforming timestamp to date 
should produce date
URL: 
https://github.com/apache/incubator-iceberg/issues/279#issuecomment-522798222
 
 
   @rdblue quick questions.
   
   1. while validating my changes,  didn't see that transfor.getResultType is 
being used apart from `PartitionSpec.partitionType()`.
   can you point me to any other test-case or code path, where its being used 
can be verified.
   
   ( though Im adding test-cases to validate `Days.getResultType()` and 
`Timestamps.getResultType()`)
   
   2. If we are returning DateType.get for Days why not for Month/Year as well. 
As for those types as well it will return Integer/day ordinal?
   
   3.
   >That will cause the partition tuple's field type to be a date, which should 
also cause the metadata table to display formatted dates instead of the day 
ordinal in Spark.
   
   I believe metadata table referenced here is related to #252? Anything 
specific I need to test/validate if this change impacts the metadata tables?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] danielcweeks merged pull request #381: [python]Bringing expression implementations back into synchronization

2019-08-19 Thread GitBox
danielcweeks merged pull request #381: [python]Bringing expression 
implementations back into synchronization
URL: https://github.com/apache/incubator-iceberg/pull/381
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] danielcweeks merged pull request #382: [python]Bringing type module into sync

2019-08-19 Thread GitBox
danielcweeks merged pull request #382: [python]Bringing type module into sync
URL: https://github.com/apache/incubator-iceberg/pull/382
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] manishmalhotrawork commented on issue #393: include column_sizes in stats columns

2019-08-19 Thread GitBox
manishmalhotrawork commented on issue #393: include column_sizes in stats 
columns
URL: https://github.com/apache/incubator-iceberg/pull/393#issuecomment-522790993
 
 
   @rdblue please see, if we can merge it. thanks !


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-19 Thread GitBox
aokolnychyi commented on a change in pull request #374: Migrate spark table to 
iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315169933
 
 

 ##
 File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
 ##
 @@ -297,5 +301,88 @@ object SparkTableUtil {
   )
 }
   }
+
+  def buildManifest(table: Table,
+sparkDataFiles: Seq[SparkDataFile],
+partitionSpec: PartitionSpec): ManifestFile = {
+val outputFile = table.io
+  .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + 
UUID.randomUUID.toString))
+val writer = ManifestWriter.write(partitionSpec, outputFile)
+try {
+  for (file <- sparkDataFiles) {
+writer.add(file.toDataFile(partitionSpec))
+  }
+} finally {
+  writer.close()
+}
+
+writer.toManifestFile
+  }
+
+  def partitionToMap(partition: String): Map[String, String] = {
+val map = new mutable.HashMap[String, String]()
+val list = partition.split("/")
+list.foreach { str =>
+  val kv = str.split("=")
+  map.put(kv(0), kv(1))
+}
+
+map.toMap
+  }
+
+  /**
+   * Migrate a spark table to a iceberg table.
+   *
+   * The migration uses the spark session to get table metadata. It assumes no
+   * operation is going on original table and target table and thus is not
+   * thread-safe.
+   *
+   * @param dbName the database name of the table to be migrated
+   * @param tableName the table to be migrated
+   * @param table the target table to migrate in
+   *
+   * @return table the target table
+   */
+  def migrateSparkTable(dbName: String, tableName: String, table: Table): 
Table = {
 
 Review comment:
   I totally agree. I think we cannot really implement "migrate" before we have 
proper DS V2 APIs coming in Spark 3.0. My comment was only about naming. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-19 Thread GitBox
chenjunjiedada commented on a change in pull request #374: Migrate spark table 
to iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315162272
 
 

 ##
 File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
 ##
 @@ -297,5 +301,88 @@ object SparkTableUtil {
   )
 }
   }
+
+  def buildManifest(table: Table,
+sparkDataFiles: Seq[SparkDataFile],
+partitionSpec: PartitionSpec): ManifestFile = {
+val outputFile = table.io
+  .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + 
UUID.randomUUID.toString))
+val writer = ManifestWriter.write(partitionSpec, outputFile)
+try {
+  for (file <- sparkDataFiles) {
+writer.add(file.toDataFile(partitionSpec))
+  }
+} finally {
+  writer.close()
+}
+
+writer.toManifestFile
+  }
+
+  def partitionToMap(partition: String): Map[String, String] = {
+val map = new mutable.HashMap[String, String]()
+val list = partition.split("/")
+list.foreach { str =>
+  val kv = str.split("=")
+  map.put(kv(0), kv(1))
+}
+
+map.toMap
+  }
+
+  /**
+   * Migrate a spark table to a iceberg table.
+   *
+   * The migration uses the spark session to get table metadata. It assumes no
+   * operation is going on original table and target table and thus is not
+   * thread-safe.
+   *
+   * @param dbName the database name of the table to be migrated
+   * @param tableName the table to be migrated
+   * @param table the target table to migrate in
+   *
+   * @return table the target table
+   */
+  def migrateSparkTable(dbName: String, tableName: String, table: Table): 
Table = {
 
 Review comment:
   Aha, it does like the "import data" behaviour. When I tried to use the 
iceberg, the first thing I had to do is to "import table".  If we want to 
implement a migration include replace the pointer, more steps should be needed. 
I can have a look further.  At this stage, I 'd like to implement "import data" 
since that should also be needed at first in the migration IIUC.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations

2019-08-19 Thread GitBox
aokolnychyi commented on a change in pull request #395: Remove Hadoop 
Configuration dependency in BaseMetastoreTableOperations
URL: https://github.com/apache/incubator-iceberg/pull/395#discussion_r315155798
 
 

 ##
 File path: 
core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java
 ##
 @@ -43,17 +41,15 @@
   private static final String METADATA_FOLDER_NAME = "metadata";
   private static final String DATA_FOLDER_NAME = "data";
 
-  private final Configuration conf;
   private final FileIO fileIo;
 
   private TableMetadata currentMetadata = null;
   private String currentMetadataLocation = null;
   private boolean shouldRefresh = true;
   private int version = -1;
 
-  protected BaseMetastoreTableOperations(Configuration conf) {
-this.conf = conf;
-this.fileIo = new HadoopFileIO(conf);
+  protected BaseMetastoreTableOperations(FileIO fileIo) {
 
 Review comment:
   As an alternative, we can consider providing an abstract method that 
subclasses should implement. That way, we have more flexibility over 
initializing `FileIO`. Calling parent constructors must always be the first 
call in child constructors, which means we will have problems when initializing 
`FileIO` requires calling another method. 
   
   @jerryshao  what do you think?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-19 Thread GitBox
chenjunjiedada commented on a change in pull request #374: Migrate spark table 
to iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315150212
 
 

 ##
 File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
 ##
 @@ -297,5 +301,88 @@ object SparkTableUtil {
   )
 }
   }
+
+  def buildManifest(table: Table,
+sparkDataFiles: Seq[SparkDataFile],
+partitionSpec: PartitionSpec): ManifestFile = {
+val outputFile = table.io
+  .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + 
UUID.randomUUID.toString))
+val writer = ManifestWriter.write(partitionSpec, outputFile)
+try {
+  for (file <- sparkDataFiles) {
+writer.add(file.toDataFile(partitionSpec))
+  }
+} finally {
+  writer.close()
+}
+
+writer.toManifestFile
+  }
+
+  def partitionToMap(partition: String): Map[String, String] = {
+val map = new mutable.HashMap[String, String]()
+val list = partition.split("/")
+list.foreach { str =>
+  val kv = str.split("=")
+  map.put(kv(0), kv(1))
+}
+
+map.toMap
+  }
+
+  /**
+   * Migrate a spark table to a iceberg table.
+   *
+   * The migration uses the spark session to get table metadata. It assumes no
+   * operation is going on original table and target table and thus is not
+   * thread-safe.
+   *
+   * @param dbName the database name of the table to be migrated
+   * @param tableName the table to be migrated
+   * @param table the target table to migrate in
+   *
+   * @return table the target table
+   */
+  def migrateSparkTable(dbName: String, tableName: String, table: Table): 
Table = {
 
 Review comment:
   Make sense. I will update in next commit.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
aokolnychyi commented on a change in pull request #351: Extend Iceberg with a 
way to overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315093153
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -74,5 +85,23 @@
*
* @return this for method chaining
*/
-  OverwriteFiles validateAddedFiles();
+  OverwriteFiles validateAddedFilesMatchOverwriteFilter();
+
+  /**
+   * Enables validation that files added concurrently do not conflict with 
this commit's operation.
+   * 
+   * This method should be called when the table is queried to determine which 
files to delete/append.
+   * If a concurrent operation commits a new file after the data was read and 
that file might
+   * contain rows matching the specified conflict detection filter, the 
overwrite operation
+   * will detect this during retries and fail.
+   * 
+   * Calling this method with a correct conflict detection filter is required 
to maintain
 
 Review comment:
   We can also consider making the isolation level a table property and 
validating it here. I can think of multiple operations where this will be 
applicable: compaction, update, delete, merge into. However, I am not sure we 
want the same isolation level for all operations. Compaction is not 
serializable in its current form: we don't check files added concurrently 
(which seems correct to me). Maybe, the isolation level shouldn't be a table 
property but rather a config of the operation itself. That way, compaction will 
be using snapshot isolation and update/deletes might be serializable.
   
   @rdblue what are your thoughts?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
aokolnychyi commented on a change in pull request #351: Extend Iceberg with a 
way to overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315093153
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -74,5 +85,23 @@
*
* @return this for method chaining
*/
-  OverwriteFiles validateAddedFiles();
+  OverwriteFiles validateAddedFilesMatchOverwriteFilter();
+
+  /**
+   * Enables validation that files added concurrently do not conflict with 
this commit's operation.
+   * 
+   * This method should be called when the table is queried to determine which 
files to delete/append.
+   * If a concurrent operation commits a new file after the data was read and 
that file might
+   * contain rows matching the specified conflict detection filter, the 
overwrite operation
+   * will detect this during retries and fail.
+   * 
+   * Calling this method with a correct conflict detection filter is required 
to maintain
 
 Review comment:
   We can also consider making the isolation level a table property and 
validating it here. I can think of multiple operations where this will be 
applicable: compaction, update, delete, merge into. However, I am not sure we 
want the same isolation level for all operations. Compaction is not 
serializable in its current form: we don't check files added concurrently 
(which seems correct to me). Maybe, the isolation level shouldn't be a table 
property but rather a config of the operation itself. That way, compaction will 
be using snapshot isolation and update/deletes might require serializable.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
aokolnychyi commented on a change in pull request #351: Extend Iceberg with a 
way to overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315093153
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -74,5 +85,23 @@
*
* @return this for method chaining
*/
-  OverwriteFiles validateAddedFiles();
+  OverwriteFiles validateAddedFilesMatchOverwriteFilter();
+
+  /**
+   * Enables validation that files added concurrently do not conflict with 
this commit's operation.
+   * 
+   * This method should be called when the table is queried to determine which 
files to delete/append.
+   * If a concurrent operation commits a new file after the data was read and 
that file might
+   * contain rows matching the specified conflict detection filter, the 
overwrite operation
+   * will detect this during retries and fail.
+   * 
+   * Calling this method with a correct conflict detection filter is required 
to maintain
 
 Review comment:
   We can also consider making the isolation level a table property and 
validating it here. Maybe, in the future, we will have more operations where 
the isolation level can be lower to allow better performance. Having said that, 
if such a property exists, it has to be respected by all operations.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-19 Thread GitBox
aokolnychyi commented on a change in pull request #351: Extend Iceberg with a 
way to overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315093153
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -74,5 +85,23 @@
*
* @return this for method chaining
*/
-  OverwriteFiles validateAddedFiles();
+  OverwriteFiles validateAddedFilesMatchOverwriteFilter();
+
+  /**
+   * Enables validation that files added concurrently do not conflict with 
this commit's operation.
+   * 
+   * This method should be called when the table is queried to determine which 
files to delete/append.
+   * If a concurrent operation commits a new file after the data was read and 
that file might
+   * contain rows matching the specified conflict detection filter, the 
overwrite operation
+   * will detect this during retries and fail.
+   * 
+   * Calling this method with a correct conflict detection filter is required 
to maintain
 
 Review comment:
   We can also consider making the isolation level a table property and 
validating it here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org