[GitHub] [incubator-iceberg] jerryshao commented on a change in pull request #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations
jerryshao commented on a change in pull request #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations URL: https://github.com/apache/incubator-iceberg/pull/395#discussion_r315498411 ## File path: core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java ## @@ -43,17 +41,15 @@ private static final String METADATA_FOLDER_NAME = "metadata"; private static final String DATA_FOLDER_NAME = "data"; - private final Configuration conf; private final FileIO fileIo; private TableMetadata currentMetadata = null; private String currentMetadataLocation = null; private boolean shouldRefresh = true; private int version = -1; - protected BaseMetastoreTableOperations(Configuration conf) { -this.conf = conf; -this.fileIo = new HadoopFileIO(conf); + protected BaseMetastoreTableOperations(FileIO fileIo) { Review comment: @aokolnychyi please help to review again, thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jerryshao commented on a change in pull request #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations
jerryshao commented on a change in pull request #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations URL: https://github.com/apache/incubator-iceberg/pull/395#discussion_r315483176 ## File path: core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java ## @@ -43,17 +41,15 @@ private static final String METADATA_FOLDER_NAME = "metadata"; private static final String DATA_FOLDER_NAME = "data"; - private final Configuration conf; private final FileIO fileIo; private TableMetadata currentMetadata = null; private String currentMetadataLocation = null; private boolean shouldRefresh = true; private int version = -1; - protected BaseMetastoreTableOperations(Configuration conf) { -this.conf = conf; -this.fileIo = new HadoopFileIO(conf); + protected BaseMetastoreTableOperations(FileIO fileIo) { Review comment: OK, let me update the code. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue closed issue #269: Fix BaseTableScan to include column_sizes in stats columns
rdblue closed issue #269: Fix BaseTableScan to include column_sizes in stats columns URL: https://github.com/apache/incubator-iceberg/issues/269 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #393: include column_sizes in stats columns
rdblue merged pull request #393: include column_sizes in stats columns URL: https://github.com/apache/incubator-iceberg/pull/393 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #393: include column_sizes in stats columns
rdblue commented on issue #393: include column_sizes in stats columns URL: https://github.com/apache/incubator-iceberg/pull/393#issuecomment-522805147 Looks great, thanks for fixing this @manishmalhotrawork! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #362: Support create and replace transactions in Catalog
rdblue commented on a change in pull request #362: Support create and replace transactions in Catalog URL: https://github.com/apache/incubator-iceberg/pull/362#discussion_r315465619 ## File path: hive/src/test/java/org/apache/iceberg/hive/HiveCreateReplaceTableTest.java ## @@ -0,0 +1,211 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +package org.apache.iceberg.hive; + +import com.google.common.collect.Maps; +import java.io.IOException; +import java.util.HashMap; +import org.apache.hadoop.hive.conf.HiveConf; +import org.apache.hadoop.hive.metastore.HiveMetaStoreClient; +import org.apache.hadoop.hive.metastore.api.Database; +import org.apache.iceberg.PartitionSpec; +import org.apache.iceberg.Schema; +import org.apache.iceberg.Table; +import org.apache.iceberg.Transaction; +import org.apache.iceberg.catalog.TableIdentifier; +import org.apache.iceberg.exceptions.AlreadyExistsException; +import org.apache.iceberg.exceptions.NoSuchTableException; +import org.apache.iceberg.types.Types; +import org.junit.After; +import org.junit.AfterClass; +import org.junit.Assert; +import org.junit.Before; +import org.junit.BeforeClass; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.ExpectedException; +import org.junit.rules.TemporaryFolder; + +import static org.apache.iceberg.PartitionSpec.builderFor; +import static org.apache.iceberg.types.Types.NestedField.required; + +public class HiveCreateReplaceTableTest { + + private static final String DB_NAME = "hivedb"; + private static final String TABLE_NAME = "tbl"; + private static final TableIdentifier TABLE_IDENTIFIER = TableIdentifier.of(DB_NAME, TABLE_NAME); + private static final Schema SCHEMA = new Schema( + required(3, "id", Types.IntegerType.get()), + required(4, "data", Types.StringType.get()) + ); + private static final PartitionSpec SPEC = builderFor(SCHEMA) + .identity("id") + .build(); + + private static TestHiveMetastore metastore; + private static HiveMetaStoreClient metastoreClient; + private static HiveConf hiveConf; + private static HiveCatalog catalog; + + @Rule + public ExpectedException exceptionRule = ExpectedException.none(); + @Rule + public TemporaryFolder temp = new TemporaryFolder(); + + private String tableLocation; + + @BeforeClass + public static void startMetastore() throws Exception { +HiveCreateReplaceTableTest.metastore = new TestHiveMetastore(); +metastore.start(); +HiveCreateReplaceTableTest.hiveConf = metastore.hiveConf(); +HiveCreateReplaceTableTest.metastoreClient = new HiveMetaStoreClient(hiveConf); +String dbPath = metastore.getDatabasePath(DB_NAME); +Database db = new Database(DB_NAME, "description", dbPath, new HashMap<>()); +metastoreClient.createDatabase(db); +HiveCreateReplaceTableTest.catalog = new HiveCatalog(hiveConf); + } + + @AfterClass + public static void stopMetastore() { +catalog.close(); +HiveCreateReplaceTableTest.catalog = null; + +metastoreClient.close(); +HiveCreateReplaceTableTest.metastoreClient = null; + +metastore.stop(); +HiveCreateReplaceTableTest.metastore = null; + } + + @Before + public void createTableLocation() throws IOException { +tableLocation = temp.newFolder("hive-").getPath(); + } + + @After + public void cleanup() { +catalog.dropTable(TABLE_IDENTIFIER); + } + + @Test + public void testCreateTableTxn() { +Assert.assertFalse("Table should not exist", catalog.tableExists(TABLE_IDENTIFIER)); + +Transaction txn = catalog.newCreateTableTransaction( +TABLE_IDENTIFIER, SCHEMA, SPEC, tableLocation, Maps.newHashMap()); +txn.updateProperties() +.set("prop", "value") +.commit(); + +// verify the table is still not visible before the transaction is committed +Assert.assertFalse(catalog.tableExists(TABLE_IDENTIFIER)); + +txn.commitTransaction(); + +Table table = catalog.loadTable(TABLE_IDENTIFIER); +Assert.assertEquals("Table props should match", "value", table.properties().get("prop")); + } + + @Test + public void testCreateTableTxnTableCreatedConcurrently() { +exceptionRule.expect(RuntimeException.class); +
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464699 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -56,14 +67,34 @@ public OverwriteFiles addFile(DataFile file) { } @Override - public OverwriteFiles validateAddedFiles() { -this.validateAddedFiles = true; + public OverwriteFiles deleteFile(DataFile file) { +Preconditions.checkState(!overwriteByRowFilter, +"Either overwriteByRowFilter or deleteFile should be used, not both"); Review comment: I don't think these are mutually exclusive. I could still create an idempotent overwrite that deletes files directly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464625 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -56,14 +67,34 @@ public OverwriteFiles addFile(DataFile file) { } @Override - public OverwriteFiles validateAddedFiles() { -this.validateAddedFiles = true; + public OverwriteFiles deleteFile(DataFile file) { +Preconditions.checkState(!overwriteByRowFilter, +"Either overwriteByRowFilter or deleteFile should be used, not both"); +this.deleteFilesByPath = true; +delete(file.path()); +return this; + } + + @Override + public OverwriteFiles validateAddedFilesMatchOverwriteFilter() { +Preconditions.checkState(!deleteFilesByPath, +"Added files can be validated only if overwriteByRowFilter is used"); Review comment: I don't think these are mutually exclusive. I could still delete by filter and want to validate no files matching a filter (probably the same one) have been added. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464103 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictDetectionFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictDetectionFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + InclusiveMetricsEvaluator metrics = new InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +ValidationException.check( +!inclusive.eval(newFile.partition()) || !metrics.eval(newFile), +"A conflicting file was appended that matches filter '%s': %s", +conflictDetectionFilter, newFile.path()); + } +} + return super.apply(base); } + + private List collectNewFiles(TableMetadata meta) { +List newFiles = new ArrayList<>(); + +Long currentSnapshotId = meta.currentSnapshot() == null ? null : meta.currentSnapshot().snapshotId(); +while (currentSnapshotId != null && !currentSnapshotId.equals(readSnapshotId)) { Review comment: If `readSnapshotId` is null, then this use the _entire_ table history? Or am I missing where `readSnapshotId` is defaulted to the current snapshot ID when the operation starts? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315464206 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -19,15 +19,23 @@ package org.apache.iceberg; +import com.google.common.base.Preconditions; +import com.google.common.collect.Iterables; +import java.util.ArrayList; import java.util.List; import org.apache.iceberg.exceptions.ValidationException; import org.apache.iceberg.expressions.Evaluator; import org.apache.iceberg.expressions.Expression; +import org.apache.iceberg.expressions.InclusiveMetricsEvaluator; import org.apache.iceberg.expressions.Projections; import org.apache.iceberg.expressions.StrictMetricsEvaluator; public class OverwriteData extends MergingSnapshotProducer implements OverwriteFiles { - private boolean validateAddedFiles = false; + private boolean overwriteByRowFilter; + private boolean validateAddedFilesMatchOverwriteFilter; + private boolean deleteFilesByPath; + private Long readSnapshotId; + private Expression conflictDetectionFilter; Review comment: I'd prefer to add explicit defaults for readability. Is there a reason to remove them? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315463778 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -74,5 +85,23 @@ * * @return this for method chaining */ - OverwriteFiles validateAddedFiles(); + OverwriteFiles validateAddedFilesMatchOverwriteFilter(); + + /** + * Enables validation that files added concurrently do not conflict with this commit's operation. + * + * This method should be called when the table is queried to determine which files to delete/append. + * If a concurrent operation commits a new file after the data was read and that file might + * contain rows matching the specified conflict detection filter, the overwrite operation + * will detect this during retries and fail. + * + * Calling this method with a correct conflict detection filter is required to maintain + * serializable isolation for eager update/delete operations. Otherwise, the isolation level + * will be snapshot isolation. + * + * @param readSnapshotId the snapshot id that was used to read the data Review comment: This should document what happens when `readSnapshotId` is null since that's supported. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315463525 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictDetectionFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictDetectionFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + InclusiveMetricsEvaluator metrics = new InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +ValidationException.check( +!inclusive.eval(newFile.partition()) || !metrics.eval(newFile), +"A conflicting file was appended that matches filter '%s': %s", Review comment: Minor: This isn't quite correct. This doesn't guarantee that a file matches the filter, just that it may contain data that matches the filter. For example, if a table is bucketed by `id` and a file has ids `x` and `z` where `x < y` and `y < z` and `x`, `y`, and `z` all hash to the same bucket, then `id = y` would match the file even though `y` isn't actually in it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315463525 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictDetectionFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictDetectionFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + InclusiveMetricsEvaluator metrics = new InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +ValidationException.check( +!inclusive.eval(newFile.partition()) || !metrics.eval(newFile), +"A conflicting file was appended that matches filter '%s': %s", Review comment: Minor: This isn't quite correct. This doesn't guarantee that a file matches the filter, just that it may contain data that matches the filter. For example, if a table is bucketed by `id` and a file has ids `x` and `z` where `x < y` and `y < z` and `x`, `y`, and `z` all hash to the same bucket, then this would match the file even though `y` isn't actually in it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315462678 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictDetectionFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictDetectionFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + InclusiveMetricsEvaluator metrics = new InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +ValidationException.check( +!inclusive.eval(newFile.partition()) || !metrics.eval(newFile), +"A conflicting file was appended that matches filter '%s': %s", +conflictDetectionFilter, newFile.path()); + } +} + return super.apply(base); } + + private List collectNewFiles(TableMetadata meta) { +List newFiles = new ArrayList<>(); + +Long currentSnapshotId = meta.currentSnapshot() == null ? null : meta.currentSnapshot().snapshotId(); Review comment: I think that if the current snapshot is null, the loop is skipped and this returns an empty list. That's not correct, right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork commented on issue #193: Support Page Skipping in Iceberg Parquet Reader
manishmalhotrawork commented on issue #193: Support Page Skipping in Iceberg Parquet Reader URL: https://github.com/apache/incubator-iceberg/issues/193#issuecomment-522801245 @aokolnychyi this is interesting. is this still valid ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315462125 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -74,5 +85,23 @@ * * @return this for method chaining */ - OverwriteFiles validateAddedFiles(); + OverwriteFiles validateAddedFilesMatchOverwriteFilter(); + + /** + * Enables validation that files added concurrently do not conflict with this commit's operation. + * + * This method should be called when the table is queried to determine which files to delete/append. + * If a concurrent operation commits a new file after the data was read and that file might + * contain rows matching the specified conflict detection filter, the overwrite operation + * will detect this during retries and fail. + * + * Calling this method with a correct conflict detection filter is required to maintain Review comment: I agree with your take on it, though I'm not sure that I agree that compaction isn't serializable. That operation takes effect when it is committed and can be relocated in the order of commits without changing the work it did. But once it is committed, it is present in the order of operations and that won't change. Anyway, I think that isolation level isn't a useful table config. Some individual operations will probably want to relax it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315461591 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -24,18 +24,17 @@ import org.apache.iceberg.expressions.Projections; /** - * API for overwriting files in a table by filter expression. + * API for overwriting files in a table. * * This API accumulates file additions and produces a new {@link Snapshot} of the table by replacing - * all the files that match the filter expression with the set of additions. This operation is used - * to implement idempotent writes that always replace a section of a table with new data. + * all the deleted files with the set of additions. This operation is used to implement idempotent + * writes that always replace a section of a table with new data or update/delete operations that + * eagerly overwrite files. Review comment: Looks good. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #279: Transforming timestamp to date should produce date
rdblue commented on issue #279: Transforming timestamp to date should produce date URL: https://github.com/apache/incubator-iceberg/issues/279#issuecomment-522800271 > can you point me to any other test-case or code path, where [getResultType] is being used can be verified. There might not be one. Thanks for adding tests! > If we are returning DateType.get for Days why not for Month/Year as well The values for those aren't stored as dates. For example, 1990 is stored as 20. > Anything specific I need to test/validate if this change impacts the metadata tables? Existing tests should cover it. The change should automatically update the Spark representation as well. Just fix any test failures, if there are any. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork commented on issue #279: Transforming timestamp to date should produce date
manishmalhotrawork commented on issue #279: Transforming timestamp to date should produce date URL: https://github.com/apache/incubator-iceberg/issues/279#issuecomment-522798222 @rdblue quick questions. 1. while validating my changes, didn't see that transfor.getResultType is being used apart from `PartitionSpec.partitionType()`. can you point me to any other test-case or code path, where its being used can be verified. ( though Im adding test-cases to validate `Days.getResultType()` and `Timestamps.getResultType()`) 2. If we are returning DateType.get for Days why not for Month/Year as well. As for those types as well it will return Integer/day ordinal? 3. >That will cause the partition tuple's field type to be a date, which should also cause the metadata table to display formatted dates instead of the day ordinal in Spark. I believe metadata table referenced here is related to #252? Anything specific I need to test/validate if this change impacts the metadata tables? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] danielcweeks merged pull request #381: [python]Bringing expression implementations back into synchronization
danielcweeks merged pull request #381: [python]Bringing expression implementations back into synchronization URL: https://github.com/apache/incubator-iceberg/pull/381 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] danielcweeks merged pull request #382: [python]Bringing type module into sync
danielcweeks merged pull request #382: [python]Bringing type module into sync URL: https://github.com/apache/incubator-iceberg/pull/382 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork commented on issue #393: include column_sizes in stats columns
manishmalhotrawork commented on issue #393: include column_sizes in stats columns URL: https://github.com/apache/incubator-iceberg/pull/393#issuecomment-522790993 @rdblue please see, if we can merge it. thanks ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #374: Migrate spark table to iceberg table
aokolnychyi commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315169933 ## File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala ## @@ -297,5 +301,88 @@ object SparkTableUtil { ) } } + + def buildManifest(table: Table, +sparkDataFiles: Seq[SparkDataFile], +partitionSpec: PartitionSpec): ManifestFile = { +val outputFile = table.io + .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + UUID.randomUUID.toString)) +val writer = ManifestWriter.write(partitionSpec, outputFile) +try { + for (file <- sparkDataFiles) { +writer.add(file.toDataFile(partitionSpec)) + } +} finally { + writer.close() +} + +writer.toManifestFile + } + + def partitionToMap(partition: String): Map[String, String] = { +val map = new mutable.HashMap[String, String]() +val list = partition.split("/") +list.foreach { str => + val kv = str.split("=") + map.put(kv(0), kv(1)) +} + +map.toMap + } + + /** + * Migrate a spark table to a iceberg table. + * + * The migration uses the spark session to get table metadata. It assumes no + * operation is going on original table and target table and thus is not + * thread-safe. + * + * @param dbName the database name of the table to be migrated + * @param tableName the table to be migrated + * @param table the target table to migrate in + * + * @return table the target table + */ + def migrateSparkTable(dbName: String, tableName: String, table: Table): Table = { Review comment: I totally agree. I think we cannot really implement "migrate" before we have proper DS V2 APIs coming in Spark 3.0. My comment was only about naming. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table
chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315162272 ## File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala ## @@ -297,5 +301,88 @@ object SparkTableUtil { ) } } + + def buildManifest(table: Table, +sparkDataFiles: Seq[SparkDataFile], +partitionSpec: PartitionSpec): ManifestFile = { +val outputFile = table.io + .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + UUID.randomUUID.toString)) +val writer = ManifestWriter.write(partitionSpec, outputFile) +try { + for (file <- sparkDataFiles) { +writer.add(file.toDataFile(partitionSpec)) + } +} finally { + writer.close() +} + +writer.toManifestFile + } + + def partitionToMap(partition: String): Map[String, String] = { +val map = new mutable.HashMap[String, String]() +val list = partition.split("/") +list.foreach { str => + val kv = str.split("=") + map.put(kv(0), kv(1)) +} + +map.toMap + } + + /** + * Migrate a spark table to a iceberg table. + * + * The migration uses the spark session to get table metadata. It assumes no + * operation is going on original table and target table and thus is not + * thread-safe. + * + * @param dbName the database name of the table to be migrated + * @param tableName the table to be migrated + * @param table the target table to migrate in + * + * @return table the target table + */ + def migrateSparkTable(dbName: String, tableName: String, table: Table): Table = { Review comment: Aha, it does like the "import data" behaviour. When I tried to use the iceberg, the first thing I had to do is to "import table". If we want to implement a migration include replace the pointer, more steps should be needed. I can have a look further. At this stage, I 'd like to implement "import data" since that should also be needed at first in the migration IIUC. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations
aokolnychyi commented on a change in pull request #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations URL: https://github.com/apache/incubator-iceberg/pull/395#discussion_r315155798 ## File path: core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java ## @@ -43,17 +41,15 @@ private static final String METADATA_FOLDER_NAME = "metadata"; private static final String DATA_FOLDER_NAME = "data"; - private final Configuration conf; private final FileIO fileIo; private TableMetadata currentMetadata = null; private String currentMetadataLocation = null; private boolean shouldRefresh = true; private int version = -1; - protected BaseMetastoreTableOperations(Configuration conf) { -this.conf = conf; -this.fileIo = new HadoopFileIO(conf); + protected BaseMetastoreTableOperations(FileIO fileIo) { Review comment: As an alternative, we can consider providing an abstract method that subclasses should implement. That way, we have more flexibility over initializing `FileIO`. Calling parent constructors must always be the first call in child constructors, which means we will have problems when initializing `FileIO` requires calling another method. @jerryshao what do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table
chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315150212 ## File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala ## @@ -297,5 +301,88 @@ object SparkTableUtil { ) } } + + def buildManifest(table: Table, +sparkDataFiles: Seq[SparkDataFile], +partitionSpec: PartitionSpec): ManifestFile = { +val outputFile = table.io + .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + UUID.randomUUID.toString)) +val writer = ManifestWriter.write(partitionSpec, outputFile) +try { + for (file <- sparkDataFiles) { +writer.add(file.toDataFile(partitionSpec)) + } +} finally { + writer.close() +} + +writer.toManifestFile + } + + def partitionToMap(partition: String): Map[String, String] = { +val map = new mutable.HashMap[String, String]() +val list = partition.split("/") +list.foreach { str => + val kv = str.split("=") + map.put(kv(0), kv(1)) +} + +map.toMap + } + + /** + * Migrate a spark table to a iceberg table. + * + * The migration uses the spark session to get table metadata. It assumes no + * operation is going on original table and target table and thus is not + * thread-safe. + * + * @param dbName the database name of the table to be migrated + * @param tableName the table to be migrated + * @param table the target table to migrate in + * + * @return table the target table + */ + def migrateSparkTable(dbName: String, tableName: String, table: Table): Table = { Review comment: Make sense. I will update in next commit. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315093153 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -74,5 +85,23 @@ * * @return this for method chaining */ - OverwriteFiles validateAddedFiles(); + OverwriteFiles validateAddedFilesMatchOverwriteFilter(); + + /** + * Enables validation that files added concurrently do not conflict with this commit's operation. + * + * This method should be called when the table is queried to determine which files to delete/append. + * If a concurrent operation commits a new file after the data was read and that file might + * contain rows matching the specified conflict detection filter, the overwrite operation + * will detect this during retries and fail. + * + * Calling this method with a correct conflict detection filter is required to maintain Review comment: We can also consider making the isolation level a table property and validating it here. I can think of multiple operations where this will be applicable: compaction, update, delete, merge into. However, I am not sure we want the same isolation level for all operations. Compaction is not serializable in its current form: we don't check files added concurrently (which seems correct to me). Maybe, the isolation level shouldn't be a table property but rather a config of the operation itself. That way, compaction will be using snapshot isolation and update/deletes might be serializable. @rdblue what are your thoughts? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315093153 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -74,5 +85,23 @@ * * @return this for method chaining */ - OverwriteFiles validateAddedFiles(); + OverwriteFiles validateAddedFilesMatchOverwriteFilter(); + + /** + * Enables validation that files added concurrently do not conflict with this commit's operation. + * + * This method should be called when the table is queried to determine which files to delete/append. + * If a concurrent operation commits a new file after the data was read and that file might + * contain rows matching the specified conflict detection filter, the overwrite operation + * will detect this during retries and fail. + * + * Calling this method with a correct conflict detection filter is required to maintain Review comment: We can also consider making the isolation level a table property and validating it here. I can think of multiple operations where this will be applicable: compaction, update, delete, merge into. However, I am not sure we want the same isolation level for all operations. Compaction is not serializable in its current form: we don't check files added concurrently (which seems correct to me). Maybe, the isolation level shouldn't be a table property but rather a config of the operation itself. That way, compaction will be using snapshot isolation and update/deletes might require serializable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315093153 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -74,5 +85,23 @@ * * @return this for method chaining */ - OverwriteFiles validateAddedFiles(); + OverwriteFiles validateAddedFilesMatchOverwriteFilter(); + + /** + * Enables validation that files added concurrently do not conflict with this commit's operation. + * + * This method should be called when the table is queried to determine which files to delete/append. + * If a concurrent operation commits a new file after the data was read and that file might + * contain rows matching the specified conflict detection filter, the overwrite operation + * will detect this during retries and fail. + * + * Calling this method with a correct conflict detection filter is required to maintain Review comment: We can also consider making the isolation level a table property and validating it here. Maybe, in the future, we will have more operations where the isolation level can be lower to allow better performance. Having said that, if such a property exists, it has to be respected by all operations. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r315093153 ## File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java ## @@ -74,5 +85,23 @@ * * @return this for method chaining */ - OverwriteFiles validateAddedFiles(); + OverwriteFiles validateAddedFilesMatchOverwriteFilter(); + + /** + * Enables validation that files added concurrently do not conflict with this commit's operation. + * + * This method should be called when the table is queried to determine which files to delete/append. + * If a concurrent operation commits a new file after the data was read and that file might + * contain rows matching the specified conflict detection filter, the overwrite operation + * will detect this during retries and fail. + * + * Calling this method with a correct conflict detection filter is required to maintain Review comment: We can also consider making the isolation level a table property and validating it here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org