[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317856977 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,149 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java +class CustomTableOperations extends BaseMetastoreTableOperations { + private String dbName; + private String tableName; + private Configuration conf; + private FileIO fileIO; + + protected CustomTableOperations(Configuration conf, String dbName, String tableName) { +this.conf = conf; +this.dbName = dbName; +this.tableName = tableName; + } + + // The doRefresh method should provide implementation on how to get the metadata location + @Override + public void doRefresh() { + +// Example custom service which returns the metadata location given a dbName and tableName +String metadataLocation = CustomService.getMetadataForTable(conf, dbName, tableName); + +// Use existing method to refresh metadata +refreshFromMetadataLocation(metadataLocation); + + } + + // The doCommit method should provide implementation on how to update with metadata location atomically + @Override + public void doCommit(TableMetadata base, TableMetadata metadata) { +// if the metadata is already out of date, reject it +if (base != current()) { Review comment: @aokolnychyi @rdblue I was wondering if it makes sense to have this and the following check as part of the `commit()` method instead, before calling `doCommit`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317856977 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,149 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java +class CustomTableOperations extends BaseMetastoreTableOperations { + private String dbName; + private String tableName; + private Configuration conf; + private FileIO fileIO; + + protected CustomTableOperations(Configuration conf, String dbName, String tableName) { +this.conf = conf; +this.dbName = dbName; +this.tableName = tableName; + } + + // The doRefresh method should provide implementation on how to get the metadata location + @Override + public void doRefresh() { + +// Example custom service which returns the metadata location given a dbName and tableName +String metadataLocation = CustomService.getMetadataForTable(conf, dbName, tableName); + +// Use existing method to refresh metadata +refreshFromMetadataLocation(metadataLocation); + + } + + // The doCommit method should provide implementation on how to update with metadata location atomically + @Override + public void doCommit(TableMetadata base, TableMetadata metadata) { +// if the metadata is already out of date, reject it +if (base != current()) { Review comment: @aokolnychyi @rdblue I was wondering if it makes sense to have this check as part of the `commit()` method instead, before calling `doCommit`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] moulimukherjee commented on issue #416: Adding docs on how to use custom catalog with iceberg
moulimukherjee commented on issue #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#issuecomment-525092405 PTAL again @rdblue @aokolnychyi This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317856128 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,134 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java +class CustomTableOperations extends BaseMetastoreTableOperations { + private String dbName; + private String tableName; + private Configuration conf; + + protected CustomTableOperations(Configuration conf, String dbName, String tableName) { Review comment: Updated This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317855049 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,134 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java +class CustomTableOperations extends BaseMetastoreTableOperations { + private String dbName; + private String tableName; + private Configuration conf; + + protected CustomTableOperations(Configuration conf, String dbName, String tableName) { +super(conf); +this.conf = conf; +this.dbName = dbName; +this.tableName = tableName; + } + + // The refresh method should provide implementation on how to get the metadata location + @Override + public TableMetadata refresh() { Review comment: Updated to `doRefresh` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317851135 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,134 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java +class CustomTableOperations extends BaseMetastoreTableOperations { + private String dbName; + private String tableName; + private Configuration conf; + + protected CustomTableOperations(Configuration conf, String dbName, String tableName) { Review comment: Ack. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317851046 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,134 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java +class CustomTableOperations extends BaseMetastoreTableOperations { + private String dbName; + private String tableName; + private Configuration conf; + + protected CustomTableOperations(Configuration conf, String dbName, String tableName) { +super(conf); +this.conf = conf; +this.dbName = dbName; +this.tableName = tableName; + } + + // The refresh method should provide implementation on how to get the metadata location + @Override + public TableMetadata refresh() { Review comment: Ack, I see that it's merged, will update this This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #388: Handle rollback in snapshot expiration
rdblue commented on a change in pull request #388: Handle rollback in snapshot expiration URL: https://github.com/apache/incubator-iceberg/pull/388#discussion_r317843266 ## File path: core/src/main/java/org/apache/iceberg/RemoveSnapshots.java ## @@ -120,56 +123,128 @@ public void commit() { } }); -LOG.info("Committed snapshot changes; cleaning up expired manifests and data files."); +cleanExpiredSnapshots(); + } + private void cleanExpiredSnapshots() { // clean up the expired snapshots: // 1. Get a list of the snapshots that were removed // 2. Delete any data files that were deleted by those snapshots and are not in the table // 3. Delete any manifests that are no longer used by current snapshots // 4. Delete the manifest lists +TableMetadata current = ops.refresh(); + +Set validIds = Sets.newHashSet(); +for (Snapshot snapshot : current.snapshots()) { + validIds.add(snapshot.snapshotId()); +} + +Set expiredIds = Sets.newHashSet(); +for (Snapshot snapshot : base.snapshots()) { + long snapshotId = snapshot.snapshotId(); + if (!validIds.contains(snapshotId)) { +// the snapshot was expired +LOG.info("Expired snapshot: {}", snapshot); +expiredIds.add(snapshotId); + } +} + +if (expiredIds.isEmpty()) { + // if no snapshots were expired, skip cleanup + return; +} + +LOG.info("Committed snapshot changes; cleaning up expired manifests and data files."); + +cleanExpiredFiles(current.snapshots(), validIds, expiredIds); + } + + @SuppressWarnings("checkstyle:CyclomaticComplexity") + private void cleanExpiredFiles(List snapshots, Set validIds, Set expiredIds) { // Reads and deletes are done using Tasks.foreach(...).suppressFailureWhenFinished to complete // as much of the delete work as possible and avoid orphaned data or manifest files. -TableMetadata current = ops.refresh(); -Set currentIds = Sets.newHashSet(); -Set currentManifests = Sets.newHashSet(); -for (Snapshot snapshot : current.snapshots()) { - currentIds.add(snapshot.snapshotId()); - currentManifests.addAll(snapshot.manifests()); +// this is the set of ancestors of the current table state. when removing snapshots, this must +// only remove files that were deleted in an ancestor of the current table state to avoid +// physically deleting files that were logically deleted in a commit that was rolled back. +Set ancestorIds = Sets.newHashSet(SnapshotUtil.ancestorIds(base.currentSnapshot(), base::snapshot)); + +Set validManifests = Sets.newHashSet(); +Set manifestsToScan = Sets.newHashSet(); +for (Snapshot snapshot : snapshots) { + validIds.add(snapshot.snapshotId()); Review comment: You're right. I'll remove it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] Parth-Brahmbhatt commented on a change in pull request #388: Handle rollback in snapshot expiration
Parth-Brahmbhatt commented on a change in pull request #388: Handle rollback in snapshot expiration URL: https://github.com/apache/incubator-iceberg/pull/388#discussion_r317838958 ## File path: core/src/main/java/org/apache/iceberg/RemoveSnapshots.java ## @@ -120,56 +123,128 @@ public void commit() { } }); -LOG.info("Committed snapshot changes; cleaning up expired manifests and data files."); +cleanExpiredSnapshots(); + } + private void cleanExpiredSnapshots() { // clean up the expired snapshots: // 1. Get a list of the snapshots that were removed // 2. Delete any data files that were deleted by those snapshots and are not in the table // 3. Delete any manifests that are no longer used by current snapshots // 4. Delete the manifest lists +TableMetadata current = ops.refresh(); + +Set validIds = Sets.newHashSet(); +for (Snapshot snapshot : current.snapshots()) { + validIds.add(snapshot.snapshotId()); +} + +Set expiredIds = Sets.newHashSet(); +for (Snapshot snapshot : base.snapshots()) { + long snapshotId = snapshot.snapshotId(); + if (!validIds.contains(snapshotId)) { +// the snapshot was expired +LOG.info("Expired snapshot: {}", snapshot); +expiredIds.add(snapshotId); + } +} + +if (expiredIds.isEmpty()) { + // if no snapshots were expired, skip cleanup + return; +} + +LOG.info("Committed snapshot changes; cleaning up expired manifests and data files."); + +cleanExpiredFiles(current.snapshots(), validIds, expiredIds); + } + + @SuppressWarnings("checkstyle:CyclomaticComplexity") + private void cleanExpiredFiles(List snapshots, Set validIds, Set expiredIds) { // Reads and deletes are done using Tasks.foreach(...).suppressFailureWhenFinished to complete // as much of the delete work as possible and avoid orphaned data or manifest files. -TableMetadata current = ops.refresh(); -Set currentIds = Sets.newHashSet(); -Set currentManifests = Sets.newHashSet(); -for (Snapshot snapshot : current.snapshots()) { - currentIds.add(snapshot.snapshotId()); - currentManifests.addAll(snapshot.manifests()); +// this is the set of ancestors of the current table state. when removing snapshots, this must +// only remove files that were deleted in an ancestor of the current table state to avoid +// physically deleting files that were logically deleted in a commit that was rolled back. +Set ancestorIds = Sets.newHashSet(SnapshotUtil.ancestorIds(base.currentSnapshot(), base::snapshot)); + +Set validManifests = Sets.newHashSet(); +Set manifestsToScan = Sets.newHashSet(); +for (Snapshot snapshot : snapshots) { + validIds.add(snapshot.snapshotId()); Review comment: this seems unnecessary, the passed in validIds should already have these snapshotIds. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource
rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317841160 ## File path: api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java ## @@ -355,4 +390,40 @@ public void testCaseInsensitiveIntegerNotEqRewritten() { public void testCaseSensitiveIntegerNotEqRewritten() { boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, not(equal("ID", 5)), true).eval(FILE); } + + @Test + public void testStringStartsWith() { +boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", "a"), true).eval(FILE); +Assert.assertTrue("Should read: no stats", shouldRead); + +shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", "a"), true).eval(FILE_2); +Assert.assertTrue("Should read: range matches", shouldRead); + +shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", "aa"), true).eval(FILE_2); +Assert.assertTrue("Should read: range matches", shouldRead); + +shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", "aaa"), true).eval(FILE_2); Review comment: I don't think there is an equivalent test for the upper bound, where the bound is shorter so the startsWith prefix is truncated. Can you add one? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource
rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317841001 ## File path: api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java ## @@ -355,4 +390,40 @@ public void testCaseInsensitiveIntegerNotEqRewritten() { public void testCaseSensitiveIntegerNotEqRewritten() { boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, not(equal("ID", 5)), true).eval(FILE); } + + @Test + public void testStringStartsWith() { +boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", "a"), true).eval(FILE); +Assert.assertTrue("Should read: no stats", shouldRead); + +shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", "a"), true).eval(FILE_2); +Assert.assertTrue("Should read: range matches", shouldRead); + +shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", "aa"), true).eval(FILE_2); +Assert.assertTrue("Should read: range matches", shouldRead); + +shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", "aaa"), true).eval(FILE_2); +Assert.assertTrue("Should read: range matches", shouldRead); + +shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", "1s"), true).eval(FILE_3); +Assert.assertTrue("Should read: range matches", shouldRead); + +shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", "1str1x"), true).eval(FILE_3); +Assert.assertTrue("Should read: range matches", shouldRead); + +shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", "ff"), true).eval(FILE_4); +Assert.assertTrue("Should read: range matches", shouldRead); + +shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", "aB"), true).eval(FILE_2); Review comment: Nit: For the purpose of testing, I think it is more clear to use all lower case or all upper case. It may not be obvious that `"B" < "a"` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue opened a new pull request #418: Add missing manifest cleanup tests
rdblue opened a new pull request #418: Add missing manifest cleanup tests URL: https://github.com/apache/incubator-iceberg/pull/418 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317833163 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,134 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java +class CustomTableOperations extends BaseMetastoreTableOperations { + private String dbName; + private String tableName; + private Configuration conf; + + protected CustomTableOperations(Configuration conf, String dbName, String tableName) { +super(conf); +this.conf = conf; +this.dbName = dbName; +this.tableName = tableName; + } + + // The refresh method should provide implementation on how to get the metadata location + @Override + public TableMetadata refresh() { + +// Example custom service which returns the metadata location given a dbName and tableName +val metadataLocation = CustomService.getMetadataForTable(conf, dbName, tableName) + +// Use existing method to refresh metadata +refreshFromMetadataLocation(metadataLocation); + +// Use existing method to return the table metadata +return current(); + } + + // The commit method should provide implementation on how to persist the metadata location + @Override + public void commit(TableMetadata base, TableMetadata metadata) { +// if the metadata is already out of date, reject it +if (base != current()) { + throw new CommitFailedException("Cannot commit: stale table metadata for %s.%s", dbName, tableName); +} + +// if the metadata is not changed, return early +if (base == metadata) { + return; +} + +// Write new metadata +String newMetadataLocation = writeNewMetadata(metadata, currentVersion() + 1); + +// Example custom service which updates the metadata location for the given db and table +CustomService.updateMetadataLocation(dbName, tableName, newMetadataLocation); Review comment: Ack. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue closed issue #261: Support create and replace transactions in Catalog
rdblue closed issue #261: Support create and replace transactions in Catalog URL: https://github.com/apache/incubator-iceberg/issues/261 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #362: Support create and replace transactions in Catalog
rdblue merged pull request #362: Support create and replace transactions in Catalog URL: https://github.com/apache/incubator-iceberg/pull/362 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #362: Support create and replace transactions in Catalog
rdblue commented on issue #362: Support create and replace transactions in Catalog URL: https://github.com/apache/incubator-iceberg/pull/362#issuecomment-525064046 +1 Thanks for fixing this, @aokolnychyi! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] shardulm94 opened a new issue #417: Adding support for time-based partitioning on long column type
shardulm94 opened a new issue #417: Adding support for time-based partitioning on long column type URL: https://github.com/apache/incubator-iceberg/issues/417 We have datasets which store milliseconds since epoch as a long column type. This column is currently being used for time-based partitioning. However Iceberg only supports time-based partitioning on timestamp and date column types with microsecond precision. Also, the transforms are all internal to Iceberg and not something that can be provided externally. How can we go about adding support for such a case in our environment? Do you think this is something that can be added to Iceberg itself? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
aokolnychyi commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317827313 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,134 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java +class CustomTableOperations extends BaseMetastoreTableOperations { + private String dbName; + private String tableName; + private Configuration conf; + + protected CustomTableOperations(Configuration conf, String dbName, String tableName) { +super(conf); +this.conf = conf; +this.dbName = dbName; +this.tableName = tableName; + } + + // The refresh method should provide implementation on how to get the metadata location + @Override + public TableMetadata refresh() { Review comment: I think #362 is quite close. Shall we merge that first and update this one to use `doRefresh`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317824379 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,134 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java +class CustomTableOperations extends BaseMetastoreTableOperations { + private String dbName; + private String tableName; + private Configuration conf; + + protected CustomTableOperations(Configuration conf, String dbName, String tableName) { +super(conf); +this.conf = conf; +this.dbName = dbName; +this.tableName = tableName; + } + + // The refresh method should provide implementation on how to get the metadata location + @Override + public TableMetadata refresh() { + +// Example custom service which returns the metadata location given a dbName and tableName +val metadataLocation = CustomService.getMetadataForTable(conf, dbName, tableName) + +// Use existing method to refresh metadata +refreshFromMetadataLocation(metadataLocation); + +// Use existing method to return the table metadata +return current(); + } + + // The commit method should provide implementation on how to persist the metadata location + @Override + public void commit(TableMetadata base, TableMetadata metadata) { +// if the metadata is already out of date, reject it +if (base != current()) { + throw new CommitFailedException("Cannot commit: stale table metadata for %s.%s", dbName, tableName); +} + +// if the metadata is not changed, return early +if (base == metadata) { + return; +} + +// Write new metadata +String newMetadataLocation = writeNewMetadata(metadata, currentVersion() + 1); + +// Example custom service which updates the metadata location for the given db and table +CustomService.updateMetadataLocation(dbName, tableName, newMetadataLocation); + +// Use existing method to request a refresh +requestRefresh(); + } +} +``` + +### Custom table implementation +Extend `BaseMetastoreCatalog` to provide default warehouse locations and instantiate `CustomTableOperations` + +Example: +```java +public class CustomCatalog extends BaseMetastoreCatalog { + + private Configuration configuration; + + public CustomCatalog(Configuration configuration) { +this.configuration = configuration; + } + + @Override + protected TableOperations newTableOps(TableIdentifier tableIdentifier) { +String dbName = tableIdentifier.namespace().level(0); +String tableName = tableIdentifier.name(); +// instantiate the CustomTableOperations +return new CustomTableOperations(configuration, dbName, tableName); + } + + @Override + protected String defaultWarehouseLocation(TableIdentifier tableIdentifier) { + +// Can choose to use any other configuration name +String tableLocation = configuration.get("custom.iceberg.table.location"); Review comment: This is a warehouse location, right? If so, it may be more clear to use `"custom.iceberg.warehouse.location"` instead. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317824181 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,134 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java +class CustomTableOperations extends BaseMetastoreTableOperations { + private String dbName; + private String tableName; + private Configuration conf; + + protected CustomTableOperations(Configuration conf, String dbName, String tableName) { Review comment: This is probably a good thing to note. This is how you can customize table implementations `FileIO` and `LocationProvider` can be defaulted, or set to new implementations by returning them from this class. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317823840 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,134 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java +class CustomTableOperations extends BaseMetastoreTableOperations { + private String dbName; + private String tableName; + private Configuration conf; + + protected CustomTableOperations(Configuration conf, String dbName, String tableName) { Review comment: `BaseMetastoreTableOperations` was recently refactored to avoid passing a Hadoop `Configuration`. So now this also requires implementing `io` to return a `HadoopFileIO`, like this: https://github.com/apache/incubator-iceberg/blob/master/hive/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L79-L85 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317822579 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,134 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java +class CustomTableOperations extends BaseMetastoreTableOperations { + private String dbName; + private String tableName; + private Configuration conf; + + protected CustomTableOperations(Configuration conf, String dbName, String tableName) { +super(conf); +this.conf = conf; +this.dbName = dbName; +this.tableName = tableName; + } + + // The refresh method should provide implementation on how to get the metadata location + @Override + public TableMetadata refresh() { + +// Example custom service which returns the metadata location given a dbName and tableName +val metadataLocation = CustomService.getMetadataForTable(conf, dbName, tableName) + +// Use existing method to refresh metadata +refreshFromMetadataLocation(metadataLocation); + +// Use existing method to return the table metadata +return current(); + } + + // The commit method should provide implementation on how to persist the metadata location + @Override + public void commit(TableMetadata base, TableMetadata metadata) { +// if the metadata is already out of date, reject it +if (base != current()) { + throw new CommitFailedException("Cannot commit: stale table metadata for %s.%s", dbName, tableName); +} + +// if the metadata is not changed, return early +if (base == metadata) { + return; +} + +// Write new metadata +String newMetadataLocation = writeNewMetadata(metadata, currentVersion() + 1); + +// Example custom service which updates the metadata location for the given db and table +CustomService.updateMetadataLocation(dbName, tableName, newMetadataLocation); Review comment: This actually needs to pass the current metadata location to `CustomService`. Otherwise, `CustomService` can't guarantee that the location update is an atomic swap, and the atomic swap is required for consistency. Otherwise, two processes could each start a change, validate that metadata hasn't changed and both call this update at the same time. The second one would win because the update doesn't know to reject it because it isn't based on the other process's changes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317815623 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,138 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java + +class CustomTableOperations extends BaseMetastoreTableOperations { +private String dbName; +private String tableName; +private Configuration conf; + +protected CustomTableOperations(Configuration conf, String dbName, String tableName) { +super(conf); +this.conf = conf; +this.dbName = dbName; +this.tableName = tableName; +} + +// The refresh method should provide implementation on how to get the metadata location +@Override +public TableMetadata refresh() { + +// Example custom service which returns the metadata location given a dbName and tableName +val metadataLocation = CustomService.getMetadataForTable(conf, dbName, tableName) Review comment: Fixed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #362: Support create and replace transactions in Catalog
rdblue commented on a change in pull request #362: Support create and replace transactions in Catalog URL: https://github.com/apache/incubator-iceberg/pull/362#discussion_r317815165 ## File path: core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java ## @@ -66,6 +67,35 @@ public int currentVersion() { return version; } + @Override + public TableMetadata refresh() { Review comment: FYI @moulimukherjee and @aokolnychyi: Mouli just added docs for implementing your own `BaseMetastoreTableOperations`. We'll want to update those for this change to override `doRefresh` at some point, depending on when this makes it in. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317814151 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,138 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java + +class CustomTableOperations extends BaseMetastoreTableOperations { +private String dbName; +private String tableName; +private Configuration conf; + +protected CustomTableOperations(Configuration conf, String dbName, String tableName) { +super(conf); +this.conf = conf; +this.dbName = dbName; +this.tableName = tableName; +} + +// The refresh method should provide implementation on how to get the metadata location +@Override +public TableMetadata refresh() { + +// Example custom service which returns the metadata location given a dbName and tableName +val metadataLocation = CustomService.getMetadataForTable(conf, dbName, tableName) Review comment: ah, correcting it This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg
rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317813862 ## File path: site/docs/custom-catalog.md ## @@ -0,0 +1,138 @@ +# Custom Catalog Implementation + +It's possible to read an iceberg table either from an hdfs path or from a hive table. It's also possible to use a custom metastore in place of hive. The steps to do that are as follows. + +- [Custom TableOperations](#custom-table-operations-implementation) +- [Custom Catalog](#custom-table-implementation) +- [Custom IcebergSource](#custom-icebergsource) + +### Custom table operations implementation +Extend `BaseMetastoreTableOperations` to provide implementation on how to read and write metadata + +Example: +```java + +class CustomTableOperations extends BaseMetastoreTableOperations { +private String dbName; +private String tableName; +private Configuration conf; + +protected CustomTableOperations(Configuration conf, String dbName, String tableName) { +super(conf); +this.conf = conf; +this.dbName = dbName; +this.tableName = tableName; +} + +// The refresh method should provide implementation on how to get the metadata location +@Override +public TableMetadata refresh() { + +// Example custom service which returns the metadata location given a dbName and tableName +val metadataLocation = CustomService.getMetadataForTable(conf, dbName, tableName) Review comment: Nit: Looks like indentation is off in this file. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] moulimukherjee opened a new pull request #416: Adding docs on how to use custom catalog with iceberg
moulimukherjee opened a new pull request #416: Adding docs on how to use custom catalog with iceberg URL: https://github.com/apache/incubator-iceberg/pull/416 Adding docs on how to use custom catalog with iceberg r? @rdblue This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on issue #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#issuecomment-525039053 +1. I'm going to merge this. Thanks for the thorough unit tests, those look great. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue merged pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue closed issue #316: Provide an API to modify records within files
rdblue closed issue #316: Provide an API to modify records within files URL: https://github.com/apache/incubator-iceberg/issues/316 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r317800614 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictDetectionFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictDetectionFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + InclusiveMetricsEvaluator metrics = new InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +ValidationException.check( +!inclusive.eval(newFile.partition()) || !metrics.eval(newFile), +"A conflicting file was appended that matches filter '%s': %s", +conflictDetectionFilter, newFile.path()); + } +} + return super.apply(base); } + + private List collectNewFiles(TableMetadata meta) { +List newFiles = new ArrayList<>(); + +Long currentSnapshotId = meta.currentSnapshot() == null ? null : meta.currentSnapshot().snapshotId(); +while (currentSnapshotId != null && !currentSnapshotId.equals(readSnapshotId)) { Review comment: Okay, it should be fine if readSnapshotId is always explicitly set. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r317799344 ## File path: core/src/main/java/org/apache/iceberg/OverwriteData.java ## @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() { } } +if (conflictDetectionFilter != null) { + PartitionSpec spec = writeSpec(); + Expression inclusiveExpr = Projections.inclusive(spec).project(conflictDetectionFilter); + Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr); + + InclusiveMetricsEvaluator metrics = new InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter); + + List newFiles = collectNewFiles(base); + for (DataFile newFile : newFiles) { +ValidationException.check( +!inclusive.eval(newFile.partition()) || !metrics.eval(newFile), +"A conflicting file was appended that matches filter '%s': %s", +conflictDetectionFilter, newFile.path()); + } +} + return super.apply(base); } + + private List collectNewFiles(TableMetadata meta) { +List newFiles = new ArrayList<>(); + +Long currentSnapshotId = meta.currentSnapshot() == null ? null : meta.currentSnapshot().snapshotId(); Review comment: Right, that works. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on issue #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
aokolnychyi commented on issue #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#issuecomment-525028721 I've updated this PR but there are still a couple of open questions. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes
aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r317793406 ## File path: api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java ## @@ -52,7 +52,7 @@ private MetricsEvalVisitor visitor() { return visitors.get(); } - InclusiveMetricsEvaluator(Schema schema, Expression unbound) { + public InclusiveMetricsEvaluator(Schema schema, Expression unbound) { Review comment: Let's address this in #413 together with other places. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on issue #398: Push down StringStartsWith in Spark IcebergSource
aokolnychyi commented on issue #398: Push down StringStartsWith in Spark IcebergSource URL: https://github.com/apache/incubator-iceberg/pull/398#issuecomment-524985833 I've updated this PR to use `toByteBuffer` in `Literal`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #414: Add toByteBuffer to Literal
rdblue merged pull request #414: Add toByteBuffer to Literal URL: https://github.com/apache/incubator-iceberg/pull/414 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #414: Add toByteBuffer to Literal
rdblue commented on issue #414: Add toByteBuffer to Literal URL: https://github.com/apache/incubator-iceberg/pull/414#issuecomment-524964776 +1. I'll merge this when tests pass. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #414: Add toByteBuffer to Literal
aokolnychyi commented on a change in pull request #414: Add toByteBuffer to Literal URL: https://github.com/apache/incubator-iceberg/pull/414#discussion_r317723243 ## File path: api/src/main/java/org/apache/iceberg/expressions/Literal.java ## @@ -102,4 +102,8 @@ * @return a comparator for T objects */ Comparator comparator(); + + default ByteBuffer toByteBuffer() { Review comment: Added. Let me know what you think. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #414: Add toByteBuffer to Literal
aokolnychyi commented on a change in pull request #414: Add toByteBuffer to Literal URL: https://github.com/apache/incubator-iceberg/pull/414#discussion_r317713814 ## File path: api/src/main/java/org/apache/iceberg/expressions/Literal.java ## @@ -102,4 +102,8 @@ * @return a comparator for T objects */ Comparator comparator(); + + default ByteBuffer toByteBuffer() { +throw new UnsupportedOperationException("toByteBuffer is not supported"); Review comment: It is left unimplemented in `AboveMax` and `BelowMin`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table
rdblue commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317712572 ## File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala ## @@ -297,5 +302,81 @@ object SparkTableUtil { ) } } + + private def buildManifest(table: Table, + sparkDataFiles: Seq[SparkDataFile], + partitionSpec: PartitionSpec): ManifestFile = { +val outputFile = table.io + .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + UUID.randomUUID.toString)) Review comment: Maybe you should have users pass in a location to use for this? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table
rdblue commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317712345 ## File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala ## @@ -297,5 +302,81 @@ object SparkTableUtil { ) } } + + private def buildManifest(table: Table, + sparkDataFiles: Seq[SparkDataFile], + partitionSpec: PartitionSpec): ManifestFile = { +val outputFile = table.io + .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + UUID.randomUUID.toString)) Review comment: This location must be a shared location, either in HDFS or a distribute FS like S3. It can't be the local host. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table
rdblue commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317711705 ## File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala ## @@ -297,5 +302,81 @@ object SparkTableUtil { ) } } + + private def buildManifest(table: Table, + sparkDataFiles: Seq[SparkDataFile], + partitionSpec: PartitionSpec): ManifestFile = { +val outputFile = table.io + .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + UUID.randomUUID.toString)) +val writer = ManifestWriter.write(partitionSpec, outputFile) +try { + sparkDataFiles.foreach { file => +writer.add(file.toDataFile(partitionSpec)) + } +} finally { + writer.close() +} + +writer.toManifestFile + } + + /** + * Import a spark table to a iceberg table. + * + * The import uses the spark session to get table metadata. It assumes no + * operation is going on original table and target table and thus is not + * thread-safe. + * + * @param source the database name of the table to be import + * @param location the location used to store table metadata + * + * @return table the imported table + */ + def importSparkTable(source: TableIdentifier, location: String): Table = { +val sparkSession = SparkSession.builder().getOrCreate() +import sparkSession.sqlContext.implicits._ + +val dbName = source.database.getOrElse("default") +val tableName = source.table + +if (!sparkSession.catalog.tableExists(dbName, tableName)) { + throw new NoSuchTableException(s"Table $dbName.$tableName does not exist") +} + +val partitionSpec = SparkSchemaUtil.specForTable(sparkSession, s"$dbName.$tableName") +val conf = sparkSession.sparkContext.hadoopConfiguration +val tables = new HadoopTables(conf) +val schema = SparkSchemaUtil.schemaForTable(sparkSession, s"$dbName.$tableName") +val table = tables.create(schema, partitionSpec, ImmutableMap.of(), location) +val appender = table.newAppend() + +if (partitionSpec == PartitionSpec.unpartitioned) { + val tableMetadata = sparkSession.sessionState.catalog.getTableMetadata(source) + val format = tableMetadata.provider.getOrElse("none") + + if (format != "avro" && format != "parquet" && format != "orc") { +throw new UnsupportedOperationException(s"Unsupported format: $format") + } + listPartition(Map.empty[String, String], tableMetadata.location.toString, +format).foreach{f => appender.appendFile(f.toDataFile(PartitionSpec.unpartitioned))} + appender.commit() +} else { + val partitions = partitionDF(sparkSession, s"$dbName.$tableName") + partitions.flatMap { row => +listPartition(row.getMap[String, String](0).toMap, row.getString(1), row.getString(2)) + }.coalesce(1).mapPartitions { Review comment: And here's the `Manifest` case class: ```scala private[sql] case class Manifest(location: String, fileLength: Long, specId: Int) { def toManifestFile: ManifestFile = new ManifestFile { override def path: String = location override def length: Long = fileLength override def partitionSpecId: Int = specId override def snapshotId: java.lang.Long = null override def addedFilesCount: Integer = null override def existingFilesCount: Integer = null override def deletedFilesCount: Integer = null override def partitions: java.util.List[ManifestFile.PartitionFieldSummary] = null override def copy: ManifestFile = this } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table
rdblue commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317711446 ## File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala ## @@ -297,5 +302,81 @@ object SparkTableUtil { ) } } + + private def buildManifest(table: Table, + sparkDataFiles: Seq[SparkDataFile], + partitionSpec: PartitionSpec): ManifestFile = { +val outputFile = table.io + .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + UUID.randomUUID.toString)) +val writer = ManifestWriter.write(partitionSpec, outputFile) +try { + sparkDataFiles.foreach { file => +writer.add(file.toDataFile(partitionSpec)) + } +} finally { + writer.close() +} + +writer.toManifestFile + } + + /** + * Import a spark table to a iceberg table. + * + * The import uses the spark session to get table metadata. It assumes no + * operation is going on original table and target table and thus is not + * thread-safe. + * + * @param source the database name of the table to be import + * @param location the location used to store table metadata + * + * @return table the imported table + */ + def importSparkTable(source: TableIdentifier, location: String): Table = { +val sparkSession = SparkSession.builder().getOrCreate() +import sparkSession.sqlContext.implicits._ + +val dbName = source.database.getOrElse("default") +val tableName = source.table + +if (!sparkSession.catalog.tableExists(dbName, tableName)) { + throw new NoSuchTableException(s"Table $dbName.$tableName does not exist") +} + +val partitionSpec = SparkSchemaUtil.specForTable(sparkSession, s"$dbName.$tableName") +val conf = sparkSession.sparkContext.hadoopConfiguration +val tables = new HadoopTables(conf) +val schema = SparkSchemaUtil.schemaForTable(sparkSession, s"$dbName.$tableName") +val table = tables.create(schema, partitionSpec, ImmutableMap.of(), location) +val appender = table.newAppend() + +if (partitionSpec == PartitionSpec.unpartitioned) { + val tableMetadata = sparkSession.sessionState.catalog.getTableMetadata(source) + val format = tableMetadata.provider.getOrElse("none") + + if (format != "avro" && format != "parquet" && format != "orc") { +throw new UnsupportedOperationException(s"Unsupported format: $format") + } + listPartition(Map.empty[String, String], tableMetadata.location.toString, +format).foreach{f => appender.appendFile(f.toDataFile(PartitionSpec.unpartitioned))} + appender.commit() +} else { + val partitions = partitionDF(sparkSession, s"$dbName.$tableName") + partitions.flatMap { row => +listPartition(row.getMap[String, String](0).toMap, row.getString(1), row.getString(2)) + }.coalesce(1).mapPartitions { Review comment: Here's `writeManifest`: ```scala def writeManifest( conf: SerializableConfiguration, spec: PartitionSpec, basePath: String): Iterator[SparkDataFile] => Iterator[Manifest] = { files => if (files.hasNext) { val ctx = TaskContext.get() val manifestLocation = new Path(basePath, s"stage-${ctx.stageId()}-task-${ctx.taskAttemptId()}-manifest.avro").toString val io = new HadoopFileIO(conf.value) val writer = ManifestWriter.write(spec, io.newOutputFile(manifestLocation)) try { files.foreach { file => writer.add(file.toDataFile(spec)) } } finally { writer.close() } val manifest = writer.toManifestFile Seq(Manifest(manifest.path, manifest.length, manifest.partitionSpecId)).iterator } else { Seq.empty.iterator } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table
rdblue commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317710889 ## File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala ## @@ -297,5 +302,81 @@ object SparkTableUtil { ) } } + + private def buildManifest(table: Table, + sparkDataFiles: Seq[SparkDataFile], + partitionSpec: PartitionSpec): ManifestFile = { +val outputFile = table.io + .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + UUID.randomUUID.toString)) +val writer = ManifestWriter.write(partitionSpec, outputFile) +try { + sparkDataFiles.foreach { file => +writer.add(file.toDataFile(partitionSpec)) + } +} finally { + writer.close() +} + +writer.toManifestFile + } + + /** + * Import a spark table to a iceberg table. + * + * The import uses the spark session to get table metadata. It assumes no + * operation is going on original table and target table and thus is not + * thread-safe. + * + * @param source the database name of the table to be import + * @param location the location used to store table metadata + * + * @return table the imported table + */ + def importSparkTable(source: TableIdentifier, location: String): Table = { +val sparkSession = SparkSession.builder().getOrCreate() +import sparkSession.sqlContext.implicits._ + +val dbName = source.database.getOrElse("default") +val tableName = source.table + +if (!sparkSession.catalog.tableExists(dbName, tableName)) { + throw new NoSuchTableException(s"Table $dbName.$tableName does not exist") +} + +val partitionSpec = SparkSchemaUtil.specForTable(sparkSession, s"$dbName.$tableName") +val conf = sparkSession.sparkContext.hadoopConfiguration +val tables = new HadoopTables(conf) +val schema = SparkSchemaUtil.schemaForTable(sparkSession, s"$dbName.$tableName") +val table = tables.create(schema, partitionSpec, ImmutableMap.of(), location) +val appender = table.newAppend() + +if (partitionSpec == PartitionSpec.unpartitioned) { + val tableMetadata = sparkSession.sessionState.catalog.getTableMetadata(source) + val format = tableMetadata.provider.getOrElse("none") + + if (format != "avro" && format != "parquet" && format != "orc") { +throw new UnsupportedOperationException(s"Unsupported format: $format") + } + listPartition(Map.empty[String, String], tableMetadata.location.toString, +format).foreach{f => appender.appendFile(f.toDataFile(PartitionSpec.unpartitioned))} + appender.commit() +} else { + val partitions = partitionDF(sparkSession, s"$dbName.$tableName") + partitions.flatMap { row => +listPartition(row.getMap[String, String](0).toMap, row.getString(1), row.getString(2)) + }.coalesce(1).mapPartitions { +files => + val tables = new HadoopTables(new Configuration()) Review comment: Can you use `SerializableConfiguration` instead? This should use Spark's `hadoopConfiguration`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table
rdblue commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317711151 ## File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala ## @@ -297,5 +302,81 @@ object SparkTableUtil { ) } } + + private def buildManifest(table: Table, + sparkDataFiles: Seq[SparkDataFile], + partitionSpec: PartitionSpec): ManifestFile = { +val outputFile = table.io + .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + UUID.randomUUID.toString)) +val writer = ManifestWriter.write(partitionSpec, outputFile) +try { + sparkDataFiles.foreach { file => +writer.add(file.toDataFile(partitionSpec)) + } +} finally { + writer.close() +} + +writer.toManifestFile + } + + /** + * Import a spark table to a iceberg table. + * + * The import uses the spark session to get table metadata. It assumes no + * operation is going on original table and target table and thus is not + * thread-safe. + * + * @param source the database name of the table to be import + * @param location the location used to store table metadata + * + * @return table the imported table + */ + def importSparkTable(source: TableIdentifier, location: String): Table = { +val sparkSession = SparkSession.builder().getOrCreate() +import sparkSession.sqlContext.implicits._ + +val dbName = source.database.getOrElse("default") +val tableName = source.table + +if (!sparkSession.catalog.tableExists(dbName, tableName)) { + throw new NoSuchTableException(s"Table $dbName.$tableName does not exist") +} + +val partitionSpec = SparkSchemaUtil.specForTable(sparkSession, s"$dbName.$tableName") +val conf = sparkSession.sparkContext.hadoopConfiguration +val tables = new HadoopTables(conf) +val schema = SparkSchemaUtil.schemaForTable(sparkSession, s"$dbName.$tableName") +val table = tables.create(schema, partitionSpec, ImmutableMap.of(), location) +val appender = table.newAppend() + +if (partitionSpec == PartitionSpec.unpartitioned) { + val tableMetadata = sparkSession.sessionState.catalog.getTableMetadata(source) + val format = tableMetadata.provider.getOrElse("none") + + if (format != "avro" && format != "parquet" && format != "orc") { Review comment: This should use the same logic that `partitionDF` uses to detect the format. It should check the serde's input format. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table
rdblue commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317710578 ## File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala ## @@ -297,5 +302,81 @@ object SparkTableUtil { ) } } + + private def buildManifest(table: Table, + sparkDataFiles: Seq[SparkDataFile], + partitionSpec: PartitionSpec): ManifestFile = { +val outputFile = table.io + .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + UUID.randomUUID.toString)) +val writer = ManifestWriter.write(partitionSpec, outputFile) +try { + sparkDataFiles.foreach { file => +writer.add(file.toDataFile(partitionSpec)) + } +} finally { + writer.close() +} + +writer.toManifestFile + } + + /** + * Import a spark table to a iceberg table. + * + * The import uses the spark session to get table metadata. It assumes no + * operation is going on original table and target table and thus is not + * thread-safe. + * + * @param source the database name of the table to be import + * @param location the location used to store table metadata + * + * @return table the imported table + */ + def importSparkTable(source: TableIdentifier, location: String): Table = { +val sparkSession = SparkSession.builder().getOrCreate() +import sparkSession.sqlContext.implicits._ + +val dbName = source.database.getOrElse("default") +val tableName = source.table + +if (!sparkSession.catalog.tableExists(dbName, tableName)) { + throw new NoSuchTableException(s"Table $dbName.$tableName does not exist") +} + +val partitionSpec = SparkSchemaUtil.specForTable(sparkSession, s"$dbName.$tableName") +val conf = sparkSession.sparkContext.hadoopConfiguration +val tables = new HadoopTables(conf) +val schema = SparkSchemaUtil.schemaForTable(sparkSession, s"$dbName.$tableName") +val table = tables.create(schema, partitionSpec, ImmutableMap.of(), location) +val appender = table.newAppend() + +if (partitionSpec == PartitionSpec.unpartitioned) { + val tableMetadata = sparkSession.sessionState.catalog.getTableMetadata(source) + val format = tableMetadata.provider.getOrElse("none") + + if (format != "avro" && format != "parquet" && format != "orc") { +throw new UnsupportedOperationException(s"Unsupported format: $format") + } + listPartition(Map.empty[String, String], tableMetadata.location.toString, +format).foreach{f => appender.appendFile(f.toDataFile(PartitionSpec.unpartitioned))} + appender.commit() +} else { + val partitions = partitionDF(sparkSession, s"$dbName.$tableName") + partitions.flatMap { row => +listPartition(row.getMap[String, String](0).toMap, row.getString(1), row.getString(2)) + }.coalesce(1).mapPartitions { Review comment: Why `coalesce(1)` here? In our version of this, we add a sort by file name and build manifests in parallel: ```scala val tempPath = new Path(s"hdfs:/tmp/iceberg-conversions/$applicationId") val manifests: Seq[ManifestFile] = files .repartition(100) // repartition to shuffle the data and not list partitions twice .orderBy($"path") .mapPartitions(writeManifest) // writes manifests to tempPath .collect() .map(_.toManifestFile) val append = table.newAppend manifests.foreach(append.appendManifest) append.commit() ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table
rdblue commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317708705 ## File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala ## @@ -297,5 +301,81 @@ object SparkTableUtil { ) } } + + private def buildManifest(table: Table, +sparkDataFiles: Seq[SparkDataFile], +partitionSpec: PartitionSpec): ManifestFile = { +val outputFile = table.io + .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + UUID.randomUUID.toString)) +val writer = ManifestWriter.write(partitionSpec, outputFile) +try { + for (file <- sparkDataFiles) { +writer.add(file.toDataFile(partitionSpec)) + } +} finally { + writer.close() +} + +writer.toManifestFile + } + + /** + * Import a spark table to a iceberg table. + * + * The import uses the spark session to get table metadata. It assumes no + * operation is going on original table and target table and thus is not + * thread-safe. + * + * @param source the database name of the table to be import + * @param location the location used to store table metadata + * + * @return table the imported table + */ + def importSparkTable(source: TableIdentifier, location: String): Table = { +val sparkSession = SparkSession.builder().getOrCreate() +import sparkSession.sqlContext.implicits._ + +val dbName = source.database.getOrElse("default") +val tableName = source.table + +if (!sparkSession.catalog.tableExists(dbName, tableName)) { + throw new NoSuchTableException(s"Table $dbName.$tableName does not exist") +} + +val partitionSpec = SparkSchemaUtil.specForTable(sparkSession, s"$dbName.$tableName") +val conf = sparkSession.sparkContext.hadoopConfiguration +val tables = new HadoopTables(conf) +val schema = SparkSchemaUtil.schemaForTable(sparkSession, s"$dbName.$tableName") +val table = tables.create(schema, partitionSpec, ImmutableMap.of(), location) +val appender = table.newAppend() + +if (partitionSpec == PartitionSpec.unpartitioned) { Review comment: No, that's okay. If the table isn't partitioned, we don't need to get the partitions as a dataframe. It is fine to run this on the driver in that case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #412: Use null counts in metrics evaluators
rdblue commented on issue #412: Use null counts in metrics evaluators URL: https://github.com/apache/incubator-iceberg/pull/412#issuecomment-524949101 Thanks @aokolnychyi! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #412: Use null counts in metrics evaluators
rdblue merged pull request #412: Use null counts in metrics evaluators URL: https://github.com/apache/incubator-iceberg/pull/412 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue closed issue #408: Use null counts in metrics evaluators
rdblue closed issue #408: Use null counts in metrics evaluators URL: https://github.com/apache/incubator-iceberg/issues/408 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #414: Add toByteBuffer to Literal
rdblue commented on issue #414: Add toByteBuffer to Literal URL: https://github.com/apache/incubator-iceberg/pull/414#issuecomment-524945345 Thanks @aokolnychyi! I just found a couple of minor things to fix. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #414: Add toByteBuffer to Literal
rdblue commented on a change in pull request #414: Add toByteBuffer to Literal URL: https://github.com/apache/incubator-iceberg/pull/414#discussion_r317701995 ## File path: api/src/main/java/org/apache/iceberg/expressions/Literal.java ## @@ -102,4 +102,8 @@ * @return a comparator for T objects */ Comparator comparator(); + + default ByteBuffer toByteBuffer() { +throw new UnsupportedOperationException("toByteBuffer is not supported"); Review comment: Is this left unimplemented for any type? If it isn't then I think we can remove the default implementation. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #414: Add toByteBuffer to Literal
rdblue commented on a change in pull request #414: Add toByteBuffer to Literal URL: https://github.com/apache/incubator-iceberg/pull/414#discussion_r317701749 ## File path: api/src/main/java/org/apache/iceberg/expressions/Literal.java ## @@ -102,4 +102,8 @@ * @return a comparator for T objects */ Comparator comparator(); + + default ByteBuffer toByteBuffer() { Review comment: This needs some documentation to explain how the value is serialized to byte buffer. I think it uses the [single-value serialization format](http://iceberg.apache.org/spec/#appendix-d-single-value-serialization), so linking to that would be helpful. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #414: Add toByteBuffer to Literal
rdblue commented on a change in pull request #414: Add toByteBuffer to Literal URL: https://github.com/apache/incubator-iceberg/pull/414#discussion_r317700813 ## File path: api/src/main/java/org/apache/iceberg/expressions/Literals.java ## @@ -94,6 +95,7 @@ private Literals() { private abstract static class BaseLiteral implements Literal { private final T value; +private volatile ByteBuffer byteBuffer; Review comment: Please explicitly initialize variables either here or in constructors. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #350: Add dropTable purge option to Catalog API
rdblue merged pull request #350: Add dropTable purge option to Catalog API URL: https://github.com/apache/incubator-iceberg/pull/350 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #406: Throw an exception when using Iceberg with Spark embedded metastore
rdblue commented on a change in pull request #406: Throw an exception when using Iceberg with Spark embedded metastore URL: https://github.com/apache/incubator-iceberg/pull/406#discussion_r317698991 ## File path: hive/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java ## @@ -183,6 +183,12 @@ public void commit(TableMetadata base, TableMetadata metadata) { } threw = false; } catch (TException | UnknownHostException e) { + if (e.getMessage().contains("Table/View 'HIVE_LOCKS' does not exist")) { +LOG.error("Failed to acquire locks from metastore because 'HIVE_LOCKS' doesn't exist, " + +"this probably happened when using embedded metastore or doesn't create transactional" + +" meta table. Please reconfigure and start the metastore.", e); Review comment: This should also be thrown instead of logged. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #406: Throw an exception when using Iceberg with Spark embedded metastore
rdblue commented on a change in pull request #406: Throw an exception when using Iceberg with Spark embedded metastore URL: https://github.com/apache/incubator-iceberg/pull/406#discussion_r317698659 ## File path: hive/src/main/java/org/apache/iceberg/hive/HiveClientPool.java ## @@ -44,6 +48,14 @@ protected HiveMetaStoreClient newClient() { return new HiveMetaStoreClient(hiveConf); } catch (MetaException e) { throw new RuntimeMetaException(e, "Failed to connect to Hive Metastore"); +} catch (Throwable t) { + if (t.getMessage().contains("Another instance of Derby may have already booted")) { +LOG.error("Another instance of Derby may have already booted. This is happened when using" + +" embedded metastore, which only supports one client at time. Please change to use " + +"other metastores which could support multiple clients, like metastore service.", t); Review comment: Can we change the error message to something a bit shorter? How about: "Failed to start an embedded metastore because Derby supports only one client at a time. To fix this, use a metastore that supports multiple clients." This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #406: Throw an exception when using Iceberg with Spark embedded metastore
rdblue commented on a change in pull request #406: Throw an exception when using Iceberg with Spark embedded metastore URL: https://github.com/apache/incubator-iceberg/pull/406#discussion_r317698081 ## File path: hive/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java ## @@ -183,6 +183,12 @@ public void commit(TableMetadata base, TableMetadata metadata) { } threw = false; } catch (TException | UnknownHostException e) { + if (e.getMessage().contains("Table/View 'HIVE_LOCKS' does not exist")) { +LOG.error("Failed to acquire locks from metastore because 'HIVE_LOCKS' doesn't exist, " + +"this probably happened when using embedded metastore or doesn't create transactional" + +" meta table. Please reconfigure and start the metastore.", e); Review comment: This recommendation isn't very helpful because it isn't clear what "the metastore" is. How about this instead: "To fix this, use an alternative metastore". This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #406: Throw an exception when using Iceberg with Spark embedded metastore
rdblue commented on a change in pull request #406: Throw an exception when using Iceberg with Spark embedded metastore URL: https://github.com/apache/incubator-iceberg/pull/406#discussion_r317697046 ## File path: hive/src/main/java/org/apache/iceberg/hive/HiveClientPool.java ## @@ -44,6 +48,14 @@ protected HiveMetaStoreClient newClient() { return new HiveMetaStoreClient(hiveConf); } catch (MetaException e) { throw new RuntimeMetaException(e, "Failed to connect to Hive Metastore"); +} catch (Throwable t) { + if (t.getMessage().contains("Another instance of Derby may have already booted")) { +LOG.error("Another instance of Derby may have already booted. This is happened when using" + +" embedded metastore, which only supports one client at time. Please change to use " + +"other metastores which could support multiple clients, like metastore service.", t); Review comment: It is fine to log this message, but I think the exception that is thrown should also contain it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #280: Add persistent IDs to partition fields
rdblue commented on issue #280: Add persistent IDs to partition fields URL: https://github.com/apache/incubator-iceberg/issues/280#issuecomment-524938283 @manishmalhotrawork, those IDs have different contexts. The source ID in a partition field is the ID of the source data column in the table schema. The ID added by partitionType is the ID in the manifest file schema. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork edited a comment on issue #280: Add persistent IDs to partition fields
manishmalhotrawork edited a comment on issue #280: Add persistent IDs to partition fields URL: https://github.com/apache/incubator-iceberg/issues/280#issuecomment-524733183 @rdblue thanks for the answer. Curious on the flow and as how fieldId are maintained in PartitionSpec vs Table Schema. Here [PartitionSpec is initialized](https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L63) And fields are copied to partition spec, which also included original IDs assigned to fields. and in [partitionType method](https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L113) where new fieldID is assigned ( 1000+) for the partitionFields and returns StructType with those fields. Not sure if Im missing something. So, why original fieldId and ids assigned in `partitionSpec` are different ? How the `partitionSpec` returned `StructType` with IDs as 1000+ is used? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #412: Use null counts in metrics evaluators
rdblue commented on issue #412: Use null counts in metrics evaluators URL: https://github.com/apache/incubator-iceberg/pull/412#issuecomment-524914156 > Is the current implementation of eq in StrictMetricsEvaluator safe? What if we have truncated min/max values? I think the bounds check is safe with truncated bounds. The check is that the literal value is equal to lower and upper bounds, so we would know that there is only one value that is equal to the filter's literal. That can only happen when the bounds have not been truncated because the truncated upper bound is never equal to the truncated lower bound unless the value was already smaller than the truncation length. Here's an example: Say the real min and max are ["aa", ""] and we are truncating to 2 codepoints. Then the bounds will be ["aa", "ab"]. Because there is no value equal to both bounds, `eq` will never return `ROWS_MUST_MATCH`. Alternatively, if there is just one value, "aa", then both min/max and truncated bounds will be "aa". In that case, `eq` will detect that "aa" is equal to both bounds and return that the predicate is guaranteed to be true by the bounds. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on issue #412: Use null counts in metrics evaluators
aokolnychyi commented on issue #412: Use null counts in metrics evaluators URL: https://github.com/apache/incubator-iceberg/pull/412#issuecomment-524900800 @rdblue not directly related to this PR, but is the current implementation of `eq` in `StrictMetricsEvaluator` safe? What if we have truncated min/max values? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] chenjunjiedada opened a new issue #415: supports unpartitioned table in partitionDF
chenjunjiedada opened a new issue #415: supports unpartitioned table in partitionDF URL: https://github.com/apache/incubator-iceberg/issues/415 partitionDF throws exception when executing on unpartitioned table, please see following exception: ``` org.apache.hadoop.hive.ql.metadata.HiveException: Table item is not a partitioned table at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2123) at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2156) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitions$1.apply(HiveClientImpl.scala:670) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitions$1.apply(HiveClientImpl.scala:662) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitions(HiveClientImpl.scala:662) at org.apache.spark.sql.hive.client.HiveClient$class.getPartitions(HiveClient.scala:210) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitions(HiveClientImpl.scala:83) at org.apache.iceberg.spark.hacks.Hive.partitions(Hive.java:48) at org.apache.iceberg.spark.SparkTableUtil$.partitionDF(SparkTableUtil.scala:56) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jerryshao commented on issue #406: Throw an exception when using Iceberg with Spark embedded metastore
jerryshao commented on issue #406: Throw an exception when using Iceberg with Spark embedded metastore URL: https://github.com/apache/incubator-iceberg/pull/406#issuecomment-524843479 @rdblue please help to take another look, thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork edited a comment on issue #280: Add persistent IDs to partition fields
manishmalhotrawork edited a comment on issue #280: Add persistent IDs to partition fields URL: https://github.com/apache/incubator-iceberg/issues/280#issuecomment-524733183 @rdblue thanks for the answer. Curios on the flow and as how fieldId are maintained in PartitionSpec vs Table Schema. Here [PartitionSpec is initialized](https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L63) And fields are copied to partition spec, which also included original IDs assigned to fields. and in [partitionType method](https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L113) where new fieldID is assigned ( 1000+) for the partitionFields and returns StructType with those fields. Not sure if Im missing something. So, why original fieldId and ids assigned in `partitionSpec` are different ? How the `partitionSpec` returned `StructType` with IDs as 1000+ is used? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] manishmalhotrawork commented on issue #280: Add persistent IDs to partition fields
manishmalhotrawork commented on issue #280: Add persistent IDs to partition fields URL: https://github.com/apache/incubator-iceberg/issues/280#issuecomment-524733183 @rdblue thanks for the answer. Curios on the flow and as how fieldId are maintained in PartitionSpec vs Table Schema. Here [PartitionSpec is initialized](https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L63) And fields are copied to partition spec, which also included original IDs assigned to fields. and in [partitionType method](https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L113) where new fieldID is assigned ( 1000+) for the partitionFields and returns StructType with those fields. Not sure if Im missing something. So, why original fieldId and ids assigned in `partitionSpec` are different ? How the `partitionSpec` returned `StructType` with IDs as 1000+ is used? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org