[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
moulimukherjee commented on a change in pull request #416: Adding docs on how 
to use custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317856977
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,149 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+class CustomTableOperations extends BaseMetastoreTableOperations {
+  private String dbName;
+  private String tableName;
+  private Configuration conf;
+  private FileIO fileIO;
+
+  protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
+this.conf = conf;
+this.dbName = dbName;
+this.tableName = tableName;
+  }
+
+  // The doRefresh method should provide implementation on how to get the 
metadata location
+  @Override
+  public void doRefresh() {
+
+// Example custom service which returns the metadata location given a 
dbName and tableName
+String metadataLocation = CustomService.getMetadataForTable(conf, dbName, 
tableName);
+
+// Use existing method to refresh metadata
+refreshFromMetadataLocation(metadataLocation);
+
+  }
+
+  // The doCommit method should provide implementation on how to update with 
metadata location atomically
+  @Override
+  public void doCommit(TableMetadata base, TableMetadata metadata) {
+// if the metadata is already out of date, reject it
+if (base != current()) {
 
 Review comment:
   @aokolnychyi @rdblue  I was wondering if it makes sense to have this and the 
following check as part of the `commit()` method instead, before calling 
`doCommit`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
moulimukherjee commented on a change in pull request #416: Adding docs on how 
to use custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317856977
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,149 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+class CustomTableOperations extends BaseMetastoreTableOperations {
+  private String dbName;
+  private String tableName;
+  private Configuration conf;
+  private FileIO fileIO;
+
+  protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
+this.conf = conf;
+this.dbName = dbName;
+this.tableName = tableName;
+  }
+
+  // The doRefresh method should provide implementation on how to get the 
metadata location
+  @Override
+  public void doRefresh() {
+
+// Example custom service which returns the metadata location given a 
dbName and tableName
+String metadataLocation = CustomService.getMetadataForTable(conf, dbName, 
tableName);
+
+// Use existing method to refresh metadata
+refreshFromMetadataLocation(metadataLocation);
+
+  }
+
+  // The doCommit method should provide implementation on how to update with 
metadata location atomically
+  @Override
+  public void doCommit(TableMetadata base, TableMetadata metadata) {
+// if the metadata is already out of date, reject it
+if (base != current()) {
 
 Review comment:
   @aokolnychyi @rdblue  I was wondering if it makes sense to have this check 
as part of the `commit()` method instead, before calling `doCommit`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] moulimukherjee commented on issue #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
moulimukherjee commented on issue #416: Adding docs on how to use custom 
catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#issuecomment-525092405
 
 
   PTAL again @rdblue @aokolnychyi 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
moulimukherjee commented on a change in pull request #416: Adding docs on how 
to use custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317856128
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,134 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+class CustomTableOperations extends BaseMetastoreTableOperations {
+  private String dbName;
+  private String tableName;
+  private Configuration conf;
+
+  protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
 
 Review comment:
   Updated


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
moulimukherjee commented on a change in pull request #416: Adding docs on how 
to use custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317855049
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,134 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+class CustomTableOperations extends BaseMetastoreTableOperations {
+  private String dbName;
+  private String tableName;
+  private Configuration conf;
+
+  protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
+super(conf);
+this.conf = conf;
+this.dbName = dbName;
+this.tableName = tableName;
+  }
+
+  // The refresh method should provide implementation on how to get the 
metadata location
+  @Override
+  public TableMetadata refresh() {
 
 Review comment:
   Updated to `doRefresh`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
moulimukherjee commented on a change in pull request #416: Adding docs on how 
to use custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317851135
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,134 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+class CustomTableOperations extends BaseMetastoreTableOperations {
+  private String dbName;
+  private String tableName;
+  private Configuration conf;
+
+  protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
 
 Review comment:
   Ack.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
moulimukherjee commented on a change in pull request #416: Adding docs on how 
to use custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317851046
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,134 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+class CustomTableOperations extends BaseMetastoreTableOperations {
+  private String dbName;
+  private String tableName;
+  private Configuration conf;
+
+  protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
+super(conf);
+this.conf = conf;
+this.dbName = dbName;
+this.tableName = tableName;
+  }
+
+  // The refresh method should provide implementation on how to get the 
metadata location
+  @Override
+  public TableMetadata refresh() {
 
 Review comment:
   Ack, I see that it's merged, will update this  


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #388: Handle rollback in snapshot expiration

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #388: Handle rollback in snapshot 
expiration
URL: https://github.com/apache/incubator-iceberg/pull/388#discussion_r317843266
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/RemoveSnapshots.java
 ##
 @@ -120,56 +123,128 @@ public void commit() {
   }
 });
 
-LOG.info("Committed snapshot changes; cleaning up expired manifests and 
data files.");
+cleanExpiredSnapshots();
+  }
 
+  private void cleanExpiredSnapshots() {
 // clean up the expired snapshots:
 // 1. Get a list of the snapshots that were removed
 // 2. Delete any data files that were deleted by those snapshots and are 
not in the table
 // 3. Delete any manifests that are no longer used by current snapshots
 // 4. Delete the manifest lists
 
+TableMetadata current = ops.refresh();
+
+Set validIds = Sets.newHashSet();
+for (Snapshot snapshot : current.snapshots()) {
+  validIds.add(snapshot.snapshotId());
+}
+
+Set expiredIds = Sets.newHashSet();
+for (Snapshot snapshot : base.snapshots()) {
+  long snapshotId = snapshot.snapshotId();
+  if (!validIds.contains(snapshotId)) {
+// the snapshot was expired
+LOG.info("Expired snapshot: {}", snapshot);
+expiredIds.add(snapshotId);
+  }
+}
+
+if (expiredIds.isEmpty()) {
+  // if no snapshots were expired, skip cleanup
+  return;
+}
+
+LOG.info("Committed snapshot changes; cleaning up expired manifests and 
data files.");
+
+cleanExpiredFiles(current.snapshots(), validIds, expiredIds);
+  }
+
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")
+  private void cleanExpiredFiles(List snapshots, Set validIds, 
Set expiredIds) {
 // Reads and deletes are done using 
Tasks.foreach(...).suppressFailureWhenFinished to complete
 // as much of the delete work as possible and avoid orphaned data or 
manifest files.
 
-TableMetadata current = ops.refresh();
-Set currentIds = Sets.newHashSet();
-Set currentManifests = Sets.newHashSet();
-for (Snapshot snapshot : current.snapshots()) {
-  currentIds.add(snapshot.snapshotId());
-  currentManifests.addAll(snapshot.manifests());
+// this is the set of ancestors of the current table state. when removing 
snapshots, this must
+// only remove files that were deleted in an ancestor of the current table 
state to avoid
+// physically deleting files that were logically deleted in a commit that 
was rolled back.
+Set ancestorIds = 
Sets.newHashSet(SnapshotUtil.ancestorIds(base.currentSnapshot(), 
base::snapshot));
+
+Set validManifests = Sets.newHashSet();
+Set manifestsToScan = Sets.newHashSet();
+for (Snapshot snapshot : snapshots) {
+  validIds.add(snapshot.snapshotId());
 
 Review comment:
   You're right. I'll remove it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] Parth-Brahmbhatt commented on a change in pull request #388: Handle rollback in snapshot expiration

2019-08-26 Thread GitBox
Parth-Brahmbhatt commented on a change in pull request #388: Handle rollback in 
snapshot expiration
URL: https://github.com/apache/incubator-iceberg/pull/388#discussion_r317838958
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/RemoveSnapshots.java
 ##
 @@ -120,56 +123,128 @@ public void commit() {
   }
 });
 
-LOG.info("Committed snapshot changes; cleaning up expired manifests and 
data files.");
+cleanExpiredSnapshots();
+  }
 
+  private void cleanExpiredSnapshots() {
 // clean up the expired snapshots:
 // 1. Get a list of the snapshots that were removed
 // 2. Delete any data files that were deleted by those snapshots and are 
not in the table
 // 3. Delete any manifests that are no longer used by current snapshots
 // 4. Delete the manifest lists
 
+TableMetadata current = ops.refresh();
+
+Set validIds = Sets.newHashSet();
+for (Snapshot snapshot : current.snapshots()) {
+  validIds.add(snapshot.snapshotId());
+}
+
+Set expiredIds = Sets.newHashSet();
+for (Snapshot snapshot : base.snapshots()) {
+  long snapshotId = snapshot.snapshotId();
+  if (!validIds.contains(snapshotId)) {
+// the snapshot was expired
+LOG.info("Expired snapshot: {}", snapshot);
+expiredIds.add(snapshotId);
+  }
+}
+
+if (expiredIds.isEmpty()) {
+  // if no snapshots were expired, skip cleanup
+  return;
+}
+
+LOG.info("Committed snapshot changes; cleaning up expired manifests and 
data files.");
+
+cleanExpiredFiles(current.snapshots(), validIds, expiredIds);
+  }
+
+  @SuppressWarnings("checkstyle:CyclomaticComplexity")
+  private void cleanExpiredFiles(List snapshots, Set validIds, 
Set expiredIds) {
 // Reads and deletes are done using 
Tasks.foreach(...).suppressFailureWhenFinished to complete
 // as much of the delete work as possible and avoid orphaned data or 
manifest files.
 
-TableMetadata current = ops.refresh();
-Set currentIds = Sets.newHashSet();
-Set currentManifests = Sets.newHashSet();
-for (Snapshot snapshot : current.snapshots()) {
-  currentIds.add(snapshot.snapshotId());
-  currentManifests.addAll(snapshot.manifests());
+// this is the set of ancestors of the current table state. when removing 
snapshots, this must
+// only remove files that were deleted in an ancestor of the current table 
state to avoid
+// physically deleting files that were logically deleted in a commit that 
was rolled back.
+Set ancestorIds = 
Sets.newHashSet(SnapshotUtil.ancestorIds(base.currentSnapshot(), 
base::snapshot));
+
+Set validManifests = Sets.newHashSet();
+Set manifestsToScan = Sets.newHashSet();
+for (Snapshot snapshot : snapshots) {
+  validIds.add(snapshot.snapshotId());
 
 Review comment:
   this seems unnecessary, the passed in validIds should already  have these 
snapshotIds.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317841160
 
 

 ##
 File path: 
api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java
 ##
 @@ -355,4 +390,40 @@ public void testCaseInsensitiveIntegerNotEqRewritten() {
   public void testCaseSensitiveIntegerNotEqRewritten() {
 boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, not(equal("ID", 
5)), true).eval(FILE);
   }
+
+  @Test
+  public void testStringStartsWith() {
+boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
startsWith("required", "a"), true).eval(FILE);
+Assert.assertTrue("Should read: no stats", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"a"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aa"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aaa"), true).eval(FILE_2);
 
 Review comment:
   I don't think there is an equivalent test for the upper bound, where the 
bound is shorter so the startsWith prefix is truncated. Can you add one?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #398: Push down StringStartsWith in Spark IcebergSource

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #398: Push down StringStartsWith 
in Spark IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#discussion_r317841001
 
 

 ##
 File path: 
api/src/test/java/org/apache/iceberg/expressions/TestInclusiveMetricsEvaluator.java
 ##
 @@ -355,4 +390,40 @@ public void testCaseInsensitiveIntegerNotEqRewritten() {
   public void testCaseSensitiveIntegerNotEqRewritten() {
 boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, not(equal("ID", 
5)), true).eval(FILE);
   }
+
+  @Test
+  public void testStringStartsWith() {
+boolean shouldRead = new InclusiveMetricsEvaluator(SCHEMA, 
startsWith("required", "a"), true).eval(FILE);
+Assert.assertTrue("Should read: no stats", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"a"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aa"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aaa"), true).eval(FILE_2);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"1s"), true).eval(FILE_3);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"1str1x"), true).eval(FILE_3);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"ff"), true).eval(FILE_4);
+Assert.assertTrue("Should read: range matches", shouldRead);
+
+shouldRead = new InclusiveMetricsEvaluator(SCHEMA, startsWith("required", 
"aB"), true).eval(FILE_2);
 
 Review comment:
   Nit: For the purpose of testing, I think it is more clear to use all lower 
case or all upper case. It may not be obvious that `"B" < "a"`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue opened a new pull request #418: Add missing manifest cleanup tests

2019-08-26 Thread GitBox
rdblue opened a new pull request #418: Add missing manifest cleanup tests
URL: https://github.com/apache/incubator-iceberg/pull/418
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
moulimukherjee commented on a change in pull request #416: Adding docs on how 
to use custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317833163
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,134 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+class CustomTableOperations extends BaseMetastoreTableOperations {
+  private String dbName;
+  private String tableName;
+  private Configuration conf;
+
+  protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
+super(conf);
+this.conf = conf;
+this.dbName = dbName;
+this.tableName = tableName;
+  }
+
+  // The refresh method should provide implementation on how to get the 
metadata location
+  @Override
+  public TableMetadata refresh() {
+
+// Example custom service which returns the metadata location given a 
dbName and tableName
+val metadataLocation = CustomService.getMetadataForTable(conf, dbName, 
tableName)
+
+// Use existing method to refresh metadata  
+refreshFromMetadataLocation(metadataLocation);
+
+// Use existing method to return the table metadata
+return current();
+  }
+
+  // The commit method should provide implementation on how to persist the 
metadata location
+  @Override
+  public void commit(TableMetadata base, TableMetadata metadata) {
+// if the metadata is already out of date, reject it
+if (base != current()) {
+  throw new CommitFailedException("Cannot commit: stale table metadata for 
%s.%s", dbName, tableName);
+}
+
+// if the metadata is not changed, return early
+if (base == metadata) {
+  return;
+}
+
+// Write new metadata
+String newMetadataLocation = writeNewMetadata(metadata, currentVersion() + 
1);
+
+// Example custom service which updates the metadata location for the 
given db and table
+CustomService.updateMetadataLocation(dbName, tableName, 
newMetadataLocation);
 
 Review comment:
   Ack.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue closed issue #261: Support create and replace transactions in Catalog

2019-08-26 Thread GitBox
rdblue closed issue #261: Support create and replace transactions in Catalog
URL: https://github.com/apache/incubator-iceberg/issues/261
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue merged pull request #362: Support create and replace transactions in Catalog

2019-08-26 Thread GitBox
rdblue merged pull request #362: Support create and replace transactions in 
Catalog
URL: https://github.com/apache/incubator-iceberg/pull/362
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #362: Support create and replace transactions in Catalog

2019-08-26 Thread GitBox
rdblue commented on issue #362: Support create and replace transactions in 
Catalog
URL: https://github.com/apache/incubator-iceberg/pull/362#issuecomment-525064046
 
 
   +1
   
   Thanks for fixing this, @aokolnychyi!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] shardulm94 opened a new issue #417: Adding support for time-based partitioning on long column type

2019-08-26 Thread GitBox
shardulm94 opened a new issue #417: Adding support for time-based partitioning 
on long column type
URL: https://github.com/apache/incubator-iceberg/issues/417
 
 
   We have datasets which store milliseconds since epoch as a long column type. 
This column is currently being used for time-based partitioning. However 
Iceberg only supports time-based partitioning on timestamp and date column 
types with microsecond precision. Also, the transforms are all internal to 
Iceberg and not something that can be provided externally. How can we go about 
adding support for such a case in our environment? Do you think this is 
something that can be added to Iceberg itself?
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
aokolnychyi commented on a change in pull request #416: Adding docs on how to 
use custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317827313
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,134 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+class CustomTableOperations extends BaseMetastoreTableOperations {
+  private String dbName;
+  private String tableName;
+  private Configuration conf;
+
+  protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
+super(conf);
+this.conf = conf;
+this.dbName = dbName;
+this.tableName = tableName;
+  }
+
+  // The refresh method should provide implementation on how to get the 
metadata location
+  @Override
+  public TableMetadata refresh() {
 
 Review comment:
   I think #362 is quite close. Shall we merge that first and update this one 
to use `doRefresh`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #416: Adding docs on how to use 
custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317824379
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,134 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+class CustomTableOperations extends BaseMetastoreTableOperations {
+  private String dbName;
+  private String tableName;
+  private Configuration conf;
+
+  protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
+super(conf);
+this.conf = conf;
+this.dbName = dbName;
+this.tableName = tableName;
+  }
+
+  // The refresh method should provide implementation on how to get the 
metadata location
+  @Override
+  public TableMetadata refresh() {
+
+// Example custom service which returns the metadata location given a 
dbName and tableName
+val metadataLocation = CustomService.getMetadataForTable(conf, dbName, 
tableName)
+
+// Use existing method to refresh metadata  
+refreshFromMetadataLocation(metadataLocation);
+
+// Use existing method to return the table metadata
+return current();
+  }
+
+  // The commit method should provide implementation on how to persist the 
metadata location
+  @Override
+  public void commit(TableMetadata base, TableMetadata metadata) {
+// if the metadata is already out of date, reject it
+if (base != current()) {
+  throw new CommitFailedException("Cannot commit: stale table metadata for 
%s.%s", dbName, tableName);
+}
+
+// if the metadata is not changed, return early
+if (base == metadata) {
+  return;
+}
+
+// Write new metadata
+String newMetadataLocation = writeNewMetadata(metadata, currentVersion() + 
1);
+
+// Example custom service which updates the metadata location for the 
given db and table
+CustomService.updateMetadataLocation(dbName, tableName, 
newMetadataLocation);
+
+// Use existing method to request a refresh
+requestRefresh();
+  }
+}
+```
+
+### Custom table implementation
+Extend `BaseMetastoreCatalog` to provide default warehouse locations and 
instantiate `CustomTableOperations`
+
+Example:
+```java
+public class CustomCatalog extends BaseMetastoreCatalog {
+
+  private Configuration configuration;
+
+  public CustomCatalog(Configuration configuration) {
+this.configuration = configuration;
+  }
+
+  @Override
+  protected TableOperations newTableOps(TableIdentifier tableIdentifier) {
+String dbName = tableIdentifier.namespace().level(0);
+String tableName = tableIdentifier.name();
+// instantiate the CustomTableOperations
+return new CustomTableOperations(configuration, dbName, tableName);
+  }
+
+  @Override
+  protected String defaultWarehouseLocation(TableIdentifier tableIdentifier) {
+
+// Can choose to use any other configuration name
+String tableLocation = configuration.get("custom.iceberg.table.location");
 
 Review comment:
   This is a warehouse location, right? If so, it may be more clear to use 
`"custom.iceberg.warehouse.location"` instead.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #416: Adding docs on how to use 
custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317824181
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,134 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+class CustomTableOperations extends BaseMetastoreTableOperations {
+  private String dbName;
+  private String tableName;
+  private Configuration conf;
+
+  protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
 
 Review comment:
   This is probably a good thing to note. This is how you can customize table 
implementations `FileIO` and `LocationProvider` can be defaulted, or set to new 
implementations by returning them from this class.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #416: Adding docs on how to use 
custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317823840
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,134 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+class CustomTableOperations extends BaseMetastoreTableOperations {
+  private String dbName;
+  private String tableName;
+  private Configuration conf;
+
+  protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
 
 Review comment:
   `BaseMetastoreTableOperations` was recently refactored to avoid passing a 
Hadoop `Configuration`. So now this also requires implementing `io` to return a 
`HadoopFileIO`, like this: 
https://github.com/apache/incubator-iceberg/blob/master/hive/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java#L79-L85


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #416: Adding docs on how to use 
custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317822579
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,134 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+class CustomTableOperations extends BaseMetastoreTableOperations {
+  private String dbName;
+  private String tableName;
+  private Configuration conf;
+
+  protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
+super(conf);
+this.conf = conf;
+this.dbName = dbName;
+this.tableName = tableName;
+  }
+
+  // The refresh method should provide implementation on how to get the 
metadata location
+  @Override
+  public TableMetadata refresh() {
+
+// Example custom service which returns the metadata location given a 
dbName and tableName
+val metadataLocation = CustomService.getMetadataForTable(conf, dbName, 
tableName)
+
+// Use existing method to refresh metadata  
+refreshFromMetadataLocation(metadataLocation);
+
+// Use existing method to return the table metadata
+return current();
+  }
+
+  // The commit method should provide implementation on how to persist the 
metadata location
+  @Override
+  public void commit(TableMetadata base, TableMetadata metadata) {
+// if the metadata is already out of date, reject it
+if (base != current()) {
+  throw new CommitFailedException("Cannot commit: stale table metadata for 
%s.%s", dbName, tableName);
+}
+
+// if the metadata is not changed, return early
+if (base == metadata) {
+  return;
+}
+
+// Write new metadata
+String newMetadataLocation = writeNewMetadata(metadata, currentVersion() + 
1);
+
+// Example custom service which updates the metadata location for the 
given db and table
+CustomService.updateMetadataLocation(dbName, tableName, 
newMetadataLocation);
 
 Review comment:
   This actually needs to pass the current metadata location to 
`CustomService`. Otherwise, `CustomService` can't guarantee that the location 
update is an atomic swap, and the atomic swap is required for consistency. 
Otherwise, two processes could each start a change, validate that metadata 
hasn't changed and both call this update at the same time. The second one would 
win because the update doesn't know to reject it because it isn't based on the 
other process's changes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
moulimukherjee commented on a change in pull request #416: Adding docs on how 
to use custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317815623
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,138 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+
+class CustomTableOperations extends BaseMetastoreTableOperations {
+private String dbName;
+private String tableName;
+private Configuration conf;
+
+protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
+super(conf);
+this.conf = conf;
+this.dbName = dbName;
+this.tableName = tableName;
+}
+
+// The refresh method should provide implementation on how to get the 
metadata location
+@Override
+public TableMetadata refresh() {
+
+// Example custom service which returns the metadata location given a 
dbName and tableName
+val metadataLocation = CustomService.getMetadataForTable(conf, dbName, 
tableName)
 
 Review comment:
   Fixed


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #362: Support create and replace transactions in Catalog

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #362: Support create and replace 
transactions in Catalog
URL: https://github.com/apache/incubator-iceberg/pull/362#discussion_r317815165
 
 

 ##
 File path: 
core/src/main/java/org/apache/iceberg/BaseMetastoreTableOperations.java
 ##
 @@ -66,6 +67,35 @@ public int currentVersion() {
 return version;
   }
 
+  @Override
+  public TableMetadata refresh() {
 
 Review comment:
   FYI @moulimukherjee and @aokolnychyi: Mouli just added docs for implementing 
your own `BaseMetastoreTableOperations`. We'll want to update those for this 
change to override `doRefresh` at some point, depending on when this makes it 
in.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] moulimukherjee commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
moulimukherjee commented on a change in pull request #416: Adding docs on how 
to use custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317814151
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,138 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+
+class CustomTableOperations extends BaseMetastoreTableOperations {
+private String dbName;
+private String tableName;
+private Configuration conf;
+
+protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
+super(conf);
+this.conf = conf;
+this.dbName = dbName;
+this.tableName = tableName;
+}
+
+// The refresh method should provide implementation on how to get the 
metadata location
+@Override
+public TableMetadata refresh() {
+
+// Example custom service which returns the metadata location given a 
dbName and tableName
+val metadataLocation = CustomService.getMetadataForTable(conf, dbName, 
tableName)
 
 Review comment:
   ah, correcting it 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #416: Adding docs on how to use 
custom catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416#discussion_r317813862
 
 

 ##
 File path: site/docs/custom-catalog.md
 ##
 @@ -0,0 +1,138 @@
+# Custom Catalog Implementation
+
+It's possible to read an iceberg table either from an hdfs path or from a hive 
table. It's also possible to use a custom metastore in place of hive. The steps 
to do that are as follows.
+
+- [Custom TableOperations](#custom-table-operations-implementation)
+- [Custom Catalog](#custom-table-implementation)
+- [Custom IcebergSource](#custom-icebergsource)
+
+### Custom table operations implementation
+Extend `BaseMetastoreTableOperations` to provide implementation on how to read 
and write metadata
+
+Example:
+```java
+
+class CustomTableOperations extends BaseMetastoreTableOperations {
+private String dbName;
+private String tableName;
+private Configuration conf;
+
+protected CustomTableOperations(Configuration conf, String dbName, String 
tableName) {
+super(conf);
+this.conf = conf;
+this.dbName = dbName;
+this.tableName = tableName;
+}
+
+// The refresh method should provide implementation on how to get the 
metadata location
+@Override
+public TableMetadata refresh() {
+
+// Example custom service which returns the metadata location given a 
dbName and tableName
+val metadataLocation = CustomService.getMetadataForTable(conf, dbName, 
tableName)
 
 Review comment:
   Nit: Looks like indentation is off in this file.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] moulimukherjee opened a new pull request #416: Adding docs on how to use custom catalog with iceberg

2019-08-26 Thread GitBox
moulimukherjee opened a new pull request #416: Adding docs on how to use custom 
catalog with iceberg
URL: https://github.com/apache/incubator-iceberg/pull/416
 
 
   Adding docs on how to use custom catalog with iceberg
   
   r? @rdblue 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-26 Thread GitBox
rdblue commented on issue #351: Extend Iceberg with a way to overwrite files 
for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#issuecomment-525039053
 
 
   +1. I'm going to merge this.
   
   Thanks for the thorough unit tests, those look great.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue merged pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-26 Thread GitBox
rdblue merged pull request #351: Extend Iceberg with a way to overwrite files 
for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue closed issue #316: Provide an API to modify records within files

2019-08-26 Thread GitBox
rdblue closed issue #316: Provide an API to modify records within files
URL: https://github.com/apache/incubator-iceberg/issues/316
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r317800614
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictDetectionFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictDetectionFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  InclusiveMetricsEvaluator metrics = new 
InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+ValidationException.check(
+!inclusive.eval(newFile.partition()) || !metrics.eval(newFile),
+"A conflicting file was appended that matches filter '%s': %s",
+conflictDetectionFilter, newFile.path());
+  }
+}
+
 return super.apply(base);
   }
+
+  private List collectNewFiles(TableMetadata meta) {
+List newFiles = new ArrayList<>();
+
+Long currentSnapshotId = meta.currentSnapshot() == null ? null : 
meta.currentSnapshot().snapshotId();
+while (currentSnapshotId != null && 
!currentSnapshotId.equals(readSnapshotId)) {
 
 Review comment:
   Okay, it should be fine if readSnapshotId is always explicitly set.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r317799344
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +119,42 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictDetectionFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictDetectionFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  InclusiveMetricsEvaluator metrics = new 
InclusiveMetricsEvaluator(base.schema(), conflictDetectionFilter);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+ValidationException.check(
+!inclusive.eval(newFile.partition()) || !metrics.eval(newFile),
+"A conflicting file was appended that matches filter '%s': %s",
+conflictDetectionFilter, newFile.path());
+  }
+}
+
 return super.apply(base);
   }
+
+  private List collectNewFiles(TableMetadata meta) {
+List newFiles = new ArrayList<>();
+
+Long currentSnapshotId = meta.currentSnapshot() == null ? null : 
meta.currentSnapshot().snapshotId();
 
 Review comment:
   Right, that works.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi commented on issue #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-26 Thread GitBox
aokolnychyi commented on issue #351: Extend Iceberg with a way to overwrite 
files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#issuecomment-525028721
 
 
   I've updated this PR but there are still a couple of open questions.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-26 Thread GitBox
aokolnychyi commented on a change in pull request #351: Extend Iceberg with a 
way to overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r317793406
 
 

 ##
 File path: 
api/src/main/java/org/apache/iceberg/expressions/InclusiveMetricsEvaluator.java
 ##
 @@ -52,7 +52,7 @@ private MetricsEvalVisitor visitor() {
 return visitors.get();
   }
 
-  InclusiveMetricsEvaluator(Schema schema, Expression unbound) {
+  public InclusiveMetricsEvaluator(Schema schema, Expression unbound) {
 
 Review comment:
   Let's address this in #413 together with other places.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi commented on issue #398: Push down StringStartsWith in Spark IcebergSource

2019-08-26 Thread GitBox
aokolnychyi commented on issue #398: Push down StringStartsWith in Spark 
IcebergSource
URL: https://github.com/apache/incubator-iceberg/pull/398#issuecomment-524985833
 
 
   I've updated this PR to use `toByteBuffer` in `Literal`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue merged pull request #414: Add toByteBuffer to Literal

2019-08-26 Thread GitBox
rdblue merged pull request #414: Add toByteBuffer to Literal
URL: https://github.com/apache/incubator-iceberg/pull/414
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #414: Add toByteBuffer to Literal

2019-08-26 Thread GitBox
rdblue commented on issue #414: Add toByteBuffer to Literal
URL: https://github.com/apache/incubator-iceberg/pull/414#issuecomment-524964776
 
 
   +1. I'll merge this when tests pass.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #414: Add toByteBuffer to Literal

2019-08-26 Thread GitBox
aokolnychyi commented on a change in pull request #414: Add toByteBuffer to 
Literal
URL: https://github.com/apache/incubator-iceberg/pull/414#discussion_r317723243
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Literal.java
 ##
 @@ -102,4 +102,8 @@
* @return a comparator for T objects
*/
   Comparator comparator();
+
+  default ByteBuffer toByteBuffer() {
 
 Review comment:
   Added. Let me know what you think.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi commented on a change in pull request #414: Add toByteBuffer to Literal

2019-08-26 Thread GitBox
aokolnychyi commented on a change in pull request #414: Add toByteBuffer to 
Literal
URL: https://github.com/apache/incubator-iceberg/pull/414#discussion_r317713814
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Literal.java
 ##
 @@ -102,4 +102,8 @@
* @return a comparator for T objects
*/
   Comparator comparator();
+
+  default ByteBuffer toByteBuffer() {
+throw new UnsupportedOperationException("toByteBuffer is not supported");
 
 Review comment:
   It is left unimplemented in `AboveMax` and `BelowMin`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #374: Migrate spark table to 
iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317712572
 
 

 ##
 File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
 ##
 @@ -297,5 +302,81 @@ object SparkTableUtil {
   )
 }
   }
+
+  private def buildManifest(table: Table,
+  sparkDataFiles: Seq[SparkDataFile],
+  partitionSpec: PartitionSpec): ManifestFile = {
+val outputFile = table.io
+  .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + 
UUID.randomUUID.toString))
 
 Review comment:
   Maybe you should have users pass in a location to use for this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #374: Migrate spark table to 
iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317712345
 
 

 ##
 File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
 ##
 @@ -297,5 +302,81 @@ object SparkTableUtil {
   )
 }
   }
+
+  private def buildManifest(table: Table,
+  sparkDataFiles: Seq[SparkDataFile],
+  partitionSpec: PartitionSpec): ManifestFile = {
+val outputFile = table.io
+  .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + 
UUID.randomUUID.toString))
 
 Review comment:
   This location must be a shared location, either in HDFS or a distribute FS 
like S3. It can't be the local host.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #374: Migrate spark table to 
iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317711705
 
 

 ##
 File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
 ##
 @@ -297,5 +302,81 @@ object SparkTableUtil {
   )
 }
   }
+
+  private def buildManifest(table: Table,
+  sparkDataFiles: Seq[SparkDataFile],
+  partitionSpec: PartitionSpec): ManifestFile = {
+val outputFile = table.io
+  .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + 
UUID.randomUUID.toString))
+val writer = ManifestWriter.write(partitionSpec, outputFile)
+try {
+  sparkDataFiles.foreach { file =>
+writer.add(file.toDataFile(partitionSpec))
+  }
+} finally {
+  writer.close()
+}
+
+writer.toManifestFile
+  }
+
+  /**
+   * Import a spark table to a iceberg table.
+   *
+   * The import uses the spark session to get table metadata. It assumes no
+   * operation is going on original table and target table and thus is not
+   * thread-safe.
+   *
+   * @param source the database name of the table to be import
+   * @param location the location used to store table metadata
+   *
+   * @return table the imported table
+   */
+  def importSparkTable(source: TableIdentifier, location: String): Table = {
+val sparkSession = SparkSession.builder().getOrCreate()
+import sparkSession.sqlContext.implicits._
+
+val dbName = source.database.getOrElse("default")
+val tableName = source.table
+
+if (!sparkSession.catalog.tableExists(dbName, tableName)) {
+  throw new NoSuchTableException(s"Table $dbName.$tableName does not 
exist")
+}
+
+val partitionSpec = SparkSchemaUtil.specForTable(sparkSession, 
s"$dbName.$tableName")
+val conf = sparkSession.sparkContext.hadoopConfiguration
+val tables = new HadoopTables(conf)
+val schema = SparkSchemaUtil.schemaForTable(sparkSession, 
s"$dbName.$tableName")
+val table = tables.create(schema, partitionSpec, ImmutableMap.of(), 
location)
+val appender = table.newAppend()
+
+if (partitionSpec == PartitionSpec.unpartitioned) {
+  val tableMetadata = 
sparkSession.sessionState.catalog.getTableMetadata(source)
+  val format = tableMetadata.provider.getOrElse("none")
+
+  if (format != "avro" && format != "parquet" && format != "orc") {
+throw new UnsupportedOperationException(s"Unsupported format: $format")
+  }
+  listPartition(Map.empty[String, String], tableMetadata.location.toString,
+format).foreach{f => 
appender.appendFile(f.toDataFile(PartitionSpec.unpartitioned))}
+  appender.commit()
+} else {
+  val partitions = partitionDF(sparkSession, s"$dbName.$tableName")
+  partitions.flatMap { row =>
+listPartition(row.getMap[String, String](0).toMap, row.getString(1), 
row.getString(2))
+  }.coalesce(1).mapPartitions {
 
 Review comment:
   And here's the `Manifest` case class:
   
   ```scala
   private[sql] case class Manifest(location: String, fileLength: Long, specId: 
Int) {
 def toManifestFile: ManifestFile = new ManifestFile {
   override def path: String = location
   
   override def length: Long = fileLength
   
   override def partitionSpecId: Int = specId
   
   override def snapshotId: java.lang.Long = null
   
   override def addedFilesCount: Integer = null
   
   override def existingFilesCount: Integer = null
   
   override def deletedFilesCount: Integer = null
   
   override def partitions: 
java.util.List[ManifestFile.PartitionFieldSummary] = null
   
   override def copy: ManifestFile = this
 }
   }
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #374: Migrate spark table to 
iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317711446
 
 

 ##
 File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
 ##
 @@ -297,5 +302,81 @@ object SparkTableUtil {
   )
 }
   }
+
+  private def buildManifest(table: Table,
+  sparkDataFiles: Seq[SparkDataFile],
+  partitionSpec: PartitionSpec): ManifestFile = {
+val outputFile = table.io
+  .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + 
UUID.randomUUID.toString))
+val writer = ManifestWriter.write(partitionSpec, outputFile)
+try {
+  sparkDataFiles.foreach { file =>
+writer.add(file.toDataFile(partitionSpec))
+  }
+} finally {
+  writer.close()
+}
+
+writer.toManifestFile
+  }
+
+  /**
+   * Import a spark table to a iceberg table.
+   *
+   * The import uses the spark session to get table metadata. It assumes no
+   * operation is going on original table and target table and thus is not
+   * thread-safe.
+   *
+   * @param source the database name of the table to be import
+   * @param location the location used to store table metadata
+   *
+   * @return table the imported table
+   */
+  def importSparkTable(source: TableIdentifier, location: String): Table = {
+val sparkSession = SparkSession.builder().getOrCreate()
+import sparkSession.sqlContext.implicits._
+
+val dbName = source.database.getOrElse("default")
+val tableName = source.table
+
+if (!sparkSession.catalog.tableExists(dbName, tableName)) {
+  throw new NoSuchTableException(s"Table $dbName.$tableName does not 
exist")
+}
+
+val partitionSpec = SparkSchemaUtil.specForTable(sparkSession, 
s"$dbName.$tableName")
+val conf = sparkSession.sparkContext.hadoopConfiguration
+val tables = new HadoopTables(conf)
+val schema = SparkSchemaUtil.schemaForTable(sparkSession, 
s"$dbName.$tableName")
+val table = tables.create(schema, partitionSpec, ImmutableMap.of(), 
location)
+val appender = table.newAppend()
+
+if (partitionSpec == PartitionSpec.unpartitioned) {
+  val tableMetadata = 
sparkSession.sessionState.catalog.getTableMetadata(source)
+  val format = tableMetadata.provider.getOrElse("none")
+
+  if (format != "avro" && format != "parquet" && format != "orc") {
+throw new UnsupportedOperationException(s"Unsupported format: $format")
+  }
+  listPartition(Map.empty[String, String], tableMetadata.location.toString,
+format).foreach{f => 
appender.appendFile(f.toDataFile(PartitionSpec.unpartitioned))}
+  appender.commit()
+} else {
+  val partitions = partitionDF(sparkSession, s"$dbName.$tableName")
+  partitions.flatMap { row =>
+listPartition(row.getMap[String, String](0).toMap, row.getString(1), 
row.getString(2))
+  }.coalesce(1).mapPartitions {
 
 Review comment:
   Here's `writeManifest`:
   
   ```scala
 def writeManifest(
 conf: SerializableConfiguration,
 spec: PartitionSpec,
 basePath: String): Iterator[SparkDataFile] => Iterator[Manifest] = {
   files =>
 if (files.hasNext) {
   val ctx = TaskContext.get()
   val manifestLocation = new Path(basePath,
 
s"stage-${ctx.stageId()}-task-${ctx.taskAttemptId()}-manifest.avro").toString
   val io = new HadoopFileIO(conf.value)
   val writer = ManifestWriter.write(spec, 
io.newOutputFile(manifestLocation))
   
   try {
 files.foreach { file =>
   writer.add(file.toDataFile(spec))
 }
   } finally {
 writer.close()
   }
   
   val manifest = writer.toManifestFile
   Seq(Manifest(manifest.path, manifest.length, 
manifest.partitionSpecId)).iterator
   
 } else {
   Seq.empty.iterator
 }
 }
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #374: Migrate spark table to 
iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317710889
 
 

 ##
 File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
 ##
 @@ -297,5 +302,81 @@ object SparkTableUtil {
   )
 }
   }
+
+  private def buildManifest(table: Table,
+  sparkDataFiles: Seq[SparkDataFile],
+  partitionSpec: PartitionSpec): ManifestFile = {
+val outputFile = table.io
+  .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + 
UUID.randomUUID.toString))
+val writer = ManifestWriter.write(partitionSpec, outputFile)
+try {
+  sparkDataFiles.foreach { file =>
+writer.add(file.toDataFile(partitionSpec))
+  }
+} finally {
+  writer.close()
+}
+
+writer.toManifestFile
+  }
+
+  /**
+   * Import a spark table to a iceberg table.
+   *
+   * The import uses the spark session to get table metadata. It assumes no
+   * operation is going on original table and target table and thus is not
+   * thread-safe.
+   *
+   * @param source the database name of the table to be import
+   * @param location the location used to store table metadata
+   *
+   * @return table the imported table
+   */
+  def importSparkTable(source: TableIdentifier, location: String): Table = {
+val sparkSession = SparkSession.builder().getOrCreate()
+import sparkSession.sqlContext.implicits._
+
+val dbName = source.database.getOrElse("default")
+val tableName = source.table
+
+if (!sparkSession.catalog.tableExists(dbName, tableName)) {
+  throw new NoSuchTableException(s"Table $dbName.$tableName does not 
exist")
+}
+
+val partitionSpec = SparkSchemaUtil.specForTable(sparkSession, 
s"$dbName.$tableName")
+val conf = sparkSession.sparkContext.hadoopConfiguration
+val tables = new HadoopTables(conf)
+val schema = SparkSchemaUtil.schemaForTable(sparkSession, 
s"$dbName.$tableName")
+val table = tables.create(schema, partitionSpec, ImmutableMap.of(), 
location)
+val appender = table.newAppend()
+
+if (partitionSpec == PartitionSpec.unpartitioned) {
+  val tableMetadata = 
sparkSession.sessionState.catalog.getTableMetadata(source)
+  val format = tableMetadata.provider.getOrElse("none")
+
+  if (format != "avro" && format != "parquet" && format != "orc") {
+throw new UnsupportedOperationException(s"Unsupported format: $format")
+  }
+  listPartition(Map.empty[String, String], tableMetadata.location.toString,
+format).foreach{f => 
appender.appendFile(f.toDataFile(PartitionSpec.unpartitioned))}
+  appender.commit()
+} else {
+  val partitions = partitionDF(sparkSession, s"$dbName.$tableName")
+  partitions.flatMap { row =>
+listPartition(row.getMap[String, String](0).toMap, row.getString(1), 
row.getString(2))
+  }.coalesce(1).mapPartitions {
+files =>
+  val tables = new HadoopTables(new Configuration())
 
 Review comment:
   Can you use `SerializableConfiguration` instead? This should use Spark's 
`hadoopConfiguration`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #374: Migrate spark table to 
iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317711151
 
 

 ##
 File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
 ##
 @@ -297,5 +302,81 @@ object SparkTableUtil {
   )
 }
   }
+
+  private def buildManifest(table: Table,
+  sparkDataFiles: Seq[SparkDataFile],
+  partitionSpec: PartitionSpec): ManifestFile = {
+val outputFile = table.io
+  .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + 
UUID.randomUUID.toString))
+val writer = ManifestWriter.write(partitionSpec, outputFile)
+try {
+  sparkDataFiles.foreach { file =>
+writer.add(file.toDataFile(partitionSpec))
+  }
+} finally {
+  writer.close()
+}
+
+writer.toManifestFile
+  }
+
+  /**
+   * Import a spark table to a iceberg table.
+   *
+   * The import uses the spark session to get table metadata. It assumes no
+   * operation is going on original table and target table and thus is not
+   * thread-safe.
+   *
+   * @param source the database name of the table to be import
+   * @param location the location used to store table metadata
+   *
+   * @return table the imported table
+   */
+  def importSparkTable(source: TableIdentifier, location: String): Table = {
+val sparkSession = SparkSession.builder().getOrCreate()
+import sparkSession.sqlContext.implicits._
+
+val dbName = source.database.getOrElse("default")
+val tableName = source.table
+
+if (!sparkSession.catalog.tableExists(dbName, tableName)) {
+  throw new NoSuchTableException(s"Table $dbName.$tableName does not 
exist")
+}
+
+val partitionSpec = SparkSchemaUtil.specForTable(sparkSession, 
s"$dbName.$tableName")
+val conf = sparkSession.sparkContext.hadoopConfiguration
+val tables = new HadoopTables(conf)
+val schema = SparkSchemaUtil.schemaForTable(sparkSession, 
s"$dbName.$tableName")
+val table = tables.create(schema, partitionSpec, ImmutableMap.of(), 
location)
+val appender = table.newAppend()
+
+if (partitionSpec == PartitionSpec.unpartitioned) {
+  val tableMetadata = 
sparkSession.sessionState.catalog.getTableMetadata(source)
+  val format = tableMetadata.provider.getOrElse("none")
+
+  if (format != "avro" && format != "parquet" && format != "orc") {
 
 Review comment:
   This should use the same logic that `partitionDF` uses to detect the format. 
It should check the serde's input format.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #374: Migrate spark table to 
iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317710578
 
 

 ##
 File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
 ##
 @@ -297,5 +302,81 @@ object SparkTableUtil {
   )
 }
   }
+
+  private def buildManifest(table: Table,
+  sparkDataFiles: Seq[SparkDataFile],
+  partitionSpec: PartitionSpec): ManifestFile = {
+val outputFile = table.io
+  .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + 
UUID.randomUUID.toString))
+val writer = ManifestWriter.write(partitionSpec, outputFile)
+try {
+  sparkDataFiles.foreach { file =>
+writer.add(file.toDataFile(partitionSpec))
+  }
+} finally {
+  writer.close()
+}
+
+writer.toManifestFile
+  }
+
+  /**
+   * Import a spark table to a iceberg table.
+   *
+   * The import uses the spark session to get table metadata. It assumes no
+   * operation is going on original table and target table and thus is not
+   * thread-safe.
+   *
+   * @param source the database name of the table to be import
+   * @param location the location used to store table metadata
+   *
+   * @return table the imported table
+   */
+  def importSparkTable(source: TableIdentifier, location: String): Table = {
+val sparkSession = SparkSession.builder().getOrCreate()
+import sparkSession.sqlContext.implicits._
+
+val dbName = source.database.getOrElse("default")
+val tableName = source.table
+
+if (!sparkSession.catalog.tableExists(dbName, tableName)) {
+  throw new NoSuchTableException(s"Table $dbName.$tableName does not 
exist")
+}
+
+val partitionSpec = SparkSchemaUtil.specForTable(sparkSession, 
s"$dbName.$tableName")
+val conf = sparkSession.sparkContext.hadoopConfiguration
+val tables = new HadoopTables(conf)
+val schema = SparkSchemaUtil.schemaForTable(sparkSession, 
s"$dbName.$tableName")
+val table = tables.create(schema, partitionSpec, ImmutableMap.of(), 
location)
+val appender = table.newAppend()
+
+if (partitionSpec == PartitionSpec.unpartitioned) {
+  val tableMetadata = 
sparkSession.sessionState.catalog.getTableMetadata(source)
+  val format = tableMetadata.provider.getOrElse("none")
+
+  if (format != "avro" && format != "parquet" && format != "orc") {
+throw new UnsupportedOperationException(s"Unsupported format: $format")
+  }
+  listPartition(Map.empty[String, String], tableMetadata.location.toString,
+format).foreach{f => 
appender.appendFile(f.toDataFile(PartitionSpec.unpartitioned))}
+  appender.commit()
+} else {
+  val partitions = partitionDF(sparkSession, s"$dbName.$tableName")
+  partitions.flatMap { row =>
+listPartition(row.getMap[String, String](0).toMap, row.getString(1), 
row.getString(2))
+  }.coalesce(1).mapPartitions {
 
 Review comment:
   Why `coalesce(1)` here?
   
   In our version of this, we add a sort by file name and build manifests in 
parallel:
   
   ```scala
   val tempPath = new 
Path(s"hdfs:/tmp/iceberg-conversions/$applicationId")
   
   val manifests: Seq[ManifestFile] = files
   .repartition(100) // repartition to shuffle the data and not 
list partitions twice
   .orderBy($"path")
   .mapPartitions(writeManifest) // writes manifests to tempPath
   .collect()
   .map(_.toManifestFile)
   
   val append = table.newAppend
   manifests.foreach(append.appendManifest)
   append.commit()
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #374: Migrate spark table to 
iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r317708705
 
 

 ##
 File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
 ##
 @@ -297,5 +301,81 @@ object SparkTableUtil {
   )
 }
   }
+
+  private def buildManifest(table: Table,
+sparkDataFiles: Seq[SparkDataFile],
+partitionSpec: PartitionSpec): ManifestFile = {
+val outputFile = table.io
+  .newOutputFile(FileFormat.AVRO.addExtension("/tmp/" + 
UUID.randomUUID.toString))
+val writer = ManifestWriter.write(partitionSpec, outputFile)
+try {
+  for (file <- sparkDataFiles) {
+writer.add(file.toDataFile(partitionSpec))
+  }
+} finally {
+  writer.close()
+}
+
+writer.toManifestFile
+  }
+
+  /**
+   * Import a spark table to a iceberg table.
+   *
+   * The import uses the spark session to get table metadata. It assumes no
+   * operation is going on original table and target table and thus is not
+   * thread-safe.
+   *
+   * @param source the database name of the table to be import
+   * @param location the location used to store table metadata
+   *
+   * @return table the imported table
+   */
+  def importSparkTable(source: TableIdentifier, location: String): Table = {
+val sparkSession = SparkSession.builder().getOrCreate()
+import sparkSession.sqlContext.implicits._
+
+val dbName = source.database.getOrElse("default")
+val tableName = source.table
+
+if (!sparkSession.catalog.tableExists(dbName, tableName)) {
+  throw new NoSuchTableException(s"Table $dbName.$tableName does not 
exist")
+}
+
+val partitionSpec = SparkSchemaUtil.specForTable(sparkSession, 
s"$dbName.$tableName")
+val conf = sparkSession.sparkContext.hadoopConfiguration
+val tables = new HadoopTables(conf)
+val schema = SparkSchemaUtil.schemaForTable(sparkSession, 
s"$dbName.$tableName")
+val table = tables.create(schema, partitionSpec, ImmutableMap.of(), 
location)
+val appender = table.newAppend()
+
+if (partitionSpec == PartitionSpec.unpartitioned) {
 
 Review comment:
   No, that's okay. If the table isn't partitioned, we don't need to get the 
partitions as a dataframe. It is fine to run this on the driver in that case.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #412: Use null counts in metrics evaluators

2019-08-26 Thread GitBox
rdblue commented on issue #412: Use null counts in metrics evaluators
URL: https://github.com/apache/incubator-iceberg/pull/412#issuecomment-524949101
 
 
   Thanks @aokolnychyi!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue merged pull request #412: Use null counts in metrics evaluators

2019-08-26 Thread GitBox
rdblue merged pull request #412: Use null counts in metrics evaluators
URL: https://github.com/apache/incubator-iceberg/pull/412
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue closed issue #408: Use null counts in metrics evaluators

2019-08-26 Thread GitBox
rdblue closed issue #408: Use null counts in metrics evaluators
URL: https://github.com/apache/incubator-iceberg/issues/408
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #414: Add toByteBuffer to Literal

2019-08-26 Thread GitBox
rdblue commented on issue #414: Add toByteBuffer to Literal
URL: https://github.com/apache/incubator-iceberg/pull/414#issuecomment-524945345
 
 
   Thanks @aokolnychyi! I just found a couple of minor things to fix.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #414: Add toByteBuffer to Literal

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #414: Add toByteBuffer to Literal
URL: https://github.com/apache/incubator-iceberg/pull/414#discussion_r317701995
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Literal.java
 ##
 @@ -102,4 +102,8 @@
* @return a comparator for T objects
*/
   Comparator comparator();
+
+  default ByteBuffer toByteBuffer() {
+throw new UnsupportedOperationException("toByteBuffer is not supported");
 
 Review comment:
   Is this left unimplemented for any type? If it isn't then I think we can 
remove the default implementation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #414: Add toByteBuffer to Literal

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #414: Add toByteBuffer to Literal
URL: https://github.com/apache/incubator-iceberg/pull/414#discussion_r317701749
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Literal.java
 ##
 @@ -102,4 +102,8 @@
* @return a comparator for T objects
*/
   Comparator comparator();
+
+  default ByteBuffer toByteBuffer() {
 
 Review comment:
   This needs some documentation to explain how the value is serialized to byte 
buffer. I think it uses the [single-value serialization 
format](http://iceberg.apache.org/spec/#appendix-d-single-value-serialization), 
so linking to that would be helpful.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #414: Add toByteBuffer to Literal

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #414: Add toByteBuffer to Literal
URL: https://github.com/apache/incubator-iceberg/pull/414#discussion_r317700813
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Literals.java
 ##
 @@ -94,6 +95,7 @@ private Literals() {
 
   private abstract static class BaseLiteral implements Literal {
 private final T value;
+private volatile ByteBuffer byteBuffer;
 
 Review comment:
   Please explicitly initialize variables either here or in constructors.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue merged pull request #350: Add dropTable purge option to Catalog API

2019-08-26 Thread GitBox
rdblue merged pull request #350: Add dropTable purge option to Catalog API
URL: https://github.com/apache/incubator-iceberg/pull/350
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #406: Throw an exception when using Iceberg with Spark embedded metastore

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #406: Throw an exception when 
using Iceberg with Spark embedded metastore
URL: https://github.com/apache/incubator-iceberg/pull/406#discussion_r317698991
 
 

 ##
 File path: hive/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
 ##
 @@ -183,6 +183,12 @@ public void commit(TableMetadata base, TableMetadata 
metadata) {
   }
   threw = false;
 } catch (TException | UnknownHostException e) {
+  if (e.getMessage().contains("Table/View 'HIVE_LOCKS' does not exist")) {
+LOG.error("Failed to acquire locks from metastore because 'HIVE_LOCKS' 
doesn't exist, " +
+"this probably happened when using embedded metastore or doesn't 
create transactional" +
+" meta table. Please reconfigure and start the metastore.", e);
 
 Review comment:
   This should also be thrown instead of logged.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #406: Throw an exception when using Iceberg with Spark embedded metastore

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #406: Throw an exception when 
using Iceberg with Spark embedded metastore
URL: https://github.com/apache/incubator-iceberg/pull/406#discussion_r317698659
 
 

 ##
 File path: hive/src/main/java/org/apache/iceberg/hive/HiveClientPool.java
 ##
 @@ -44,6 +48,14 @@ protected HiveMetaStoreClient newClient()  {
   return new HiveMetaStoreClient(hiveConf);
 } catch (MetaException e) {
   throw new RuntimeMetaException(e, "Failed to connect to Hive Metastore");
+} catch (Throwable t) {
+  if (t.getMessage().contains("Another instance of Derby may have already 
booted")) {
+LOG.error("Another instance of Derby may have already booted. This is 
happened when using" +
+" embedded metastore, which only supports one client at time. 
Please change to use " +
+"other metastores which could support multiple clients, like 
metastore service.", t);
 
 Review comment:
   Can we change the error message to something a bit shorter? How about: 
"Failed to start an embedded metastore because Derby supports only one client 
at a time. To fix this, use a metastore that supports multiple clients."


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #406: Throw an exception when using Iceberg with Spark embedded metastore

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #406: Throw an exception when 
using Iceberg with Spark embedded metastore
URL: https://github.com/apache/incubator-iceberg/pull/406#discussion_r317698081
 
 

 ##
 File path: hive/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java
 ##
 @@ -183,6 +183,12 @@ public void commit(TableMetadata base, TableMetadata 
metadata) {
   }
   threw = false;
 } catch (TException | UnknownHostException e) {
+  if (e.getMessage().contains("Table/View 'HIVE_LOCKS' does not exist")) {
+LOG.error("Failed to acquire locks from metastore because 'HIVE_LOCKS' 
doesn't exist, " +
+"this probably happened when using embedded metastore or doesn't 
create transactional" +
+" meta table. Please reconfigure and start the metastore.", e);
 
 Review comment:
   This recommendation isn't very helpful because it isn't clear what "the 
metastore" is. How about this instead: "To fix this, use an alternative 
metastore".


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #406: Throw an exception when using Iceberg with Spark embedded metastore

2019-08-26 Thread GitBox
rdblue commented on a change in pull request #406: Throw an exception when 
using Iceberg with Spark embedded metastore
URL: https://github.com/apache/incubator-iceberg/pull/406#discussion_r317697046
 
 

 ##
 File path: hive/src/main/java/org/apache/iceberg/hive/HiveClientPool.java
 ##
 @@ -44,6 +48,14 @@ protected HiveMetaStoreClient newClient()  {
   return new HiveMetaStoreClient(hiveConf);
 } catch (MetaException e) {
   throw new RuntimeMetaException(e, "Failed to connect to Hive Metastore");
+} catch (Throwable t) {
+  if (t.getMessage().contains("Another instance of Derby may have already 
booted")) {
+LOG.error("Another instance of Derby may have already booted. This is 
happened when using" +
+" embedded metastore, which only supports one client at time. 
Please change to use " +
+"other metastores which could support multiple clients, like 
metastore service.", t);
 
 Review comment:
   It is fine to log this message, but I think the exception that is thrown 
should also contain it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #280: Add persistent IDs to partition fields

2019-08-26 Thread GitBox
rdblue commented on issue #280: Add persistent IDs to partition fields
URL: 
https://github.com/apache/incubator-iceberg/issues/280#issuecomment-524938283
 
 
   @manishmalhotrawork, those IDs have different contexts. The source ID in a 
partition field is the ID of the source data column in the table schema. The ID 
added by partitionType is the ID in the manifest file schema.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] manishmalhotrawork edited a comment on issue #280: Add persistent IDs to partition fields

2019-08-26 Thread GitBox
manishmalhotrawork edited a comment on issue #280: Add persistent IDs to 
partition fields
URL: 
https://github.com/apache/incubator-iceberg/issues/280#issuecomment-524733183
 
 
   @rdblue thanks for the answer.
   Curious on the flow and as how fieldId are maintained in PartitionSpec vs 
Table Schema.
   
   Here [PartitionSpec is 
initialized](https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L63)
   And fields are copied to partition spec, which also included original IDs 
assigned to fields.
   
   and in [partitionType 
method](https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L113)
   where new fieldID is assigned ( 1000+) for the partitionFields and returns 
StructType with those fields.
   
   Not sure if Im missing something.
   
   So, why original fieldId and ids assigned in `partitionSpec` are different ?
   How the `partitionSpec` returned `StructType` with IDs as 1000+ is used?
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #412: Use null counts in metrics evaluators

2019-08-26 Thread GitBox
rdblue commented on issue #412: Use null counts in metrics evaluators
URL: https://github.com/apache/incubator-iceberg/pull/412#issuecomment-524914156
 
 
   > Is the current implementation of eq in StrictMetricsEvaluator safe? What 
if we have truncated min/max values?
   
   I think the bounds check is safe with truncated bounds. The check is that 
the literal value is equal to lower and upper bounds, so we would know that 
there is only one value that is equal to the filter's literal. That can only 
happen when the bounds have not been truncated because the truncated upper 
bound is never equal to the truncated lower bound unless the value was already 
smaller than the truncation length.
   
   Here's an example: Say the real min and max are ["aa", ""] and we are 
truncating to 2 codepoints. Then the bounds will be ["aa", "ab"]. Because there 
is no value equal to both bounds, `eq` will never return `ROWS_MUST_MATCH`. 
Alternatively, if there is just one value, "aa", then both min/max and 
truncated bounds will be "aa". In that case, `eq` will detect that "aa" is 
equal to both bounds and return that the predicate is guaranteed to be true by 
the bounds.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi commented on issue #412: Use null counts in metrics evaluators

2019-08-26 Thread GitBox
aokolnychyi commented on issue #412: Use null counts in metrics evaluators
URL: https://github.com/apache/incubator-iceberg/pull/412#issuecomment-524900800
 
 
   @rdblue not directly related to this PR, but is the current implementation 
of `eq` in `StrictMetricsEvaluator` safe? What if we have truncated min/max 
values? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] chenjunjiedada opened a new issue #415: supports unpartitioned table in partitionDF

2019-08-26 Thread GitBox
chenjunjiedada opened a new issue #415: supports unpartitioned table in 
partitionDF
URL: https://github.com/apache/incubator-iceberg/issues/415
 
 
   partitionDF throws exception when executing on unpartitioned table, please 
see following exception:
   ```
   org.apache.hadoop.hive.ql.metadata.HiveException: Table item is not a 
partitioned table
 at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2123)
 at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2156)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitions$1.apply(HiveClientImpl.scala:670)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitions$1.apply(HiveClientImpl.scala:662)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitions(HiveClientImpl.scala:662)
 at 
org.apache.spark.sql.hive.client.HiveClient$class.getPartitions(HiveClient.scala:210)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.getPartitions(HiveClientImpl.scala:83)
 at org.apache.iceberg.spark.hacks.Hive.partitions(Hive.java:48)
 at 
org.apache.iceberg.spark.SparkTableUtil$.partitionDF(SparkTableUtil.scala:56)
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jerryshao commented on issue #406: Throw an exception when using Iceberg with Spark embedded metastore

2019-08-26 Thread GitBox
jerryshao commented on issue #406: Throw an exception when using Iceberg with 
Spark embedded metastore
URL: https://github.com/apache/incubator-iceberg/pull/406#issuecomment-524843479
 
 
   @rdblue please help to take another look, thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] manishmalhotrawork edited a comment on issue #280: Add persistent IDs to partition fields

2019-08-26 Thread GitBox
manishmalhotrawork edited a comment on issue #280: Add persistent IDs to 
partition fields
URL: 
https://github.com/apache/incubator-iceberg/issues/280#issuecomment-524733183
 
 
   @rdblue thanks for the answer.
   Curios on the flow and as how fieldId are maintained in PartitionSpec vs 
Table Schema.
   
   Here [PartitionSpec is 
initialized](https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L63)
   And fields are copied to partition spec, which also included original IDs 
assigned to fields.
   
   and in [partitionType 
method](https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L113)
   where new fieldID is assigned ( 1000+) for the partitionFields and returns 
StructType with those fields.
   
   Not sure if Im missing something.
   
   So, why original fieldId and ids assigned in `partitionSpec` are different ?
   How the `partitionSpec` returned `StructType` with IDs as 1000+ is used?
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] manishmalhotrawork commented on issue #280: Add persistent IDs to partition fields

2019-08-26 Thread GitBox
manishmalhotrawork commented on issue #280: Add persistent IDs to partition 
fields
URL: 
https://github.com/apache/incubator-iceberg/issues/280#issuecomment-524733183
 
 
   @rdblue thanks for the answer.
   Curios on the flow and as how fieldId are maintained in PartitionSpec vs 
Table Schema.
   
   Here [PartitionSpec is 
initialized](https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L63)
   And fields are copied to partition spec, which also included original IDs 
assigned to fields.
   
   and in [partitionType 
method](https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/PartitionSpec.java#L113)
   where new fieldID is assigned ( 1000+) for the partitionFields and returns 
StructType with those fields.
   
   Not sure if Im missing something.
   
   So, why original fieldId and ids assigned in `partitionSpec` are different ?
   
   How the `partitionSpec` returned `StructType` with IDs as 1000+ is used?
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org