[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table
chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315939963 ## File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala ## @@ -19,18 +19,22 @@ package org.apache.iceberg.spark +import com.google.common.collect.ImmutableMap import com.google.common.collect.Maps import java.nio.ByteBuffer import java.util +import java.util.UUID import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{Path, PathFilter} -import org.apache.iceberg.{DataFile, DataFiles, Metrics, MetricsConfig, PartitionSpec} -import org.apache.iceberg.hadoop.HadoopInputFile +import org.apache.iceberg._ Review comment: It looks like we need to import 9 entities which exceed 100 line length limit. From spark's coding style, when importing more than 6 entities it prefers to use a wildcard. Just want to confirm which one do we prefer? line break or following spark coding style for scala? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table
chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315936341 ## File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala ## @@ -19,18 +19,22 @@ package org.apache.iceberg.spark +import com.google.common.collect.ImmutableMap import com.google.common.collect.Maps import java.nio.ByteBuffer import java.util +import java.util.UUID import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{Path, PathFilter} -import org.apache.iceberg.{DataFile, DataFiles, Metrics, MetricsConfig, PartitionSpec} -import org.apache.iceberg.hadoop.HadoopInputFile +import org.apache.iceberg._ Review comment: hmm... The Idea IDE always optimizes the imports when adding a new one from Alt+Enter. Let me update it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table
chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315935961 ## File path: hive/src/test/java/org/apache/iceberg/hive/TestHiveMetastore.java ## @@ -111,6 +111,7 @@ private HiveConf newHiveConf(int port) { HiveConf newHiveConf = new HiveConf(new Configuration(), TestHiveMetastore.class); newHiveConf.set(HiveConf.ConfVars.METASTOREURIS.varname, "thrift://localhost:" + port); newHiveConf.set(HiveConf.ConfVars.METASTOREWAREHOUSE.varname, "file:" + hiveLocalDir.getAbsolutePath()); +newHiveConf.set(HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL.varname, "false"); Review comment: This is used to eliminate warning in the unit test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table
chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315935720 ## File path: hive/src/test/java/org/apache/iceberg/hive/TestHiveMetastore.java ## @@ -102,7 +102,7 @@ private TServer newThriftServer(TServerSocket socket, HiveConf conf) throws Exce .transportFactory(new TTransportFactory()) .protocolFactory(new TBinaryProtocol.Factory()) .minWorkerThreads(3) -.maxWorkerThreads(5); +.maxWorkerThreads(8); Review comment: This is used for the unit test. The spark job needs at least three thrift server threads since we have three partitions in the unit test, the maximum value of 5 should be enough while the previous unit tests still not release yet (default release time is 1800s) which cause failed to connect to thrift server. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi opened a new issue #397: Push down StringStartsWith in Spark IcebergSource
aokolnychyi opened a new issue #397: Push down StringStartsWith in Spark IcebergSource URL: https://github.com/apache/incubator-iceberg/issues/397 Apart from changes in `SparkFilters`, we will need to extend `InclusiveMetricsEvaluator`, `ParquetDictionaryRowGroupFilter`, `ParquetMetricsRowGroupFilter` with the logic that checks lower/upper bounds in `startsWith`. One way to implement this is to take the same number of bytes from prefix and bounds (e.g. slice) and then rely on `UnsignedByteBufComparator` for unsigned lexicographical comparison. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table
rdblue commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315824567 ## File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala ## @@ -19,18 +19,22 @@ package org.apache.iceberg.spark +import com.google.common.collect.ImmutableMap import com.google.common.collect.Maps import java.nio.ByteBuffer import java.util +import java.util.UUID import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{Path, PathFilter} -import org.apache.iceberg.{DataFile, DataFiles, Metrics, MetricsConfig, PartitionSpec} -import org.apache.iceberg.hadoop.HadoopInputFile +import org.apache.iceberg._ Review comment: Please don't use wildcard imports. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table
rdblue commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315824445 ## File path: hive/src/test/java/org/apache/iceberg/hive/TestHiveMetastore.java ## @@ -111,6 +111,7 @@ private HiveConf newHiveConf(int port) { HiveConf newHiveConf = new HiveConf(new Configuration(), TestHiveMetastore.class); newHiveConf.set(HiveConf.ConfVars.METASTOREURIS.varname, "thrift://localhost:" + port); newHiveConf.set(HiveConf.ConfVars.METASTOREWAREHOUSE.varname, "file:" + hiveLocalDir.getAbsolutePath()); +newHiveConf.set(HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL.varname, "false"); Review comment: What does this do? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table
rdblue commented on a change in pull request #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315824375 ## File path: hive/src/test/java/org/apache/iceberg/hive/TestHiveMetastore.java ## @@ -102,7 +102,7 @@ private TServer newThriftServer(TServerSocket socket, HiveConf conf) throws Exce .transportFactory(new TTransportFactory()) .protocolFactory(new TBinaryProtocol.Factory()) .minWorkerThreads(3) -.maxWorkerThreads(5); +.maxWorkerThreads(8); Review comment: Why was this change required? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #396: Support DS read/write without specifying namespace
rdblue commented on a change in pull request #396: Support DS read/write without specifying namespace URL: https://github.com/apache/incubator-iceberg/pull/396#discussion_r315798731 ## File path: spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java ## @@ -104,7 +104,9 @@ protected Table findTable(DataSourceOptions options, Configuration conf) { return tables.load(path.get()); } else { HiveCatalog hiveCatalog = HiveCatalogs.loadCatalog(conf); - TableIdentifier tableIdentifier = TableIdentifier.parse(path.get()); + TableIdentifier tableIdentifier = path.get().contains(".") ? + TableIdentifier.parse(path.get()) : + TableIdentifier.of(lazySparkSession().catalog().currentDatabase(), path.get()); Review comment: I think the problem with this approach is that there is no guarantee that Spark's session catalog uses the same namespace as the Iceberg catalog. Spark just solved the problem for multiple catalogs: https://github.com/apache/spark/pull/25368 Spark will now only keep a current database (namespace) for the current catalog and will fill in the current namespace before passing the identifier to a catalog. I think that we should go with that approach and let Spark fill in session configuration like current database. Of course, that means we can't do much to fix this through this path because Spark can't help fill in a namespace when it doesn't know about the catalog because we are accessing the Iceberg source directly by name. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations
rdblue commented on issue #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations URL: https://github.com/apache/incubator-iceberg/pull/395#issuecomment-523096730 +1 Thanks for fixing this, @jerryshao! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations
rdblue merged pull request #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations URL: https://github.com/apache/incubator-iceberg/pull/395 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #327: Introduce startsWith Predicate
rdblue commented on issue #327: Introduce startsWith Predicate URL: https://github.com/apache/incubator-iceberg/pull/327#issuecomment-523096095 That plan sounds good to me! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] aokolnychyi commented on issue #327: Introduce startsWith Predicate
aokolnychyi commented on issue #327: Introduce startsWith Predicate URL: https://github.com/apache/incubator-iceberg/pull/327#issuecomment-523089442 I would like to follow up with a PR to enable pushdown of `startsWith` in Spark. Are there any limitations that will block me? Apart from changes in `SparkFilters`, we will need to extend `InclusiveMetricsEvaluator`, `ParquetDictionaryRowGroupFilter`, `ParquetMetricsRowGroupFilter` with logic that checks lower/upper bounds in `startsWith`. I think one way to implement this is to take the same number of bytes from prefix and bounds (e.g. slice) and then rely on `UnsignedByteBufComparator` for unsigned lexicographical comparison. Alternatively, we can convert bounds to strings and operate on them. @sujithjay @rdblue @moulimukherjee Any thoughts? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jerryshao opened a new pull request #396: Support DS read/write without specifying namespace
jerryshao opened a new pull request #396: Support DS read/write without specifying namespace URL: https://github.com/apache/incubator-iceberg/pull/396 Current iceberg must specify namespace when read/write through DS API, for example: ` spark.read.format("iceberg").load("default.logs") `, otherwise it will throw an exception. Actually we could leverage Spark's catalog information to pick the current namespace when it is not specified. So here changing to use default namespace when it is not specified. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] chenjunjiedada commented on issue #374: Migrate spark table to iceberg table
chenjunjiedada commented on issue #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#issuecomment-522978032 Figured out the root cause of the failed unit test. The spark job needs at least three thrift server threads since we have three partitions in the unit test, the maximum value of 5 should be enough while the previous unit tests still not release yet (default release time is 1800s). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] chenjunjiedada commented on issue #374: Migrate spark table to iceberg table
chenjunjiedada commented on issue #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#issuecomment-522920854 hmm, it's a warning. ``` WARN org.apache.hadoop.hive.metastore.ObjectStore - Direct SQL failed, falling back to ORM java.lang.ClassCastException: org.apache.derby.impl.jdbc.EmbedClob cannot be cast to java.lang.String ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] chenjunjiedada commented on issue #374: Migrate spark table to iceberg table
chenjunjiedada commented on issue #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#issuecomment-522906620 Not sure why unit tests failed. ` org.apache.iceberg.spark.source.TestIcebergSourceHiveTables > testHiveManifestsTable FAILED java.lang.RuntimeException at TestIcebergSourceHiveTables.java:438 Caused by: org.apache.thrift.transport.TTransportException at TestIcebergSourceHiveTables.java:438 Caused by: java.net.SocketException at TestIcebergSourceHiveTables.java:438 org.apache.iceberg.spark.source.TestIcebergSourceHiveTables > testHiveHistoryTable FAILED java.lang.RuntimeException at TestIcebergSourceHiveTables.java:285 Caused by: org.apache.thrift.transport.TTransportException at TestIcebergSourceHiveTables.java:285 Caused by: java.net.SocketException at TestIcebergSourceHiveTables.java:285 org.apache.iceberg.spark.source.TestIcebergSourceHiveTables > testHiveEntriesTable FAILED java.lang.RuntimeException at TestIcebergSourceHiveTables.java:144 Caused by: org.apache.thrift.transport.TTransportException at TestIcebergSourceHiveTables.java:144 Caused by: java.net.SocketException at TestIcebergSourceHiveTables.java:144 org.apache.iceberg.spark.source.TestSparkTableUtil > testMigrateSparkTable FAILED org.apache.hadoop.hive.ql.metadata.HiveException at TestSparkTableUtil.java:134 Caused by: org.apache.thrift.transport.TTransportException at TestSparkTableUtil.java:134 Caused by: java.net.SocketException at TestSparkTableUtil.java:134 135 tests completed, 4 failed, 4 skipped > Task :iceberg-spark:test FAILED FAILURE: Build failed with an exception. ` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] chenjunjiedada edited a comment on issue #374: Migrate spark table to iceberg table
chenjunjiedada edited a comment on issue #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#issuecomment-522906620 Not sure why unit tests failed. ``` org.apache.iceberg.spark.source.TestIcebergSourceHiveTables > testHiveManifestsTable FAILED java.lang.RuntimeException at TestIcebergSourceHiveTables.java:438 Caused by: org.apache.thrift.transport.TTransportException at TestIcebergSourceHiveTables.java:438 Caused by: java.net.SocketException at TestIcebergSourceHiveTables.java:438 org.apache.iceberg.spark.source.TestIcebergSourceHiveTables > testHiveHistoryTable FAILED java.lang.RuntimeException at TestIcebergSourceHiveTables.java:285 Caused by: org.apache.thrift.transport.TTransportException at TestIcebergSourceHiveTables.java:285 Caused by: java.net.SocketException at TestIcebergSourceHiveTables.java:285 org.apache.iceberg.spark.source.TestIcebergSourceHiveTables > testHiveEntriesTable FAILED java.lang.RuntimeException at TestIcebergSourceHiveTables.java:144 Caused by: org.apache.thrift.transport.TTransportException at TestIcebergSourceHiveTables.java:144 Caused by: java.net.SocketException at TestIcebergSourceHiveTables.java:144 org.apache.iceberg.spark.source.TestSparkTableUtil > testMigrateSparkTable FAILED org.apache.hadoop.hive.ql.metadata.HiveException at TestSparkTableUtil.java:134 Caused by: org.apache.thrift.transport.TTransportException at TestSparkTableUtil.java:134 Caused by: java.net.SocketException at TestSparkTableUtil.java:134 135 tests completed, 4 failed, 4 skipped > Task :iceberg-spark:test FAILED FAILURE: Build failed with an exception. ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] chenjunjiedada commented on issue #374: Migrate spark table to iceberg table
chenjunjiedada commented on issue #374: Migrate spark table to iceberg table URL: https://github.com/apache/incubator-iceberg/pull/374#issuecomment-522904779 @rdblue, @aokolnychyi @jerryshao PR updated. It uses `coalesce(1)` to append all files to one manifest file. In the case of a large table which has a mass of files, the final manifest may be too big. Currently, I'm not sure what is the best way to transfer manifests between the executor and the driver. IIUC, there are two ways if we want to build manifests parallel and commit once. 1. Store the temporary manifests to some HDFS location and return location string to the driver, then the driver constructs the manifests and appends manifests and commits. While I don't find an existing API to build manifests from location, ManifestReader? 2. Implement `Serializable ` for the Manifest or Snapshot, so that we collect them and append manifests in the driver side. While this looks like a bit complex. What do you think? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org