[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-20 Thread GitBox
chenjunjiedada commented on a change in pull request #374: Migrate spark table 
to iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315939963
 
 

 ##
 File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
 ##
 @@ -19,18 +19,22 @@
 
 package org.apache.iceberg.spark
 
+import com.google.common.collect.ImmutableMap
 import com.google.common.collect.Maps
 import java.nio.ByteBuffer
 import java.util
+import java.util.UUID
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{Path, PathFilter}
-import org.apache.iceberg.{DataFile, DataFiles, Metrics, MetricsConfig, 
PartitionSpec}
-import org.apache.iceberg.hadoop.HadoopInputFile
+import org.apache.iceberg._
 
 Review comment:
   It looks like we need to import 9 entities which exceed 100 line length 
limit. From spark's coding style, when importing more than 6 entities it 
prefers to use a wildcard. Just want to confirm which one do we prefer? line 
break or following spark coding style for scala?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-20 Thread GitBox
chenjunjiedada commented on a change in pull request #374: Migrate spark table 
to iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315936341
 
 

 ##
 File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
 ##
 @@ -19,18 +19,22 @@
 
 package org.apache.iceberg.spark
 
+import com.google.common.collect.ImmutableMap
 import com.google.common.collect.Maps
 import java.nio.ByteBuffer
 import java.util
+import java.util.UUID
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{Path, PathFilter}
-import org.apache.iceberg.{DataFile, DataFiles, Metrics, MetricsConfig, 
PartitionSpec}
-import org.apache.iceberg.hadoop.HadoopInputFile
+import org.apache.iceberg._
 
 Review comment:
   hmm... The Idea IDE always optimizes the imports when adding a new one from 
Alt+Enter. Let me update it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-20 Thread GitBox
chenjunjiedada commented on a change in pull request #374: Migrate spark table 
to iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315935961
 
 

 ##
 File path: hive/src/test/java/org/apache/iceberg/hive/TestHiveMetastore.java
 ##
 @@ -111,6 +111,7 @@ private HiveConf newHiveConf(int port) {
 HiveConf newHiveConf = new HiveConf(new Configuration(), 
TestHiveMetastore.class);
 newHiveConf.set(HiveConf.ConfVars.METASTOREURIS.varname, 
"thrift://localhost:" + port);
 newHiveConf.set(HiveConf.ConfVars.METASTOREWAREHOUSE.varname, "file:" + 
hiveLocalDir.getAbsolutePath());
+newHiveConf.set(HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL.varname, 
"false");
 
 Review comment:
   This is used to eliminate warning in the unit test.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] chenjunjiedada commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-20 Thread GitBox
chenjunjiedada commented on a change in pull request #374: Migrate spark table 
to iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315935720
 
 

 ##
 File path: hive/src/test/java/org/apache/iceberg/hive/TestHiveMetastore.java
 ##
 @@ -102,7 +102,7 @@ private TServer newThriftServer(TServerSocket socket, 
HiveConf conf) throws Exce
 .transportFactory(new TTransportFactory())
 .protocolFactory(new TBinaryProtocol.Factory())
 .minWorkerThreads(3)
-.maxWorkerThreads(5);
+.maxWorkerThreads(8);
 
 Review comment:
   This is used for the unit test. The spark job needs at least three thrift 
server threads since we have three partitions in the unit test, the maximum 
value of 5 should be enough while the previous unit tests still not release yet 
(default release time is 1800s) which cause failed to connect to thrift server.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi opened a new issue #397: Push down StringStartsWith in Spark IcebergSource

2019-08-20 Thread GitBox
aokolnychyi opened a new issue #397: Push down StringStartsWith in Spark 
IcebergSource
URL: https://github.com/apache/incubator-iceberg/issues/397
 
 
   Apart from changes in `SparkFilters`, we will need to extend 
`InclusiveMetricsEvaluator`, `ParquetDictionaryRowGroupFilter`, 
`ParquetMetricsRowGroupFilter` with the logic that checks lower/upper bounds in 
`startsWith`.
   
   One way to implement this is to take the same number of bytes from prefix 
and bounds (e.g. slice) and then rely on `UnsignedByteBufComparator` for 
unsigned lexicographical comparison.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-20 Thread GitBox
rdblue commented on a change in pull request #374: Migrate spark table to 
iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315824567
 
 

 ##
 File path: spark/src/main/scala/org/apache/iceberg/spark/SparkTableUtil.scala
 ##
 @@ -19,18 +19,22 @@
 
 package org.apache.iceberg.spark
 
+import com.google.common.collect.ImmutableMap
 import com.google.common.collect.Maps
 import java.nio.ByteBuffer
 import java.util
+import java.util.UUID
 import org.apache.hadoop.conf.Configuration
 import org.apache.hadoop.fs.{Path, PathFilter}
-import org.apache.iceberg.{DataFile, DataFiles, Metrics, MetricsConfig, 
PartitionSpec}
-import org.apache.iceberg.hadoop.HadoopInputFile
+import org.apache.iceberg._
 
 Review comment:
   Please don't use wildcard imports.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-20 Thread GitBox
rdblue commented on a change in pull request #374: Migrate spark table to 
iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315824445
 
 

 ##
 File path: hive/src/test/java/org/apache/iceberg/hive/TestHiveMetastore.java
 ##
 @@ -111,6 +111,7 @@ private HiveConf newHiveConf(int port) {
 HiveConf newHiveConf = new HiveConf(new Configuration(), 
TestHiveMetastore.class);
 newHiveConf.set(HiveConf.ConfVars.METASTOREURIS.varname, 
"thrift://localhost:" + port);
 newHiveConf.set(HiveConf.ConfVars.METASTOREWAREHOUSE.varname, "file:" + 
hiveLocalDir.getAbsolutePath());
+newHiveConf.set(HiveConf.ConfVars.METASTORE_TRY_DIRECT_SQL.varname, 
"false");
 
 Review comment:
   What does this do?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #374: Migrate spark table to iceberg table

2019-08-20 Thread GitBox
rdblue commented on a change in pull request #374: Migrate spark table to 
iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#discussion_r315824375
 
 

 ##
 File path: hive/src/test/java/org/apache/iceberg/hive/TestHiveMetastore.java
 ##
 @@ -102,7 +102,7 @@ private TServer newThriftServer(TServerSocket socket, 
HiveConf conf) throws Exce
 .transportFactory(new TTransportFactory())
 .protocolFactory(new TBinaryProtocol.Factory())
 .minWorkerThreads(3)
-.maxWorkerThreads(5);
+.maxWorkerThreads(8);
 
 Review comment:
   Why was this change required?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #396: Support DS read/write without specifying namespace

2019-08-20 Thread GitBox
rdblue commented on a change in pull request #396: Support DS read/write 
without specifying namespace
URL: https://github.com/apache/incubator-iceberg/pull/396#discussion_r315798731
 
 

 ##
 File path: 
spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java
 ##
 @@ -104,7 +104,9 @@ protected Table findTable(DataSourceOptions options, 
Configuration conf) {
   return tables.load(path.get());
 } else {
   HiveCatalog hiveCatalog = HiveCatalogs.loadCatalog(conf);
-  TableIdentifier tableIdentifier = TableIdentifier.parse(path.get());
+  TableIdentifier tableIdentifier = path.get().contains(".") ?
+  TableIdentifier.parse(path.get()) :
+  TableIdentifier.of(lazySparkSession().catalog().currentDatabase(), 
path.get());
 
 Review comment:
   I think the problem with this approach is that there is no guarantee that 
Spark's session catalog uses the same namespace as the Iceberg catalog.
   
   Spark just solved the problem for multiple catalogs: 
https://github.com/apache/spark/pull/25368
   
   Spark will now only keep a current database (namespace) for the current 
catalog and will fill in the current namespace before passing the identifier to 
a catalog. I think that we should go with that approach and let Spark fill in 
session configuration like current database. Of course, that means we can't do 
much to fix this through this path because Spark can't help fill in a namespace 
when it doesn't know about the catalog because we are accessing the Iceberg 
source directly by name.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations

2019-08-20 Thread GitBox
rdblue commented on issue #395: Remove Hadoop Configuration dependency in 
BaseMetastoreTableOperations
URL: https://github.com/apache/incubator-iceberg/pull/395#issuecomment-523096730
 
 
   +1
   
   Thanks for fixing this, @jerryshao!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue merged pull request #395: Remove Hadoop Configuration dependency in BaseMetastoreTableOperations

2019-08-20 Thread GitBox
rdblue merged pull request #395: Remove Hadoop Configuration dependency in 
BaseMetastoreTableOperations
URL: https://github.com/apache/incubator-iceberg/pull/395
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #327: Introduce startsWith Predicate

2019-08-20 Thread GitBox
rdblue commented on issue #327: Introduce startsWith Predicate
URL: https://github.com/apache/incubator-iceberg/pull/327#issuecomment-523096095
 
 
   That plan sounds good to me!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] aokolnychyi commented on issue #327: Introduce startsWith Predicate

2019-08-20 Thread GitBox
aokolnychyi commented on issue #327: Introduce startsWith Predicate
URL: https://github.com/apache/incubator-iceberg/pull/327#issuecomment-523089442
 
 
   I would like to follow up with a PR to enable pushdown of `startsWith` in 
Spark. Are there any limitations that will block me?
   
   Apart from changes in `SparkFilters`, we will need to extend 
`InclusiveMetricsEvaluator`, `ParquetDictionaryRowGroupFilter`, 
`ParquetMetricsRowGroupFilter` with logic that checks lower/upper bounds in 
`startsWith`.
   
   I think one way to implement this is to take the same number of bytes from 
prefix and bounds (e.g. slice) and then rely on `UnsignedByteBufComparator` for 
unsigned lexicographical comparison. Alternatively, we can convert bounds to 
strings and operate on them.
   
   @sujithjay @rdblue @moulimukherjee Any thoughts?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jerryshao opened a new pull request #396: Support DS read/write without specifying namespace

2019-08-20 Thread GitBox
jerryshao opened a new pull request #396: Support DS read/write without 
specifying namespace
URL: https://github.com/apache/incubator-iceberg/pull/396
 
 
   Current iceberg must specify namespace when read/write through DS API, for 
example: ` spark.read.format("iceberg").load("default.logs") `,  otherwise it 
will throw an exception. 
   
   Actually we could leverage Spark's catalog information to pick the current 
namespace when it is not specified. So here changing to use default namespace 
when it is not specified.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] chenjunjiedada commented on issue #374: Migrate spark table to iceberg table

2019-08-20 Thread GitBox
chenjunjiedada commented on issue #374: Migrate spark table to iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#issuecomment-522978032
 
 
   Figured out the root cause of the failed unit test.  The spark job needs at 
least three thrift server threads since we have three partitions in the unit 
test, the maximum value of 5 should be enough while the previous unit tests 
still not release yet (default release time is 1800s).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] chenjunjiedada commented on issue #374: Migrate spark table to iceberg table

2019-08-20 Thread GitBox
chenjunjiedada commented on issue #374: Migrate spark table to iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#issuecomment-522920854
 
 
   hmm, it's a warning.
   ```
   WARN org.apache.hadoop.hive.metastore.ObjectStore - Direct SQL failed, 
falling back to ORM
   java.lang.ClassCastException: org.apache.derby.impl.jdbc.EmbedClob cannot be 
cast to java.lang.String
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] chenjunjiedada commented on issue #374: Migrate spark table to iceberg table

2019-08-20 Thread GitBox
chenjunjiedada commented on issue #374: Migrate spark table to iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#issuecomment-522906620
 
 
   Not sure why unit tests failed. 
   `
   org.apache.iceberg.spark.source.TestIcebergSourceHiveTables > 
testHiveManifestsTable FAILED
   java.lang.RuntimeException at TestIcebergSourceHiveTables.java:438
   Caused by: org.apache.thrift.transport.TTransportException at 
TestIcebergSourceHiveTables.java:438
   Caused by: java.net.SocketException at 
TestIcebergSourceHiveTables.java:438
   org.apache.iceberg.spark.source.TestIcebergSourceHiveTables > 
testHiveHistoryTable FAILED
   java.lang.RuntimeException at TestIcebergSourceHiveTables.java:285
   Caused by: org.apache.thrift.transport.TTransportException at 
TestIcebergSourceHiveTables.java:285
   Caused by: java.net.SocketException at 
TestIcebergSourceHiveTables.java:285
   org.apache.iceberg.spark.source.TestIcebergSourceHiveTables > 
testHiveEntriesTable FAILED
   java.lang.RuntimeException at TestIcebergSourceHiveTables.java:144
   Caused by: org.apache.thrift.transport.TTransportException at 
TestIcebergSourceHiveTables.java:144
   Caused by: java.net.SocketException at 
TestIcebergSourceHiveTables.java:144
   org.apache.iceberg.spark.source.TestSparkTableUtil > testMigrateSparkTable 
FAILED
   org.apache.hadoop.hive.ql.metadata.HiveException at 
TestSparkTableUtil.java:134
   Caused by: org.apache.thrift.transport.TTransportException at 
TestSparkTableUtil.java:134
   Caused by: java.net.SocketException at 
TestSparkTableUtil.java:134
   135 tests completed, 4 failed, 4 skipped
   > Task :iceberg-spark:test FAILED
   FAILURE: Build failed with an exception.
   `


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] chenjunjiedada edited a comment on issue #374: Migrate spark table to iceberg table

2019-08-20 Thread GitBox
chenjunjiedada edited a comment on issue #374: Migrate spark table to iceberg 
table
URL: https://github.com/apache/incubator-iceberg/pull/374#issuecomment-522906620
 
 
   Not sure why unit tests failed. 
   ```
   org.apache.iceberg.spark.source.TestIcebergSourceHiveTables > 
testHiveManifestsTable FAILED
   java.lang.RuntimeException at TestIcebergSourceHiveTables.java:438
   Caused by: org.apache.thrift.transport.TTransportException at 
TestIcebergSourceHiveTables.java:438
   Caused by: java.net.SocketException at 
TestIcebergSourceHiveTables.java:438
   org.apache.iceberg.spark.source.TestIcebergSourceHiveTables > 
testHiveHistoryTable FAILED
   java.lang.RuntimeException at TestIcebergSourceHiveTables.java:285
   Caused by: org.apache.thrift.transport.TTransportException at 
TestIcebergSourceHiveTables.java:285
   Caused by: java.net.SocketException at 
TestIcebergSourceHiveTables.java:285
   org.apache.iceberg.spark.source.TestIcebergSourceHiveTables > 
testHiveEntriesTable FAILED
   java.lang.RuntimeException at TestIcebergSourceHiveTables.java:144
   Caused by: org.apache.thrift.transport.TTransportException at 
TestIcebergSourceHiveTables.java:144
   Caused by: java.net.SocketException at 
TestIcebergSourceHiveTables.java:144
   org.apache.iceberg.spark.source.TestSparkTableUtil > testMigrateSparkTable 
FAILED
   org.apache.hadoop.hive.ql.metadata.HiveException at 
TestSparkTableUtil.java:134
   Caused by: org.apache.thrift.transport.TTransportException at 
TestSparkTableUtil.java:134
   Caused by: java.net.SocketException at 
TestSparkTableUtil.java:134
   135 tests completed, 4 failed, 4 skipped
   > Task :iceberg-spark:test FAILED
   FAILURE: Build failed with an exception.
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] chenjunjiedada commented on issue #374: Migrate spark table to iceberg table

2019-08-20 Thread GitBox
chenjunjiedada commented on issue #374: Migrate spark table to iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/374#issuecomment-522904779
 
 
   @rdblue, @aokolnychyi @jerryshao PR updated.
   
   It uses `coalesce(1)` to append all files to one manifest file.  In the case 
of a large table which has a mass of files, the final manifest may be too big.  
Currently, I'm not sure what is the best way to transfer manifests between the 
executor and the driver. IIUC, there are two ways if we want to build manifests 
parallel and commit once.
   
   1.  Store the temporary manifests to some HDFS location and return location 
string to the driver, then the driver constructs the manifests and appends 
manifests and commits. While I don't find an existing API to build manifests 
from location, ManifestReader?  
   
   2. Implement `Serializable ` for the Manifest or Snapshot, so that we 
collect them and append manifests in the driver side.  While this looks like a 
bit complex.
   
   What do you think?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org