[GitHub] [incubator-iceberg] jerryshao commented on issue #743: [spark-3] Bump Apache spark to 3.0.0-preview2

2020-02-02 Thread GitBox
jerryshao commented on issue #743: [spark-3] Bump Apache spark to 3.0.0-preview2
URL: https://github.com/apache/incubator-iceberg/pull/743#issuecomment-581282605
 
 
   LGTM.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] edgarRd commented on issue #768: Metrics javadoc fixes #767

2020-02-02 Thread GitBox
edgarRd commented on issue #768: Metrics javadoc fixes #767
URL: https://github.com/apache/incubator-iceberg/pull/768#issuecomment-581219056
 
 
   PTAL @rdsr @rdblue  - Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] edgarRd commented on a change in pull request #768: Metrics javadoc fixes #767

2020-02-02 Thread GitBox
edgarRd commented on a change in pull request #768: Metrics javadoc fixes #767
URL: https://github.com/apache/incubator-iceberg/pull/768#discussion_r373907688
 
 

 ##
 File path: core/src/test/java/org/apache/iceberg/TestMetrics.java
 ##
 @@ -114,6 +114,56 @@ public abstract File writeRecords(Schema schema, 
Map properties,
 
   public abstract int splitCount(File parquetFile) throws IOException;
 
+  @Test
+  public void testMetricsForRepeatedValues() throws IOException {
+Record firstRecord = new 
Record(AvroSchemaUtil.convert(SIMPLE_SCHEMA.asStruct()));
+firstRecord.put("booleanCol", true);
+firstRecord.put("intCol", 3);
+firstRecord.put("longCol", null);
+firstRecord.put("floatCol", 2.0F);
+firstRecord.put("doubleCol", 2.0D);
+firstRecord.put("decimalCol", new BigDecimal("3.50"));
+firstRecord.put("stringCol", "AAA");
+firstRecord.put("dateCol", 1500);
+firstRecord.put("timeCol", 2000L);
+firstRecord.put("timestampCol", 0L);
+firstRecord.put("uuidCol", uuid);
+firstRecord.put("fixedCol", fixed);
+firstRecord.put("binaryCol", "S".getBytes());
+Record secondRecord = new 
Record(AvroSchemaUtil.convert(SIMPLE_SCHEMA.asStruct()));
+secondRecord.put("booleanCol", true);
+secondRecord.put("intCol", 3);
+secondRecord.put("longCol", null);
+secondRecord.put("floatCol", 2.0F);
+secondRecord.put("doubleCol", 2.0D);
+secondRecord.put("decimalCol", new BigDecimal("3.50"));
+secondRecord.put("stringCol", "AAA");
+secondRecord.put("dateCol", 1500);
+secondRecord.put("timeCol", 2000L);
+secondRecord.put("timestampCol", 0L);
+secondRecord.put("uuidCol", uuid);
+secondRecord.put("fixedCol", fixed);
+secondRecord.put("binaryCol", "S".getBytes());
+
+File recordsFile = writeRecords(SIMPLE_SCHEMA, firstRecord, secondRecord);
+
+Metrics metrics = getMetrics(Files.localInput(recordsFile));
+Assert.assertEquals(2L, (long) metrics.recordCount());
+assertCounts(1, 2L, 0L, metrics);
+assertCounts(2, 2L, 0L, metrics);
+assertCounts(3, 2L, 2L, metrics);
 
 Review comment:
   Note that in this case `valueCounts == nullValueCounts` and the count is the 
same as `recordCount`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] edgarRd opened a new pull request #768: Metrics javadoc fixes #767

2020-02-02 Thread GitBox
edgarRd opened a new pull request #768: Metrics javadoc fixes #767
URL: https://github.com/apache/incubator-iceberg/pull/768
 
 
   Adding javadoc to `Metrics` following `Parquet` metrics implementation as 
proposal.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] edgarRd opened a new issue #767: Clarify / Document metrics contract

2020-02-02 Thread GitBox
edgarRd opened a new issue #767: Clarify / Document metrics contract 
URL: https://github.com/apache/incubator-iceberg/issues/767
 
 
   The metrics contract is a bit unclear, from the implementation. Since it's 
not defined in the spec, having the only fully implemented metrics for Parquet, 
and while I'm working on ORC metrics it's not very clear what is the contract 
expected since file formats seem to implement this differently, for instance:
   
   * `Map valueCounts()` - it's not clear whether this method 
includes non-null or repeated values. As per the `TestMetrics` it looks like 
value counts *includes null and repeated values* which would be pretty much the 
same as row count, except for nested structures (e.g. lists, maps) - however 
this is not defined.
   
   This issue is to track the discussion about the expected metrics contract 
and get a clear definition.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] edgarRd commented on issue #213: Incomplete Iceberg to ORC column mapping

2020-02-02 Thread GitBox
edgarRd commented on issue #213: Incomplete Iceberg to ORC column mapping
URL: 
https://github.com/apache/incubator-iceberg/issues/213#issuecomment-581215477
 
 
   Closing this issue since it was fixed in #227 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] edgarRd closed issue #213: Incomplete Iceberg to ORC column mapping

2020-02-02 Thread GitBox
edgarRd closed issue #213: Incomplete Iceberg to ORC column mapping
URL: https://github.com/apache/incubator-iceberg/issues/213
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression

2020-02-02 Thread GitBox
jun-he commented on a change in pull request #749: Convert Spark In filter to 
iceberg IN Expression
URL: https://github.com/apache/incubator-iceberg/pull/749#discussion_r373900954
 
 

 ##
 File path: 
spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java
 ##
 @@ -543,11 +579,11 @@ private File buildPartitionedTable(String desc, 
PartitionSpec spec, String udf,
 
   private List testRecords(org.apache.avro.Schema avroSchema) {
 return Lists.newArrayList(
-record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294658+00:00"), 
"junction"),
+record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294+00:00"), 
"junction"),
 
 Review comment:
   The test fails because those partitions picked in the tests have only one 
value (equals the lower and higher bound) so the Timestamp must exactly match.
   
   To avoid change those values, I will update the test to use the partition of 
`2017-12-21T15`, which contains two records. So any Timestamp between them will 
match.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression

2020-02-02 Thread GitBox
jun-he commented on a change in pull request #749: Convert Spark In filter to 
iceberg IN Expression
URL: https://github.com/apache/incubator-iceberg/pull/749#discussion_r373900954
 
 

 ##
 File path: 
spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java
 ##
 @@ -543,11 +579,11 @@ private File buildPartitionedTable(String desc, 
PartitionSpec spec, String udf,
 
   private List testRecords(org.apache.avro.Schema avroSchema) {
 return Lists.newArrayList(
-record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294658+00:00"), 
"junction"),
+record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294+00:00"), 
"junction"),
 
 Review comment:
   The test fails because those partitions picked in the tests have only one 
value (equals the lower and higher bound) so the Timestamp must exactly match.
   
   I will update the test to use the partition of `2017-12-21T15`, which 
contains two records. So any Timestamp between them will match.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression

2020-02-02 Thread GitBox
jun-he commented on a change in pull request #749: Convert Spark In filter to 
iceberg IN Expression
URL: https://github.com/apache/incubator-iceberg/pull/749#discussion_r373900954
 
 

 ##
 File path: 
spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java
 ##
 @@ -543,11 +579,11 @@ private File buildPartitionedTable(String desc, 
PartitionSpec spec, String udf,
 
   private List testRecords(org.apache.avro.Schema avroSchema) {
 return Lists.newArrayList(
-record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294658+00:00"), 
"junction"),
+record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294+00:00"), 
"junction"),
 
 Review comment:
   The test fails because those partitions picked in the tests have only one 
value (equals the lower and higher bound) so the Timestamp must exactly match.
   
   To avoid changing those values, I will update the test to use the partition 
of `2017-12-21T15`, which contains two records. So any Timestamp between them 
will match.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2

2020-02-02 Thread GitBox
yujiantao commented on a change in pull request #743: [spark-3] Bump Apache 
spark to 3.0.0-preview2
URL: https://github.com/apache/incubator-iceberg/pull/743#discussion_r373897701
 
 

 ##
 File path: 
spark/src/test/java/org/apache/iceberg/spark/source/TestForwardCompatibility.java
 ##
 @@ -125,7 +126,7 @@ public void testSparkWriteFailsUnknownTransform() throws 
IOException {
   }
 
   @Test
-  public void testSparkStreamingWriteFailsUnknownTransform() throws 
IOException {
+  public void testSparkStreamingWriteFailsUnknownTransform() throws 
IOException, TimeoutException {
 
 Review comment:
   Ok.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2

2020-02-02 Thread GitBox
yujiantao commented on a change in pull request #743: [spark-3] Bump Apache 
spark to 3.0.0-preview2
URL: https://github.com/apache/incubator-iceberg/pull/743#discussion_r373897760
 
 

 ##
 File path: spark/src/test/java/org/apache/iceberg/spark/source/TestTables.java
 ##
 @@ -37,7 +37,7 @@
 import org.apache.iceberg.io.InputFile;
 import org.apache.iceberg.io.LocationProvider;
 import org.apache.iceberg.io.OutputFile;
-import parquet.Preconditions;
+import org.apache.parquet.Preconditions;
 
 Review comment:
   Yes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2

2020-02-02 Thread GitBox
yujiantao commented on a change in pull request #743: [spark-3] Bump Apache 
spark to 3.0.0-preview2
URL: https://github.com/apache/incubator-iceberg/pull/743#discussion_r373897701
 
 

 ##
 File path: 
spark/src/test/java/org/apache/iceberg/spark/source/TestForwardCompatibility.java
 ##
 @@ -125,7 +126,7 @@ public void testSparkWriteFailsUnknownTransform() throws 
IOException {
   }
 
   @Test
-  public void testSparkStreamingWriteFailsUnknownTransform() throws 
IOException {
+  public void testSparkStreamingWriteFailsUnknownTransform() throws 
IOException, TimeoutException {
 
 Review comment:
   Ok


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue edited a comment on issue #764: make org.apache.iceberg. PartitionData public

2020-02-02 Thread GitBox
rdblue edited a comment on issue #764: make org.apache.iceberg. PartitionData 
public
URL: 
https://github.com/apache/incubator-iceberg/issues/764#issuecomment-581183953
 
 
   I don't think that this necessarily needs to be public. It's an internal 
implementation that is not part of the public API. What are you trying to do? 
Can you pass a `StructLike` instead? The Iceberg generic records in 
org.apache.iceberg.data can be passed, so you can build a tuple of data using 
the generic record.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #315: Incremental processing implementation

2020-02-02 Thread GitBox
rdsr commented on a change in pull request #315: Incremental processing 
implementation
URL: https://github.com/apache/incubator-iceberg/pull/315#discussion_r373885093
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
 ##
 @@ -42,6 +42,16 @@ private SnapshotUtil() {
 return ancestorIds(table.currentSnapshot(), table::snapshot);
   }
 
+  /**
+   * @return List of snapshot ids in the range - (fromSnapshotId, toSnapshotId]
+   * This method assumes that fromSnapshotId is an ancestor of toSnapshotId
 
 Review comment:
   Yes. It will be in the follow-up


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression

2020-02-02 Thread GitBox
rdblue commented on a change in pull request #749: Convert Spark In filter to 
iceberg IN Expression
URL: https://github.com/apache/incubator-iceberg/pull/749#discussion_r373880794
 
 

 ##
 File path: 
spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java
 ##
 @@ -543,11 +579,11 @@ private File buildPartitionedTable(String desc, 
PartitionSpec spec, String udf,
 
   private List testRecords(org.apache.avro.Schema avroSchema) {
 return Lists.newArrayList(
-record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294658+00:00"), 
"junction"),
+record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294+00:00"), 
"junction"),
 
 Review comment:
   These don't use the constructor that only supports milliseconds, so these 
should be precise to microseconds. But the test only uses Timestamp to create a 
filter that gets converted to a partition filter, so the timestamps used to 
create the Spark filter and these timestamps shouldn't need to match. Doesn't 
the test pass if these are unchanged?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #751: Add an option to decide whether to delete data files in Catalog.dropTable()

2020-02-02 Thread GitBox
rdblue commented on issue #751: Add an option to decide whether to delete data 
files in Catalog.dropTable()
URL: 
https://github.com/apache/incubator-iceberg/issues/751#issuecomment-581184344
 
 
   I think you're saying you want a purge option that deletes metadata, but not 
data?
   
   I'm very reluctant to add cases like this to the API. We want to keep the 
API small, and I think this may be specific to your use case. What about doing 
this in the application that creates the staging tables instead? You'd set 
purge=false when dropping the table and then clean up the metadata using your 
own code. Would that work?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression

2020-02-02 Thread GitBox
jun-he commented on a change in pull request #749: Convert Spark In filter to 
iceberg IN Expression
URL: https://github.com/apache/incubator-iceberg/pull/749#discussion_r373880462
 
 

 ##
 File path: 
spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java
 ##
 @@ -543,11 +579,11 @@ private File buildPartitionedTable(String desc, 
PartitionSpec spec, String udf,
 
   private List testRecords(org.apache.avro.Schema avroSchema) {
 return Lists.newArrayList(
-record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294658+00:00"), 
"junction"),
+record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294+00:00"), 
"junction"),
 
 Review comment:
   @rdblue It is because `java.sql.Timestamp` constructor uses a milliseconds 
time value. 
   There is a deprecated `java.sql.Timestamp` constructor to use year, month, 
date, hour, minute, second, and nano. But we also need take care of timezone 
issue (java timestamp is always UTC). 
   
   So to avoid using deprecated method and make the test straightforward, I 
just update two records to be millisecond scale.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #761: Suggestion for newbie getting started guide

2020-02-02 Thread GitBox
rdblue commented on issue #761: Suggestion for newbie getting started guide
URL: 
https://github.com/apache/incubator-iceberg/issues/761#issuecomment-581184156
 
 
   Hadoop tables cannot be used with a file system that doesn't support atomic 
rename. They should only be used with HDFS or a local FS, not with S3. For S3, 
you should be using a metastore, like Hive.
   
   You're right that IcebergGenerics is the recommended way to read a table 
directly, without using an engine like Spark or Presto.
   
   If you want to write up some documentation, that would be great! Please open 
a PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #764: make org.apache.iceberg. PartitionData public

2020-02-02 Thread GitBox
rdblue commented on issue #764: make org.apache.iceberg. PartitionData public
URL: 
https://github.com/apache/incubator-iceberg/issues/764#issuecomment-581183953
 
 
   I don't think that this necessarily needs to be public. It's an internal 
implementation that is not part of the public API. What are you trying to do? 
Can you pass `StructLike` instead?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2

2020-02-02 Thread GitBox
rdblue commented on a change in pull request #743: [spark-3] Bump Apache spark 
to 3.0.0-preview2
URL: https://github.com/apache/incubator-iceberg/pull/743#discussion_r373880002
 
 

 ##
 File path: 
spark/src/test/java/org/apache/iceberg/spark/source/TestForwardCompatibility.java
 ##
 @@ -125,7 +126,7 @@ public void testSparkWriteFailsUnknownTransform() throws 
IOException {
   }
 
   @Test
-  public void testSparkStreamingWriteFailsUnknownTransform() throws 
IOException {
+  public void testSparkStreamingWriteFailsUnknownTransform() throws 
IOException, TimeoutException {
 
 Review comment:
   This should also be `Exception`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #743: [spark-3] Bump Apache spark to 3.0.0-preview2

2020-02-02 Thread GitBox
rdblue commented on issue #743: [spark-3] Bump Apache spark to 3.0.0-preview2
URL: https://github.com/apache/incubator-iceberg/pull/743#issuecomment-581183783
 
 
   Thanks @yujiantao! I caught a couple more minor things. Once those are done, 
we should be good to merge.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2

2020-02-02 Thread GitBox
rdblue commented on a change in pull request #743: [spark-3] Bump Apache spark 
to 3.0.0-preview2
URL: https://github.com/apache/incubator-iceberg/pull/743#discussion_r373879959
 
 

 ##
 File path: spark/src/test/java/org/apache/iceberg/spark/source/TestTables.java
 ##
 @@ -37,7 +37,7 @@
 import org.apache.iceberg.io.InputFile;
 import org.apache.iceberg.io.LocationProvider;
 import org.apache.iceberg.io.OutputFile;
-import parquet.Preconditions;
+import org.apache.parquet.Preconditions;
 
 Review comment:
   This should not be using Preconditions from Parquet. It should be the guava 
one. Can you update it to that one instead?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #766: Support in and notIn in ResidualVisitor

2020-02-02 Thread GitBox
rdblue commented on a change in pull request #766: Support in and notIn in 
ResidualVisitor
URL: https://github.com/apache/incubator-iceberg/pull/766#discussion_r373879733
 
 

 ##
 File path: api/src/test/java/org/apache/iceberg/transforms/TestResiduals.java
 ##
 @@ -156,4 +158,46 @@ public void testUnpartitionedResiduals() {
   expr, residualEvaluator.residualFor(Row.of()));
 }
   }
+
+  @Test
+  public void testIn() {
+Schema schema = new Schema(
+Types.NestedField.optional(50, "dateint", Types.IntegerType.get()),
+Types.NestedField.optional(51, "hour", Types.IntegerType.get())
+);
+
+PartitionSpec spec = PartitionSpec.builderFor(schema)
+.identity("dateint")
+.build();
+
+ResidualEvaluator resEval = ResidualEvaluator.of(spec,
+in("dateint", 20170815, 20170816, 20170817), true);
+
+Expression residual = resEval.residualFor(Row.of(20170815));
+Assert.assertEquals("Residual should be alwaysTrue", alwaysTrue(), 
residual);
+
+residual = resEval.residualFor(Row.of(20180815));
+Assert.assertEquals("Residual should be alwaysFalse", alwaysFalse(), 
residual);
+  }
+
+  @Test
+  public void testNotIn() {
 
 Review comment:
   Can you also add tests for `in` and `notIn` with a timestamp (`ts` column) 
to date transform (`day`)?
   
   For the `in` case, the residual for a date that contains one of the 
timestamps should be the original `in` predicate. If the date doesn't match one 
of the `in` predicate timestamp values, then it should be `alwaysFalse`. For 
`notIn`, if the date matches it should be the original `notIn` predicate and if 
it does not match, it should be `alwaysTrue`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #766: Support in and notIn in ResidualVisitor

2020-02-02 Thread GitBox
rdblue commented on a change in pull request #766: Support in and notIn in 
ResidualVisitor
URL: https://github.com/apache/incubator-iceberg/pull/766#discussion_r373879282
 
 

 ##
 File path: api/src/test/java/org/apache/iceberg/transforms/TestResiduals.java
 ##
 @@ -156,4 +158,46 @@ public void testUnpartitionedResiduals() {
   expr, residualEvaluator.residualFor(Row.of()));
 }
   }
+
+  @Test
 
 Review comment:
   Can you also add cases to `testUnpartitionedResiduals`?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #766: Support in and notIn in ResidualVisitor

2020-02-02 Thread GitBox
rdblue commented on a change in pull request #766: Support in and notIn in 
ResidualVisitor
URL: https://github.com/apache/incubator-iceberg/pull/766#discussion_r373879036
 
 

 ##
 File path: 
api/src/main/java/org/apache/iceberg/expressions/ResidualEvaluator.java
 ##
 @@ -195,6 +196,26 @@ public Expression or(Expression leftResult, Expression 
rightResult) {
   return (cmp.compare(ref.eval(struct), lit.value()) != 0) ? alwaysTrue() 
: alwaysFalse();
 }
 
+@Override
+public  Expression in(BoundReference ref, Set literalSet) {
+  T val = ref.eval(struct);
+  if (literalSet.stream().anyMatch(lit -> ref.comparator().compare(val, 
lit) == 0)) {
 
 Review comment:
   The literal set can handle all types, so this can use `contains` instead of 
checking each value individually.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #315: Incremental processing implementation

2020-02-02 Thread GitBox
rdblue commented on issue #315: Incremental processing implementation
URL: https://github.com/apache/incubator-iceberg/pull/315#issuecomment-581182311
 
 
   @rdsr, I'm merging this. Thanks for all your work on it!
   
   I think we do need to follow up pretty quickly with better validations for 
snapshot IDs passed to appendsBetween and appendsAfter. Right now, I don't 
think there is anything that validates that there is a range of snapshots from 
the start to the end -- it looks like this would use the whole ancestry of the 
"to" snapshot if "from" isn't an ancestor. Let's clean that up with 
validations, but for now I've committed this since it quite a large patch.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue merged pull request #315: Incremental processing implementation

2020-02-02 Thread GitBox
rdblue merged pull request #315: Incremental processing implementation
URL: https://github.com/apache/incubator-iceberg/pull/315
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #315: Incremental processing implementation

2020-02-02 Thread GitBox
rdblue commented on a change in pull request #315: Incremental processing 
implementation
URL: https://github.com/apache/incubator-iceberg/pull/315#discussion_r373878737
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java
 ##
 @@ -42,6 +42,16 @@ private SnapshotUtil() {
 return ancestorIds(table.currentSnapshot(), table::snapshot);
   }
 
+  /**
+   * @return List of snapshot ids in the range - (fromSnapshotId, toSnapshotId]
+   * This method assumes that fromSnapshotId is an ancestor of toSnapshotId
 
 Review comment:
   I don't see any place that checks whether the from snapshot is an ancestor 
of the to snapshot. That seems like a requirement for this to work correctly to 
me.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #315: Incremental processing implementation

2020-02-02 Thread GitBox
rdblue commented on a change in pull request #315: Incremental processing 
implementation
URL: https://github.com/apache/incubator-iceberg/pull/315#discussion_r373878272
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/ManifestGroup.java
 ##
 @@ -103,7 +122,35 @@ ManifestGroup caseSensitive(boolean newCaseSensitive) {
 return this;
   }
 
+  ManifestGroup planWith(ExecutorService newExecutorService) {
+this.executorService = newExecutorService;
+return this;
+  }
+
   /**
+   * Returns a iterable of scan tasks. It is safe to add entries of this 
iterable
+   * to a collection as {@link DataFile} in each {@link FileScanTask} is 
defensively
+   * copied.
+   * @return a {@link CloseableIterable} of {@link FileScanTask}
+   */
+  public CloseableIterable planFiles() {
+Iterable> tasks = entries((manifest, 
entries) -> {
+  PartitionSpec spec = specsById.get(manifest.partitionSpecId());
+  String schemaString = SchemaParser.toJson(spec.schema());
+  String specString = PartitionSpecParser.toJson(spec);
+  ResidualEvaluator residuals = ResidualEvaluator.of(spec, dataFilter, 
caseSensitive);
+  return CloseableIterable.transform(entries, e -> new BaseFileScanTask(
+  e.copy().file(), schemaString, specString, residuals));
 
 Review comment:
   Later, we will probably want to detect whether any stats column was 
projected and use `file.copy()` or `file.copyWithoutStats()` depending on the 
projection.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jun-he commented on issue #760: Operators In / Not In are not implemented for Residuals

2020-02-02 Thread GitBox
jun-he commented on issue #760: Operators In / Not In are not implemented for 
Residuals
URL: 
https://github.com/apache/incubator-iceberg/issues/760#issuecomment-581180942
 
 
   @arina-ielchiieva I have submitted a PR to address this issue. Thanks.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] yujiantao commented on issue #743: [spark-3] Bump Apache spark to 3.0.0-preview2

2020-02-02 Thread GitBox
yujiantao commented on issue #743: [spark-3] Bump Apache spark to 3.0.0-preview2
URL: https://github.com/apache/incubator-iceberg/pull/743#issuecomment-581109895
 
 
   @rdblue Sorry for late reply as I was on a long vocation. I have refreshed 
the commit.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2

2020-02-02 Thread GitBox
yujiantao commented on a change in pull request #743: [spark-3] Bump Apache 
spark to 3.0.0-preview2
URL: https://github.com/apache/incubator-iceberg/pull/743#discussion_r373827428
 
 

 ##
 File path: 
spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchWrite.java
 ##
 @@ -493,6 +494,9 @@ protected void setCurrentKey(PartitionKey currentKey) {
 public void write(InternalRow row) throws IOException {
   writeInternal(row);
 }
+
+@Override
+public void close() {}
 
 Review comment:
   I will add this comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2

2020-02-02 Thread GitBox
yujiantao commented on a change in pull request #743: [spark-3] Bump Apache 
spark to 3.0.0-preview2
URL: https://github.com/apache/incubator-iceberg/pull/743#discussion_r373827417
 
 

 ##
 File path: 
spark/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreaming.java
 ##
 @@ -79,7 +80,7 @@ public static void stopSpark() {
   }
 
   @Test
-  public void testStreamingWriteAppendMode() throws IOException {
+  public void testStreamingWriteAppendMode() throws IOException, 
TimeoutException {
 
 Review comment:
   Yes, throwing "Exception" is better.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org