[GitHub] [incubator-iceberg] jerryshao commented on issue #743: [spark-3] Bump Apache spark to 3.0.0-preview2
jerryshao commented on issue #743: [spark-3] Bump Apache spark to 3.0.0-preview2 URL: https://github.com/apache/incubator-iceberg/pull/743#issuecomment-581282605 LGTM. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] edgarRd commented on issue #768: Metrics javadoc fixes #767
edgarRd commented on issue #768: Metrics javadoc fixes #767 URL: https://github.com/apache/incubator-iceberg/pull/768#issuecomment-581219056 PTAL @rdsr @rdblue - Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] edgarRd commented on a change in pull request #768: Metrics javadoc fixes #767
edgarRd commented on a change in pull request #768: Metrics javadoc fixes #767 URL: https://github.com/apache/incubator-iceberg/pull/768#discussion_r373907688 ## File path: core/src/test/java/org/apache/iceberg/TestMetrics.java ## @@ -114,6 +114,56 @@ public abstract File writeRecords(Schema schema, Map properties, public abstract int splitCount(File parquetFile) throws IOException; + @Test + public void testMetricsForRepeatedValues() throws IOException { +Record firstRecord = new Record(AvroSchemaUtil.convert(SIMPLE_SCHEMA.asStruct())); +firstRecord.put("booleanCol", true); +firstRecord.put("intCol", 3); +firstRecord.put("longCol", null); +firstRecord.put("floatCol", 2.0F); +firstRecord.put("doubleCol", 2.0D); +firstRecord.put("decimalCol", new BigDecimal("3.50")); +firstRecord.put("stringCol", "AAA"); +firstRecord.put("dateCol", 1500); +firstRecord.put("timeCol", 2000L); +firstRecord.put("timestampCol", 0L); +firstRecord.put("uuidCol", uuid); +firstRecord.put("fixedCol", fixed); +firstRecord.put("binaryCol", "S".getBytes()); +Record secondRecord = new Record(AvroSchemaUtil.convert(SIMPLE_SCHEMA.asStruct())); +secondRecord.put("booleanCol", true); +secondRecord.put("intCol", 3); +secondRecord.put("longCol", null); +secondRecord.put("floatCol", 2.0F); +secondRecord.put("doubleCol", 2.0D); +secondRecord.put("decimalCol", new BigDecimal("3.50")); +secondRecord.put("stringCol", "AAA"); +secondRecord.put("dateCol", 1500); +secondRecord.put("timeCol", 2000L); +secondRecord.put("timestampCol", 0L); +secondRecord.put("uuidCol", uuid); +secondRecord.put("fixedCol", fixed); +secondRecord.put("binaryCol", "S".getBytes()); + +File recordsFile = writeRecords(SIMPLE_SCHEMA, firstRecord, secondRecord); + +Metrics metrics = getMetrics(Files.localInput(recordsFile)); +Assert.assertEquals(2L, (long) metrics.recordCount()); +assertCounts(1, 2L, 0L, metrics); +assertCounts(2, 2L, 0L, metrics); +assertCounts(3, 2L, 2L, metrics); Review comment: Note that in this case `valueCounts == nullValueCounts` and the count is the same as `recordCount`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] edgarRd opened a new pull request #768: Metrics javadoc fixes #767
edgarRd opened a new pull request #768: Metrics javadoc fixes #767 URL: https://github.com/apache/incubator-iceberg/pull/768 Adding javadoc to `Metrics` following `Parquet` metrics implementation as proposal. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] edgarRd opened a new issue #767: Clarify / Document metrics contract
edgarRd opened a new issue #767: Clarify / Document metrics contract URL: https://github.com/apache/incubator-iceberg/issues/767 The metrics contract is a bit unclear, from the implementation. Since it's not defined in the spec, having the only fully implemented metrics for Parquet, and while I'm working on ORC metrics it's not very clear what is the contract expected since file formats seem to implement this differently, for instance: * `Map valueCounts()` - it's not clear whether this method includes non-null or repeated values. As per the `TestMetrics` it looks like value counts *includes null and repeated values* which would be pretty much the same as row count, except for nested structures (e.g. lists, maps) - however this is not defined. This issue is to track the discussion about the expected metrics contract and get a clear definition. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] edgarRd commented on issue #213: Incomplete Iceberg to ORC column mapping
edgarRd commented on issue #213: Incomplete Iceberg to ORC column mapping URL: https://github.com/apache/incubator-iceberg/issues/213#issuecomment-581215477 Closing this issue since it was fixed in #227 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] edgarRd closed issue #213: Incomplete Iceberg to ORC column mapping
edgarRd closed issue #213: Incomplete Iceberg to ORC column mapping URL: https://github.com/apache/incubator-iceberg/issues/213 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression
jun-he commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression URL: https://github.com/apache/incubator-iceberg/pull/749#discussion_r373900954 ## File path: spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java ## @@ -543,11 +579,11 @@ private File buildPartitionedTable(String desc, PartitionSpec spec, String udf, private List testRecords(org.apache.avro.Schema avroSchema) { return Lists.newArrayList( -record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294658+00:00"), "junction"), +record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294+00:00"), "junction"), Review comment: The test fails because those partitions picked in the tests have only one value (equals the lower and higher bound) so the Timestamp must exactly match. To avoid change those values, I will update the test to use the partition of `2017-12-21T15`, which contains two records. So any Timestamp between them will match. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression
jun-he commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression URL: https://github.com/apache/incubator-iceberg/pull/749#discussion_r373900954 ## File path: spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java ## @@ -543,11 +579,11 @@ private File buildPartitionedTable(String desc, PartitionSpec spec, String udf, private List testRecords(org.apache.avro.Schema avroSchema) { return Lists.newArrayList( -record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294658+00:00"), "junction"), +record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294+00:00"), "junction"), Review comment: The test fails because those partitions picked in the tests have only one value (equals the lower and higher bound) so the Timestamp must exactly match. I will update the test to use the partition of `2017-12-21T15`, which contains two records. So any Timestamp between them will match. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression
jun-he commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression URL: https://github.com/apache/incubator-iceberg/pull/749#discussion_r373900954 ## File path: spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java ## @@ -543,11 +579,11 @@ private File buildPartitionedTable(String desc, PartitionSpec spec, String udf, private List testRecords(org.apache.avro.Schema avroSchema) { return Lists.newArrayList( -record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294658+00:00"), "junction"), +record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294+00:00"), "junction"), Review comment: The test fails because those partitions picked in the tests have only one value (equals the lower and higher bound) so the Timestamp must exactly match. To avoid changing those values, I will update the test to use the partition of `2017-12-21T15`, which contains two records. So any Timestamp between them will match. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2
yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2 URL: https://github.com/apache/incubator-iceberg/pull/743#discussion_r373897701 ## File path: spark/src/test/java/org/apache/iceberg/spark/source/TestForwardCompatibility.java ## @@ -125,7 +126,7 @@ public void testSparkWriteFailsUnknownTransform() throws IOException { } @Test - public void testSparkStreamingWriteFailsUnknownTransform() throws IOException { + public void testSparkStreamingWriteFailsUnknownTransform() throws IOException, TimeoutException { Review comment: Ok. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2
yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2 URL: https://github.com/apache/incubator-iceberg/pull/743#discussion_r373897760 ## File path: spark/src/test/java/org/apache/iceberg/spark/source/TestTables.java ## @@ -37,7 +37,7 @@ import org.apache.iceberg.io.InputFile; import org.apache.iceberg.io.LocationProvider; import org.apache.iceberg.io.OutputFile; -import parquet.Preconditions; +import org.apache.parquet.Preconditions; Review comment: Yes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2
yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2 URL: https://github.com/apache/incubator-iceberg/pull/743#discussion_r373897701 ## File path: spark/src/test/java/org/apache/iceberg/spark/source/TestForwardCompatibility.java ## @@ -125,7 +126,7 @@ public void testSparkWriteFailsUnknownTransform() throws IOException { } @Test - public void testSparkStreamingWriteFailsUnknownTransform() throws IOException { + public void testSparkStreamingWriteFailsUnknownTransform() throws IOException, TimeoutException { Review comment: Ok This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue edited a comment on issue #764: make org.apache.iceberg. PartitionData public
rdblue edited a comment on issue #764: make org.apache.iceberg. PartitionData public URL: https://github.com/apache/incubator-iceberg/issues/764#issuecomment-581183953 I don't think that this necessarily needs to be public. It's an internal implementation that is not part of the public API. What are you trying to do? Can you pass a `StructLike` instead? The Iceberg generic records in org.apache.iceberg.data can be passed, so you can build a tuple of data using the generic record. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #315: Incremental processing implementation
rdsr commented on a change in pull request #315: Incremental processing implementation URL: https://github.com/apache/incubator-iceberg/pull/315#discussion_r373885093 ## File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java ## @@ -42,6 +42,16 @@ private SnapshotUtil() { return ancestorIds(table.currentSnapshot(), table::snapshot); } + /** + * @return List of snapshot ids in the range - (fromSnapshotId, toSnapshotId] + * This method assumes that fromSnapshotId is an ancestor of toSnapshotId Review comment: Yes. It will be in the follow-up This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression
rdblue commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression URL: https://github.com/apache/incubator-iceberg/pull/749#discussion_r373880794 ## File path: spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java ## @@ -543,11 +579,11 @@ private File buildPartitionedTable(String desc, PartitionSpec spec, String udf, private List testRecords(org.apache.avro.Schema avroSchema) { return Lists.newArrayList( -record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294658+00:00"), "junction"), +record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294+00:00"), "junction"), Review comment: These don't use the constructor that only supports milliseconds, so these should be precise to microseconds. But the test only uses Timestamp to create a filter that gets converted to a partition filter, so the timestamps used to create the Spark filter and these timestamps shouldn't need to match. Doesn't the test pass if these are unchanged? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #751: Add an option to decide whether to delete data files in Catalog.dropTable()
rdblue commented on issue #751: Add an option to decide whether to delete data files in Catalog.dropTable() URL: https://github.com/apache/incubator-iceberg/issues/751#issuecomment-581184344 I think you're saying you want a purge option that deletes metadata, but not data? I'm very reluctant to add cases like this to the API. We want to keep the API small, and I think this may be specific to your use case. What about doing this in the application that creates the staging tables instead? You'd set purge=false when dropping the table and then clean up the metadata using your own code. Would that work? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression
jun-he commented on a change in pull request #749: Convert Spark In filter to iceberg IN Expression URL: https://github.com/apache/incubator-iceberg/pull/749#discussion_r373880462 ## File path: spark/src/test/java/org/apache/iceberg/spark/source/TestFilteredScan.java ## @@ -543,11 +579,11 @@ private File buildPartitionedTable(String desc, PartitionSpec spec, String udf, private List testRecords(org.apache.avro.Schema avroSchema) { return Lists.newArrayList( -record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294658+00:00"), "junction"), +record(avroSchema, 0L, timestamp("2017-12-22T09:20:44.294+00:00"), "junction"), Review comment: @rdblue It is because `java.sql.Timestamp` constructor uses a milliseconds time value. There is a deprecated `java.sql.Timestamp` constructor to use year, month, date, hour, minute, second, and nano. But we also need take care of timezone issue (java timestamp is always UTC). So to avoid using deprecated method and make the test straightforward, I just update two records to be millisecond scale. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #761: Suggestion for newbie getting started guide
rdblue commented on issue #761: Suggestion for newbie getting started guide URL: https://github.com/apache/incubator-iceberg/issues/761#issuecomment-581184156 Hadoop tables cannot be used with a file system that doesn't support atomic rename. They should only be used with HDFS or a local FS, not with S3. For S3, you should be using a metastore, like Hive. You're right that IcebergGenerics is the recommended way to read a table directly, without using an engine like Spark or Presto. If you want to write up some documentation, that would be great! Please open a PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #764: make org.apache.iceberg. PartitionData public
rdblue commented on issue #764: make org.apache.iceberg. PartitionData public URL: https://github.com/apache/incubator-iceberg/issues/764#issuecomment-581183953 I don't think that this necessarily needs to be public. It's an internal implementation that is not part of the public API. What are you trying to do? Can you pass `StructLike` instead? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2
rdblue commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2 URL: https://github.com/apache/incubator-iceberg/pull/743#discussion_r373880002 ## File path: spark/src/test/java/org/apache/iceberg/spark/source/TestForwardCompatibility.java ## @@ -125,7 +126,7 @@ public void testSparkWriteFailsUnknownTransform() throws IOException { } @Test - public void testSparkStreamingWriteFailsUnknownTransform() throws IOException { + public void testSparkStreamingWriteFailsUnknownTransform() throws IOException, TimeoutException { Review comment: This should also be `Exception`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #743: [spark-3] Bump Apache spark to 3.0.0-preview2
rdblue commented on issue #743: [spark-3] Bump Apache spark to 3.0.0-preview2 URL: https://github.com/apache/incubator-iceberg/pull/743#issuecomment-581183783 Thanks @yujiantao! I caught a couple more minor things. Once those are done, we should be good to merge. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2
rdblue commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2 URL: https://github.com/apache/incubator-iceberg/pull/743#discussion_r373879959 ## File path: spark/src/test/java/org/apache/iceberg/spark/source/TestTables.java ## @@ -37,7 +37,7 @@ import org.apache.iceberg.io.InputFile; import org.apache.iceberg.io.LocationProvider; import org.apache.iceberg.io.OutputFile; -import parquet.Preconditions; +import org.apache.parquet.Preconditions; Review comment: This should not be using Preconditions from Parquet. It should be the guava one. Can you update it to that one instead? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #766: Support in and notIn in ResidualVisitor
rdblue commented on a change in pull request #766: Support in and notIn in ResidualVisitor URL: https://github.com/apache/incubator-iceberg/pull/766#discussion_r373879733 ## File path: api/src/test/java/org/apache/iceberg/transforms/TestResiduals.java ## @@ -156,4 +158,46 @@ public void testUnpartitionedResiduals() { expr, residualEvaluator.residualFor(Row.of())); } } + + @Test + public void testIn() { +Schema schema = new Schema( +Types.NestedField.optional(50, "dateint", Types.IntegerType.get()), +Types.NestedField.optional(51, "hour", Types.IntegerType.get()) +); + +PartitionSpec spec = PartitionSpec.builderFor(schema) +.identity("dateint") +.build(); + +ResidualEvaluator resEval = ResidualEvaluator.of(spec, +in("dateint", 20170815, 20170816, 20170817), true); + +Expression residual = resEval.residualFor(Row.of(20170815)); +Assert.assertEquals("Residual should be alwaysTrue", alwaysTrue(), residual); + +residual = resEval.residualFor(Row.of(20180815)); +Assert.assertEquals("Residual should be alwaysFalse", alwaysFalse(), residual); + } + + @Test + public void testNotIn() { Review comment: Can you also add tests for `in` and `notIn` with a timestamp (`ts` column) to date transform (`day`)? For the `in` case, the residual for a date that contains one of the timestamps should be the original `in` predicate. If the date doesn't match one of the `in` predicate timestamp values, then it should be `alwaysFalse`. For `notIn`, if the date matches it should be the original `notIn` predicate and if it does not match, it should be `alwaysTrue`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #766: Support in and notIn in ResidualVisitor
rdblue commented on a change in pull request #766: Support in and notIn in ResidualVisitor URL: https://github.com/apache/incubator-iceberg/pull/766#discussion_r373879282 ## File path: api/src/test/java/org/apache/iceberg/transforms/TestResiduals.java ## @@ -156,4 +158,46 @@ public void testUnpartitionedResiduals() { expr, residualEvaluator.residualFor(Row.of())); } } + + @Test Review comment: Can you also add cases to `testUnpartitionedResiduals`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #766: Support in and notIn in ResidualVisitor
rdblue commented on a change in pull request #766: Support in and notIn in ResidualVisitor URL: https://github.com/apache/incubator-iceberg/pull/766#discussion_r373879036 ## File path: api/src/main/java/org/apache/iceberg/expressions/ResidualEvaluator.java ## @@ -195,6 +196,26 @@ public Expression or(Expression leftResult, Expression rightResult) { return (cmp.compare(ref.eval(struct), lit.value()) != 0) ? alwaysTrue() : alwaysFalse(); } +@Override +public Expression in(BoundReference ref, Set literalSet) { + T val = ref.eval(struct); + if (literalSet.stream().anyMatch(lit -> ref.comparator().compare(val, lit) == 0)) { Review comment: The literal set can handle all types, so this can use `contains` instead of checking each value individually. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #315: Incremental processing implementation
rdblue commented on issue #315: Incremental processing implementation URL: https://github.com/apache/incubator-iceberg/pull/315#issuecomment-581182311 @rdsr, I'm merging this. Thanks for all your work on it! I think we do need to follow up pretty quickly with better validations for snapshot IDs passed to appendsBetween and appendsAfter. Right now, I don't think there is anything that validates that there is a range of snapshots from the start to the end -- it looks like this would use the whole ancestry of the "to" snapshot if "from" isn't an ancestor. Let's clean that up with validations, but for now I've committed this since it quite a large patch. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue merged pull request #315: Incremental processing implementation
rdblue merged pull request #315: Incremental processing implementation URL: https://github.com/apache/incubator-iceberg/pull/315 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #315: Incremental processing implementation
rdblue commented on a change in pull request #315: Incremental processing implementation URL: https://github.com/apache/incubator-iceberg/pull/315#discussion_r373878737 ## File path: core/src/main/java/org/apache/iceberg/util/SnapshotUtil.java ## @@ -42,6 +42,16 @@ private SnapshotUtil() { return ancestorIds(table.currentSnapshot(), table::snapshot); } + /** + * @return List of snapshot ids in the range - (fromSnapshotId, toSnapshotId] + * This method assumes that fromSnapshotId is an ancestor of toSnapshotId Review comment: I don't see any place that checks whether the from snapshot is an ancestor of the to snapshot. That seems like a requirement for this to work correctly to me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #315: Incremental processing implementation
rdblue commented on a change in pull request #315: Incremental processing implementation URL: https://github.com/apache/incubator-iceberg/pull/315#discussion_r373878272 ## File path: core/src/main/java/org/apache/iceberg/ManifestGroup.java ## @@ -103,7 +122,35 @@ ManifestGroup caseSensitive(boolean newCaseSensitive) { return this; } + ManifestGroup planWith(ExecutorService newExecutorService) { +this.executorService = newExecutorService; +return this; + } + /** + * Returns a iterable of scan tasks. It is safe to add entries of this iterable + * to a collection as {@link DataFile} in each {@link FileScanTask} is defensively + * copied. + * @return a {@link CloseableIterable} of {@link FileScanTask} + */ + public CloseableIterable planFiles() { +Iterable> tasks = entries((manifest, entries) -> { + PartitionSpec spec = specsById.get(manifest.partitionSpecId()); + String schemaString = SchemaParser.toJson(spec.schema()); + String specString = PartitionSpecParser.toJson(spec); + ResidualEvaluator residuals = ResidualEvaluator.of(spec, dataFilter, caseSensitive); + return CloseableIterable.transform(entries, e -> new BaseFileScanTask( + e.copy().file(), schemaString, specString, residuals)); Review comment: Later, we will probably want to detect whether any stats column was projected and use `file.copy()` or `file.copyWithoutStats()` depending on the projection. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] jun-he commented on issue #760: Operators In / Not In are not implemented for Residuals
jun-he commented on issue #760: Operators In / Not In are not implemented for Residuals URL: https://github.com/apache/incubator-iceberg/issues/760#issuecomment-581180942 @arina-ielchiieva I have submitted a PR to address this issue. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] yujiantao commented on issue #743: [spark-3] Bump Apache spark to 3.0.0-preview2
yujiantao commented on issue #743: [spark-3] Bump Apache spark to 3.0.0-preview2 URL: https://github.com/apache/incubator-iceberg/pull/743#issuecomment-581109895 @rdblue Sorry for late reply as I was on a long vocation. I have refreshed the commit. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2
yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2 URL: https://github.com/apache/incubator-iceberg/pull/743#discussion_r373827428 ## File path: spark/src/main/java/org/apache/iceberg/spark/source/SparkBatchWrite.java ## @@ -493,6 +494,9 @@ protected void setCurrentKey(PartitionKey currentKey) { public void write(InternalRow row) throws IOException { writeInternal(row); } + +@Override +public void close() {} Review comment: I will add this comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2
yujiantao commented on a change in pull request #743: [spark-3] Bump Apache spark to 3.0.0-preview2 URL: https://github.com/apache/incubator-iceberg/pull/743#discussion_r373827417 ## File path: spark/src/test/java/org/apache/iceberg/spark/source/TestStructuredStreaming.java ## @@ -79,7 +80,7 @@ public static void stopSpark() { } @Test - public void testStreamingWriteAppendMode() throws IOException { + public void testStreamingWriteAppendMode() throws IOException, TimeoutException { Review comment: Yes, throwing "Exception" is better. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org