[GitHub] [incubator-iceberg] waterlx edited a comment on issue #279: Transforming timestamp to date should produce date

2019-08-13 Thread GitBox
waterlx edited a comment on issue #279: Transforming timestamp to date should 
produce date
URL: 
https://github.com/apache/incubator-iceberg/issues/279#issuecomment-521086258
 
 
   @rdblue, regarding
   > @manishmalhotrawork, only the `day` transform should return a date. Years, 
months, and hours should continue returning ordinals like before.
   
   so you mean that we are supposed to keep 
`org.apache.iceberg.transforms.Timestamps` as it is (as `Transform`) and keep `org.apache.iceberg.transforms.Dates` as it is (as 
`Transform`), so as not to change the return of 
`Transforms.year(), month() and hour()`? Do you propose to only change the 
output of Transforms.day() to `Transform`? Do I get it right?
   
   @manishmalhotrawork yes, I am working on this issue. But please feel free to 
go ahead to share your thoughts or upload patches, I am new to iceberg 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] waterlx commented on issue #279: Transforming timestamp to date should produce date

2019-08-13 Thread GitBox
waterlx commented on issue #279: Transforming timestamp to date should produce 
date
URL: 
https://github.com/apache/incubator-iceberg/issues/279#issuecomment-521086258
 
 
   @rdblue 
   > @manishmalhotrawork, only the `day` transform should return a date. Years, 
months, and hours should continue returning ordinals like before.
   
   so you mean that we are supposed to keep 
`org.apache.iceberg.transforms.Timestamps` as it is (as `Transform`) and keep `org.apache.iceberg.transforms.Dates` as it is (as 
`Transform`, but only change the output of Transforms.day() 
to `Transform`? Do I get it right?
   
   @manishmalhotrawork yes, I am working on this issue. But please feel free to 
go ahead to share your thoughts or upload patches, I am new to iceberg 
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] fbocse commented on a change in pull request #244: [WIP] Add new Update schema API to support adding of all new fields detected based on a companion schema

2019-08-13 Thread GitBox
fbocse commented on a change in pull request #244: [WIP] Add new Update schema 
API to support adding of all new fields detected based on a companion schema
URL: https://github.com/apache/incubator-iceberg/pull/244#discussion_r313681588
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
 ##
 @@ -186,6 +188,88 @@ public UpdateSchema updateColumnDoc(String name, String 
doc) {
 return this;
   }
 
+  @Override
+  public UpdateSchema equateNewFields(Schema schema) {
+Preconditions.checkArgument(schema != null, "Cannot add new fields off of 
a null schema");
+List wrappedTypes = 
TypeUtil.CompanionSchemaVisitor.schema(this.schema, schema,
+new CompanionSchemaNewFieldsCollector());
+
+// With Iceberg we can only add to pre-existing fields so we need to 
"hydrate" all nested types.
+wrappedTypes.forEach(
+wrappedType -> {
+  Map.Entry entry = inflate(wrappedType);
+  // Based on being top-level or nested we use the corresponding API 
for adding new columns to the schema
+  Preconditions.checkArgument(entry != null, "Cannot inflate types of 
null predecessor");
+  if (entry.getKey().isEmpty()) {
+addColumn(entry.getValue().name(), entry.getValue().type(), 
entry.getValue().doc());
+  } else {
+addColumn(entry.getKey(), entry.getValue().name(), 
entry.getValue().type(), entry.getValue().doc());
+  }
+}
+);
+
+return this;
+  }
+
+  /**
+   * Given a nested type structure this will provide a fully hydrated {@link 
Types.NestedField} accompanied by
+   * predecessor name as dot separated string.
+   * TODO - any suggestions for modelling a tuple other than 
java.util.Map.Entry?
+   *
+   * @param wrappedType the nested type we want to inflate (will use method 
recursively for non-primitive structures)
+   * @return a tuple having a String representation of the type predecessor as 
key and the nested field as value
+   */
+  private  Map.Entry inflate(T wrappedType) {
 
 Review comment:
   Good point, when writing the code I did kind of ended up with this synthetic 
 `WrappedType` construct but after putting together the code for inflating 
nested types I should just ditch using the `WrappedType` construct altogether.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] fbocse commented on a change in pull request #244: [WIP] Add new Update schema API to support adding of all new fields detected based on a companion schema

2019-08-13 Thread GitBox
fbocse commented on a change in pull request #244: [WIP] Add new Update schema 
API to support adding of all new fields detected based on a companion schema
URL: https://github.com/apache/incubator-iceberg/pull/244#discussion_r313681188
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/SchemaUpdate.java
 ##
 @@ -186,6 +188,88 @@ public UpdateSchema updateColumnDoc(String name, String 
doc) {
 return this;
   }
 
+  @Override
+  public UpdateSchema equateNewFields(Schema schema) {
+Preconditions.checkArgument(schema != null, "Cannot add new fields off of 
a null schema");
+List wrappedTypes = 
TypeUtil.CompanionSchemaVisitor.schema(this.schema, schema,
+new CompanionSchemaNewFieldsCollector());
+
+// With Iceberg we can only add to pre-existing fields so we need to 
"hydrate" all nested types.
+wrappedTypes.forEach(
+wrappedType -> {
+  Map.Entry entry = inflate(wrappedType);
+  // Based on being top-level or nested we use the corresponding API 
for adding new columns to the schema
+  Preconditions.checkArgument(entry != null, "Cannot inflate types of 
null predecessor");
+  if (entry.getKey().isEmpty()) {
+addColumn(entry.getValue().name(), entry.getValue().type(), 
entry.getValue().doc());
+  } else {
+addColumn(entry.getKey(), entry.getValue().name(), 
entry.getValue().type(), entry.getValue().doc());
+  }
+}
+);
+
+return this;
+  }
+
+  /**
+   * Given a nested type structure this will provide a fully hydrated {@link 
Types.NestedField} accompanied by
+   * predecessor name as dot separated string.
+   * TODO - any suggestions for modelling a tuple other than 
java.util.Map.Entry?
+   *
+   * @param wrappedType the nested type we want to inflate (will use method 
recursively for non-primitive structures)
+   * @return a tuple having a String representation of the type predecessor as 
key and the nested field as value
 
 Review comment:
   Yes, will rename to `parent`, in the case of a hierarchy `parent` sounds 
better.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] xabriel commented on a change in pull request #377: Add FindFiles helper API

2019-08-13 Thread GitBox
xabriel commented on a change in pull request #377: Add FindFiles helper API
URL: https://github.com/apache/incubator-iceberg/pull/377#discussion_r313667733
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/FindFiles.java
 ##
 @@ -0,0 +1,195 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.iceberg;
+
+import com.google.common.base.Preconditions;
+import java.time.Instant;
+import java.time.LocalDateTime;
+import java.time.ZoneId;
+import java.time.format.DateTimeFormatter;
+import java.util.Arrays;
+import java.util.List;
+import org.apache.iceberg.expressions.Expression;
+import org.apache.iceberg.expressions.Expressions;
+import org.apache.iceberg.io.CloseableIterable;
+
+public class FindFiles {
+  private static final DateTimeFormatter DATE_FORMAT = 
DateTimeFormatter.ofPattern("-MM-dd HH:mm:ss.SSS");
+
+  public static Builder in(Table table) {
+return new Builder(table);
+  }
+
+  public static class Builder {
+private final Table table;
+private final TableOperations ops;
+private boolean caseSensitive = true;
+private Long snapshotId = null;
+private Expression rowFilter = Expressions.alwaysTrue();
+private Expression fileFilter = Expressions.alwaysTrue();
+private Expression partitionFilter = Expressions.alwaysTrue();
+
+public Builder(Table table) {
+  this.table = table;
+  this.ops = ((HasTableOperations) table).operations();
+}
+
+public Builder caseInsensitive() {
+  this.caseSensitive = false;
+  return this;
+}
+
+public Builder caseSensitive(boolean caseSensitive) {
+  this.caseSensitive = caseSensitive;
+  return this;
+}
+
+/**
+ * Base results on the given snapshot.
+ *
+ * @param snapshotId a snapshot ID
+ * @return this for method chaining
+ */
+public Builder inSnapshot(long snapshotId) {
+  Preconditions.checkArgument(this.snapshotId == null,
+  "Cannot set snapshot multiple times, already set to id=%s", 
snapshotId);
+  Preconditions.checkArgument(table.snapshot(snapshotId) != null,
+  "Cannot find snapshot for id=%s", snapshotId);
+  this.snapshotId = snapshotId;
+  return this;
+}
+
+/**
+ * Base results on files in the snapshot that was current as of a 
timestamp.
+ *
+ * @param timestampMillis a timestamp in milliseconds
+ * @return this for method chaining
+ */
+public Builder asOfTime(long timestampMillis) {
+  Preconditions.checkArgument(this.snapshotId == null,
+  "Cannot set snapshot multiple times, already set to id=%s", 
snapshotId);
+
+  Long lastSnapshotId = null;
+  for (HistoryEntry logEntry : ops.current().snapshotLog()) {
+if (logEntry.timestampMillis() <= timestampMillis) {
+  lastSnapshotId = logEntry.snapshotId();
+}
 
 Review comment:
   I assume `snapshotLog()` is in ascending order. Should we then `break` after 
the condition is not true anymore?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates

2019-08-13 Thread GitBox
jun-he commented on a change in pull request #357: Add in and not in predicates
URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313653184
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java
 ##
 @@ -134,18 +134,18 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 }
 
 @Override
-public  Boolean in(BoundReference ref, Literal lit) {
-  throw new UnsupportedOperationException("In is not supported yet");
+public  Boolean in(BoundReference ref, LiteralSet literalSet) {
+  return literalSet.contains(ref.get(struct));
 }
 
 @Override
-public  Boolean notIn(BoundReference ref, Literal lit) {
-  return !in(ref, lit);
+public  Boolean notIn(BoundReference ref, LiteralSet literalSet) {
+  return !in(ref, literalSet);
 }
 
 @Override
 public  Boolean startsWith(BoundReference ref, Literal lit) {
-  return ((String) ref.get(struct)).startsWith((String) lit.value());
+  return ((String) ref.get(struct)).startsWith(lit.value().toString());
 
 Review comment:
   Here, I changed the construction of `StringLiteral` to be
   ```
   StringLiteral(CharSequence value) {
 super(new CharSeqWrapper(value));
   }
   ```
   
   The other way is to only wrap `CharSequence` within LiteralSet. In that way, 
there is no need to change `startsWith`.
   
   Wrapping inside seems cleaner with less side effect and I will update the PR 
accordingly.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
rdblue commented on issue #372: Allow read split behavior overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#issuecomment-521056199
 
 
   Merged. Thanks @xabriel!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue merged pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
rdblue merged pull request #372: Allow read split behavior overrides
URL: https://github.com/apache/incubator-iceberg/pull/372
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue opened a new pull request #378: Handle null values in transforms

2019-08-13 Thread GitBox
rdblue opened a new pull request #378: Handle null values in transforms
URL: https://github.com/apache/incubator-iceberg/pull/378
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] xabriel commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
xabriel commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313655207
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java
 ##
 @@ -54,22 +56,24 @@
   private final TableOperations ops;
   private final Table table;
   private final Long snapshotId;
+  private final Map options;
   private final Schema schema;
   private final Expression rowFilter;
   private final boolean caseSensitive;
   private final boolean colStats;
   private final Collection selectedColumns;
 
   protected BaseTableScan(TableOperations ops, Table table, Schema schema) {
-this(ops, table, null, schema, Expressions.alwaysTrue(), true, false, 
null);
+this(ops, table, null, null, schema, Expressions.alwaysTrue(), true, 
false, null);
   }
 
-  protected BaseTableScan(TableOperations ops, Table table, Long snapshotId, 
Schema schema,
-Expression rowFilter, boolean caseSensitive, boolean 
colStats,
+  protected BaseTableScan(TableOperations ops, Table table, Long snapshotId, 
Map options,
 
 Review comment:
   Fixed to be added to the end.
   
   The reason it was in the middle before is that it used to be `splitOptions` 
and those were related to `snapshotId` as in that they modify the read behavior.
   
   Now they're generic so agreed it makes sense to keep at end.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] xabriel commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
xabriel commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313654409
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java
 ##
 @@ -54,22 +56,24 @@
   private final TableOperations ops;
   private final Table table;
   private final Long snapshotId;
+  private final Map options;
   private final Schema schema;
   private final Expression rowFilter;
   private final boolean caseSensitive;
   private final boolean colStats;
   private final Collection selectedColumns;
 
   protected BaseTableScan(TableOperations ops, Table table, Schema schema) {
-this(ops, table, null, schema, Expressions.alwaysTrue(), true, false, 
null);
+this(ops, table, null, null, schema, Expressions.alwaysTrue(), true, 
false, null);
 
 Review comment:
   Using an `ImmutableMap` all around now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] xabriel commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
xabriel commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313654308
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java
 ##
 @@ -123,31 +128,46 @@ public TableScan asOfTime(long timestampMillis) {
 return useSnapshot(lastSnapshotId);
   }
 
+  @Override
+  public TableScan option(String property, String value) {
+Map localOptions = Maps.newHashMap();
 
 Review comment:
   Agreed. Fixed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates

2019-08-13 Thread GitBox
jun-he commented on a change in pull request #357: Add in and not in predicates
URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313653184
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java
 ##
 @@ -134,18 +134,18 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 }
 
 @Override
-public  Boolean in(BoundReference ref, Literal lit) {
-  throw new UnsupportedOperationException("In is not supported yet");
+public  Boolean in(BoundReference ref, LiteralSet literalSet) {
+  return literalSet.contains(ref.get(struct));
 }
 
 @Override
-public  Boolean notIn(BoundReference ref, Literal lit) {
-  return !in(ref, lit);
+public  Boolean notIn(BoundReference ref, LiteralSet literalSet) {
+  return !in(ref, literalSet);
 }
 
 @Override
 public  Boolean startsWith(BoundReference ref, Literal lit) {
-  return ((String) ref.get(struct)).startsWith((String) lit.value());
+  return ((String) ref.get(struct)).startsWith(lit.value().toString());
 
 Review comment:
   Here, I changed the construction of `StringLiteral` to be
   ```
   StringLiteral(CharSequence value) {
 super(new CharSeqWrapper(value));
   }
   ```
   
   The other way is to only wrap `CharSequence` within LiteralSet. In that way, 
there is no need to change `startsWith`.
   
   This seems cleaner with less side effect and I will update the PR 
accordingly.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates

2019-08-13 Thread GitBox
jun-he commented on a change in pull request #357: Add in and not in predicates
URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313653184
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java
 ##
 @@ -134,18 +134,18 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 }
 
 @Override
-public  Boolean in(BoundReference ref, Literal lit) {
-  throw new UnsupportedOperationException("In is not supported yet");
+public  Boolean in(BoundReference ref, LiteralSet literalSet) {
+  return literalSet.contains(ref.get(struct));
 }
 
 @Override
-public  Boolean notIn(BoundReference ref, Literal lit) {
-  return !in(ref, lit);
+public  Boolean notIn(BoundReference ref, LiteralSet literalSet) {
+  return !in(ref, literalSet);
 }
 
 @Override
 public  Boolean startsWith(BoundReference ref, Literal lit) {
-  return ((String) ref.get(struct)).startsWith((String) lit.value());
+  return ((String) ref.get(struct)).startsWith(lit.value().toString());
 
 Review comment:
   Here, I changed the construction of `StringLiteral` to be
   ```StringLiteral(CharSequence value) {
 super(new CharSeqWrapper(value));
   }```
   
   The other way is to only wrap `CharSequence` within LiteralSet. In that way, 
there is no need to change `startsWith`.
   
   This seems cleaner with less side effect and I will update the PR 
accordingly.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates

2019-08-13 Thread GitBox
jun-he commented on a change in pull request #357: Add in and not in predicates
URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313653184
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java
 ##
 @@ -134,18 +134,18 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 }
 
 @Override
-public  Boolean in(BoundReference ref, Literal lit) {
-  throw new UnsupportedOperationException("In is not supported yet");
+public  Boolean in(BoundReference ref, LiteralSet literalSet) {
+  return literalSet.contains(ref.get(struct));
 }
 
 @Override
-public  Boolean notIn(BoundReference ref, Literal lit) {
-  return !in(ref, lit);
+public  Boolean notIn(BoundReference ref, LiteralSet literalSet) {
+  return !in(ref, literalSet);
 }
 
 @Override
 public  Boolean startsWith(BoundReference ref, Literal lit) {
-  return ((String) ref.get(struct)).startsWith((String) lit.value());
+  return ((String) ref.get(struct)).startsWith(lit.value().toString());
 
 Review comment:
   Here, I changed the construction of `StringLiteral` to be
   ```StringLiteral(CharSequence value) {
 super(new CharSeqWrapper(value));
   }
   ```
   
   The other way is to only wrap `CharSequence` within LiteralSet. In that way, 
there is no need to change `startsWith`.
   
   This seems cleaner with less side effect and I will update the PR 
accordingly.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue opened a new pull request #377: Add FindFiles helper API

2019-08-13 Thread GitBox
rdblue opened a new pull request #377: Add FindFiles helper API
URL: https://github.com/apache/incubator-iceberg/pull/377
 
 
   This adds `FindFiles`, a helper API for working with files in a table. This 
API is an alternative to building a scan and calling `planFiles`. Using a scan 
returns `FileScanTask` instead of `DataFile` instances and send a scan 
notification, which may not be useful when inspecting a table's files.
   
   This is intended to be used for inspecting a table's contents without 
needing a scan. For example, a merge process could use this API to locate files 
under a certain size in a partition to merge.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313647841
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -88,6 +121,40 @@ public OverwriteFiles validateAddedFiles() {
   }
 }
 
+if (conflictingAppendsRowFilter != null) {
+  PartitionSpec spec = writeSpec();
+  Expression inclusiveExpr = 
Projections.inclusive(spec).project(conflictingAppendsRowFilter);
+  Evaluator inclusive = new Evaluator(spec.partitionType(), inclusiveExpr);
+
+  List newFiles = collectNewFiles(base);
+  for (DataFile newFile : newFiles) {
+// we do partition-level conflict resolution right now
+// we can enhance it by leveraging column stats and MetricsEvaluator
+ValidationException.check(
+!inclusive.eval(newFile.partition()),
+"A conflicting file was appended that matches filter '%s': %s",
+conflictingAppendsRowFilter, newFile.path());
+  }
+}
+
 return super.apply(base);
   }
+
+  private List collectNewFiles(TableMetadata meta) {
+List newFiles = new ArrayList<>();
+
+Long currentSnapshotId = meta.currentSnapshot() == null ? null : 
meta.currentSnapshot().snapshotId();
+while (currentSnapshotId != null && 
!currentSnapshotId.equals(readSnapshotId)) {
+  Snapshot currentSnapshot = meta.snapshot(currentSnapshotId);
+
+  if (currentSnapshot == null) {
+throw new ValidationException("Cannot find snapshot %d. Was it 
expired?", currentSnapshotId);
 
 Review comment:
   If `meta.currentSnapshot` is null, this will print "Cannot find snapshot 
null". I think it would be better to start the method with a precondition that 
checks that meta has a current snapshot and fails with a better error message 
if it doesn't.
   
   Also, there are a few reasons why the snapshot can't be found. The base 
snapshot may have been expired, but it also could have been rolled back. I 
think this should just state "Cannot determine history between base snapshot %s 
and current %s".


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313646140
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/OverwriteData.java
 ##
 @@ -27,10 +30,15 @@
 import org.apache.iceberg.expressions.StrictMetricsEvaluator;
 
 public class OverwriteData extends MergingSnapshotProducer 
implements OverwriteFiles {
-  private boolean validateAddedFiles = false;
+  private boolean overwriteByRowFilter;
+  private boolean validateAddedFilesMatchRowFilter;
+  private boolean deleteFilesByPath;
+  private Long readSnapshotId;
+  private Expression conflictingAppendsRowFilter;
 
   protected OverwriteData(TableOperations ops) {
 super(ops);
+failMissingDeletePaths();
 
 Review comment:
   Is this always required? It seems like it might only be needed when 
validating conflicting operations.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313645766
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -74,5 +90,22 @@
*
* @return this for method chaining
*/
-  OverwriteFiles validateAddedFiles();
+  OverwriteFiles validateAddedFilesMatchRowFilter();
+
+  /**
+   * Enables validation of files that are added concurrently.
 
 Review comment:
   I'd like to add something about the type of validation. Does something like 
this make sense?
   
   > Enables validation that files added concurrently do not conflict with this 
commit's operation.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313645460
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -74,5 +90,22 @@
*
* @return this for method chaining
*/
-  OverwriteFiles validateAddedFiles();
+  OverwriteFiles validateAddedFilesMatchRowFilter();
+
+  /**
+   * Enables validation of files that are added concurrently.
+   * 
+   * This method should be called when the table is queried to determine files 
to delete/append.
+   * If a concurrent operation commits a new file after the data was read and 
that file might
+   * contain rows matching the specified row filter, the overwrite operation 
will detect
+   * this during retries and fail.
+   * 
+   * Calling this method with a correct row filter is required to maintain 
serializable isolation.
+   * Otherwise, the isolation level will be snapshot isolation.
+   *
+   * @param readSnapshotId the snapshot id that was used to read the data
+   * @param rowFilter an expression on rows in the table
+   * @return this for method chaining
+   */
+  OverwriteFiles validateNoConflictingAppends(Long readSnapshotId, Expression 
rowFilter);
 
 Review comment:
   Can we rename `rowFilter` to `conflictDetectionFilter`?
   
   Why use `Long` instead of `long`? That would allow passing null. Is that 
intentional? If null, then use the base snapshot ID?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313644916
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -74,5 +90,22 @@
*
* @return this for method chaining
*/
-  OverwriteFiles validateAddedFiles();
+  OverwriteFiles validateAddedFilesMatchRowFilter();
 
 Review comment:
   Since a row filter is passed to `validateNoConflictingAppends`, I think this 
should be `validateAddedFilesMatchOverwriteFilter` instead.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313643617
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -65,6 +70,17 @@
*/
   OverwriteFiles addFile(DataFile file);
 
+  /**
+   * Delete a {@link DataFile} from the table.
+   * 
+   * Note that either this method or {@link 
OverwriteFiles#overwriteByRowFilter(Expression)}
+   * should be used to mark deleted files, not both.
 
 Review comment:
   Like my comment above, I don't think this paragraph is needed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #351: Extend Iceberg with a way to overwrite files for eager updates/deletes

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #351: Extend Iceberg with a way to 
overwrite files for eager updates/deletes
URL: https://github.com/apache/incubator-iceberg/pull/351#discussion_r313643529
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/OverwriteFiles.java
 ##
 @@ -50,6 +49,12 @@
* 
* Files that may contain some rows that match the expression and some rows 
that do not will
* result in a {@link ValidationException}.
+   * 
+   * This has no requirements for the latest snapshot and will not fail based 
on other snapshot
+   * changes.
+   * 
+   * Note that either this method or {@link 
OverwriteFiles#deleteFile(DataFile)}
+   * should be used to mark deleted files, not both.
 
 Review comment:
   This update isn't quite true. There are methods to determine which files to 
delete, and there are separate ways to configure validations.
   
   When this operation is used to produce an idempotent overwrite, where it can 
be completed regardless of what files are added after it starts, it doesn't 
matter that the files were matched by a filter or directly by `deleteFile`. For 
example, I might want to overwrite an old file with new values for a column 
that was added. That overwrite is idempotent, but might use `deleteFile`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313640101
 
 

 ##
 File path: site/docs/configuration.md
 ##
 @@ -10,7 +10,7 @@ Iceberg tables support table properties to configure table 
behavior, like the de
 |  | -- | 
-- |
 | read.split.target-size   | 134217728 (128 MB) | Target size when 
combining input splits|
 | read.split.planning-lookback | 10 | Number of bins to 
consider when combining input splits |
-
+| read.split.open-file-cost| 4194304 (4 MB) | The estimated cost to 
open a file, used as a minimum weight when combining splits. |
 
 Review comment:
   Thanks for catching this oversight!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313639861
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java
 ##
 @@ -54,22 +56,24 @@
   private final TableOperations ops;
   private final Table table;
   private final Long snapshotId;
+  private final Map options;
   private final Schema schema;
   private final Expression rowFilter;
   private final boolean caseSensitive;
   private final boolean colStats;
   private final Collection selectedColumns;
 
   protected BaseTableScan(TableOperations ops, Table table, Schema schema) {
-this(ops, table, null, schema, Expressions.alwaysTrue(), true, false, 
null);
+this(ops, table, null, null, schema, Expressions.alwaysTrue(), true, 
false, null);
   }
 
-  protected BaseTableScan(TableOperations ops, Table table, Long snapshotId, 
Schema schema,
-Expression rowFilter, boolean caseSensitive, boolean 
colStats,
+  protected BaseTableScan(TableOperations ops, Table table, Long snapshotId, 
Map options,
 
 Review comment:
   Why add options in the middle instead of at the end? Seems like you're 
changing more lines by displacing the other arguments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313639689
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java
 ##
 @@ -54,22 +56,24 @@
   private final TableOperations ops;
   private final Table table;
   private final Long snapshotId;
+  private final Map options;
   private final Schema schema;
   private final Expression rowFilter;
   private final boolean caseSensitive;
   private final boolean colStats;
   private final Collection selectedColumns;
 
   protected BaseTableScan(TableOperations ops, Table table, Schema schema) {
-this(ops, table, null, schema, Expressions.alwaysTrue(), true, false, 
null);
+this(ops, table, null, null, schema, Expressions.alwaysTrue(), true, 
false, null);
 
 Review comment:
   You can avoid null checks by always having a map instance. Using 
`ImmutableMap.of()` would be easier.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313639543
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java
 ##
 @@ -123,31 +128,46 @@ public TableScan asOfTime(long timestampMillis) {
 return useSnapshot(lastSnapshotId);
   }
 
+  @Override
+  public TableScan option(String property, String value) {
+Map localOptions = Maps.newHashMap();
 
 Review comment:
   As long as this creates a new map each time, I think it makes sense to use 
an ImmutableMap:
   
   ```java
   ImmutableMap.Builder builder = ImmutableMap.builder();
   builder.putAll(options);
   builder.put(property, value);
   return newRefinedScan(..., builder.build());
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313638753
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java
 ##
 @@ -176,11 +188,26 @@ public Expression filter() {
 
   @Override
   public CloseableIterable planTasks() {
-long splitSize = targetSplitSize(ops);
-int lookback = ops.current().propertyAsInt(
-TableProperties.SPLIT_LOOKBACK, 
TableProperties.SPLIT_LOOKBACK_DEFAULT);
-long openFileCost = ops.current().propertyAsLong(
-TableProperties.SPLIT_OPEN_FILE_COST, 
TableProperties.SPLIT_OPEN_FILE_COST_DEFAULT);
+long splitSize;
+if (splitOptions != null && splitOptions.getSplitSize() != null) {
+  splitSize = splitOptions.getSplitSize();
+} else {
+  splitSize = targetSplitSize(ops);
+}
+int lookback;
+if (splitOptions != null && splitOptions.getSplitLookback() != null) {
+  lookback = splitOptions.getSplitLookback();
+} else {
+  lookback = ops.current().propertyAsInt(
+  TableProperties.SPLIT_LOOKBACK, 
TableProperties.SPLIT_LOOKBACK_DEFAULT);
 
 Review comment:
   I think accepting both is a good idea, but we can add that later.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313638753
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java
 ##
 @@ -176,11 +188,26 @@ public Expression filter() {
 
   @Override
   public CloseableIterable planTasks() {
-long splitSize = targetSplitSize(ops);
-int lookback = ops.current().propertyAsInt(
-TableProperties.SPLIT_LOOKBACK, 
TableProperties.SPLIT_LOOKBACK_DEFAULT);
-long openFileCost = ops.current().propertyAsLong(
-TableProperties.SPLIT_OPEN_FILE_COST, 
TableProperties.SPLIT_OPEN_FILE_COST_DEFAULT);
+long splitSize;
+if (splitOptions != null && splitOptions.getSplitSize() != null) {
+  splitSize = splitOptions.getSplitSize();
+} else {
+  splitSize = targetSplitSize(ops);
+}
+int lookback;
+if (splitOptions != null && splitOptions.getSplitLookback() != null) {
+  lookback = splitOptions.getSplitLookback();
+} else {
+  lookback = ops.current().propertyAsInt(
+  TableProperties.SPLIT_LOOKBACK, 
TableProperties.SPLIT_LOOKBACK_DEFAULT);
 
 Review comment:
   I think accepting both is a good idea


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #357: Add in and not in predicates

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #357: Add in and not in predicates
URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313633929
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java
 ##
 @@ -134,18 +134,18 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 }
 
 @Override
-public  Boolean in(BoundReference ref, Literal lit) {
-  throw new UnsupportedOperationException("In is not supported yet");
+public  Boolean in(BoundReference ref, LiteralSet literalSet) {
+  return literalSet.contains(ref.get(struct));
 }
 
 @Override
-public  Boolean notIn(BoundReference ref, Literal lit) {
-  return !in(ref, lit);
+public  Boolean notIn(BoundReference ref, LiteralSet literalSet) {
+  return !in(ref, literalSet);
 }
 
 @Override
 public  Boolean startsWith(BoundReference ref, Literal lit) {
-  return ((String) ref.get(struct)).startsWith((String) lit.value());
+  return ((String) ref.get(struct)).startsWith(lit.value().toString());
 
 Review comment:
   Why did the type used by the literal change? I don't think it needs to.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue opened a new pull request #376: Do not use original manifest list when adding filters

2019-08-13 Thread GitBox
rdblue opened a new pull request #376: Do not use original manifest list when 
adding filters
URL: https://github.com/apache/incubator-iceberg/pull/376
 
 
   This fixes a bug in `ManifestGroup` that caused all manifests to be read 
instead of correctly filtering when either `ignoreExisting` or `ignoreDeleted` 
was used because those filters were applied to the wrong variable, `manifests` 
instead of `filteredManifests`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates

2019-08-13 Thread GitBox
jun-he commented on a change in pull request #357: Add in and not in predicates
URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313593900
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Predicate.java
 ##
 @@ -19,15 +19,44 @@
 
 package org.apache.iceberg.expressions;
 
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ImmutableSet;
+import java.util.Collection;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
 public abstract class Predicate implements Expression {
   private final Operation op;
   private final R ref;
   private final Literal literal;
+  private final Set> literalSet;
 
 Review comment:
   @rdblue Sorry for the confusion.
   
   In this change, only factory methods in `Expressions` and the `bind` in 
`UnboundPredicate` can change the returned predicate.
   
   What I meant is that the factory method `in` will return an `EQ` predicate 
if only a single element is provided because UnboundPredicate factory method 
for IN operation with a single item should return `EQ` predicate.
   
   As the `bind` returns `Expression`, it is straightforward to separated the 
`BoundPredicate`.
   
   But in the case of the factory method, if it also returns `Expression` or a 
new interface class instead of `UnboundPredicate`, then later when it is used, 
it has to be explicitly cast by checking the `operation`. This seems 
over-complicated compared to the current approach.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates

2019-08-13 Thread GitBox
jun-he commented on a change in pull request #357: Add in and not in predicates
URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313593900
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Predicate.java
 ##
 @@ -19,15 +19,44 @@
 
 package org.apache.iceberg.expressions;
 
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ImmutableSet;
+import java.util.Collection;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
 public abstract class Predicate implements Expression {
   private final Operation op;
   private final R ref;
   private final Literal literal;
+  private final Set> literalSet;
 
 Review comment:
   @rdblue Sorry for the misunderstanding.
   
   In this change, only factory methods in `Expressions` and the `bind` in 
`UnboundPredicate` can change the returned predicate.
   
   What I meant is that the factory method `in` will return an `EQ` predicate 
if only a single element is provided because UnboundPredicate factory method 
for IN operation with a single item should return `EQ` predicate.
   
   As the `bind` returns `Expression`, it is straightforward to separated the 
`BoundPredicate`.
   
   But in the case of the factory method, if it also returns `Expression` or a 
new interface class instead of `UnboundPredicate`, then later when it is used, 
it has to be explicitly cast by checking the `operation`. This seems 
over-complicated compared to the current approach.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] xabriel commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
xabriel commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313589124
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java
 ##
 @@ -176,11 +188,26 @@ public Expression filter() {
 
   @Override
   public CloseableIterable planTasks() {
-long splitSize = targetSplitSize(ops);
-int lookback = ops.current().propertyAsInt(
-TableProperties.SPLIT_LOOKBACK, 
TableProperties.SPLIT_LOOKBACK_DEFAULT);
-long openFileCost = ops.current().propertyAsLong(
-TableProperties.SPLIT_OPEN_FILE_COST, 
TableProperties.SPLIT_OPEN_FILE_COST_DEFAULT);
+long splitSize;
+if (splitOptions != null && splitOptions.getSplitSize() != null) {
+  splitSize = splitOptions.getSplitSize();
+} else {
+  splitSize = targetSplitSize(ops);
+}
+int lookback;
+if (splitOptions != null && splitOptions.getSplitLookback() != null) {
+  lookback = splitOptions.getSplitLookback();
+} else {
+  lookback = ops.current().propertyAsInt(
+  TableProperties.SPLIT_LOOKBACK, 
TableProperties.SPLIT_LOOKBACK_DEFAULT);
 
 Review comment:
   Went with short option names for the Spark `Reader`, but full property name 
when passed to the `TableScan` object. Let me know if this matches the 
precedent?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates

2019-08-13 Thread GitBox
jun-he commented on a change in pull request #357: Add in and not in predicates
URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313576493
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java
 ##
 @@ -134,18 +134,18 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 }
 
 @Override
-public  Boolean in(BoundReference ref, Literal lit) {
-  throw new UnsupportedOperationException("In is not supported yet");
+public  Boolean in(BoundReference ref, LiteralSet literalSet) {
+  return literalSet.contains(ref.get(struct));
 }
 
 @Override
-public  Boolean notIn(BoundReference ref, Literal lit) {
-  return !in(ref, lit);
+public  Boolean notIn(BoundReference ref, LiteralSet literalSet) {
+  return !in(ref, literalSet);
 }
 
 @Override
 public  Boolean startsWith(BoundReference ref, Literal lit) {
-  return ((String) ref.get(struct)).startsWith((String) lit.value());
+  return ((String) ref.get(struct)).startsWith(lit.value().toString());
 
 Review comment:
   The issue is that `lit.value()` in this case will return a `CharSeqWrapper` 
instead of `String`. It cannot be cast to `String`.
   If the underlying is a `String`, `toString()` will still return it without 
extra costs.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] xabriel commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
xabriel commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313525003
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java
 ##
 @@ -176,11 +188,26 @@ public Expression filter() {
 
   @Override
   public CloseableIterable planTasks() {
-long splitSize = targetSplitSize(ops);
-int lookback = ops.current().propertyAsInt(
-TableProperties.SPLIT_LOOKBACK, 
TableProperties.SPLIT_LOOKBACK_DEFAULT);
-long openFileCost = ops.current().propertyAsLong(
-TableProperties.SPLIT_OPEN_FILE_COST, 
TableProperties.SPLIT_OPEN_FILE_COST_DEFAULT);
+long splitSize;
+if (splitOptions != null && splitOptions.getSplitSize() != null) {
+  splitSize = splitOptions.getSplitSize();
+} else {
+  splitSize = targetSplitSize(ops);
+}
+int lookback;
+if (splitOptions != null && splitOptions.getSplitLookback() != null) {
+  lookback = splitOptions.getSplitLookback();
+} else {
+  lookback = ops.current().propertyAsInt(
+  TableProperties.SPLIT_LOOKBACK, 
TableProperties.SPLIT_LOOKBACK_DEFAULT);
 
 Review comment:
   Do you mean to not accept, say, `read.split.target-size` in favor of 
`split-size`, or that  we should accept both?
   
   Re updating the docs, will do.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] xabriel commented on issue #371: Fix reading of old .metadata.json.gz files

2019-08-13 Thread GitBox
xabriel commented on issue #371: Fix reading of old .metadata.json.gz files
URL: https://github.com/apache/incubator-iceberg/pull/371#issuecomment-520931166
 
 
   I think the guarantee should match what we have for table evolution: If 
Iceberg generated it, then Iceberg should be able to read it. And we should 
have the tests to prove it.
   
   Having said that, I do understand that we are still moving fast and that 
there has been no official release yet. So I would be ok if we decide to drop 
this particular backwards compatible code when we cut the first release.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313520937
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java
 ##
 @@ -176,11 +188,26 @@ public Expression filter() {
 
   @Override
   public CloseableIterable planTasks() {
-long splitSize = targetSplitSize(ops);
-int lookback = ops.current().propertyAsInt(
-TableProperties.SPLIT_LOOKBACK, 
TableProperties.SPLIT_LOOKBACK_DEFAULT);
-long openFileCost = ops.current().propertyAsLong(
-TableProperties.SPLIT_OPEN_FILE_COST, 
TableProperties.SPLIT_OPEN_FILE_COST_DEFAULT);
+long splitSize;
+if (splitOptions != null && splitOptions.getSplitSize() != null) {
+  splitSize = splitOptions.getSplitSize();
+} else {
+  splitSize = targetSplitSize(ops);
+}
+int lookback;
+if (splitOptions != null && splitOptions.getSplitLookback() != null) {
+  lookback = splitOptions.getSplitLookback();
+} else {
+  lookback = ops.current().propertyAsInt(
+  TableProperties.SPLIT_LOOKBACK, 
TableProperties.SPLIT_LOOKBACK_DEFAULT);
 
 Review comment:
   Actually, "snapshot-id" is a read option, not a write option: 
http://iceberg.apache.org/configuration/#read-options
   
   Not sure how I missed that. Anyway, I think we've set a precedent that we 
should use shorter option names for interactive cases like this, instead of 
full config names.
   
   Could you also update the read options table?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #365: test cases for parquetMetrics with multiple rowgroup

2019-08-13 Thread GitBox
rdblue commented on issue #365: test cases for parquetMetrics with multiple 
rowgroup
URL: https://github.com/apache/incubator-iceberg/pull/365#issuecomment-520927967
 
 
   Thanks for contributing this, @manishmalhotrawork! I merged it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue merged pull request #365: test cases for parquetMetrics with multiple rowgroup

2019-08-13 Thread GitBox
rdblue merged pull request #365: test cases for parquetMetrics with multiple 
rowgroup
URL: https://github.com/apache/incubator-iceberg/pull/365
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313517386
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/TableScan.java
 ##
 @@ -58,6 +58,15 @@
*/
   TableScan asOfTime(long timestampMillis);
 
+  /**
+   * Create a new {@link TableScan} from this scan's configuration that will 
override the {@link Table}'s read splitting
+   * behavior based on the incoming {@link SplitOptions}.
+   *
+   * @param splitOptions a {@link SplitOptions} containing the splitting 
options
+   * @return a new scan based on this with the splitting options set
+   */
+  TableScan splitOptions(SplitOptions splitOptions);
 
 Review comment:
   This would also make it easier to translate from Spark to Iceberg. You'd 
just need to add all of the options to the scan.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313516781
 
 

 ##
 File path: core/src/main/java/org/apache/iceberg/BaseTableScan.java
 ##
 @@ -176,11 +188,26 @@ public Expression filter() {
 
   @Override
   public CloseableIterable planTasks() {
-long splitSize = targetSplitSize(ops);
-int lookback = ops.current().propertyAsInt(
-TableProperties.SPLIT_LOOKBACK, 
TableProperties.SPLIT_LOOKBACK_DEFAULT);
-long openFileCost = ops.current().propertyAsLong(
-TableProperties.SPLIT_OPEN_FILE_COST, 
TableProperties.SPLIT_OPEN_FILE_COST_DEFAULT);
+long splitSize;
+if (splitOptions != null && splitOptions.getSplitSize() != null) {
+  splitSize = splitOptions.getSplitSize();
+} else {
+  splitSize = targetSplitSize(ops);
+}
+int lookback;
+if (splitOptions != null && splitOptions.getSplitLookback() != null) {
+  lookback = splitOptions.getSplitLookback();
+} else {
+  lookback = ops.current().propertyAsInt(
+  TableProperties.SPLIT_LOOKBACK, 
TableProperties.SPLIT_LOOKBACK_DEFAULT);
 
 Review comment:
   On the write path, we decided to go with easier to remember option names, 
like "snapshot-id" and "write-format", to be less verbose. What do you think 
about doing that here? This could be "lookback", "split-size", and 
"file-open-cost".


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #372: Allow read split behavior 
overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#discussion_r313515860
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/TableScan.java
 ##
 @@ -58,6 +58,15 @@
*/
   TableScan asOfTime(long timestampMillis);
 
+  /**
+   * Create a new {@link TableScan} from this scan's configuration that will 
override the {@link Table}'s read splitting
+   * behavior based on the incoming {@link SplitOptions}.
+   *
+   * @param splitOptions a {@link SplitOptions} containing the splitting 
options
+   * @return a new scan based on this with the splitting options set
+   */
+  TableScan splitOptions(SplitOptions splitOptions);
 
 Review comment:
   What about using more generic options that are string keys and values? 
Something like this:
   
   ```java
   public TableScan option(String property, String value);
   ```
   
   That would make this more generic so we can use options to control behavior 
other than splits, without adding a lot of methods to the scan API. What do you 
think?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
rdblue commented on issue #372: Allow read split behavior overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#issuecomment-520925050
 
 
   I restarted the tests. The Hive tests are a little flaky right now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue merged pull request #375: Vectorized iceberg reader

2019-08-13 Thread GitBox
rdblue merged pull request #375: Vectorized iceberg reader
URL: https://github.com/apache/incubator-iceberg/pull/375
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue merged pull request #371: Fix reading of old .metadata.json.gz files

2019-08-13 Thread GitBox
rdblue merged pull request #371: Fix reading of old .metadata.json.gz files
URL: https://github.com/apache/incubator-iceberg/pull/371
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on issue #371: Fix reading of old .metadata.json.gz files

2019-08-13 Thread GitBox
rdblue commented on issue #371: Fix reading of old .metadata.json.gz files
URL: https://github.com/apache/incubator-iceberg/pull/371#issuecomment-520924428
 
 
   Thanks for fixing this, @xabriel! I didn't realize that #258 had broken the 
old extension.
   
   I'm a little concerned about needing to check for so many file extensions 
for compatibility; hopefully we can standardize this and eventually stop. Would 
be great to hear what you think the guarantee from Iceberg should be.
   
   FYI @aokolnychyi.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup

2019-08-13 Thread GitBox
manishmalhotrawork commented on a change in pull request #365: test cases for 
parquetMetrics with multiple rowgroup
URL: https://github.com/apache/incubator-iceberg/pull/365#discussion_r313506603
 
 

 ##
 File path: core/src/test/java/org/apache/iceberg/TestMetrics.java
 ##
 @@ -270,6 +289,97 @@ public void testMetricsForNullColumns() throws 
IOException {
 assertBounds(1, IntegerType.get(), null, null, metrics);
   }
 
+  @Test
+  public void testMetricsForTopLevelWithMultipleRG() throws Exception {
 
 Review comment:
   thanks @rdblue. please let me know if with changes it looks good.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup

2019-08-13 Thread GitBox
manishmalhotrawork commented on a change in pull request #365: test cases for 
parquetMetrics with multiple rowgroup
URL: https://github.com/apache/incubator-iceberg/pull/365#discussion_r313506252
 
 

 ##
 File path: core/src/test/java/org/apache/iceberg/TestMetrics.java
 ##
 @@ -270,6 +289,97 @@ public void testMetricsForNullColumns() throws 
IOException {
 assertBounds(1, IntegerType.get(), null, null, metrics);
   }
 
+  @Test
+  public void testMetricsForTopLevelWithMultipleRG() throws Exception {
+Schema schema = getSimpleSchema();
+int minimumRowGroupRecordCount = 100;
 
 Review comment:
   updated.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup

2019-08-13 Thread GitBox
manishmalhotrawork commented on a change in pull request #365: test cases for 
parquetMetrics with multiple rowgroup
URL: https://github.com/apache/incubator-iceberg/pull/365#discussion_r313506183
 
 

 ##
 File path: core/src/test/java/org/apache/iceberg/TestMetrics.java
 ##
 @@ -270,6 +289,97 @@ public void testMetricsForNullColumns() throws 
IOException {
 assertBounds(1, IntegerType.get(), null, null, metrics);
   }
 
+  @Test
+  public void testMetricsForTopLevelWithMultipleRG() throws Exception {
+Schema schema = getSimpleSchema();
+int minimumRowGroupRecordCount = 100;
+List records = new ArrayList<>(201);
+
+for (int i = 0; i < 201; i++) {
+  GenericData.Record newRecord = new 
GenericData.Record(AvroSchemaUtil.convert(schema.asStruct()));
+  newRecord.put("booleanCol", i == 0 ? false : true);
+  newRecord.put("intCol", i + 1);
+  newRecord.put("longCol", i == 0 ? null : i + 1L);
+  newRecord.put("floatCol", i + 1.0F);
+  newRecord.put("doubleCol", i == 0 ? null : i + 1.0D);
+  newRecord.put("decimalCol", i == 0 ? null : new BigDecimal(i + 
"").add(new BigDecimal("1.00")));
+  newRecord.put("stringCol", "AAA");
+  newRecord.put("dateCol", i + 1);
+  newRecord.put("timeCol", i + 1L);
+  newRecord.put("timestampCol", i + 1L);
+  newRecord.put("uuidCol", uuid);
+  newRecord.put("fixedCol", fixed);
+  newRecord.put("binaryCol", "S".getBytes());
+  records.add(newRecord);
+}
+
+// create parquet file with multiple row groups. by using smaller number 
of bytes
+File parquetFile = writeRecords(
+schema,
+ImmutableMap.of(PARQUET_ROW_GROUP_SIZE_BYTES, 
Integer.toString(ROW_GROUP_SIZE)),
+records.toArray(new GenericData.Record[] {}));
+
+Assert.assertNotNull(parquetFile);
+// rowgroup size should be > 1
+Assert.assertEquals(3, splitCount(parquetFile));
+
+Metrics metrics = getMetrics(localInput(parquetFile));
+Assert.assertEquals(201L, (long) metrics.recordCount());
+assertCounts(1, 201L, 0L, metrics);
+assertBounds(1, Types.BooleanType.get(), false, true, metrics);
+assertBounds(2, Types.IntegerType.get(), 1, 201, metrics);
+assertCounts(3, 201L, 1L, metrics);
+assertBounds(3, Types.LongType.get(), 2L, 201L, metrics);
+assertCounts(4, 201L, 0L, metrics);
+assertBounds(4, Types.FloatType.get(), 1.0F, 201.0F, metrics);
+assertCounts(5, 201L, 1L, metrics);
+assertBounds(5, Types.DoubleType.get(), 2.0D, 201.0D, metrics);
+assertCounts(6, 201L, 1L, metrics);
+assertBounds(6, Types.DecimalType.of(10, 2), new BigDecimal("2.00"),
+new BigDecimal("201.00"), metrics);
+  }
+
+  @Test
+  public void testMetricsForNestedStructFieldsWithMultipleRG() throws 
IOException {
+int minimumRowGroupRecordCount = 101;
 
 Review comment:
   updated.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] xabriel commented on issue #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
xabriel commented on issue #372: Allow read split behavior overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#issuecomment-520912826
 
 
   Not sure how to retest. Failed tests do look unrelated, caused by a 
`java.net.SocketException`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #357: Add in and not in predicates

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #357: Add in and not in predicates
URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313499455
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java
 ##
 @@ -134,18 +134,18 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 }
 
 @Override
-public  Boolean in(BoundReference ref, Literal lit) {
-  throw new UnsupportedOperationException("In is not supported yet");
+public  Boolean in(BoundReference ref, LiteralSet literalSet) {
+  return literalSet.contains(ref.get(struct));
 }
 
 @Override
-public  Boolean notIn(BoundReference ref, Literal lit) {
-  return !in(ref, lit);
+public  Boolean notIn(BoundReference ref, LiteralSet literalSet) {
+  return !in(ref, literalSet);
 }
 
 @Override
 public  Boolean startsWith(BoundReference ref, Literal lit) {
-  return ((String) ref.get(struct)).startsWith((String) lit.value());
+  return ((String) ref.get(struct)).startsWith(lit.value().toString());
 
 Review comment:
   This isn't a related change, and the literal is guaranteed to be a string at 
this point. I don't think this should be included in this PR.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #357: Add in and not in predicates

2019-08-13 Thread GitBox
rdblue commented on a change in pull request #357: Add in and not in predicates
URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313498614
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Predicate.java
 ##
 @@ -19,15 +19,44 @@
 
 package org.apache.iceberg.expressions;
 
+import com.google.common.base.Preconditions;
+import com.google.common.collect.ImmutableSet;
+import java.util.Collection;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
 public abstract class Predicate implements Expression {
   private final Operation op;
   private final R ref;
   private final Literal literal;
+  private final Set> literalSet;
 
 Review comment:
   > UnboundPredicate for IN predicate with a single item should return an EQ 
predicate. The separation will complicate it.
   
   I disagree because that over-complicates the predicate classes. The factory 
methods in `Expressions` are allowed to change the predicate that is returned; 
for example, `and(true, p)` returns just `p`. That's what we should use to 
return an equality predicate. The predicate class itself should not include 
logic for changing itself.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] anjalinorwood opened a new pull request #375: Vectorized iceberg reader

2019-08-13 Thread GitBox
anjalinorwood opened a new pull request #375: Vectorized iceberg reader
URL: https://github.com/apache/incubator-iceberg/pull/375
 
 
   This commit:
   + implements batched version of iterators (BatchedPageReader,
 ColumnarBatchReader etc) as the value-based iterator abstractions
 were taking up too much CPU.
   + moves the reads close to page level by consulting the definition
 levels and directly reading from values vector.
   
   Co-Authored-By: Samarth Jain 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] xabriel commented on issue #372: Allow read split behavior overrides

2019-08-13 Thread GitBox
xabriel commented on issue #372: Allow read split behavior overrides
URL: https://github.com/apache/incubator-iceberg/pull/372#issuecomment-520893129
 
 
   retest


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates

2019-08-13 Thread GitBox
jun-he commented on a change in pull request #357: Add in and not in predicates
URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313463995
 
 

 ##
 File path: 
api/src/main/java/org/apache/iceberg/expressions/ExpressionVisitors.java
 ##
 @@ -89,11 +91,11 @@ public R or(R leftResult, R rightResult) {
   return null;
 }
 
-public  R in(BoundReference ref, Literal lit) {
+public  R in(BoundReference ref, Set> lits) {
   return null;
 
 Review comment:
    


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] chenjunjiedada opened a new pull request #374: Migrate spark table to iceberg table

2019-08-13 Thread GitBox
chenjunjiedada opened a new pull request #374: Migrate spark table to iceberg 
table
URL: https://github.com/apache/incubator-iceberg/pull/374
 
 
   This PR converts a spark table to iceberg table. It assumes no parallel 
operation is going on the original table and target table and thus is not 
thread-safe.
   
   PR is still working in progress. Unit tests will be added when APIs is 
settled down.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] chenjunjiedada closed pull request #373: Migrate spark table to iceberg table

2019-08-13 Thread GitBox
chenjunjiedada closed pull request #373: Migrate spark table to iceberg table
URL: https://github.com/apache/incubator-iceberg/pull/373
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] chenjunjiedada opened a new pull request #373: Migrate spark table to iceberg table

2019-08-13 Thread GitBox
chenjunjiedada opened a new pull request #373: Migrate spark table to iceberg 
table
URL: https://github.com/apache/incubator-iceberg/pull/373
 
 
   This PR converts a spark table to iceberg table. It assumes no parallel 
operation is going on the original table and target table and thus is not 
thread-safe. 
   
   PR is still working in progress. Unit tests will be added when APIs is 
settled down.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] jun-he commented on a change in pull request #357: Add in and not in predicates

2019-08-13 Thread GitBox
jun-he commented on a change in pull request #357: Add in and not in predicates
URL: https://github.com/apache/incubator-iceberg/pull/357#discussion_r313259192
 
 

 ##
 File path: api/src/main/java/org/apache/iceberg/expressions/Evaluator.java
 ##
 @@ -134,13 +135,13 @@ public Boolean or(Boolean leftResult, Boolean 
rightResult) {
 }
 
 @Override
-public  Boolean in(BoundReference ref, Literal lit) {
-  throw new UnsupportedOperationException("In is not supported yet");
+public  Boolean in(BoundReference ref, Set> lits) {
+  return lits.contains(Literals.from(ref.get(struct)));
 
 Review comment:
   Oops, I missed your comment. 
   Yep, you are right. I found this issue and came up with a similar solution.
   Thanks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup

2019-08-13 Thread GitBox
manishmalhotrawork commented on a change in pull request #365: test cases for 
parquetMetrics with multiple rowgroup
URL: https://github.com/apache/incubator-iceberg/pull/365#discussion_r313238208
 
 

 ##
 File path: core/src/test/java/org/apache/iceberg/TestMetrics.java
 ##
 @@ -63,32 +69,39 @@
  */
 public abstract class TestMetrics {
 
+  public static final int ROW_GROUP_SIZE = 1600;
   private final UUID uuid = UUID.randomUUID();
   private final GenericFixed fixed = new GenericData.Fixed(
   org.apache.avro.Schema.createFixed("fixedCol", null, null, 4),
   "abcd".getBytes(StandardCharsets.UTF_8));
 
+  private static StructType leafStructType = StructType.of(
 
 Review comment:
   updated.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup

2019-08-13 Thread GitBox
manishmalhotrawork commented on a change in pull request #365: test cases for 
parquetMetrics with multiple rowgroup
URL: https://github.com/apache/incubator-iceberg/pull/365#discussion_r313238331
 
 

 ##
 File path: core/src/test/java/org/apache/iceberg/TestMetrics.java
 ##
 @@ -153,6 +166,24 @@ public void testMetricsForTopLevelFields() throws 
IOException {
 ByteBuffer.wrap("S".getBytes()), ByteBuffer.wrap("W".getBytes()), 
metrics);
   }
 
+  private Schema getSimpleSchema() {
 
 Review comment:
   sure, done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[GitHub] [incubator-iceberg] manishmalhotrawork commented on a change in pull request #365: test cases for parquetMetrics with multiple rowgroup

2019-08-13 Thread GitBox
manishmalhotrawork commented on a change in pull request #365: test cases for 
parquetMetrics with multiple rowgroup
URL: https://github.com/apache/incubator-iceberg/pull/365#discussion_r313238232
 
 

 ##
 File path: core/src/test/java/org/apache/iceberg/TestMetrics.java
 ##
 @@ -270,6 +289,97 @@ public void testMetricsForNullColumns() throws 
IOException {
 assertBounds(1, IntegerType.get(), null, null, metrics);
   }
 
+  @Test
+  public void testMetricsForTopLevelWithMultipleRG() throws Exception {
 
 Review comment:
   updated.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org