[GitHub] [incubator-iceberg] rdsr opened a new pull request #220: Support parsing of special characters in TypeToSchema visitor
rdsr opened a new pull request #220: Support parsing of special characters in TypeToSchema visitor URL: https://github.com/apache/incubator-iceberg/pull/220 This addresses #216 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] edgarRd commented on issue #213: Incomplete Iceberg to ORC column mapping
edgarRd commented on issue #213: Incomplete Iceberg to ORC column mapping URL: https://github.com/apache/incubator-iceberg/issues/213#issuecomment-502318792 I have a preliminary fix for this issue in branch: https://github.com/apache/incubator-iceberg/compare/master...edgarRd:orc-column-map-fix This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on a change in pull request #219: Fix concurrent access to lazy PartitionSpec methods
rdblue commented on a change in pull request #219: Fix concurrent access to lazy PartitionSpec methods URL: https://github.com/apache/incubator-iceberg/pull/219#discussion_r294026110 ## File path: api/src/main/java/org/apache/iceberg/PartitionSpec.java ## @@ -118,12 +118,18 @@ public int specId() { public Class[] javaClasses() { if (lazyJavaClasses == null) { - this.lazyJavaClasses = new Class[fields.length]; - for (int i = 0; i < fields.length; i += 1) { -PartitionField field = fields[i]; -Type sourceType = schema.findType(field.sourceId()); -Type result = field.transform().getResultType(sourceType); -lazyJavaClasses[i] = result.typeId().javaClass(); + synchronized (this) { +if (lazyJavaClasses == null) { Review comment: You're right. It's still safe, but two threads may initialize the variable. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #219: Fix concurrent access to lazy PartitionSpec methods
rdsr commented on a change in pull request #219: Fix concurrent access to lazy PartitionSpec methods URL: https://github.com/apache/incubator-iceberg/pull/219#discussion_r294024626 ## File path: api/src/main/java/org/apache/iceberg/PartitionSpec.java ## @@ -118,12 +118,18 @@ public int specId() { public Class[] javaClasses() { if (lazyJavaClasses == null) { - this.lazyJavaClasses = new Class[fields.length]; - for (int i = 0; i < fields.length; i += 1) { -PartitionField field = fields[i]; -Type sourceType = schema.findType(field.sourceId()); -Type result = field.transform().getResultType(sourceType); -lazyJavaClasses[i] = result.typeId().javaClass(); + synchronized (this) { +if (lazyJavaClasses == null) { Review comment: I believe for the double-check idiom to reliably work these fields need to be `volatile` http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue opened a new pull request #219: Fix concurrent access to lazy PartitionSpec methods
rdblue opened a new pull request #219: Fix concurrent access to lazy PartitionSpec methods URL: https://github.com/apache/incubator-iceberg/pull/219 This was causing `NullPointerException` in `PartitionData.get`. When two Spark tasks started at the same time on an executor, both would load the same partition spec instance because of the spec parser's cache. One thread would create `lazyJavaClasses` and another would access it before it was fully initialized, causing Spark's converter to partition to cache null classes for some partitions, leading to the NPE from `PartitionData.get`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue opened a new pull request #218: Fix uncommitted file clean-up in transactions
rdblue opened a new pull request #218: Fix uncommitted file clean-up in transactions URL: https://github.com/apache/incubator-iceberg/pull/218 This adds a callback for deleting files to `SnapshotProducer` and uses that callback in `Transaction` to track deletes instead of running them. When a transaction is committed, the last set of deletes are run so that the actual deletes are for the last commit of each operation in the transaction. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdsr edited a comment on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema
rdsr edited a comment on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema URL: https://github.com/apache/incubator-iceberg/issues/216#issuecomment-502308596 The special char can only be DOT though right? Also it seems like you are suggesting that we should handle the sanitization in TypeToSchema visitor. That makes sense to me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema
rdblue commented on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema URL: https://github.com/apache/incubator-iceberg/issues/216#issuecomment-502309454 I think it could be anything since we allow users to pass strings into `UpdateSchema`. We should probably sanitize any character that isn't allowed by Avro, assuming that they might creep in. By the way, because we project based on ID and not by name, the names used in Avro files don't really matter. So this should be safe to change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdsr edited a comment on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema
rdsr edited a comment on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema URL: https://github.com/apache/incubator-iceberg/issues/216#issuecomment-502308596 The special char can only be DOT though right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdsr commented on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema
rdsr commented on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema URL: https://github.com/apache/incubator-iceberg/issues/216#issuecomment-502308596 Today the special char can only be DOT though right? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdsr edited a comment on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema
rdsr edited a comment on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema URL: https://github.com/apache/incubator-iceberg/issues/216#issuecomment-502306413 @rdblue , as per the spec > The source column, selected by id, must be a primitive type and cannot be contained in a map or list, but may be nested in a struct So we can possibly have 2 ways to address this. 1. Automatically replace the dot with underscore 2. Allow user to specify the target column name in the partition field This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdsr edited a comment on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema
rdsr edited a comment on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema URL: https://github.com/apache/incubator-iceberg/issues/216#issuecomment-502306413 @rdblue , as per the spec > The source column, selected by id, must be a primitive type and cannot be contained in a map or list, but may be nested in a struct So we can possibly have 2 ways to address this. 1. Automatically replace the dot with underscore 2. Allow user to specify the target column name in the partition field This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema
rdblue commented on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema URL: https://github.com/apache/incubator-iceberg/issues/216#issuecomment-502307787 Iceberg supports special characters in names, looks like the failure is in Avro. I think Iceberg should sanitize special characters from the names when creating an Avro schema. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdsr opened a new issue #217: IcebergRuntime should not bundle hadoop-hdfs
rdsr opened a new issue #217: IcebergRuntime should not bundle hadoop-hdfs URL: https://github.com/apache/incubator-iceberg/issues/217 The dep chain is iceberg-runtime -> iceberg-spark -> iceberg-orc -> orc -> hadoop-hdfs ORC is bringing in an old version (2.2) of hadoop-hdfs This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdsr commented on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema
rdsr commented on issue #216: PartitionData cannot handle nested field names in PartitionSpec schema URL: https://github.com/apache/incubator-iceberg/issues/216#issuecomment-502306413 @rdblue , as per the spec > The source column, selected by id, must be a primitive type and cannot be contained in a map or list, but may be nested in a struct So we can possibly have 2 ways to address this. 1. Automatically replace the dot with underscore 2. Allow user to specify the target column name in the partition field This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdsr opened a new issue #216: `PartitionData` cannot handle nested field names in `PartitionSpec` schema
rdsr opened a new issue #216: `PartitionData` cannot handle nested field names in `PartitionSpec` schema URL: https://github.com/apache/incubator-iceberg/issues/216 The following test will throw an exception when parsing the partition's Iceberg schema to Avro schema Test: https://github.com/rdsr/incubator-iceberg/commit/95b95b0eed3b7fa074a65422aa26ab70c7e22444 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #215: Remove uncessary precondition checks
rdblue commented on issue #215: Remove uncessary precondition checks URL: https://github.com/apache/incubator-iceberg/pull/215#issuecomment-502303647 Here's the problem: ``` [ant:checkstyle] [ERROR] /home/travis/build/apache/incubator-iceberg/core/src/main/java/org/apache/iceberg/avro/AvroFileAppender.java:22:8: Unused import - com.google.common.base.Preconditions. [UnusedImports] ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] danielcweeks commented on issue #215: Remove uncessary precondition checks
danielcweeks commented on issue #215: Remove uncessary precondition checks URL: https://github.com/apache/incubator-iceberg/pull/215#issuecomment-502251379 > I think this also needs to update the `length` javadoc because the requirement that the appender has been closed is no longer needed. Good spot. I've updated the javadoc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdblue commented on issue #215: Remove uncessary precondition checks
rdblue commented on issue #215: Remove uncessary precondition checks URL: https://github.com/apache/incubator-iceberg/pull/215#issuecomment-502230983 I think this also needs to update the `length` javadoc because the requirement that the appender has been closed is no longer needed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] danielcweeks opened a new pull request #215: Remove uncessary precondition checks
danielcweeks opened a new pull request #215: Remove uncessary precondition checks URL: https://github.com/apache/incubator-iceberg/pull/215 These checks unnecessarily prevent checking how much data has been written. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] edgarRd commented on a change in pull request #199: ORC metrics
edgarRd commented on a change in pull request #199: ORC metrics URL: https://github.com/apache/incubator-iceberg/pull/199#discussion_r293859195 ## File path: orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java ## @@ -45,16 +61,82 @@ public static Metrics fromInputFile(InputFile file, Configuration config) { try { final Reader orcReader = OrcFile.createReader(new Path(file.location()), OrcFile.readerOptions(config)); + final Schema schema = TypeConversion.fromOrc(orcReader.getSchema()); + + ColumnStatistics[] colStats = orcReader.getStatistics(); + Map columSizes = Maps.newHashMapWithExpectedSize(colStats.length); + Map valueCounts = Maps.newHashMapWithExpectedSize(colStats.length); + Map lowerBounds = Maps.newHashMap(); + Map upperBounds = Maps.newHashMap(); + + for(Types.NestedField col : schema.columns()) { +final int i = col.fieldId(); +columSizes.put(i, colStats[i].getBytesOnDisk()); +valueCounts.put(i, colStats[i].getNumberOfValues()); + +Optional orcMin = fromOrcMin(col, colStats[i]); +orcMin.ifPresent(byteBuffer -> lowerBounds.put(i, byteBuffer)); +Optional orcMax = fromOrcMax(col, colStats[i]); +orcMax.ifPresent(byteBuffer -> upperBounds.put(i, byteBuffer)); + } - // TODO: implement rest of the methods for ORC metrics return new Metrics(orcReader.getNumberOfRows(), - null, - null, + columSizes, + valueCounts, Collections.emptyMap(), - null, - null); + lowerBounds, + upperBounds); } catch (IOException ioe) { throw new RuntimeIOException(ioe, "Failed to read footer of file: %s", file); } } + + private static Optional fromOrcMin(Types.NestedField column, + ColumnStatistics columnStats) { +ByteBuffer min = null; +if (columnStats instanceof IntegerColumnStatistics) { + IntegerColumnStatistics intColStats = (IntegerColumnStatistics) columnStats; + if (column.type().typeId() == Type.TypeID.INTEGER) { +min = toByteBuffer(column.type(), (int) intColStats.getMinimum()); + } else { +min = toByteBuffer(column.type(), intColStats.getMinimum()); + } +} else if (columnStats instanceof DoubleColumnStatistics) { + min = toByteBuffer(column.type(), ((DoubleColumnStatistics) columnStats).getMinimum()); +} else if (columnStats instanceof StringColumnStatistics) { + min = toByteBuffer(column.type(), ((StringColumnStatistics) columnStats).getMinimum()); Review comment: Yes, I’ll push an update for this in a bit. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org
[GitHub] [incubator-iceberg] rdsr commented on a change in pull request #199: ORC metrics
rdsr commented on a change in pull request #199: ORC metrics URL: https://github.com/apache/incubator-iceberg/pull/199#discussion_r293854867 ## File path: orc/src/main/java/org/apache/iceberg/orc/OrcMetrics.java ## @@ -45,16 +61,82 @@ public static Metrics fromInputFile(InputFile file, Configuration config) { try { final Reader orcReader = OrcFile.createReader(new Path(file.location()), OrcFile.readerOptions(config)); + final Schema schema = TypeConversion.fromOrc(orcReader.getSchema()); + + ColumnStatistics[] colStats = orcReader.getStatistics(); + Map columSizes = Maps.newHashMapWithExpectedSize(colStats.length); + Map valueCounts = Maps.newHashMapWithExpectedSize(colStats.length); + Map lowerBounds = Maps.newHashMap(); + Map upperBounds = Maps.newHashMap(); + + for(Types.NestedField col : schema.columns()) { +final int i = col.fieldId(); +columSizes.put(i, colStats[i].getBytesOnDisk()); +valueCounts.put(i, colStats[i].getNumberOfValues()); + +Optional orcMin = fromOrcMin(col, colStats[i]); +orcMin.ifPresent(byteBuffer -> lowerBounds.put(i, byteBuffer)); +Optional orcMax = fromOrcMax(col, colStats[i]); +orcMax.ifPresent(byteBuffer -> upperBounds.put(i, byteBuffer)); + } - // TODO: implement rest of the methods for ORC metrics return new Metrics(orcReader.getNumberOfRows(), - null, - null, + columSizes, + valueCounts, Collections.emptyMap(), - null, - null); + lowerBounds, + upperBounds); } catch (IOException ioe) { throw new RuntimeIOException(ioe, "Failed to read footer of file: %s", file); } } + + private static Optional fromOrcMin(Types.NestedField column, + ColumnStatistics columnStats) { +ByteBuffer min = null; +if (columnStats instanceof IntegerColumnStatistics) { + IntegerColumnStatistics intColStats = (IntegerColumnStatistics) columnStats; + if (column.type().typeId() == Type.TypeID.INTEGER) { +min = toByteBuffer(column.type(), (int) intColStats.getMinimum()); + } else { +min = toByteBuffer(column.type(), intColStats.getMinimum()); + } +} else if (columnStats instanceof DoubleColumnStatistics) { + min = toByteBuffer(column.type(), ((DoubleColumnStatistics) columnStats).getMinimum()); +} else if (columnStats instanceof StringColumnStatistics) { + min = toByteBuffer(column.type(), ((StringColumnStatistics) columnStats).getMinimum()); Review comment: Seems like DecimalColumnStatistics, DateColumnStatistics and TimestampColumnStatistics all can return null for min/max This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org