[jira] [Commented] (PARQUET-1926) Add LogicalType support to ThriftType.I64Type
[ https://issues.apache.org/jira/browse/PARQUET-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217010#comment-17217010 ] ASF GitHub Bot commented on PARQUET-1926: - jmartone opened a new pull request #832: URL: https://github.com/apache/parquet-mr/pull/832 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET-1926/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-1926 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add LogicalType support to ThriftType.I64Type > - > > Key: PARQUET-1926 > URL: https://issues.apache.org/jira/browse/PARQUET-1926 > Project: Parquet > Issue Type: Improvement > Components: parquet-thrift >Reporter: Joshua Martone >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > Adds a LogicalTypeAnnotation to the I64Type. > This allows you to serialize timestamps and times. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] jmartone opened a new pull request #832: PARQUET-1926: Add LogicalType support to I64Type
jmartone opened a new pull request #832: URL: https://github.com/apache/parquet-mr/pull/832 Make sure you have checked _all_ steps below. ### Jira - [ ] My PR addresses the following [Parquet Jira](https://issues.apache.org/jira/browse/PARQUET-1926/) issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR" - https://issues.apache.org/jira/browse/PARQUET-1926 - In case you are adding a dependency, check if the license complies with the [ASF 3rd Party License Policy](https://www.apache.org/legal/resolved.html#category-x). ### Tests - [ ] My PR adds the following unit tests __OR__ does not need testing for this extremely good reason: ### Commits - [ ] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "[How to write a good git commit message](http://chris.beams.io/posts/git-commit/)": 1. Subject is separated from body by a blank line 1. Subject is limited to 50 characters (not including Jira issue reference) 1. Subject does not end with a period 1. Subject uses the imperative mood ("add", not "adding") 1. Body wraps at 72 characters 1. Body explains "what" and "why", not "how" ### Documentation - [ ] In case of new functionality, my PR adds documentation that describes how to use it. - All the public functions and the classes in the PR contain Javadoc that explain what it does This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [parquet-mr] jmartone closed pull request #830: [Parquet-1926] Add LogicalType support to ThriftType.I64Type
jmartone closed pull request #830: URL: https://github.com/apache/parquet-mr/pull/830 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped
[ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216849#comment-17216849 ] Xinli Shang commented on PARQUET-1927: -- [~gszadovszky], the way that Iceberg Parquet reader iterator implements is that it relies on the check of 'valuesRead < totalValues'. When intergrating ColumnIndex, we relace readNextRowGroup() with readNextFilteredRowGroup(). Because readNextFilteredRowGroup() will skip some records, we change the check as 'valuesRead + skippedValues < totalValues'. The skippedValues is calculated as 'blockRowCount - counts_Retuned_from_readNextFilteredRowGroup'.This works great. But when the whole row group is skipped, readNextFilteredRowGroup() advance to next row group internally without Iceberg's knowledge. Hence Icerberg doesn't know how to calculate the skippedValues. So if readNextFilteredRowGroup() can return how many records it skipped, or tell the index of the row group that it gets the returned pages, Iceberg can calcuate the skippedValues. > ColumnIndex should provide number of records skipped > - > > Key: PARQUET-1927 > URL: https://issues.apache.org/jira/browse/PARQUET-1927 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > When integrating Parquet ColumnIndex, I found we need to know from Parquet > that how many records that we skipped due to ColumnIndex filtering. When > rowCount is 0, readNextFilteredRowGroup() just advance to next without > telling the caller. See code here > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] > > In Iceberg, it reads Parquet record with an iterator. The hasNext() has the > following code(): > valuesRead + skippedValues < totalValues > See > ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] > > So without knowing the skipped values, it is hard to determine hasNext() or > not. > > Currently, we can workaround by using a flag. When readNextFilteredRowGroup() > returns null, we consider it is done for the whole file. Then hasNext() just > retrun false. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped
[ https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216534#comment-17216534 ] Gabor Szadovszky commented on PARQUET-1927: --- [~shangxinli], I am not sure I get the problem. If rowCount is 0 after column-index filtering we just skip the whole row-group similarly to the row-group level filters (dictionary/statistics or bloom). You don't know the number of rows skipped in case of row-group level filters either. > ColumnIndex should provide number of records skipped > - > > Key: PARQUET-1927 > URL: https://issues.apache.org/jira/browse/PARQUET-1927 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.11.0 >Reporter: Xinli Shang >Priority: Major > Fix For: 1.12.0 > > > When integrating Parquet ColumnIndex, I found we need to know from Parquet > that how many records that we skipped due to ColumnIndex filtering. When > rowCount is 0, readNextFilteredRowGroup() just advance to next without > telling the caller. See code here > [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969] > > In Iceberg, it reads Parquet record with an iterator. The hasNext() has the > following code(): > valuesRead + skippedValues < totalValues > See > ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).] > > So without knowing the skipped values, it is hard to determine hasNext() or > not. > > Currently, we can workaround by using a flag. When readNextFilteredRowGroup() > returns null, we consider it is done for the whole file. Then hasNext() just > retrun false. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1928) Interpret Parquet INT96 type as FIXED[12] AVRO Schema
[ https://issues.apache.org/jira/browse/PARQUET-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216502#comment-17216502 ] ASF GitHub Bot commented on PARQUET-1928: - anantdamle opened a new pull request #831: URL: https://github.com/apache/parquet-mr/pull/831 Make sure you have checked _all_ steps below. Reading Parquet files in Apache Beam using ParquetIO uses `AvroParquetReader` causing it to throw `IllegalArgumentException("INT96 not implemented and is deprecated")` Customers have large datasets which can't be reprocessed again to convert into a supported type. An easier approach would be to convert into a byte array of 12 bytes, that can then be interpreted by the developer in any way they want to interpret it. This patch interprets the INT96 parquet type as a byte array of 12-bytes. the developer/user can then handle it appropriate to interpret into a timestamp or simple some bytes. - [x ] My PR adds the following unit tests `testParquetInt96AsFixed12AvroType` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Interpret Parquet INT96 type as FIXED[12] AVRO Schema > - > > Key: PARQUET-1928 > URL: https://issues.apache.org/jira/browse/PARQUET-1928 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Reporter: Anant Damle >Priority: Minor > Labels: patch > > Reading Parquet files in Apache Beam using ParquetIO uses `AvroParquetReader` > causing it to throw `IllegalArgumentException("INT96 not implemented and is > deprecated")` > Customers have large datasets which can't be reprocessed again to convert > into a supported type. An easier approach would be to convert into a byte > array of 12 bytes, that can then be interpreted by the developer in any way > they want to interpret it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] anantdamle opened a new pull request #831: PARQUET-1928: Interpret Parquet INT96 type as FIXED[12] AVRO Schema
anantdamle opened a new pull request #831: URL: https://github.com/apache/parquet-mr/pull/831 Make sure you have checked _all_ steps below. Reading Parquet files in Apache Beam using ParquetIO uses `AvroParquetReader` causing it to throw `IllegalArgumentException("INT96 not implemented and is deprecated")` Customers have large datasets which can't be reprocessed again to convert into a supported type. An easier approach would be to convert into a byte array of 12 bytes, that can then be interpreted by the developer in any way they want to interpret it. This patch interprets the INT96 parquet type as a byte array of 12-bytes. the developer/user can then handle it appropriate to interpret into a timestamp or simple some bytes. - [x ] My PR adds the following unit tests `testParquetInt96AsFixed12AvroType` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Created] (PARQUET-1928) Interpret Parquet INT96 type as FIXED[12] AVRO Schema
Anant Damle created PARQUET-1928: Summary: Interpret Parquet INT96 type as FIXED[12] AVRO Schema Key: PARQUET-1928 URL: https://issues.apache.org/jira/browse/PARQUET-1928 Project: Parquet Issue Type: Bug Components: parquet-avro Reporter: Anant Damle Reading Parquet files in Apache Beam using ParquetIO uses `AvroParquetReader` causing it to throw `IllegalArgumentException("INT96 not implemented and is deprecated")` Customers have large datasets which can't be reprocessed again to convert into a supported type. An easier approach would be to convert into a byte array of 12 bytes, that can then be interpreted by the developer in any way they want to interpret it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro
[ https://issues.apache.org/jira/browse/PARQUET-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216489#comment-17216489 ] ASF GitHub Bot commented on PARQUET-1883: - anantdamle closed pull request #821: URL: https://github.com/apache/parquet-mr/pull/821 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > int96 support in parquet-avro > - > > Key: PARQUET-1883 > URL: https://issues.apache.org/jira/browse/PARQUET-1883 > Project: Parquet > Issue Type: Bug > Components: parquet-avro >Affects Versions: 1.10.1 >Reporter: satish >Priority: Major > > Hi > It looks like 'timestamp' is being converted to 'int64' primitive type in > parquet-avro. This is incompatible with hive2. Hive throws below error > {code:java} > Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: > java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be > cast to org.apache.hadoop.hive.serde2.io.TimestampWritable (state=,code=0) > {code} > What does it take to write timestamp field as 'int96'? > Hive seems to write timestamp field as int96. See example below > {code:java} > $ hadoop jar parquet-tools-1.9.0.jar meta hdfs://timestamp_test/00_0 > creator: parquet-mr version 1.10.6 (build > 098c6199a821edd3d6af56b962fd0f1558af849b) > file schema: hive_schema > > ts: OPTIONAL INT96 R:0 D:1 > row group 1: RC:4 TS:88 OFFSET:4 > > ts: INT96 UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:4 > ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY > {code} > Writing a spark dataframe into parquet format (without using avro) is also > using int96. > {code:java} > scala> testDS.printSchema() > root > |-- ts: timestamp (nullable = true) > scala> testDS.write.mode(Overwrite).save("/tmp/x"); > $ parquet-tools meta > /tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet > file: > file:/tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet > creator: parquet-mr version 1.10.1 (build > a89df8f9932b6ef6633d06069e50c9b7970bebd1) > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]} > > file schema: spark_schema > > ts: OPTIONAL INT96 R:0 D:1 > row group 1: RC:4 TS:93 OFFSET:4 > > ts: INT96 GZIP DO:0 FPO:4 SZ:130/93/0.72 VC:4 > ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED ST:[no stats for this column] > {code} > I saw some explanation for deprecating int96 [support > here|https://issues.apache.org/jira/browse/PARQUET-1870?focusedCommentId=17127963=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17127963] > from [~gszadovszky]. But given hive and serialization in other parquet > modules (non-avro) support int96, I'm trying to understand the reasoning for > not implementing it in parquet-avro. > A bit more context: we are trying to migrate some of our data to [hudi > format|https://hudi.apache.org/]. Hudi adds a lot of efficiency for our use > cases. But, when we write data using hudi, hudi uses parquet-avro and > timestamp is being converted to int64. As mentioned earlier, this breaks > compatibility with hive. A lot of columns in our tables have 'timestamp' as > type in hive DDL. It is almost impossible to change DDL to long as there are > large number of tables and columns. > We are happy to contribute if there is a clear path forward to support int96 > in parquet-avro. Please also let me know if you are aware of a workaround in > hive that can read int64 correctly as timestamp. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [parquet-mr] anantdamle closed pull request #821: PARQUET-1883 Interpret deprecated INT96 as FIXED[12] in Avro conversion
anantdamle closed pull request #821: URL: https://github.com/apache/parquet-mr/pull/821 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org