[jira] [Commented] (PARQUET-1926) Add LogicalType support to ThriftType.I64Type

2020-10-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217010#comment-17217010
 ] 

ASF GitHub Bot commented on PARQUET-1926:
-

jmartone opened a new pull request #832:
URL: https://github.com/apache/parquet-mr/pull/832


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET-1926/) issues and 
references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-1926
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add LogicalType support to ThriftType.I64Type
> -
>
> Key: PARQUET-1926
> URL: https://issues.apache.org/jira/browse/PARQUET-1926
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-thrift
>Reporter: Joshua Martone
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Adds a LogicalTypeAnnotation to the I64Type.
> This allows you to serialize timestamps and times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] jmartone opened a new pull request #832: PARQUET-1926: Add LogicalType support to I64Type

2020-10-19 Thread GitBox


jmartone opened a new pull request #832:
URL: https://github.com/apache/parquet-mr/pull/832


   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Parquet 
Jira](https://issues.apache.org/jira/browse/PARQUET-1926/) issues and 
references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
 - https://issues.apache.org/jira/browse/PARQUET-1926
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines. In 
addition, my commits follow the guidelines from "[How to write a good git 
commit message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain Javadoc that 
explain what it does
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [parquet-mr] jmartone closed pull request #830: [Parquet-1926] Add LogicalType support to ThriftType.I64Type

2020-10-19 Thread GitBox


jmartone closed pull request #830:
URL: https://github.com/apache/parquet-mr/pull/830


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-10-19 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216849#comment-17216849
 ] 

Xinli Shang commented on PARQUET-1927:
--

[~gszadovszky], the way that Iceberg Parquet reader iterator implements is that 
it relies on the check of 'valuesRead < totalValues'. When intergrating 
ColumnIndex, we relace readNextRowGroup() with readNextFilteredRowGroup(). 
Because readNextFilteredRowGroup() will skip some records, we change the check 
as 'valuesRead + skippedValues < totalValues'. The skippedValues is calculated 
as 'blockRowCount - counts_Retuned_from_readNextFilteredRowGroup'.This works 
great. But when the whole row group is skipped, readNextFilteredRowGroup() 
advance to next row group internally without Iceberg's knowledge. Hence 
Icerberg doesn't know how to calculate the skippedValues. 

So if readNextFilteredRowGroup() can return how many records it skipped, or 
tell the index of the row group that it gets the returned pages, Iceberg can 
calcuate the skippedValues. 

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-10-19 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216534#comment-17216534
 ] 

Gabor Szadovszky commented on PARQUET-1927:
---

[~shangxinli], I am not sure I get the problem. If rowCount is 0 after 
column-index filtering we just skip the whole row-group similarly to the 
row-group level filters (dictionary/statistics or bloom). You don't know the 
number of rows skipped in case of row-group level filters either.

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1928) Interpret Parquet INT96 type as FIXED[12] AVRO Schema

2020-10-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216502#comment-17216502
 ] 

ASF GitHub Bot commented on PARQUET-1928:
-

anantdamle opened a new pull request #831:
URL: https://github.com/apache/parquet-mr/pull/831


   Make sure you have checked _all_ steps below.
   
   Reading Parquet files in Apache Beam using ParquetIO uses 
`AvroParquetReader` causing it to throw `IllegalArgumentException("INT96 not 
implemented and is deprecated")`
   
   Customers have large datasets which can't be reprocessed again to convert 
into a supported type. An easier approach would be to convert into a byte array 
of 12 bytes, that can then be interpreted by the developer in any way they want 
to interpret it.
   
   This patch interprets the INT96 parquet type as a byte array of 12-bytes. 
the developer/user can then handle it appropriate to interpret into a timestamp 
or simple some bytes.
   
   - [x ] My PR adds the following unit tests 
`testParquetInt96AsFixed12AvroType`
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Interpret Parquet INT96 type as FIXED[12] AVRO Schema
> -
>
> Key: PARQUET-1928
> URL: https://issues.apache.org/jira/browse/PARQUET-1928
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Reporter: Anant Damle
>Priority: Minor
>  Labels: patch
>
> Reading Parquet files in Apache Beam using ParquetIO uses `AvroParquetReader` 
> causing it to throw `IllegalArgumentException("INT96 not implemented and is 
> deprecated")`
> Customers have large datasets which can't be reprocessed again to convert 
> into a supported type. An easier approach would be to convert into a byte 
> array of 12 bytes, that can then be interpreted by the developer in any way 
> they want to interpret it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] anantdamle opened a new pull request #831: PARQUET-1928: Interpret Parquet INT96 type as FIXED[12] AVRO Schema

2020-10-19 Thread GitBox


anantdamle opened a new pull request #831:
URL: https://github.com/apache/parquet-mr/pull/831


   Make sure you have checked _all_ steps below.
   
   Reading Parquet files in Apache Beam using ParquetIO uses 
`AvroParquetReader` causing it to throw `IllegalArgumentException("INT96 not 
implemented and is deprecated")`
   
   Customers have large datasets which can't be reprocessed again to convert 
into a supported type. An easier approach would be to convert into a byte array 
of 12 bytes, that can then be interpreted by the developer in any way they want 
to interpret it.
   
   This patch interprets the INT96 parquet type as a byte array of 12-bytes. 
the developer/user can then handle it appropriate to interpret into a timestamp 
or simple some bytes.
   
   - [x ] My PR adds the following unit tests 
`testParquetInt96AsFixed12AvroType`
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Created] (PARQUET-1928) Interpret Parquet INT96 type as FIXED[12] AVRO Schema

2020-10-19 Thread Anant Damle (Jira)
Anant Damle created PARQUET-1928:


 Summary: Interpret Parquet INT96 type as FIXED[12] AVRO Schema
 Key: PARQUET-1928
 URL: https://issues.apache.org/jira/browse/PARQUET-1928
 Project: Parquet
  Issue Type: Bug
  Components: parquet-avro
Reporter: Anant Damle


Reading Parquet files in Apache Beam using ParquetIO uses `AvroParquetReader` 
causing it to throw `IllegalArgumentException("INT96 not implemented and is 
deprecated")`

Customers have large datasets which can't be reprocessed again to convert into 
a supported type. An easier approach would be to convert into a byte array of 
12 bytes, that can then be interpreted by the developer in any way they want to 
interpret it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

2020-10-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216489#comment-17216489
 ] 

ASF GitHub Bot commented on PARQUET-1883:
-

anantdamle closed pull request #821:
URL: https://github.com/apache/parquet-mr/pull/821


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> int96 support in parquet-avro
> -
>
> Key: PARQUET-1883
> URL: https://issues.apache.org/jira/browse/PARQUET-1883
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: satish
>Priority: Major
>
> Hi
> It looks like 'timestamp' is being converted to 'int64' primitive type in 
> parquet-avro. This is incompatible with hive2. Hive throws below error 
> {code:java}
> Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.TimestampWritable (state=,code=0)
> {code}
> What does it take to write timestamp field as 'int96'? 
> Hive seems to write timestamp field as int96.  See example below
> {code:java}
> $ hadoop jar parquet-tools-1.9.0.jar meta hdfs://timestamp_test/00_0
> creator: parquet-mr version 1.10.6 (build 
> 098c6199a821edd3d6af56b962fd0f1558af849b)
> file schema: hive_schema
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:88 OFFSET:4
> 
> ts:   INT96 UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:4 
> ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
> {code}
> Writing a spark dataframe into parquet format (without using avro) is also 
> using int96.
> {code:java}
> scala> testDS.printSchema()
> root
>  |-- ts: timestamp (nullable = true)
> scala> testDS.write.mode(Overwrite).save("/tmp/x");
> $ parquet-tools meta 
> /tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> file:
> file:/tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
>  
> file schema: spark_schema 
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:93 OFFSET:4 
> 
> ts:   INT96 GZIP DO:0 FPO:4 SZ:130/93/0.72 VC:4 
> ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED ST:[no stats for this column]
> {code}
> I saw some explanation for deprecating int96 [support 
> here|https://issues.apache.org/jira/browse/PARQUET-1870?focusedCommentId=17127963=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17127963]
>  from [~gszadovszky]. But given hive and serialization in other parquet 
> modules (non-avro) support int96, I'm trying to understand the reasoning for 
> not implementing it in parquet-avro.
> A bit more context: we are trying to migrate some of our data to [hudi 
> format|https://hudi.apache.org/]. Hudi adds a lot of efficiency for our use 
> cases. But, when we write data using hudi, hudi uses parquet-avro and 
> timestamp is being converted to int64. As mentioned earlier, this breaks 
> compatibility with hive. A lot of columns in our tables have 'timestamp' as 
> type in hive DDL.  It is almost impossible to change DDL to long as there are 
> large number of tables and columns. 
> We are happy to contribute if there is a clear path forward to support int96 
> in parquet-avro. Please also let me know if you are aware of a workaround in 
> hive that can read int64 correctly as timestamp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [parquet-mr] anantdamle closed pull request #821: PARQUET-1883 Interpret deprecated INT96 as FIXED[12] in Avro conversion

2020-10-19 Thread GitBox


anantdamle closed pull request #821:
URL: https://github.com/apache/parquet-mr/pull/821


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org