[jira] [Created] (PARQUET-2183) Fix statistics issue of Column Encryptor

2022-09-02 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2183:


 Summary: Fix statistics issue of Column Encryptor
 Key: PARQUET-2183
 URL: https://issues.apache.org/jira/browse/PARQUET-2183
 Project: Parquet
  Issue Type: Improvement
Reporter: Xinli Shang
Assignee: Xinli Shang


There is an issue that missing column statistics if that column is 
re-encrypted. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2022-04-08 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17519686#comment-17519686
 ] 

Xinli Shang commented on PARQUET-1681:
--

[~theosib-amazon]It seems different. 

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-1595) Parquet proto writer de-nest Protobuf wrapper classes

2022-03-20 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17509500#comment-17509500
 ] 

Xinli Shang commented on PARQUET-1595:
--

Is it a typo for Int32Value -> int64?



> Parquet proto writer de-nest Protobuf wrapper classes
> -
>
> Key: PARQUET-1595
> URL: https://issues.apache.org/jira/browse/PARQUET-1595
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Ying Xu
>Priority: Major
>
> Existing Parquet protobuf writer support preserves the structure of any 
> Protobuf Message objects.  This works well in most cases. However, when 
> dealing with [Protobuf wrapper 
> messages|https://github.com/protocolbuffers/protobuf/blob/master/src/google/protobuf/wrappers.proto],
>  users may prefer directly writing the de-nested value into the Parquet 
> files, for ease of querying them directly (in query engine such as 
> Hive/Presto). 
> Proposal: 
>  * Implement a control flag, e.g., enableDenestingWrappers, to control 
> whether or not to denest Protobuf wrapper classes. 
>  * When this flag is set to true, write the Protobuf wrapper classes as 
> single primitive fields, based on the type of the wrapped *value* field.
>   
> ||Protobuf Type||Parquet Type||
> |BoolValue|boolean|
> |BytesValue|binary|
> |DoubleValue|double|
> |FloatValue|float|
> |Int32Value|int64 (32-bit, signed)|
> |Int64Value|int64 (64-bit, signed)|
> |StringValue|binary (string)|
> |UInt32Value|int64 (32-bit, unsigned)|
> |UInt64Value|int64 (64-bit, unsigned)|
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-2116) Cell Level Encryption

2022-03-12 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-2116:
-
External issue URL: 
https://docs.google.com/document/d/1PUonl9i_fVlRhUmqEmWBQJ8zesX7mlvnu3ubemT11rk/edit#heading=h.kkuoyw5u0ywe
  (was: 
https://docs.google.com/document/d/1Q-d98Os_aJahUynznPrWvXwWQeN0aFDRhZj3hXt_JOM/edit#)

> Cell Level Encryption 
> --
>
> Key: PARQUET-2116
> URL: https://issues.apache.org/jira/browse/PARQUET-2116
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Cell level encryption can do finer-grained encryption than modular 
> encryption(Parquet-1178) or file encryption. The idea is only some fields 
> inside the column are encrypted based on a filter expression. For example, a 
> table with column a, b, c.x, c.y, d, we can encrypt column a, c.x where d == 
> 5 and c.y > 0.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (PARQUET-2127) Security risk in latest parquet-jackson-1.12.2.jar

2022-02-17 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494321#comment-17494321
 ] 

Xinli Shang edited comment on PARQUET-2127 at 2/18/22, 2:23 AM:


Thanks for reporting [~phoebemaomao]! Will you be able to come up with the fix? 
I will be happy to review and merge.


was (Author: sha...@uber.com):
Thanks for reporting [~phoebemaomao]! Will you be able to come up with the fix? 
I will be happy to review and merge.. 

> Security risk in latest parquet-jackson-1.12.2.jar
> --
>
> Key: PARQUET-2127
> URL: https://issues.apache.org/jira/browse/PARQUET-2127
> Project: Parquet
>  Issue Type: Improvement
>Reporter: phoebe chen
>Priority: Major
>
> Embed jackson-databind:2.11.4 has security risk of Possible DoS if using JDK 
> serialization to serialize JsonNode 
> ([https://github.com/FasterXML/jackson-databind/issues/3328] ), upgrade to 
> 2.13.1 can fix this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2127) Security risk in latest parquet-jackson-1.12.2.jar

2022-02-17 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2127?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17494321#comment-17494321
 ] 

Xinli Shang commented on PARQUET-2127:
--

Thanks for reporting [~phoebemaomao]! Will you be able to come up with the fix? 
I will be happy to review and merge.. 

> Security risk in latest parquet-jackson-1.12.2.jar
> --
>
> Key: PARQUET-2127
> URL: https://issues.apache.org/jira/browse/PARQUET-2127
> Project: Parquet
>  Issue Type: Improvement
>Reporter: phoebe chen
>Priority: Major
>
> Embed jackson-databind:2.11.4 has security risk of Possible DoS if using JDK 
> serialization to serialize JsonNode 
> ([https://github.com/FasterXML/jackson-databind/issues/3328] ), upgrade to 
> 2.13.1 can fix this.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Comment Edited] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700

2022-02-14 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492099#comment-17492099
 ] 

Xinli Shang edited comment on PARQUET-2122 at 2/14/22, 4:56 PM:


[~junjie] Do you know why? 


was (Author: sha...@uber.com):
[~junjie]Do you know why? 

> Adding Bloom filter to small Parquet file bloats in size X1700
> --
>
> Key: PARQUET-2122
> URL: https://issues.apache.org/jira/browse/PARQUET-2122
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.13.0
>Reporter: Ze'ev Maor
>Priority: Critical
> Attachments: data.csv, data_index_bloom.parquet
>
>
> Converting a small, 14 rows/1 string column csv file to Parquet without bloom 
> filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to 
> ParquetWriter then yields a 1049197B file.
> It isn't clear what the extra space is used by.
> Attached csv and bloated Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2122) Adding Bloom filter to small Parquet file bloats in size X1700

2022-02-14 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492099#comment-17492099
 ] 

Xinli Shang commented on PARQUET-2122:
--

[~junjie]Do you know why? 

> Adding Bloom filter to small Parquet file bloats in size X1700
> --
>
> Key: PARQUET-2122
> URL: https://issues.apache.org/jira/browse/PARQUET-2122
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.13.0
>Reporter: Ze'ev Maor
>Priority: Critical
> Attachments: data.csv, data_index_bloom.parquet
>
>
> Converting a small, 14 rows/1 string column csv file to Parquet without bloom 
> filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to 
> ParquetWriter then yields a 1049197B file.
> It isn't clear what the extra space is used by.
> Attached csv and bloated Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2117) Add rowPosition API in parquet record readers

2022-02-02 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17485949#comment-17485949
 ] 

Xinli Shang commented on PARQUET-2117:
--

Thanks for opening this Jira! Look forward to the PR.

> Add rowPosition API in parquet record readers
> -
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Prakhar Jain
>Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-2116) Cell Level Encryption

2022-01-27 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-2116:
-
External issue URL: 
https://docs.google.com/document/d/1Q-d98Os_aJahUynznPrWvXwWQeN0aFDRhZj3hXt_JOM/edit#

> Cell Level Encryption 
> --
>
> Key: PARQUET-2116
> URL: https://issues.apache.org/jira/browse/PARQUET-2116
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Cell level encryption can do finer-grained encryption than modular 
> encryption(Parquet-1178) or file encryption. The idea is only some fields 
> inside the column are encrypted based on a filter expression. For example, a 
> table with column a, b, c.x, c.y, d, we can encrypt column a, c.x where d == 
> 5 and c.y > 0.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (PARQUET-2116) Cell Level Encryption

2022-01-27 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2116:


 Summary: Cell Level Encryption 
 Key: PARQUET-2116
 URL: https://issues.apache.org/jira/browse/PARQUET-2116
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Xinli Shang
Assignee: Xinli Shang


Cell level encryption can do finer-grained encryption than modular 
encryption(Parquet-1178) or file encryption. The idea is only some fields 
inside the column are encrypted based on a filter expression. For example, a 
table with column a, b, c.x, c.y, d, we can encrypt column a, c.x where d == 5 
and c.y > 0.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2091) Fix release build error introduced by PARQUET-2043

2022-01-27 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved PARQUET-2091.
--
Resolution: Won't Fix

> Fix release build error introduced by PARQUET-2043
> --
>
> Key: PARQUET-2091
> URL: https://issues.apache.org/jira/browse/PARQUET-2091
> Project: Parquet
>  Issue Type: Task
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> After PARQUET-2043 when building for a release like 1.12.1, there is build 
> error complaining 'used undeclared dependency'. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2098) Add more methods into interface of BlockCipher

2022-01-27 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17483225#comment-17483225
 ] 

Xinli Shang commented on PARQUET-2098:
--

[~gershinsky] Do you have time to work on it as we discussed to release the new 
version?

> Add more methods into interface of BlockCipher
> --
>
> Key: PARQUET-2098
> URL: https://issues.apache.org/jira/browse/PARQUET-2098
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> Currently BlockCipher interface has methods without letting caller to specify 
> length/offset. In some use cases like Presto,  it is needed to pass in a byte 
> array and the data to be encrypted only occupys partially of the array.  So 
> we need to add a new methods something like below for decrypt. Similar 
> methods might be needed for encrypt. 
> byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, 
> byte[] aad);



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2112) Fix typo in MessageColumnIO

2022-01-27 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved PARQUET-2112.
--
Resolution: Fixed

> Fix typo in MessageColumnIO
> ---
>
> Key: PARQUET-2112
> URL: https://issues.apache.org/jira/browse/PARQUET-2112
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.13.0
>
>
> Typo of the variable 'BitSet vistedIndexes'. Change it to 'visitedIndexes'



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (PARQUET-2112) Fix typo in MessageColumnIO

2022-01-22 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2112:


 Summary: Fix typo in MessageColumnIO
 Key: PARQUET-2112
 URL: https://issues.apache.org/jira/browse/PARQUET-2112
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.12.2
Reporter: Xinli Shang
Assignee: Xinli Shang
 Fix For: 1.13.0


Typo of the variable 'BitSet vistedIndexes'. Change it to 'visitedIndexes'



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2111) Support limit push down and stop early for RecordReader

2022-01-21 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480128#comment-17480128
 ] 

Xinli Shang commented on PARQUET-2111:
--

Look forward to the PR

> Support limit push down and stop early for RecordReader
> ---
>
> Key: PARQUET-2111
> URL: https://issues.apache.org/jira/browse/PARQUET-2111
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Jackey Lee
>Priority: Major
>
> With limit push down, it can stop scanning parquet early, and reduce network 
> and disk IO.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2071) Encryption translation tool

2022-01-14 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved PARQUET-2071.
--
Resolution: Fixed

> Encryption translation tool 
> 
>
> Key: PARQUET-2071
> URL: https://issues.apache.org/jira/browse/PARQUET-2071
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When translating existing data to encryption state, we could develop a tool 
> like TransCompression to translate the data at page level to encryption state 
> without reading to record and rewrite. This will speed up the process a lot. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-1872) Add TransCompression Feature

2022-01-14 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved PARQUET-1872.
--
Resolution: Fixed

> Add TransCompression Feature 
> -
>
> Key: PARQUET-1872
> URL: https://issues.apache.org/jira/browse/PARQUET-1872
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data to 
> ZSTD compressed which can achieve a higher compression ratio. It would be 
> useful if we can have a tool to convert a Parquet file directly by just 
> decompressing/compressing each page without decoding/encoding or assembling 
> the record because it is much faster. The initial result shows it is ~5 times 
> faster. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2105) Refactor the test code of creating the test file

2022-01-14 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved PARQUET-2105.
--
Resolution: Fixed

> Refactor the test code of creating the test file 
> -
>
> Key: PARQUET-2105
> URL: https://issues.apache.org/jira/browse/PARQUET-2105
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> In the tests, there are many places that need to create a test parquet file 
> with different settings. Currently, each test file just creates its own code. 
> It would be better to have a test file builder to create that. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-1889) Register a MIME type for the Parquet format.

2022-01-11 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17473147#comment-17473147
 ] 

Xinli Shang commented on PARQUET-1889:
--

+1 on [~westonpace]'s point 

> Register a MIME type for the Parquet format.
> 
>
> Key: PARQUET-1889
> URL: https://issues.apache.org/jira/browse/PARQUET-1889
> Project: Parquet
>  Issue Type: Wish
>  Components: parquet-format
>Affects Versions: format-2.7.0
>Reporter: Mark Wood
>Priority: Major
>
> There is currently  no MIME type registered for Parquet.  Perhaps this is 
> intentional.
> If it is not intentional, I suggest steps be taken to register a MIME type 
> with IANA.
>  
> [https://www.iana.org/assignments/media-types/media-types.xhtml]
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-1911) Add way to disables statistics on a per column basis

2022-01-04 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17468759#comment-17468759
 ] 

Xinli Shang commented on PARQUET-1911:
--

[~panthony] Thanks for working on this! Just FYI that there was an effort to 
truncate the min/max https://issues.apache.org/jira/browse/PARQUET-1685. It can 
be enabled with a flag. With that said, your changes are still welcome. Feel 
free to create a PR if you haven't and I will review it. 

> Add way to disables statistics on a per column basis
> 
>
> Key: PARQUET-1911
> URL: https://issues.apache.org/jira/browse/PARQUET-1911
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Anthony Pessy
>Priority: Major
> Attachments: NoOpStatistics.java, 
> add_config_to_opt-out_of_a_column's_statistics.patch
>
>
> When you write dataset with BINARY columns that can be fairly large (several 
> Mbs) you can often end with an OutOfMemory error where you either have to:
>  
>  - Throw more RAM
>  - Increase number of output files
>  - Play with Block size
>  
> Using a fork with increased checks frequency for row group size help but it 
> is not enough. (PR: [https://github.com/apache/parquet-mr/pull/470])
>  
>  
> The OutOfMemory error is now caused due to the accumulation of min/max values 
> for those columns for each BlockMetaData.
>  
> The "parquet.statistics.truncate.length" configuration is of no help because 
> it is applied during the footer serialization whereas the OOM occurs before 
> that.
>  
> I think it would be nice to have, like for dictionary or bloom filter, a way 
> to disable the statistic on a per-column basis.
>  
> Could be very useful to lower memory consumption when stats of huge binary 
> column are unnecessary.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-1874) Add to parquet-cli

2021-12-03 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved PARQUET-1874.
--
Resolution: Fixed

> Add to parquet-cli
> --
>
> Key: PARQUET-1874
> URL: https://issues.apache.org/jira/browse/PARQUET-1874
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-1873) Add to Parquet-tools

2021-12-03 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved PARQUET-1873.
--
Resolution: Fixed

> Add to Parquet-tools 
> -
>
> Key: PARQUET-1873
> URL: https://issues.apache.org/jira/browse/PARQUET-1873
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1396) EncryptionPropertiesFactory and DecryptionPropertiesFactory

2021-12-03 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-1396:
-
Summary: EncryptionPropertiesFactory and DecryptionPropertiesFactory  (was: 
Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory)

> EncryptionPropertiesFactory and DecryptionPropertiesFactory
> ---
>
> Key: PARQUET-1396
> URL: https://issues.apache.org/jira/browse/PARQUET-1396
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) 
> that will provide the basic building blocks and APIs for the encryption 
> support. 
> This JIRA provides a crypto data interface for schema activation of Parquet 
> encryption and serves as a high-level layer on top of PARQUET-1178 to make 
> the adoption of Parquet-1178 easier, with pluggable key access module, 
> without a need to use the low-level encryption APIs. Also, this feature will 
> enable seamless integration with existing clients.
> No change to specifications (Parquet-format), no new Parquet APIs, and no 
> changes in existing Parquet APIs. All current applications, tests, etc, will 
> work.
> From developer perspective, they can just implement the interface into a 
> plugin which can be attached any Parquet application like Hive/Spark etc. 
> This decouples the complexity of dealing with KMS and schema from Parquet 
> applications. In large organization, they may have hundreds or even thousands 
> of Parquet applications and pipelines. The decoupling would make Parquet 
> encryption easier to be adopted.  
> From end user(for example data owner) perspective, if they think a column is 
> sensitive, they can just set that column’s schema as sensitive and then the 
> Parquet application just encrypt that column automatically. This makes end 
> user easy to manage the encryptions of their columns.  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-1872) Add TransCompression Feature

2021-12-03 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-1872:
-
Summary: Add TransCompression Feature   (was: Add TransCompression command )

> Add TransCompression Feature 
> -
>
> Key: PARQUET-1872
> URL: https://issues.apache.org/jira/browse/PARQUET-1872
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data to 
> ZSTD compressed which can achieve a higher compression ratio. It would be 
> useful if we can have a tool to convert a Parquet file directly by just 
> decompressing/compressing each page without decoding/encoding or assembling 
> the record because it is much faster. The initial result shows it is ~5 times 
> faster. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (PARQUET-2105) Refactor the test code of creating the test file

2021-11-30 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2105:


 Summary: Refactor the test code of creating the test file 
 Key: PARQUET-2105
 URL: https://issues.apache.org/jira/browse/PARQUET-2105
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Xinli Shang
Assignee: Xinli Shang


In the tests, there are many places that need to create a test parquet file 
with different settings. Currently, each test file just creates its own code. 
It would be better to have a test file builder to create that. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (PARQUET-2098) Add more methods into interface of BlockCipher

2021-09-29 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2098:


 Summary: Add more methods into interface of BlockCipher
 Key: PARQUET-2098
 URL: https://issues.apache.org/jira/browse/PARQUET-2098
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Xinli Shang
Assignee: Xinli Shang


Currently BlockCipher interface has methods without letting caller to specify 
length/offset. In some use cases like Presto,  it is needed to pass in a byte 
array and the data to be encrypted only occupys partially of the array.  So we 
need to add a new methods something like below for decrypt. Similar methods 
might be needed for encrypt. 

byte[] decrypt(byte[] ciphertext, int cipherTextOffset, int cipherTextLength, 
byte[] aad);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (PARQUET-2027) Merging parquet files created in 1.11.1 not possible using 1.12.0

2021-09-27 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang closed PARQUET-2027.


> Merging parquet files created in 1.11.1 not possible using 1.12.0 
> --
>
> Key: PARQUET-2027
> URL: https://issues.apache.org/jira/browse/PARQUET-2027
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Matthew M
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.12.1
>
>
> I have parquet files created using 1.11.1. In the process I join two files 
> (with the same schema) into a one output file. I create Hadoop writer:
> {code:scala}
> val hadoopWriter = new ParquetFileWriter(
>   HadoopOutputFile.fromPath(
> new Path(outputPath.toString),
> new Configuration()
>   ), outputSchema, Mode.OVERWRITE,
>   8 * 1024 * 1024,
>   2097152,
>   DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH,
>   DEFAULT_STATISTICS_TRUNCATE_LENGTH,
>   DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED
> )
> hadoopWriter.start()
> {code}
> and try to append one file into another:
> {code:scala}
> hadoopWriter.appendFile(HadoopInputFile.fromPath(new Path(file), new 
> Configuration()))
> {code}
> Everything works on 1.11.1. But when I've switched to 1.12.0 it fails with 
> that error:
> {code:scala}
> STDERR: Exception in thread "main" java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' 
> was not found in serialized data! Struct: 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4
>  at org.apache.parquet.format.Util.read(Util.java:365)
>  at org.apache.parquet.format.Util.readPageHeader(Util.java:132)
>  at org.apache.parquet.format.Util.readPageHeader(Util.java:127)
>  at org.apache.parquet.hadoop.Offsets.readDictionaryPageSize(Offsets.java:75)
>  at org.apache.parquet.hadoop.Offsets.getOffsets(Offsets.java:58)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroup(ParquetFileWriter.java:998)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroups(ParquetFileWriter.java:918)
>  at 
> org.apache.parquet.hadoop.ParquetFileReader.appendTo(ParquetFileReader.java:888)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:895)
>  at [...]
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'uncompressed_page_size' was not found in serialized data! 
> Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4
>  at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1108)
>  at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019)
>  at org.apache.parquet.format.PageHeader.read(PageHeader.java:896)
>  at org.apache.parquet.format.Util.read(Util.java:362)
>  ... 14 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version

2021-09-27 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang closed PARQUET-2078.


> Failed to read parquet file after writing with the same parquet version
> ---
>
> Key: PARQUET-2078
> URL: https://issues.apache.org/jira/browse/PARQUET-2078
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Nemon Lou
>Assignee: Nemon Lou
>Priority: Critical
> Fix For: 1.13.0, 1.12.1
>
> Attachments: 
> PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, 
> tpcds_customer_footer.json
>
>
> Writing parquet  file with version 1.12.0 in Apache Hive, then read that 
> file, returns the following error:
> {noformat}
> Caused by: java.lang.IllegalStateException: All of the offsets in the split 
> should be found in the file. expected: [4, 133961161] found: 
> [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED 
> [c_customer_sk] optional int64 c_customer_sk  [PLAIN, RLE, BIT_PACKED], 4}, 
> ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED 
> [c_current_cdemo_sk] optional int64 c_current_cdemo_sk  [PLAIN, RLE, 
> BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] 
> optional int64 c_current_hdemo_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 
> c_current_addr_sk  [PLAIN, RLE, BIT_PACKED], 57421932}, 
> ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 
> c_first_shipto_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, 
> ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 
> c_first_sales_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, 
> ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, 
> ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, 
> ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, 
> ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary 
> c_preferred_cust_flag (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 
> c_birth_day  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, 
> ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month  
> [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED 
> [c_birth_year] optional int32 c_birth_year  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] 
> optional binary c_birth_country (STRING)  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary 
> c_login (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, 
> ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED 
> [c_last_review_date_sk] optional int64 c_last_review_date_sk  [RLE, 
> PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}]
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:96)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_292]
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_292]
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_292]
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_292]
>   at 
> 

[jira] [Created] (PARQUET-2093) Add rewriter version to Parquet footer

2021-09-20 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2093:


 Summary: Add rewriter version to Parquet footer 
 Key: PARQUET-2093
 URL: https://issues.apache.org/jira/browse/PARQUET-2093
 Project: Parquet
  Issue Type: Improvement
Affects Versions: 1.13.0
Reporter: Xinli Shang
Assignee: Xinli Shang


Parquet footer records the writer's version in the field of 'create-by'. As we 
introduce several rewrites, the new file is written partially by the rewriter. 
In this case, we need to record the rewriter's version also. 

Some questions (about a common rewriter) we need to answer before step forward:

What would be the place of the rewriter versions? (New specific field or 
key-value metadata? Which key shall we use?)
Shall we somehow also save what the rewriter has done? How?
At what level shall we copy the original created_by field and what level shall 
we write the version of the rewriter to that field instead? (What different 
levels are possible?)
>From the introduction of this rewriter(s) field in case of any related writer 
>version dependent fix we need to check this field as well and not only the 
>created_by one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-2075) Unified Rewriter Tool

2021-09-17 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-2075:
-
External issue URL: 
https://docs.google.com/document/d/1Ryt5uXnp-YwOrsnIDrGdTMoFfbOaTM6X39pBLHOa_50

> Unified Rewriter Tool  
> ---
>
> Key: PARQUET-2075
> URL: https://issues.apache.org/jira/browse/PARQUET-2075
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> During the discussion of PARQUET-2071, we came up with the idea of a 
> universal tool to translate the existing file to a different state while 
> skipping some level steps like encoding/decoding, to gain speed. For example, 
> only decompress pages and then compress directly. For PARQUET-2071, we only 
> decrypt and then encrypt directly. This will be useful for the existing data 
> to onboard Parquet features like column encryption, zstd etc. 
> We already have tools like trans-compression, column pruning etc. We will 
> consolidate all these tools with this universal tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-2075) Unified Rewriter Tool

2021-09-17 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-2075:
-
Summary: Unified Rewriter Tool(was: Unified translation tool  )

> Unified Rewriter Tool  
> ---
>
> Key: PARQUET-2075
> URL: https://issues.apache.org/jira/browse/PARQUET-2075
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> During the discussion of PARQUET-2071, we came up with the idea of a 
> universal tool to translate the existing file to a different state while 
> skipping some level steps like encoding/decoding, to gain speed. For example, 
> only decompress pages and then compress directly. For PARQUET-2071, we only 
> decrypt and then encrypt directly. This will be useful for the existing data 
> to onboard Parquet features like column encryption, zstd etc. 
> We already have tools like trans-compression, column pruning etc. We will 
> consolidate all these tools with this universal tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2087) Release parquet-mr 1.12.1

2021-09-17 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang resolved PARQUET-2087.
--
Resolution: Fixed

> Release parquet-mr 1.12.1
> -
>
> Key: PARQUET-2087
> URL: https://issues.apache.org/jira/browse/PARQUET-2087
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2091) Fix release build error introduced by PARQUET-2043

2021-09-17 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416806#comment-17416806
 ] 

Xinli Shang commented on PARQUET-2091:
--

No issues on build but when run the release command, it will show up. 

> Fix release build error introduced by PARQUET-2043
> --
>
> Key: PARQUET-2091
> URL: https://issues.apache.org/jira/browse/PARQUET-2091
> Project: Parquet
>  Issue Type: Task
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> After PARQUET-2043 when building for a release like 1.12.1, there is build 
> error complaining 'used undeclared dependency'. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2091) Fix release build error introduced by PARQUET-2043

2021-09-13 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2091:


 Summary: Fix release build error introduced by PARQUET-2043
 Key: PARQUET-2091
 URL: https://issues.apache.org/jira/browse/PARQUET-2091
 Project: Parquet
  Issue Type: Task
Reporter: Xinli Shang
Assignee: Xinli Shang


After PARQUET-2043 when building for a release like 1.12.1, there is build 
error complaining 'used undeclared dependency'. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2087) Release parquet-mr 1.12.0

2021-09-09 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2087:


 Summary: Release parquet-mr 1.12.0
 Key: PARQUET-2087
 URL: https://issues.apache.org/jira/browse/PARQUET-2087
 Project: Parquet
  Issue Type: Task
  Components: parquet-mr
Reporter: Xinli Shang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-2087) Release parquet-mr 1.12.1

2021-09-09 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang reassigned PARQUET-2087:


Assignee: Xinli Shang
Due Date: 18/Sep/21

> Release parquet-mr 1.12.1
> -
>
> Key: PARQUET-2087
> URL: https://issues.apache.org/jira/browse/PARQUET-2087
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-2087) Release parquet-mr 1.12.1

2021-09-09 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-2087:
-
Summary: Release parquet-mr 1.12.1  (was: Release parquet-mr 1.12.0)

> Release parquet-mr 1.12.1
> -
>
> Key: PARQUET-2087
> URL: https://issues.apache.org/jira/browse/PARQUET-2087
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Reporter: Xinli Shang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2082) Encryption translation tool - Parquet-cli

2021-08-30 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2082:


 Summary: Encryption translation tool - Parquet-cli
 Key: PARQUET-2082
 URL: https://issues.apache.org/jira/browse/PARQUET-2082
 Project: Parquet
  Issue Type: Task
Reporter: Xinli Shang


This is to implement the parquet-cli part of the encryption translation tool. 
It integrates with key tools to build the encryption properties, handle the 
parameters and call the parquet-hadoop API to encrypt. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2081) Encryption translation tool - Parquet-hadoop

2021-08-30 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2081:


 Summary: Encryption translation tool - Parquet-hadoop
 Key: PARQUET-2081
 URL: https://issues.apache.org/jira/browse/PARQUET-2081
 Project: Parquet
  Issue Type: Task
  Components: parquet-mr
Reporter: Xinli Shang
 Fix For: 1.13.0


This is the implement the core part of the Encryption translation tool in 
parquet-hadoop. After this, we will have another Jira/PR for parquet-cli to 
integrate with key tools for encryption properties.. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-2071) Encryption translation tool

2021-08-21 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402670#comment-17402670
 ] 

Xinli Shang edited comment on PARQUET-2071 at 8/21/21, 5:40 PM:


I just drafted the tool and had [~gershinsky] to have an earlier look(Thanks 
Gidon!). It is working now and I just had a comparison with a regular tool(I 
simply write a tool that read each record and write it back immediately. I have 
the code example in the 
[doc|https://docs.google.com/document/d/1-XdE8-QyDHnBsYrClwNsR8X3ks0JmKJ1-rXq7_th0hc/edit]
 ). The result is promising that it is 20X faster than the regular tool. 

[~gszadovszky] Are you open to having the tool merge in first and then we 
refactor all the existing similar tools to have the universal tool? If yes, I 
am going to make a PR shortly. 


was (Author: sha...@uber.com):
I just drafted the tool and had [~gershinsky] to have an earlier look(Thanks 
Gidon!). It is working now and I just had a comparison with a regular tool(I 
simply write a tool that read each record and write it back immediately). The 
result is promising that it is 20X faster than the regular tool. 

[~gszadovszky] Are you open to having the tool merge in first and then we 
refactor all the existing similar tools to have the universal tool? If yes, I 
am going to make a PR shortly. 

> Encryption translation tool 
> 
>
> Key: PARQUET-2071
> URL: https://issues.apache.org/jira/browse/PARQUET-2071
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When translating existing data to encryption state, we could develop a tool 
> like TransCompression to translate the data at page level to encryption state 
> without reading to record and rewrite. This will speed up the process a lot. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2071) Encryption translation tool

2021-08-21 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17402670#comment-17402670
 ] 

Xinli Shang commented on PARQUET-2071:
--

I just drafted the tool and had [~gershinsky] to have an earlier look(Thanks 
Gidon!). It is working now and I just had a comparison with a regular tool(I 
simply write a tool that read each record and write it back immediately). The 
result is promising that it is 20X faster than the regular tool. 

[~gszadovszky] Are you open to having the tool merge in first and then we 
refactor all the existing similar tools to have the universal tool? If yes, I 
am going to make a PR shortly. 

> Encryption translation tool 
> 
>
> Key: PARQUET-2071
> URL: https://issues.apache.org/jira/browse/PARQUET-2071
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When translating existing data to encryption state, we could develop a tool 
> like TransCompression to translate the data at page level to encryption state 
> without reading to record and rewrite. This will speed up the process a lot. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-2071) Encryption translation tool

2021-08-05 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-2071:
-
External issue ID: https://issues.apache.org/jira/browse/PARQUET-2075

> Encryption translation tool 
> 
>
> Key: PARQUET-2071
> URL: https://issues.apache.org/jira/browse/PARQUET-2071
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When translating existing data to encryption state, we could develop a tool 
> like TransCompression to translate the data at page level to encryption state 
> without reading to record and rewrite. This will speed up the process a lot. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-2075) Unified translation tool

2021-08-05 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-2075:
-
External issue ID: https://issues.apache.org/jira/browse/PARQUET-2071

> Unified translation tool  
> --
>
> Key: PARQUET-2075
> URL: https://issues.apache.org/jira/browse/PARQUET-2075
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> During the discussion of PARQUET-2071, we came up with the idea of a 
> universal tool to translate the existing file to a different state while 
> skipping some level steps like encoding/decoding, to gain speed. For example, 
> only decompress pages and then compress directly. For PARQUET-2071, we only 
> decrypt and then encrypt directly. This will be useful for the existing data 
> to onboard Parquet features like column encryption, zstd etc. 
> We already have tools like trans-compression, column pruning etc. We will 
> consolidate all these tools with this universal tool. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2071) Encryption translation tool

2021-08-05 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394098#comment-17394098
 ] 

Xinli Shang commented on PARQUET-2071:
--

Thanks, Gabor and Gidon! I think it is a good idea of 'universal tool' and load 
it for different use cases. I opened 
https://issues.apache.org/jira/browse/PARQUET-2075 for it. 

> Encryption translation tool 
> 
>
> Key: PARQUET-2071
> URL: https://issues.apache.org/jira/browse/PARQUET-2071
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When translating existing data to encryption state, we could develop a tool 
> like TransCompression to translate the data at page level to encryption state 
> without reading to record and rewrite. This will speed up the process a lot. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2075) Unified translation tool

2021-08-05 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2075:


 Summary: Unified translation tool  
 Key: PARQUET-2075
 URL: https://issues.apache.org/jira/browse/PARQUET-2075
 Project: Parquet
  Issue Type: New Feature
Reporter: Xinli Shang
Assignee: Xinli Shang


During the discussion of PARQUET-2071, we came up with the idea of a universal 
tool to translate the existing file to a different state while skipping some 
level steps like encoding/decoding, to gain speed. For example, only decompress 
pages and then compress directly. For PARQUET-2071, we only decrypt and then 
encrypt directly. This will be useful for the existing data to onboard Parquet 
features like column encryption, zstd etc. 

We already have tools like trans-compression, column pruning etc. We will 
consolidate all these tools with this universal tool. 





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-2071) Encryption translation tool

2021-08-04 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-2071:
-
External issue URL: 
https://docs.google.com/document/d/1-XdE8-QyDHnBsYrClwNsR8X3ks0JmKJ1-rXq7_th0hc/edit#

> Encryption translation tool 
> 
>
> Key: PARQUET-2071
> URL: https://issues.apache.org/jira/browse/PARQUET-2071
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When translating existing data to encryption state, we could develop a tool 
> like TransCompression to translate the data at page level to encryption state 
> without reading to record and rewrite. This will speed up the process a lot. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2071) Encryption translation tool

2021-08-04 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2071:


 Summary: Encryption translation tool 
 Key: PARQUET-2071
 URL: https://issues.apache.org/jira/browse/PARQUET-2071
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-mr
Reporter: Xinli Shang
Assignee: Xinli Shang


When translating existing data to encryption state, we could develop a tool 
like TransCompression to translate the data at page level to encryption state 
without reading to record and rewrite. This will speed up the process a lot. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2064) Make Range public accessible in RowRanges

2021-07-12 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379217#comment-17379217
 ] 

Xinli Shang commented on PARQUET-2064:
--

[~gszadovszky], do you have some suggestions on how to proceed? It is the 
reality that Spar/Hive uses lower-level APIs that were not designed for and it 
is now a blocker for column index to rollout.

> Make Range public accessible in RowRanges
> -
>
> Key: PARQUET-2064
> URL: https://issues.apache.org/jira/browse/PARQUET-2064
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When rolling out to Presto, I found we need to know the boundaries of each 
> Range in RowRanges. It is still doable with Iterator but Presto has. batch 
> reader, we cannot use iterator for each row. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2064) Make Range public accessible in RowRanges

2021-07-09 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2064:


 Summary: Make Range public accessible in RowRanges
 Key: PARQUET-2064
 URL: https://issues.apache.org/jira/browse/PARQUET-2064
 Project: Parquet
  Issue Type: New Feature
Reporter: Xinli Shang
Assignee: Xinli Shang


When rolling out to Presto, I found we need to know the boundaries of each 
Range in RowRanges. It is still doable with Iterator but Presto has. batch 
reader, we cannot use iterator for each row. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2062) Data masking(null) for column encryption

2021-07-05 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374892#comment-17374892
 ] 

Xinli Shang commented on PARQUET-2062:
--

Great idea!

On Mon, Jul 5, 2021 at 1:03 AM Gabor Szadovszky (Jira) 



-- 
Xinli Shang


> Data masking(null) for column encryption 
> -
>
> Key: PARQUET-2062
> URL: https://issues.apache.org/jira/browse/PARQUET-2062
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When user doesn't have permisson on a column that are encrypted by the column 
> encryption feature (parquet-1178), returning a masked value could avoid an 
> exception and let the call succeed. 
> We would like to introduce the data masking with null values. The idea is 
> when the user gets key access denied and the user can accept null(via a 
> reading option flag), we would return null for the encrypted columns. This 
> solution doesn't need to save extra columns for masked value and doesn't need 
> to translate existing data. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2021-07-01 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-1792:
-
Fix Version/s: 1.12.0

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1681) Avro's isElementType() change breaks the reading of some parquet(1.8.1) files

2021-07-01 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372862#comment-17372862
 ] 

Xinli Shang commented on PARQUET-1681:
--

We chose to revert the behavior back to 1.8.1. It runs fine for a year or so. 
We will port the changes soon. 

> Avro's isElementType() change breaks the reading of some parquet(1.8.1) files
> -
>
> Key: PARQUET-1681
> URL: https://issues.apache.org/jira/browse/PARQUET-1681
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Affects Versions: 1.10.0, 1.9.1, 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Critical
>
> When using the Avro schema below to write a parquet(1.8.1) file and then read 
> back by using parquet 1.10.1 without passing any schema, the reading throws 
> an exception "XXX is not a group" . Reading through parquet 1.8.1 is fine. 
>    {
>   "name": "phones",
>   "type": [
> "null",
> {
>   "type": "array",
>   "items": {
> "type": "record",
> "name": "phones_items",
> "fields": [
>   
> { "name": "phone_number", 
> "type": [   "null",   
> "string" ], "default": null   
> }
> ]
>   }
> }
>   ],
>   "default": null
> }
> The code to read is as below 
>  val reader = 
> AvroParquetReader._builder_[SomeRecordType](parquetPath).withConf(*new*   
> Configuration).build()
> reader.read()
> PARQUET-651 changed the method isElementType() by relying on Avro's 
> checkReaderWriterCompatibility() to check the compatibility. However, 
> checkReaderWriterCompatibility() consider the ParquetSchema and the 
> AvroSchema(converted from File schema) as not compatible(the name in avro 
> schema is ‘phones_items’, but the name is ‘array’ in Parquet schema, hence 
> not compatible) . Hence return false and caused the “phone_number” field in 
> the above schema to be considered as group type which is not true. Then the 
> exception throws as .asGroupType(). 
> I didn’t try writing via parquet 1.10.1 would reproduce the same problem or 
> not. But it could because the translation of Avro schema to Parquet schema is 
> not changed(didn’t verify yet). 
>  I hesitate to revert PARQUET-651 because it solved several problems. I would 
> like to hear the community's thoughts on it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2062) Data masking(null) for column encryption

2021-06-30 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2062:


 Summary: Data masking(null) for column encryption 
 Key: PARQUET-2062
 URL: https://issues.apache.org/jira/browse/PARQUET-2062
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-mr
Reporter: Xinli Shang
Assignee: Xinli Shang


When user doesn't have permisson on a column that are encrypted by the column 
encryption feature (parquet-1178), returning a masked value could avoid an 
exception and let the call succeed. 

We would like to introduce the data masking with null values. The idea is when 
the user gets key access denied and the user can accept null(via a reading 
option flag), we would return null for the encrypted columns. This solution 
doesn't need to save extra columns for masked value and doesn't need to 
translate existing data. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2054) TCP connection leaking when calling appendFile()

2021-06-01 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2054:


 Summary: TCP connection leaking when calling appendFile()
 Key: PARQUET-2054
 URL: https://issues.apache.org/jira/browse/PARQUET-2054
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-mr
Reporter: Xinli Shang


When appendFile() is called, the file reader path is opened but not closed. It 
caused many TCP connections leaked. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-05-25 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17351199#comment-17351199
 ] 

Xinli Shang commented on PARQUET-1968:
--

Go ahead to work on it. Thanks Huaxin!

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1827) UUID type currently not supported by parquet-mr

2021-04-01 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17313524#comment-17313524
 ] 

Xinli Shang commented on PARQUET-1827:
--

It seems the storage size is reduced by ~8%  for the UUID column. 

> UUID type currently not supported by parquet-mr
> ---
>
> Key: PARQUET-1827
> URL: https://issues.apache.org/jira/browse/PARQUET-1827
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Brad Smith
>Assignee: Gabor Szadovszky
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.12.0
>
>
> The parquet-format project introduced a new UUID logical type in version 2.4:
> [https://github.com/apache/parquet-format/blob/master/CHANGES.md]
> This would be a useful type to have available in some circumstances, but it 
> currently isn't supported in the parquet-mr library. Hopefully this feature 
> can be implemented at some point.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2006) Column resolution by ID

2021-03-23 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-2006:


 Summary: Column resolution by ID
 Key: PARQUET-2006
 URL: https://issues.apache.org/jira/browse/PARQUET-2006
 Project: Parquet
  Issue Type: New Feature
  Components: parquet-mr
Reporter: Xinli Shang
Assignee: Xinli Shang


Parquet relies on the name. In a lot of usages e.g. schema resolution, this 
would be a problem. Iceberg uses ID and stored Id/name mappings. 

This Jira is to add column ID resolution support. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1992) Cannot build from tarball because of git submodules

2021-03-05 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17296124#comment-17296124
 ] 

Xinli Shang commented on PARQUET-1992:
--

I think we shouldn't let it fail when developers run 'mvn package/install' or 
'mvn verify' in any case if they don't make any changes.  So I like the idea of 
downloading directly.  I will review the code once it passes the build. 

> Cannot build from tarball because of git submodules
> ---
>
> Key: PARQUET-1992
> URL: https://issues.apache.org/jira/browse/PARQUET-1992
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Priority: Blocker
>
> Because we use git submodules (to get test parquet files) a simple "mvn clean 
> install" fails from the unpacked tarball due to "not a git repository".
> I think we would have 2 options to solve this situation:
> * Include all the required files (even only for testing) in the tarball and 
> somehow avoid the git submodule update in case of executed in a non-git 
> envrionment
> * Make the downloading of the parquet files and the related tests optional so 
> it won't fail the build from the tarball



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1948) TransCompressionCommand Inoperable

2021-02-18 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17286825#comment-17286825
 ] 

Xinli Shang commented on PARQUET-1948:
--

[~vanhooser], glad to see you have the interests of this tool. We have been 
using it by translating GZIP to ZSTD for existing parquet files. Let me know if 
you hit any issues. 

> TransCompressionCommand Inoperable
> --
>
> Key: PARQUET-1948
> URL: https://issues.apache.org/jira/browse/PARQUET-1948
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.1
> Environment: I am using parquet-tools 1.11.1 on a Mac machine running 
> Catalina, and my parquet-tools jar was downloaded from Maven Central. 
>Reporter: Shelby Vanhooser
>Priority: Blocker
>  Labels: parquet-tools
>
> {{TransCompressionCommand}} in parquet-tools is intended to allow translation 
> of compression types in parquet files.  We are intending to use this 
> functionality to debug a corrupted file, but this command fails to run at the 
> moment entirely. 
> Running the following command (on the uncorrupted file):
> {code:java}
> java -jar ./parquet-tools-1.11.1.jar trans-compression 
> ~/Downloads/part-00048-69f65188-94b5-4772-8906-5c78989240b5_00048.c000.snappy.parquet{code}
> This results in 
>  
> {code:java}
> Unknown command: trans-compression{code}
>  
> I believe this is due to the Registry class [silently catching any errors to 
> initialize|https://github.com/apache/parquet-mr/blob/master/parquet-tools/src/main/java/org/apache/parquet/tools/command/Registry.java#L65]
>  which subsequently is [misinterpreted as an unknown 
> command|https://github.com/apache/parquet-mr/blob/master/parquet-tools/src/main/java/org/apache/parquet/tools/Main.java#L200].
> We need to: 
>  # Write a test for the TransCompressionCommand to figure out why it's 
> showing up as unknown command
>  # Probably expand these tests to cover all the other commands
>  
> This will then unblock our debugging work on the suspect file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-02-01 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276664#comment-17276664
 ] 

Xinli Shang commented on PARQUET-1968:
--

Sure, will connect with you shortly. 

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1968) FilterApi support In predicate

2021-02-01 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17276533#comment-17276533
 ] 

Xinli Shang commented on PARQUET-1968:
--

Hi [~rdblue]. We didn't discuss it in last week's Parquet sync meeting since 
you were not there.  The next Parquet sync is Feb 23th 9:00am. I just added you 
explicitly with your Netflix email account. Hopefully, you can join. 

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1949) Mark Parquet-1872 with not support bloom filter yet

2021-01-10 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-1949:
-
Summary: Mark Parquet-1872 with not support bloom filter yet   (was: Mark 
Parquet-1872 with note support bloom filter yet )

> Mark Parquet-1872 with not support bloom filter yet 
> 
>
> Key: PARQUET-1949
> URL: https://issues.apache.org/jira/browse/PARQUET-1949
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> To unblock the release of 1.12.0, we need to add comments in the 
> trans-compression command to indicated 'not support bloom filter yet'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1872) Add TransCompression command

2020-12-04 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244334#comment-17244334
 ] 

Xinli Shang commented on PARQUET-1872:
--

Thanks [~gszadovszky] for working on this! I just created the PR to add 
comments in the command. Once you review it and we merge, I will resolve this 
Jira. 

> Add TransCompression command 
> -
>
> Key: PARQUET-1872
> URL: https://issues.apache.org/jira/browse/PARQUET-1872
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data to 
> ZSTD compressed which can achieve a higher compression ratio. It would be 
> useful if we can have a tool to convert a Parquet file directly by just 
> decompressing/compressing each page without decoding/encoding or assembling 
> the record because it is much faster. The initial result shows it is ~5 times 
> faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1949) Mark Parquet-1872 with note support bloom filter yet

2020-12-04 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1949:


 Summary: Mark Parquet-1872 with note support bloom filter yet 
 Key: PARQUET-1949
 URL: https://issues.apache.org/jira/browse/PARQUET-1949
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.12.0
Reporter: Xinli Shang
Assignee: Xinli Shang
 Fix For: 1.12.0


To unblock the release of 1.12.0, we need to add comments in the 
trans-compression command to indicated 'not support bloom filter yet'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1901) Add filter null check for ColumnIndex

2020-12-02 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242634#comment-17242634
 ] 

Xinli Shang commented on PARQUET-1901:
--

For now, I think we can move it to the next release. 

> Add filter null check for ColumnIndex  
> ---
>
> Key: PARQUET-1901
> URL: https://issues.apache.org/jira/browse/PARQUET-1901
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> This Jira is opened for discussion that should we add null checking for the 
> filter when ColumnIndex is enabled. 
> In the ColumnIndexFilter#calculateRowRanges() method, the input parameter 
> 'filter' is assumed to be non-null without checking. It throws NPE when 
> ColumnIndex is enabled(by default) but there is no filter set in the 
> ParquetReadOptions. The call stack is as below. 
> java.lang.NullPointerException
> at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891)
> If we don't add, the user might need to choose to call readNextRowGroup() or 
> readFilteredNextRowGroup() accordingly based on filter existence. 
> Thoughts?  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-12-02 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242631#comment-17242631
 ] 

Xinli Shang edited comment on PARQUET-1927 at 12/2/20, 7:05 PM:


It is still not decided yet in the last Iceberg meeting. But I think if adding 
the 'skipped number of records' is minimal for us,  we can go ahead just to add 
it. Otherwise, we can release without this. 

Add [~rdblue] for FYI


was (Author: sha...@uber.com):
It is still not decided yet in the last Iceberg meeting. But I think if adding 
the 'skipped number of records' is minimal for us,  we can go ahead just to add 
it. Otherwise, we can release without this. 

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-12-02 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242631#comment-17242631
 ] 

Xinli Shang commented on PARQUET-1927:
--

It is still not decided yet in the last Iceberg meeting. But I think if adding 
the 'skipped number of records' is minimal for us,  we can go ahead just to add 
it. Otherwise, we can release without this. 

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1666) Remove Unused Modules

2020-12-02 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242625#comment-17242625
 ] 

Xinli Shang commented on PARQUET-1666:
--

I think adding "-deprecated" is a good idea. 

[~zhenxiao], can you help us to know if dropping parquet-scooge module in 
paruqet-mr repo is OK for Twitter usage? 

> Remove Unused Modules 
> --
>
> Key: PARQUET-1666
> URL: https://issues.apache.org/jira/browse/PARQUET-1666
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> In the last two meetings, Ryan Blue proposed to remove some unused Parquet 
> modules. This is to open a task to track it. 
> Here are the related meeting notes for the discussion on this. 
> Remove old Parquet modules
> Hive modules - sounds good
> Scooge - Julien will reach out to twitter
> Tools - undecided - Cloudera may still use the parquet-tools according to 
> Gabor.
> Cascading - undecided
> We can change the module as deprecated as description.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-11-04 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226089#comment-17226089
 ] 

Xinli Shang commented on PARQUET-1927:
--

[~gszadovszky], I just realized the RowGroupFilter only applies the stats from 
ColumnChunkMetaData instead of page-level stats.  There is a chance that 
ColumnChunkMetaData stats say yes, but page-level stats say no. In that case, 
readNextFilteredRowGroup() can still skip block. 

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-10-27 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221481#comment-17221481
 ] 

Xinli Shang commented on PARQUET-1927:
--

Thanks [~gszadovszky] for the explanation. I see it now. The confusing part is 
Iceberg creates ParquetFileReader object without passing in the filter. 
Instead, it rewrites RowGroup and Dictionary filtering. 

Hi [~rdblue], do you know why Iceberg rewrites the RowGroup and Dictionary 
filtering? From what Gabor mentioned above, if we pass the filter to the 
ParquetFileReader constructor, all the row groups that we need to deal with 
later are already filtered. When we upgrade to 1.12.0, bloomfilter will be 
automatically applied to those row groups. 



> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-10-26 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17221036#comment-17221036
 ] 

Xinli Shang commented on PARQUET-1927:
--

ParquetFileReader.getFilteredRecordCount() cannot be used because Iceberg 
applied RowGroup stats filter and Dcitionary filter also.

I think what we can do is to make getRowRanges() public. Iceberg call 
getRowRanges() to calculate the filteredRecordCount for the RowGroup that is 
determined(by RowGroup stats and Dictionary filter) to be read.   

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-10-23 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang reassigned PARQUET-1927:


Assignee: Xinli Shang

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-10-22 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219293#comment-17219293
 ] 

Xinli Shang commented on PARQUET-1927:
--

[~gszadovszky], the problem is when rowCount is 0(line 966 
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L966),
 readNextFilteredRowGroup() will just call  advanceToNextBlock() and then 
recurse itself to next row group. In that case, the returned count of 
[PageReadStore.getRowCount()|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/page/PageReadStore.java#L44]
 will be the filtered count of the next row group. Iceberg doesn't have the 
knowledge to know these row counts are from which row group. It has to assume 
it is from the previous group. The result is it is wrongly counted and Iceberg 
iterator will just return true in hasNext() even all the records are read. 

 

The fix could be just to add a count for a skipped count including the skipped 
count as a whole row group. 

 

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1396) Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory

2020-10-21 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218717#comment-17218717
 ] 

Xinli Shang commented on PARQUET-1396:
--

Most of the functionality of this Jira has been addressed by "PARQUET-1817: 
Crypto Properties Factory". Hence change the name of this Jira to 'Example of 
Using EncryptionPropertiesFactory/DecryptionPropertiesFactory. 

> Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory
> 
>
> Key: PARQUET-1396
> URL: https://issues.apache.org/jira/browse/PARQUET-1396
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Xinli Shang
>Priority: Major
>  Labels: pull-request-available
>
> This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) 
> that will provide the basic building blocks and APIs for the encryption 
> support. 
> This JIRA provides a crypto data interface for schema activation of Parquet 
> encryption and serves as a high-level layer on top of PARQUET-1178 to make 
> the adoption of Parquet-1178 easier, with pluggable key access module, 
> without a need to use the low-level encryption APIs. Also, this feature will 
> enable seamless integration with existing clients.
> No change to specifications (Parquet-format), no new Parquet APIs, and no 
> changes in existing Parquet APIs. All current applications, tests, etc, will 
> work.
> From developer perspective, they can just implement the interface into a 
> plugin which can be attached any Parquet application like Hive/Spark etc. 
> This decouples the complexity of dealing with KMS and schema from Parquet 
> applications. In large organization, they may have hundreds or even thousands 
> of Parquet applications and pipelines. The decoupling would make Parquet 
> encryption easier to be adopted.  
> From end user(for example data owner) perspective, if they think a column is 
> sensitive, they can just set that column’s schema as sensitive and then the 
> Parquet application just encrypt that column automatically. This makes end 
> user easy to manage the encryptions of their columns.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1396) Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory

2020-10-21 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-1396:
-
Summary: Example of using EncryptionPropertiesFactory and 
DecryptionPropertiesFactory  (was: Cryptodata Interface for Schema Activation 
of Parquet Encryption)

> Example of using EncryptionPropertiesFactory and DecryptionPropertiesFactory
> 
>
> Key: PARQUET-1396
> URL: https://issues.apache.org/jira/browse/PARQUET-1396
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Xinli Shang
>Priority: Major
>  Labels: pull-request-available
>
> This JIRA is an extension to Parquet Modular Encryption Jira(PARQUET-1178) 
> that will provide the basic building blocks and APIs for the encryption 
> support. 
> This JIRA provides a crypto data interface for schema activation of Parquet 
> encryption and serves as a high-level layer on top of PARQUET-1178 to make 
> the adoption of Parquet-1178 easier, with pluggable key access module, 
> without a need to use the low-level encryption APIs. Also, this feature will 
> enable seamless integration with existing clients.
> No change to specifications (Parquet-format), no new Parquet APIs, and no 
> changes in existing Parquet APIs. All current applications, tests, etc, will 
> work.
> From developer perspective, they can just implement the interface into a 
> plugin which can be attached any Parquet application like Hive/Spark etc. 
> This decouples the complexity of dealing with KMS and schema from Parquet 
> applications. In large organization, they may have hundreds or even thousands 
> of Parquet applications and pipelines. The decoupling would make Parquet 
> encryption easier to be adopted.  
> From end user(for example data owner) perspective, if they think a column is 
> sensitive, they can just set that column’s schema as sensitive and then the 
> Parquet application just encrypt that column automatically. This makes end 
> user easy to manage the encryptions of their columns.  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-10-21 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218325#comment-17218325
 ] 

Xinli Shang commented on PARQUET-1927:
--

The workaround I can think of is to apply ColumnIndex to row groups, something 
like (columnIndex, rowGroup) => recordCount, before calling 
readNextFilteredRowGroup() in Iceberg. If recordCount is 0, we skip calling 
readNextFilteredRowGroup() for that row group. By doing this way, it is ensured 
that readNextFilteredRowGroup() will never advance to the next row group 
without Iceberg's knowledge. But this workaround has several issues. 1) It is 
not a trivial implementation because we need to implement all types of filters 
against columnIndex, which pretty much duplicate the implementation in Parquet. 
2) The two implementations(in Parquet and in Iceberg) have to be consistent. If 
one has issues, it will cause Iceberg to be in an unknown state. 3) It requires 
other adoption like Hive, Spark to reimplement their own too.  

This is not regression because ColumnIndex is a new feature in 1.11.x. But I 
think releasing 1.11.2 would be better because it helps the adoption of 1.11.x  
as the ColumnIndex feature is one of the major features in 1.11.x. 

 

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-10-20 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217827#comment-17217827
 ] 

Xinli Shang commented on PARQUET-1927:
--

Add [~rdblue],[~shardulm] as FYI**

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-10-20 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17217774#comment-17217774
 ] 

Xinli Shang commented on PARQUET-1927:
--

That is correct [~gszadovszky]! We need a finer-grained filter count in row 
group level to let the iterator to use. Do you think it makes sense that we add 
the API for that? 

If yes, do you think we can release the 1.11.2 version? I see usually no more 
release after 1.xx.1. 

 

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-10-19 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17216849#comment-17216849
 ] 

Xinli Shang commented on PARQUET-1927:
--

[~gszadovszky], the way that Iceberg Parquet reader iterator implements is that 
it relies on the check of 'valuesRead < totalValues'. When intergrating 
ColumnIndex, we relace readNextRowGroup() with readNextFilteredRowGroup(). 
Because readNextFilteredRowGroup() will skip some records, we change the check 
as 'valuesRead + skippedValues < totalValues'. The skippedValues is calculated 
as 'blockRowCount - counts_Retuned_from_readNextFilteredRowGroup'.This works 
great. But when the whole row group is skipped, readNextFilteredRowGroup() 
advance to next row group internally without Iceberg's knowledge. Hence 
Icerberg doesn't know how to calculate the skippedValues. 

So if readNextFilteredRowGroup() can return how many records it skipped, or 
tell the index of the row group that it gets the returned pages, Iceberg can 
calcuate the skippedValues. 

> ColumnIndex should provide number of records skipped 
> -
>
> Key: PARQUET-1927
> URL: https://issues.apache.org/jira/browse/PARQUET-1927
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> When integrating Parquet ColumnIndex, I found we need to know from Parquet 
> that how many records that we skipped due to ColumnIndex filtering. When 
> rowCount is 0, readNextFilteredRowGroup() just advance to next without 
> telling the caller. See code here 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]
>  
> In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
> following code():
> valuesRead + skippedValues < totalValues
> See 
> ([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
>  
> So without knowing the skipped values, it is hard to determine hasNext() or 
> not. 
>  
> Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
> returns null, we consider it is done for the whole file. Then hasNext() just 
> retrun false. 
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1927) ColumnIndex should provide number of records skipped

2020-10-17 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1927:


 Summary: ColumnIndex should provide number of records skipped 
 Key: PARQUET-1927
 URL: https://issues.apache.org/jira/browse/PARQUET-1927
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Affects Versions: 1.11.0
Reporter: Xinli Shang
 Fix For: 1.12.0


When integrating Parquet ColumnIndex, I found we need to know from Parquet that 
how many records that we skipped due to ColumnIndex filtering. When rowCount is 
0, readNextFilteredRowGroup() just advance to next without telling the caller. 
See code here 
[https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L969]

 

In Iceberg, it reads Parquet record with an iterator. The hasNext() has the 
following code():

valuesRead + skippedValues < totalValues

See 
([https://github.com/apache/iceberg/pull/1566/commits/cd70cac279d3f14ba61f0143f9988d4cc9413651#diff-d80c15b3e5376265436aeab8b79d5a92fb629c6b81f58ad10a11b9b9d3bfcffcR115).]
 

So without knowing the skipped values, it is hard to determine hasNext() or 
not. 

 

Currently, we can workaround by using a flag. When readNextFilteredRowGroup() 
returns null, we consider it is done for the whole file. Then hasNext() just 
retrun false. 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1916) Add hash functionality

2020-09-23 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1916:


 Summary: Add hash functionality 
 Key: PARQUET-1916
 URL: https://issues.apache.org/jira/browse/PARQUET-1916
 Project: Parquet
  Issue Type: Sub-task
Reporter: Xinli Shang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1915) Add null command

2020-09-23 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang reassigned PARQUET-1915:


Assignee: Xinli Shang

> Add null command 
> -
>
> Key: PARQUET-1915
> URL: https://issues.apache.org/jira/browse/PARQUET-1915
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1915) Add null command

2020-09-23 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1915:


 Summary: Add null command 
 Key: PARQUET-1915
 URL: https://issues.apache.org/jira/browse/PARQUET-1915
 Project: Parquet
  Issue Type: Sub-task
Reporter: Xinli Shang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (PARQUET-1901) Add filter null check for ColumnIndex

2020-08-27 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186211#comment-17186211
 ] 

Xinli Shang edited comment on PARQUET-1901 at 8/28/20, 2:23 AM:


I have the initial version of Iceberg integration working in my private repo 
https://github.com/shangxinli/iceberg/commit/4cc9351f8a511a3179cb3ac857541f9116dd8661.
 It can skip the pages now based on the column index. But it is very initial 
version and I didn't finalize it yet, also no tests are added. I also didn't 
get time to address your feedback to idtoAlias comments yet. But I hope it can 
give you an idea ON what the integration looks like. 


was (Author: sha...@uber.com):
I have the initial version of Iceberg integration working in my private repo 
https://github.com/shangxinli/iceberg/commit/4cc9351f8a511a3179cb3ac857541f9116dd8661.
 It can skip the pages now based on the column index. But it is very initial 
version and I didn't finalize it yet, also no tests are added. I also didn't 
get time to address your feedback to idtoAlias comments yet. But I hope it can 
give you AN idea ON what the integration looks like. 

> Add filter null check for ColumnIndex  
> ---
>
> Key: PARQUET-1901
> URL: https://issues.apache.org/jira/browse/PARQUET-1901
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> This Jira is opened for discussion that should we add null checking for the 
> filter when ColumnIndex is enabled. 
> In the ColumnIndexFilter#calculateRowRanges() method, the input parameter 
> 'filter' is assumed to be non-null without checking. It throws NPE when 
> ColumnIndex is enabled(by default) but there is no filter set in the 
> ParquetReadOptions. The call stack is as below. 
> java.lang.NullPointerException
> at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891)
> If we don't add, the user might need to choose to call readNextRowGroup() or 
> readFilteredNextRowGroup() accordingly based on filter existence. 
> Thoughts?  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1901) Add filter null check for ColumnIndex

2020-08-27 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17186211#comment-17186211
 ] 

Xinli Shang commented on PARQUET-1901:
--

I have the initial version of Iceberg integration working in my private repo 
https://github.com/shangxinli/iceberg/commit/4cc9351f8a511a3179cb3ac857541f9116dd8661.
 It can skip the pages now based on the column index. But it is very initial 
version and I didn't finalize it yet, also no tests are added. I also didn't 
get time to address your feedback to idtoAlias comments yet. But I hope it can 
give you AN idea ON what the integration looks like. 

> Add filter null check for ColumnIndex  
> ---
>
> Key: PARQUET-1901
> URL: https://issues.apache.org/jira/browse/PARQUET-1901
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> This Jira is opened for discussion that should we add null checking for the 
> filter when ColumnIndex is enabled. 
> In the ColumnIndexFilter#calculateRowRanges() method, the input parameter 
> 'filter' is assumed to be non-null without checking. It throws NPE when 
> ColumnIndex is enabled(by default) but there is no filter set in the 
> ParquetReadOptions. The call stack is as below. 
> java.lang.NullPointerException
> at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891)
> If we don't add, the user might need to choose to call readNextRowGroup() or 
> readFilteredNextRowGroup() accordingly based on filter existence. 
> Thoughts?  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1901) Add filter null check for ColumnIndex

2020-08-24 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183352#comment-17183352
 ] 

Xinli Shang commented on PARQUET-1901:
--

Hi [~rdblue], please comment on this if you have different opinions. This is 
found during the ColumnIndex integration to Iceberg. We would need to handle 
the null checking in Iceberg anyway before Parquet 1.12.0.  

> Add filter null check for ColumnIndex  
> ---
>
> Key: PARQUET-1901
> URL: https://issues.apache.org/jira/browse/PARQUET-1901
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> This Jira is opened for discussion that should we add null checking for the 
> filter when ColumnIndex is enabled. 
> In the ColumnIndexFilter#calculateRowRanges() method, the input parameter 
> 'filter' is assumed to be non-null without checking. It throws NPE when 
> ColumnIndex is enabled(by default) but there is no filter set in the 
> ParquetReadOptions. The call stack is as below. 
> java.lang.NullPointerException
> at 
> org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891)
> If we don't add, the user might need to choose to call readNextRowGroup() or 
> readFilteredNextRowGroup() accordingly based on filter existence. 
> Thoughts?  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1901) Add filter null check for ColumnIndex

2020-08-22 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1901:


 Summary: Add filter null check for ColumnIndex  
 Key: PARQUET-1901
 URL: https://issues.apache.org/jira/browse/PARQUET-1901
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.11.0
Reporter: Xinli Shang
Assignee: Xinli Shang
 Fix For: 1.12.0


This Jira is opened for discussion that should we add null checking for the 
filter when ColumnIndex is enabled. 

In the ColumnIndexFilter#calculateRowRanges() method, the input parameter 
'filter' is assumed to be non-null without checking. It throws NPE when 
ColumnIndex is enabled(by default) but there is no filter set in the 
ParquetReadOptions. The call stack is as below. 
java.lang.NullPointerException
at 
org.apache.parquet.internal.filter2.columnindex.ColumnIndexFilter.calculateRowRanges(ColumnIndexFilter.java:81)
at 
org.apache.parquet.hadoop.ParquetFileReader.getRowRanges(ParquetFileReader.java:961)
at 
org.apache.parquet.hadoop.ParquetFileReader.readNextFilteredRowGroup(ParquetFileReader.java:891)

If we don't add, the user might need to choose to call readNextRowGroup() or 
readFilteredNextRowGroup() accordingly based on filter existence. 

Thoughts?  




  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1801) Add column index support for 'prune' command in Parquet-tools/cli

2020-08-13 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176961#comment-17176961
 ] 

Xinli Shang commented on PARQUET-1801:
--

I will try to do it in 1.12.0.  

The feature works great! We removed columns in many whale tables and 
significant storage space was saved. I will have a talk in ApacheCon 2020 to 
present this topic.  

> Add column index support for 'prune' command in Parquet-tools/cli
> -
>
> Key: PARQUET-1801
> URL: https://issues.apache.org/jira/browse/PARQUET-1801
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli, parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1792) Add 'mask' command to parquet-tools/parquet-cli

2020-08-13 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17176959#comment-17176959
 ] 

Xinli Shang commented on PARQUET-1792:
--

We might want to push it for next release. 

> Add 'mask' command to parquet-tools/parquet-cli
> ---
>
> Key: PARQUET-1792
> URL: https://issues.apache.org/jira/browse/PARQUET-1792
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
> Fix For: 1.12.0
>
>
> Some personal data columns need to be masked instead of being 
> pruned(Parquet-1791). We need a tool to replace the raw data columns with 
> masked value. The masked value could be hash, null, redact etc.  For the 
> unchanged columns, they should be moved as a whole like 'merge', 'prune' 
> command in Parquet-tools. 
>  
> Implementing this feature in file format is 10X faster than doing it by 
> rewriting the table data in the query engine. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1893) H2SeekableInputStream readFully() doesn't respect start and len

2020-07-29 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1893:


 Summary: H2SeekableInputStream readFully() doesn't respect start 
and len 
 Key: PARQUET-1893
 URL: https://issues.apache.org/jira/browse/PARQUET-1893
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Reporter: Xinli Shang
Assignee: Xinli Shang


The  readFully() throws away the parameters 'start' and 'len' as shown below. 
public void readFully(byte[] bytes, int start, int len) throws IOException {
stream.readFully(bytes);
  }

It should be corrected as below. 
public void readFully(byte[] bytes, int start, int len) throws IOException {
stream.readFully(bytes, start, len);
  }

H1SeekableInputStream() has been fixed. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1830) Vectorized API to support Column Index in Apache Spark

2020-07-20 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161573#comment-17161573
 ] 

Xinli Shang commented on PARQUET-1830:
--

[~FelixKJose]Do we have Spark task created for implementing the short term 
solution? 

> Vectorized API to support Column Index in Apache Spark
> --
>
> Key: PARQUET-1830
> URL: https://issues.apache.org/jira/browse/PARQUET-1830
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> As per the comment on https://issues.apache.org/jira/browse/SPARK-26345. Its 
> seems like Apache Spark doesn't support Column Index until we disable 
> vectorizedReader in Spark - which will have other performance implications. 
> As per [~zi] , parquet-mr should implement a Vectorized API. Is it already 
> implemented or any pull request for the same?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1739) Make Spark SQL support Column indexes

2020-07-20 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17161566#comment-17161566
 ] 

Xinli Shang commented on PARQUET-1739:
--

[~yumwang], Can you share is the implementation is done in Spark to skip 
Parquet pages, as [~gszadovszky] asked that question in Spark-26346? If you 
haven't, I will start looking into it. 

> Make Spark SQL support Column indexes
> -
>
> Key: PARQUET-1739
> URL: https://issues.apache.org/jira/browse/PARQUET-1739
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.11.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Make Spark SQL support Column indexes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1883) int96 support in parquet-avro

2020-07-09 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154888#comment-17154888
 ] 

Xinli Shang commented on PARQUET-1883:
--

[~gszadovszky], Do you still have links for INT96 will be deprecated? And do 
you have a suggestion to workaround for this case? 

> int96 support in parquet-avro
> -
>
> Key: PARQUET-1883
> URL: https://issues.apache.org/jira/browse/PARQUET-1883
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.10.1
>Reporter: satish
>Priority: Major
>
> Hi
> It looks like 'timestamp' is being converted to 'int64' primitive type in 
> parquet-avro. This is incompatible with hive2. Hive throws below error 
> {code:java}
> Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be 
> cast to org.apache.hadoop.hive.serde2.io.TimestampWritable (state=,code=0)
> {code}
> What does it take to write timestamp field as 'int96'? 
> Hive seems to write timestamp field as int96.  See example below
> {code:java}
> $ hadoop jar parquet-tools-1.9.0.jar meta hdfs://timestamp_test/00_0
> creator: parquet-mr version 1.10.6 (build 
> 098c6199a821edd3d6af56b962fd0f1558af849b)
> file schema: hive_schema
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:88 OFFSET:4
> 
> ts:   INT96 UNCOMPRESSED DO:0 FPO:4 SZ:88/88/1.00 VC:4 
> ENC:BIT_PACKED,RLE,PLAIN_DICTIONARY
> {code}
> Writing a spark dataframe into parquet format (without using avro) is also 
> using int96.
> {code:java}
> scala> testDS.printSchema()
> root
>  |-- ts: timestamp (nullable = true)
> scala> testDS.write.mode(Overwrite).save("/tmp/x");
> $ parquet-tools meta 
> /tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> file:
> file:/tmp/x/part-0-99720ebd-0aea-45ac-9b8c-0eb7ad6f4e3c-c000.gz.parquet 
> creator: parquet-mr version 1.10.1 (build 
> a89df8f9932b6ef6633d06069e50c9b7970bebd1) 
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"ts","type":"timestamp","nullable":true,"metadata":{}}]}
>  
> file schema: spark_schema 
> 
> ts:  OPTIONAL INT96 R:0 D:1
> row group 1: RC:4 TS:93 OFFSET:4 
> 
> ts:   INT96 GZIP DO:0 FPO:4 SZ:130/93/0.72 VC:4 
> ENC:RLE,PLAIN_DICTIONARY,BIT_PACKED ST:[no stats for this column]
> {code}
> I saw some explanation for deprecating int96 [support 
> here|https://issues.apache.org/jira/browse/PARQUET-1870?focusedCommentId=17127963=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17127963]
>  from [~gszadovszky]. But given hive and serialization in other parquet 
> modules (non-avro) support int96, I'm trying to understand the reasoning for 
> not implementing it in parquet-avro.
> A bit more context: we are trying to migrate some of our data to [hudi 
> format|https://hudi.apache.org/]. Hudi adds a lot of efficiency for our use 
> cases. But, when we write data using hudi, hudi uses parquet-avro and 
> timestamp is being converted to int64. As mentioned earlier, this breaks 
> compatibility with hive. A lot of columns in our tables have 'timestamp' as 
> type in hive DDL.  It is almost impossible to change DDL to long as there are 
> large number of tables and columns. 
> We are happy to contribute if there is a clear path forward to support int96 
> in parquet-avro. Please also let me know if you are aware of a workaround in 
> hive that can read int64 correctly as timestamp.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1872) Add TransCompression command

2020-06-17 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138607#comment-17138607
 ] 

Xinli Shang commented on PARQUET-1872:
--

That is correct understanding [~gszadovszky]. 

> Add TransCompression command 
> -
>
> Key: PARQUET-1872
> URL: https://issues.apache.org/jira/browse/PARQUET-1872
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data to 
> ZSTD compressed which can achieve a higher compression ratio. It would be 
> useful if we can have a tool to convert a Parquet file directly by just 
> decompressing/compressing each page without decoding/encoding or assembling 
> the record because it is much faster. The initial result shows it is ~5 times 
> faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1872) Add TransCompression command

2020-06-16 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137967#comment-17137967
 ] 

Xinli Shang commented on PARQUET-1872:
--

[~gszadovszky]Thanks for the reply! I just manually linked the PR. 

For the subtask, I was thinking to have a review & changes first with 
parquet-tools then I can add it to parquet-cli instead of changing both at the 
same time. But that is also fine for me to have the two places changes at the 
same PR. I just add to parquet-cli in the newest PR. 

For Column and OffsetIndex, they are taken care of in my PR. I also added tests 
for both ColumnIndex and OffsetIndex validation. 

For bloom filter, I will work on the subtask when this PR is done. That would 
require to copy over the existing bloom filters to the new files.

Xinli 

> Add TransCompression command 
> -
>
> Key: PARQUET-1872
> URL: https://issues.apache.org/jira/browse/PARQUET-1872
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data to 
> ZSTD compressed which can achieve a higher compression ratio. It would be 
> useful if we can have a tool to convert a Parquet file directly by just 
> decompressing/compressing each page without decoding/encoding or assembling 
> the record because it is much faster. The initial result shows it is ~5 times 
> faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1874) Add to parquet-cli

2020-06-16 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang reassigned PARQUET-1874:


Assignee: Xinli Shang

> Add to parquet-cli
> --
>
> Key: PARQUET-1874
> URL: https://issues.apache.org/jira/browse/PARQUET-1874
> Project: Parquet
>  Issue Type: Sub-task
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1876) Port ZSTD-JNI support to 1.10.x brach

2020-06-14 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1876:


 Summary: Port ZSTD-JNI support to 1.10.x brach
 Key: PARQUET-1876
 URL: https://issues.apache.org/jira/browse/PARQUET-1876
 Project: Parquet
  Issue Type: Bug
  Components: parquet-mr
Affects Versions: 1.10.2
Reporter: Xinli Shang
Assignee: Xinli Shang
 Fix For: 1.10.2


I hear the need to port the zstd-jni support to 1.10.x because of easiness to 
use ZSTD. 

cc [~dbtsai]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-1872) Add TransCompression command

2020-06-12 Thread Xinli Shang (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinli Shang updated PARQUET-1872:
-
Description: 
When ZSTD becomes more popular, there is a need to translate existing data to 
ZSTD compressed which can achieve a higher compression ratio. It would be 
useful if we can have a tool to convert a Parquet file directly by just 
decompressing/compressing each page without decoding/encoding or assembling the 
record because it is much faster. The initial result shows it is ~5 times 
faster. 



  was:
When ZSTD becomes more popular, there is a need to translate existing data ZSTD 
compressed which can achieve a higher compression ratio. It would be useful if 
we can have a tool to convert a Parquet file directly by just 
decompressing/compressing each page without decoding/encoding or assembling the 
record because it is much faster. The initial result shows it is ~5 times 
faster. 




> Add TransCompression command 
> -
>
> Key: PARQUET-1872
> URL: https://issues.apache.org/jira/browse/PARQUET-1872
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data to 
> ZSTD compressed which can achieve a higher compression ratio. It would be 
> useful if we can have a tool to convert a Parquet file directly by just 
> decompressing/compressing each page without decoding/encoding or assembling 
> the record because it is much faster. The initial result shows it is ~5 times 
> faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-1875) Add bloom filter support

2020-06-11 Thread Xinli Shang (Jira)
Xinli Shang created PARQUET-1875:


 Summary: Add bloom filter support 
 Key: PARQUET-1875
 URL: https://issues.apache.org/jira/browse/PARQUET-1875
 Project: Parquet
  Issue Type: Sub-task
Reporter: Xinli Shang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >