[jira] [Updated] (PARQUET-2094) Handle negative values in page headers

2021-12-20 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-2094:
--
 External issue ID: CVE-2021-41561
External issue URL: 
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-41561

> Handle negative values in page headers
> --
>
> Key: PARQUET-2094
> URL: https://issues.apache.org/jira/browse/PARQUET-2094
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.11.2, 1.12.2
>
>
> There are integer values in the page headers that should be always positive 
> (e.g. length). I am not sure if we properly handle the cases if they are not 
> positive.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (PARQUET-2106) BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path

2021-12-09 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-2106:
--
Issue Type: Improvement  (was: Task)

> BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path
> ---
>
> Key: PARQUET-2106
> URL: https://issues.apache.org/jira/browse/PARQUET-2106
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Attachments: Screen Shot 2021-12-03 at 3.26.31 PM.png, 
> profile_48449_alloc_1638494450_sort_by.html
>
>
> *Background*
> While writing out large Parquet tables using Spark, we've noticed that 
> BinaryComparator is the source of substantial churn of extremely short-lived 
> `HeapByteBuffer` objects – It's taking up to *16%* of total amount of 
> allocations in our benchmarks, putting substantial pressure on a Garbage 
> Collector:
> !Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521!
> [^profile_48449_alloc_1638494450_sort_by.html]
>  
> *Proposal*
> We're proposing to adjust lexicographical comparison (at least) to avoid 
> doing any allocations, since this code lies on the hot-path of every Parquet 
> write, therefore causing substantial churn amplification.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (PARQUET-2106) BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path

2021-12-09 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2106:
-

Assignee: Alexey Kudinkin

> BinaryComparator should avoid doing ByteBuffer.wrap in the hot-path
> ---
>
> Key: PARQUET-2106
> URL: https://issues.apache.org/jira/browse/PARQUET-2106
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Alexey Kudinkin
>Assignee: Alexey Kudinkin
>Priority: Major
> Attachments: Screen Shot 2021-12-03 at 3.26.31 PM.png, 
> profile_48449_alloc_1638494450_sort_by.html
>
>
> *Background*
> While writing out large Parquet tables using Spark, we've noticed that 
> BinaryComparator is the source of substantial churn of extremely short-lived 
> `HeapByteBuffer` objects – It's taking up to *16%* of total amount of 
> allocations in our benchmarks, putting substantial pressure on a Garbage 
> Collector:
> !Screen Shot 2021-12-03 at 3.26.31 PM.png|width=828,height=521!
> [^profile_48449_alloc_1638494450_sort_by.html]
>  
> *Proposal*
> We're proposing to adjust lexicographical comparison (at least) to avoid 
> doing any allocations, since this code lies on the hot-path of every Parquet 
> write, therefore causing substantial churn amplification.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Resolved] (PARQUET-2107) Travis failures

2021-12-08 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2107.
---
Resolution: Fixed

> Travis failures
> ---
>
> Key: PARQUET-2107
> URL: https://issues.apache.org/jira/browse/PARQUET-2107
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> There are Travis failures since a while in our PRs. See e.g. 
> https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598285 or 
> https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598286



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (PARQUET-2107) Travis failures

2021-12-07 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-2107:
-

 Summary: Travis failures
 Key: PARQUET-2107
 URL: https://issues.apache.org/jira/browse/PARQUET-2107
 Project: Parquet
  Issue Type: Bug
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


There are Travis failures since a while in our PRs. See e.g. 
https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598285 or 
https://app.travis-ci.com/github/apache/parquet-mr/jobs/550598286



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2104) parquet-cli broken in master

2021-11-24 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448483#comment-17448483
 ] 

Gabor Szadovszky commented on PARQUET-2104:
---

[~gamaken], I am not sure about a workaround. I've tried this on master as well 
as on the tags of the releases 1.12.2 and 1.11.2. All works the same way. :(

One idea is to use parquet-tools instead of parquet-cli. It has similar 
functionality. However, parquet-tools has been deprecated in 1.12.0 and removed 
in the current master. You may want to try it with an older tag (e.g. 
apache-parquet-1.11.2).

> parquet-cli broken in master
> 
>
> Key: PARQUET-2104
> URL: https://issues.apache.org/jira/browse/PARQUET-2104
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.12.2
> Environment: ubuntu 18.04 and ubuntu 20.04
>Reporter: Balaji K
>Priority: Major
>
> Creating a Jira per this thread:
> [https://lists.apache.org/thread/k233838g010lvbp81s99floqjmm7nnvs]
>  # clone parquet-mr and build the repo locally
>  # run parquet-cli without Hadoop (according to this ReadMe 
> <[https://github.com/apache/parquet-mr/tree/master/parquet-cli#running-without-hadoop]>
>  )
>  # try a command that deserializes data such as cat or head
>  # observe NoSuchMethodError being thrown
> *Error stack:* ~/repos/parquet-mr/parquet-cli$ parquet cat 
> ../../testdata/dictionaryEncodingSample.parquet WARNING: An illegal 
> reflective access operation has occurred .. 
> Exception in thread "main" java.lang.NoSuchMethodError: 
> 'org.apache.avro.Schema 
> org.apache.parquet.avro.AvroSchemaConverter.convert(org.apache.parquet.schema.MessageType)'
>  at org.apache.parquet.cli.util.Schemas.fromParquet(Schemas.java:89) at 
> org.apache.parquet.cli.BaseCommand.getAvroSchema(BaseCommand.java:405) at 
> org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:66) at 
> org.apache.parquet.cli.Main.run(Main.java:157) at 
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at 
> org.apache.parquet.cli.Main.main(Main.java:187)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (PARQUET-2103) crypto exception in print toPrettyJSON

2021-11-22 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17447455#comment-17447455
 ] 

Gabor Szadovszky commented on PARQUET-2103:
---

I think, we need to update 
[ParquetMetadata.toJSON|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ParquetMetadata.java#L67-L71].
 Jackson shall be able to be configured to not to look for getter methods but 
the private fields. I am not sure if it is a good idea or if it will work in 
every environment. Another option would be to refactor 
EncryptedColumnChunkMetaData to not to call "decrypt" for a getter but it might 
not worth the efforts. The easiest way would be to simply detect if the 
metadata contains encrypted data and do not log anything. I don't know how 
urgent might it be to log the metadata in case of debugging.

> crypto exception in print toPrettyJSON
> --
>
> Key: PARQUET-2103
> URL: https://issues.apache.org/jira/browse/PARQUET-2103
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0, 1.12.1, 1.12.2
>Reporter: Gidon Gershinsky
>Priority: Major
>
> In debug mode, this code 
> {{if (LOG.isDebugEnabled()) {}}
> {{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
> {{}}}
> called in 
> {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}
>  
> _*for unencrypted files*_ 
> triggers an exception:
>  
> {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
> Null File Decryptor     }}
> {{    at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]}}
> {{    at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]}}
> {{    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]}}
> {{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  

[jira] [Resolved] (PARQUET-2101) Fix wrong descriptions about the default block size

2021-11-02 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2101.
---
Resolution: Fixed

> Fix wrong descriptions about the default block size
> ---
>
> Key: PARQUET-2101
> URL: https://issues.apache.org/jira/browse/PARQUET-2101
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro, parquet-mr, parquet-protobuf
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Trivial
>
> https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L90
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L240
> https://github.com/apache/parquet-mr/blob/master/parquet-protobuf/src/main/java/org/apache/parquet/proto/ProtoParquetWriter.java#L80
> These javadocs say the default block size is 50 MB but it's actually 128MB.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-2094) Handle negative values in page headers

2021-09-30 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-2094:
--
Fix Version/s: 1.12.2
   1.11.2

> Handle negative values in page headers
> --
>
> Key: PARQUET-2094
> URL: https://issues.apache.org/jira/browse/PARQUET-2094
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.11.2, 1.12.2
>
>
> There are integer values in the page headers that should be always positive 
> (e.g. length). I am not sure if we properly handle the cases if they are not 
> positive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2094) Handle negative values in page headers

2021-09-30 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2094.
---
Resolution: Fixed

> Handle negative values in page headers
> --
>
> Key: PARQUET-2094
> URL: https://issues.apache.org/jira/browse/PARQUET-2094
> Project: Parquet
>  Issue Type: Bug
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> There are integer values in the page headers that should be always positive 
> (e.g. length). I am not sure if we properly handle the cases if they are not 
> positive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1968) FilterApi support In predicate

2021-09-30 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1968.
---
Resolution: Fixed

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Assignee: Huaxin Gao
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1968) FilterApi support In predicate

2021-09-30 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1968:
-

Assignee: Huaxin Gao

> FilterApi support In predicate
> --
>
> Key: PARQUET-1968
> URL: https://issues.apache.org/jira/browse/PARQUET-1968
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Yuming Wang
>Assignee: Huaxin Gao
>Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2096) Upgrade Thrift to 0.15.0

2021-09-28 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2096.
---
Resolution: Fixed

> Upgrade Thrift to 0.15.0
> 
>
> Key: PARQUET-2096
> URL: https://issues.apache.org/jira/browse/PARQUET-2096
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Vinoo Ganesh
>Assignee: Vinoo Ganesh
>Priority: Minor
>
> Thrift 0.15.0 is currently the default in brew: 
> [https://github.com/Homebrew/homebrew-core/blob/82d03f657371e1541a9a5e5de57c5e1aa00acd45/Formula/thrift.rb#L4.|https://github.com/Homebrew/homebrew-core/blob/master/Formula/thrift.rb#L4.]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-2096) Upgrade Thrift to 0.15.0

2021-09-28 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2096:
-

Assignee: Vinoo Ganesh

> Upgrade Thrift to 0.15.0
> 
>
> Key: PARQUET-2096
> URL: https://issues.apache.org/jira/browse/PARQUET-2096
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Vinoo Ganesh
>Assignee: Vinoo Ganesh
>Priority: Minor
>
> Thrift 0.15.0 is currently the default in brew: 
> [https://github.com/Homebrew/homebrew-core/blob/82d03f657371e1541a9a5e5de57c5e1aa00acd45/Formula/thrift.rb#L4.|https://github.com/Homebrew/homebrew-core/blob/master/Formula/thrift.rb#L4.]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-09-28 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17421270#comment-17421270
 ] 

Gabor Szadovszky commented on PARQUET-2080:
---

[~gershinsky], could you make the doc available for comments?

> Deprecate RowGroup.file_offset
> --
>
> Key: PARQUET-2080
> URL: https://issues.apache.org/jira/browse/PARQUET-2080
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gidon Gershinsky
>Priority: Major
>
> Due to PARQUET-2078 RowGroup.file_offset is not reliable.
> This field is also wrongly calculated in the C++ oss parquet implementation 
> PARQUET-2089



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2094) Handle negative values in page headers

2021-09-22 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-2094:
-

 Summary: Handle negative values in page headers
 Key: PARQUET-2094
 URL: https://issues.apache.org/jira/browse/PARQUET-2094
 Project: Parquet
  Issue Type: Bug
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


There are integer values in the page headers that should be always positive 
(e.g. length). I am not sure if we properly handle the cases if they are not 
positive.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-118) Provide option to use on-heap buffers for Snappy compression/decompression

2021-09-21 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17418202#comment-17418202
 ] 

Gabor Szadovszky commented on PARQUET-118:
--

[~MasterDDT], Unfortunately I can only say something similar that Julien add at 
the first comment. I'm happy to review any PRs about this topic. :)

> Provide option to use on-heap buffers for Snappy compression/decompression
> --
>
> Key: PARQUET-118
> URL: https://issues.apache.org/jira/browse/PARQUET-118
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.6.0
>Reporter: Patrick Wendell
>Priority: Major
>
> The current code uses direct off-heap buffers for decompression. If many 
> decompressors are instantiated across multiple threads, and/or the objects 
> being decompressed are large, this can lead to a huge amount of off-heap 
> allocation by the JVM. This can be exacerbated if overall, there is not heap 
> contention, since no GC will be performed to reclaim the space used by these 
> buffers.
> It would be nice if there was a flag we cold use to simply allocate on-heap 
> buffers here:
> https://github.com/apache/incubator-parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/codec/SnappyDecompressor.java#L28
> We ran into an issue today where these buffers totaled a very large amount of 
> storage and caused our Java processes (running within containers) to be 
> terminated by the kernel OOM-killer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2091) Fix release build error introduced by PARQUET-2043

2021-09-20 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17417540#comment-17417540
 ] 

Gabor Szadovszky commented on PARQUET-2091:
---

Strange to me because the release command should not do anything more (related 
to dependencies) than a {{mvn verify}}.
Isn't it possible that this issue occurred only on the 1.12.x branch and the 
master doesn't have this issue?

> Fix release build error introduced by PARQUET-2043
> --
>
> Key: PARQUET-2091
> URL: https://issues.apache.org/jira/browse/PARQUET-2091
> Project: Parquet
>  Issue Type: Task
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> After PARQUET-2043 when building for a release like 1.12.1, there is build 
> error complaining 'used undeclared dependency'. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2088) Different created_by field values for application and library

2021-09-15 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17415378#comment-17415378
 ] 

Gabor Szadovszky commented on PARQUET-2088:
---

parquet-mr automatically fills the {{created_by}} field by using FULL_VERSION. 
The components using it (Hive/Spark) do not have to populate anything. So if 
parquet-mr writes a file the proper full version string of parquet-mr will be 
written to the field every time.

You are right that there is no separate field to fill the version of the 
"higher level" application. (I remember some discussions about this topic but 
could not find it in the jiras :( ) The issue here is which application version 
should we store? For example there is a customer code that uses a tool written 
for Spark that writes the parquet file. We can make mistakes at any level that 
may cause invalid values (from a certain point of view). So how should we 
handle this and how can we formalize it? Also, how can we enforce the client 
codes to fill these fields?
Anyway, if you have a proposal feel free to write to the dev list.

> Different created_by field values for application and library
> -
>
> Key: PARQUET-2088
> URL: https://issues.apache.org/jira/browse/PARQUET-2088
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: format-2.9.0
>Reporter: Joshua Howard
>Priority: Minor
>
> There seems to be a discrepancy in the Parquet format created_by field 
> regarding how it should be filled out. The parquet-mr library uses this value 
> to enable/disable features based on the parquet-mr version 
> [here|https://github.com/apache/parquet-mr/blob/5f403501e9de05b6aa48f028191b4e78bb97fb12/parquet-column/src/main/java/org/apache/parquet/CorruptDeltaByteArrays.java#L64-L68].
>  Meanwhile, users are encouraged to make use of the application version 
> [here|https://www.javadoc.io/doc/org.apache.parquet/parquet-format/latest/org/apache/parquet/format/FileMetaData.html].
>  It seems like there are multiple fields needed for an application and 
> library version. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2091) Fix release build error introduced by PARQUET-2043

2021-09-14 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414886#comment-17414886
 ] 

Gabor Szadovszky commented on PARQUET-2091:
---

[~sha...@uber.com], do you have issues with building on master? Just checked 
and it is working fine on my environment. (Also seems to be working at the PR 
checks.)

> Fix release build error introduced by PARQUET-2043
> --
>
> Key: PARQUET-2091
> URL: https://issues.apache.org/jira/browse/PARQUET-2091
> Project: Parquet
>  Issue Type: Task
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> After PARQUET-2043 when building for a release like 1.12.1, there is build 
> error complaining 'used undeclared dependency'. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2084) Upgrade Thrift to 0.14.2

2021-09-14 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2084.
---
Resolution: Fixed

> Upgrade Thrift to 0.14.2
> 
>
> Key: PARQUET-2084
> URL: https://issues.apache.org/jira/browse/PARQUET-2084
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2083) Expose getFieldPath from ColumnIO

2021-09-14 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2083.
---
Resolution: Fixed

> Expose getFieldPath from ColumnIO
> -
>
> Key: PARQUET-2083
> URL: https://issues.apache.org/jira/browse/PARQUET-2083
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
>
> Similar to PARUQET-2050, this exposes {{getFieldPath}} from {{ColumnIO}} so 
> downstream apps such as Spark can use it to assemble nested records.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2088) Different created_by field values for application and library

2021-09-14 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414829#comment-17414829
 ] 

Gabor Szadovszky commented on PARQUET-2088:
---

Ah, I see. So, that code part is not about a feature but a bug fix. It is the 
pain in file format implementations that you not only have to fix issues in the 
code but you have to deal with invalid files written by that faulty code (if it 
was released). This time we've had to implement a workaround for those invalid 
files written by parquet-mr releases before 1.8.0.
I am not sure how the Impala reader/writer works. I work on parquet-mr and 
Impala is not tightly part of the Parquet community. It is more an example that 
the created_by field has to be filled by the application actually implements 
the writing of the parquet files. So e.g. Hive, Spark etc. won't be listed here 
ever as they are using parquet-mr to write/read the files. Impala has its own 
writer/reader implementation.

> Different created_by field values for application and library
> -
>
> Key: PARQUET-2088
> URL: https://issues.apache.org/jira/browse/PARQUET-2088
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: format-2.9.0
>Reporter: Joshua Howard
>Priority: Minor
>
> There seems to be a discrepancy in the Parquet format created_by field 
> regarding how it should be filled out. The parquet-mr library uses this value 
> to enable/disable features based on the parquet-mr version 
> [here|https://github.com/apache/parquet-mr/blob/5f403501e9de05b6aa48f028191b4e78bb97fb12/parquet-column/src/main/java/org/apache/parquet/CorruptDeltaByteArrays.java#L64-L68].
>  Meanwhile, users are encouraged to make use of the application version 
> [here|https://www.javadoc.io/doc/org.apache.parquet/parquet-format/latest/org/apache/parquet/format/FileMetaData.html].
>  It seems like there are multiple fields needed for an application and 
> library version. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2085) Formatting is broken for description of BIT_PACKED

2021-09-14 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414823#comment-17414823
 ] 

Gabor Szadovszky commented on PARQUET-2085:
---

[~alexott], I got it now. You are talking about the [Parquet 
site|http://parquet.apache.org/documentation/latest/]. I was confused because 
the PR is in the parquet-format repo. The Offical site has a separate 
repository: https://github.com/apache/parquet-site. It is a bit tricky to 
update (need to install old ruby libs and generate the htmls manually) but if 
you would like to give a try feel free to create a new PR on the site repo.

> Formatting is broken for description of BIT_PACKED
> --
>
> Key: PARQUET-2085
> URL: https://issues.apache.org/jira/browse/PARQUET-2085
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Alex Ott
>Priority: Minor
>
> The Nested Encoding section of documentation doesn't escape the {{_}} 
> character, so it looks as following:
> Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is 
> now used as it supersedes BIT_PACKED.
> instead of
> Two encodings for the levels are supported BIT_PACKED and RLE. Only RLE is 
> now used as it supersedes BIT_PACKED.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version

2021-09-13 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2078.
---
Resolution: Fixed

Since the PR is merged I am resolving this.

> Failed to read parquet file after writing with the same parquet version
> ---
>
> Key: PARQUET-2078
> URL: https://issues.apache.org/jira/browse/PARQUET-2078
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Nemon Lou
>Assignee: Nemon Lou
>Priority: Critical
> Fix For: 1.13.0, 1.12.1
>
> Attachments: 
> PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, 
> tpcds_customer_footer.json
>
>
> Writing parquet  file with version 1.12.0 in Apache Hive, then read that 
> file, returns the following error:
> {noformat}
> Caused by: java.lang.IllegalStateException: All of the offsets in the split 
> should be found in the file. expected: [4, 133961161] found: 
> [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED 
> [c_customer_sk] optional int64 c_customer_sk  [PLAIN, RLE, BIT_PACKED], 4}, 
> ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED 
> [c_current_cdemo_sk] optional int64 c_current_cdemo_sk  [PLAIN, RLE, 
> BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] 
> optional int64 c_current_hdemo_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 
> c_current_addr_sk  [PLAIN, RLE, BIT_PACKED], 57421932}, 
> ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 
> c_first_shipto_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, 
> ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 
> c_first_sales_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, 
> ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, 
> ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, 
> ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, 
> ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary 
> c_preferred_cust_flag (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 
> c_birth_day  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, 
> ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month  
> [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED 
> [c_birth_year] optional int32 c_birth_year  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] 
> optional binary c_birth_country (STRING)  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary 
> c_login (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, 
> ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED 
> [c_last_review_date_sk] optional int64 c_last_review_date_sk  [RLE, 
> PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}]
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:96)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_292]
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_292]
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_292]
>   at 

[jira] [Commented] (PARQUET-2088) Different created_by field values for application and library

2021-09-13 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414092#comment-17414092
 ] 

Gabor Szadovszky commented on PARQUET-2088:
---

Could you please list what exact features do you think parquet-mr is 
enabling/disabling based on {{created_by}}? This field is used by the actual 
writer implementations (e.g. Impala, parquet-mr, parquet-cpp etc.). The example 
already explains how to use it: {{impala version 1.0 (build 
6cf94d29b2b7115df4de2c06e2ab4326d721eb55)}}

> Different created_by field values for application and library
> -
>
> Key: PARQUET-2088
> URL: https://issues.apache.org/jira/browse/PARQUET-2088
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: format-2.9.0
>Reporter: Joshua Howard
>Priority: Minor
>
> There seems to be a discrepancy in the Parquet format created_by field 
> regarding how it should be filled out. The parquet-mr library uses this value 
> to enable/disable features based on the parquet-mr version [here|#L64-L68]. 
> Meanwhile, users are encouraged to make use of the application version 
> [here|[https://www.javadoc.io/doc/org.apache.parquet/parquet-format/latest/org/apache/parquet/format/FileMetaData.html]].
>  It seems like there are multiple fields needed for an application and 
> library version. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-09-13 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17414053#comment-17414053
 ] 

Gabor Szadovszky commented on PARQUET-2080:
---

[~gershinsky], however the original topic of this jira is invalid we still need 
to add proper comments to {{RowGroup.file_offset}} describing the situation of 
PARQUET-2078 and helping the implementations to handle the potential wrong 
value. Would you like to handle this?

> Deprecate RowGroup.file_offset
> --
>
> Key: PARQUET-2080
> URL: https://issues.apache.org/jira/browse/PARQUET-2080
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Due to PARQUET-2078 RowGroup.file_offset is not reliable. We shall deprecate 
> the field and add suggestions how to calculate the value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2080) Deprecate RowGroup.file_offset

2021-08-30 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-2080:
-

 Summary: Deprecate RowGroup.file_offset
 Key: PARQUET-2080
 URL: https://issues.apache.org/jira/browse/PARQUET-2080
 Project: Parquet
  Issue Type: Bug
  Components: parquet-format
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


Due to PARQUET-2078 RowGroup.file_offset is not reliable. We shall deprecate 
the field and add suggestions how to calculate the value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version

2021-08-30 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2078:
-

Assignee: Nemon Lou

> Failed to read parquet file after writing with the same parquet version
> ---
>
> Key: PARQUET-2078
> URL: https://issues.apache.org/jira/browse/PARQUET-2078
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Nemon Lou
>Assignee: Nemon Lou
>Priority: Critical
> Fix For: 1.13.0, 1.12.1
>
> Attachments: 
> PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, 
> tpcds_customer_footer.json
>
>
> Writing parquet  file with version 1.12.0 in Apache Hive, then read that 
> file, returns the following error:
> {noformat}
> Caused by: java.lang.IllegalStateException: All of the offsets in the split 
> should be found in the file. expected: [4, 133961161] found: 
> [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED 
> [c_customer_sk] optional int64 c_customer_sk  [PLAIN, RLE, BIT_PACKED], 4}, 
> ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED 
> [c_current_cdemo_sk] optional int64 c_current_cdemo_sk  [PLAIN, RLE, 
> BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] 
> optional int64 c_current_hdemo_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 
> c_current_addr_sk  [PLAIN, RLE, BIT_PACKED], 57421932}, 
> ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 
> c_first_shipto_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, 
> ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 
> c_first_sales_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, 
> ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, 
> ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, 
> ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, 
> ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary 
> c_preferred_cust_flag (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 
> c_birth_day  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, 
> ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month  
> [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED 
> [c_birth_year] optional int32 c_birth_year  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] 
> optional binary c_birth_country (STRING)  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary 
> c_login (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, 
> ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED 
> [c_last_review_date_sk] optional int64 c_last_review_date_sk  [RLE, 
> PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}]
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:96)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_292]
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_292]
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_292]
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> 

[jira] [Commented] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version

2021-08-30 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17406722#comment-17406722
 ] 

Gabor Szadovszky commented on PARQUET-2078:
---

[~nemon], you are right, so {{dictionaryPageOffset}} is not impacted. Great 
news. 

After the second look it is not required to have the first column being 
dictionary encoded before the invalid row group. It is enough that there are 
dictionary encoded column chunks in the previous row groups and that the first 
column chunk is not dictionary encoded in the invalid row group. So, [~nemon], 
you also right with your PR. 

> Failed to read parquet file after writing with the same parquet version
> ---
>
> Key: PARQUET-2078
> URL: https://issues.apache.org/jira/browse/PARQUET-2078
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Nemon Lou
>Priority: Critical
> Fix For: 1.13.0, 1.12.1
>
> Attachments: 
> PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, 
> tpcds_customer_footer.json
>
>
> Writing parquet  file with version 1.12.0 in Apache Hive, then read that 
> file, returns the following error:
> {noformat}
> Caused by: java.lang.IllegalStateException: All of the offsets in the split 
> should be found in the file. expected: [4, 133961161] found: 
> [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED 
> [c_customer_sk] optional int64 c_customer_sk  [PLAIN, RLE, BIT_PACKED], 4}, 
> ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED 
> [c_current_cdemo_sk] optional int64 c_current_cdemo_sk  [PLAIN, RLE, 
> BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] 
> optional int64 c_current_hdemo_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 
> c_current_addr_sk  [PLAIN, RLE, BIT_PACKED], 57421932}, 
> ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 
> c_first_shipto_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, 
> ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 
> c_first_sales_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, 
> ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, 
> ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, 
> ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, 
> ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary 
> c_preferred_cust_flag (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 
> c_birth_day  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, 
> ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month  
> [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED 
> [c_birth_year] optional int32 c_birth_year  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] 
> optional binary c_birth_country (STRING)  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary 
> c_login (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, 
> ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED 
> [c_last_review_date_sk] optional int64 c_last_review_date_sk  [RLE, 
> PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}]
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:96)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 

[jira] [Commented] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version

2021-08-30 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17406621#comment-17406621
 ] 

Gabor Szadovszky commented on PARQUET-2078:
---

[~nemon], I am not sure how it would be possible. RowGroup.file_offset is set 
by using the dictionary page offset of the first column chunk (if there is any):
 * 
[rowGroup.setFile_offset(block.getStartingPos())|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L580]
 * 
[BlockMetaData.getStartingPos()|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/BlockMetaData.java#L102-L104]
 * 
[ColumnChunkMetaData.getStartingPos()|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/metadata/ColumnChunkMetaData.java#L184-L193]

As per my understanding the issue is based on the following to have the wrong 
offsets of {{rowGroup ~n~}} (where we have {{k}} columns):
* {{columnChunk ~n-1, 1~}} (first column chunk of {{rowGroup ~n-1~}}) is 
dictionary encoded as well as {{columnChunk ~n-1, k~}}
* {{columnChunk ~n, 1~}} is not dictionary encoded
In this case {{fileOffset ~n~ = dictionaryOffset ~n, 1~ = dictionaryOffset 
~n-1, k~}}

To discover this issue we should check if a column chunk is dictionary encoded 
before using the dictionary offset of it. Unfortunately, we have to do the same 
before using the file offset of a row group, or simply ignore this value and 
use the offsets of the first column chunk with the check.

> Failed to read parquet file after writing with the same parquet version
> ---
>
> Key: PARQUET-2078
> URL: https://issues.apache.org/jira/browse/PARQUET-2078
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Nemon Lou
>Priority: Critical
> Fix For: 1.13.0, 1.12.1
>
> Attachments: 
> PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, 
> tpcds_customer_footer.json
>
>
> Writing parquet  file with version 1.12.0 in Apache Hive, then read that 
> file, returns the following error:
> {noformat}
> Caused by: java.lang.IllegalStateException: All of the offsets in the split 
> should be found in the file. expected: [4, 133961161] found: 
> [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED 
> [c_customer_sk] optional int64 c_customer_sk  [PLAIN, RLE, BIT_PACKED], 4}, 
> ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED 
> [c_current_cdemo_sk] optional int64 c_current_cdemo_sk  [PLAIN, RLE, 
> BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] 
> optional int64 c_current_hdemo_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 
> c_current_addr_sk  [PLAIN, RLE, BIT_PACKED], 57421932}, 
> ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 
> c_first_shipto_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, 
> ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 
> c_first_sales_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, 
> ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, 
> ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, 
> ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, 
> ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary 
> c_preferred_cust_flag (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 
> c_birth_day  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, 
> ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month  
> [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED 
> [c_birth_year] optional int32 c_birth_year  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] 
> optional binary c_birth_country (STRING)  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary 
> c_login (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, 
> ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED 
> [c_last_review_date_sk] optional int64 c_last_review_date_sk  [RLE, 
> PLAIN_DICTIONARY, BIT_PACKED], 

[jira] [Commented] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version

2021-08-27 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405698#comment-17405698
 ] 

Gabor Szadovszky commented on PARQUET-2078:
---

Added the dev list thread link here to keep both sides in the loop.

> Failed to read parquet file after writing with the same parquet version
> ---
>
> Key: PARQUET-2078
> URL: https://issues.apache.org/jira/browse/PARQUET-2078
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Nemon Lou
>Priority: Critical
> Fix For: 1.13.0, 1.12.1
>
> Attachments: 
> PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, 
> tpcds_customer_footer.json
>
>
> Writing parquet  file with version 1.12.0 in Apache Hive, then read that 
> file, returns the following error:
> {noformat}
> Caused by: java.lang.IllegalStateException: All of the offsets in the split 
> should be found in the file. expected: [4, 133961161] found: 
> [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED 
> [c_customer_sk] optional int64 c_customer_sk  [PLAIN, RLE, BIT_PACKED], 4}, 
> ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED 
> [c_current_cdemo_sk] optional int64 c_current_cdemo_sk  [PLAIN, RLE, 
> BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] 
> optional int64 c_current_hdemo_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 
> c_current_addr_sk  [PLAIN, RLE, BIT_PACKED], 57421932}, 
> ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 
> c_first_shipto_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, 
> ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 
> c_first_sales_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, 
> ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, 
> ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, 
> ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, 
> ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary 
> c_preferred_cust_flag (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 
> c_birth_day  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, 
> ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month  
> [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED 
> [c_birth_year] optional int32 c_birth_year  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] 
> optional binary c_birth_country (STRING)  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary 
> c_login (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, 
> ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED 
> [c_last_review_date_sk] optional int64 c_last_review_date_sk  [RLE, 
> PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}]
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:96)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_292]
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_292]
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_292]
>   at 

[jira] [Comment Edited] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version

2021-08-27 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405677#comment-17405677
 ] 

Gabor Szadovszky edited comment on PARQUET-2078 at 8/27/21, 8:50 AM:
-

[~nemon], thanks a lot for the detailed explanation and the patch! So what I 
have written before stands. Before 1.12.0 we did not write the dictionary 
offset to the column chunk metadata (see PARQUET-1850) even though the 
calculation was wrong since the beginning. Since we released 1.12.0 already it 
means we have to prepare for the invalid dictionary offset values.

What we need to handle in a fix:
* Fix the calculation issue (see the attached patch)
* Add unit test for this issue to ensure it works properly and won't happen 
again
* Investigate all code parts where the dictionary offset and file offset are 
used and prepare for invalid values

[~nemon], would you like to work on this by opening a PR on github?


was (Author: gszadovszky):
[~nemon], thanks a lot for the detailed explanation and the patch! So what I 
have written before stands. Before 1.12.0 we did not write the dictionary 
offset to the column chunk metadata (see PARQUET-1850) even though the 
calculation was wrong since the beginning. Since we released 1.12.0 already it 
means we have to prepare for the invalid dictionary offset values.

What we need to handle in a fix:
* Fix the calculation issue (see the attached patch)
* Add unit test for this issue to ensure it works properly and won't happen 
again
* Investigate all code parts where the dictionary offset is used and prepare 
for invalid values

[~nemon], would you like to work on this by opening a PR on github?

> Failed to read parquet file after writing with the same parquet version
> ---
>
> Key: PARQUET-2078
> URL: https://issues.apache.org/jira/browse/PARQUET-2078
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Nemon Lou
>Priority: Critical
> Fix For: 1.13.0, 1.12.1
>
> Attachments: 
> PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, 
> tpcds_customer_footer.json
>
>
> Writing parquet  file with version 1.12.0 in Apache Hive, then read that 
> file, returns the following error:
> {noformat}
> Caused by: java.lang.IllegalStateException: All of the offsets in the split 
> should be found in the file. expected: [4, 133961161] found: 
> [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED 
> [c_customer_sk] optional int64 c_customer_sk  [PLAIN, RLE, BIT_PACKED], 4}, 
> ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED 
> [c_current_cdemo_sk] optional int64 c_current_cdemo_sk  [PLAIN, RLE, 
> BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] 
> optional int64 c_current_hdemo_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 
> c_current_addr_sk  [PLAIN, RLE, BIT_PACKED], 57421932}, 
> ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 
> c_first_shipto_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, 
> ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 
> c_first_sales_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, 
> ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, 
> ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, 
> ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, 
> ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary 
> c_preferred_cust_flag (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 
> c_birth_day  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, 
> ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month  
> [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED 
> [c_birth_year] optional int32 c_birth_year  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] 
> optional binary c_birth_country (STRING)  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary 
> c_login (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, 
> ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED 
> [c_last_review_date_sk] optional 

[jira] [Commented] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version

2021-08-27 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405677#comment-17405677
 ] 

Gabor Szadovszky commented on PARQUET-2078:
---

[~nemon], thanks a lot for the detailed explanation and the patch! So what I 
have written before stands. Before 1.12.0 we did not write the dictionary 
offset to the column chunk metadata (see PARQUET-1850) even though the 
calculation was wrong since the beginning. Since we released 1.12.0 already it 
means we have to prepare for the invalid dictionary offset values.

What we need to handle in a fix:
* Fix the calculation issue (see the attached patch)
* Add unit test for this issue to ensure it works properly and won't happen 
again
* Investigate all code parts where the dictionary offset is used and prepare 
for invalid values

[~nemon], would you like to work on this by opening a PR on github?

> Failed to read parquet file after writing with the same parquet version
> ---
>
> Key: PARQUET-2078
> URL: https://issues.apache.org/jira/browse/PARQUET-2078
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Nemon Lou
>Priority: Critical
> Fix For: 1.13.0, 1.12.1
>
> Attachments: 
> PARQUET_2078_how_to_fix_rowgroup_fileoffset_for_branch_1.12.x.patch, 
> tpcds_customer_footer.json
>
>
> Writing parquet  file with version 1.12.0 in Apache Hive, then read that 
> file, returns the following error:
> {noformat}
> Caused by: java.lang.IllegalStateException: All of the offsets in the split 
> should be found in the file. expected: [4, 133961161] found: 
> [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED 
> [c_customer_sk] optional int64 c_customer_sk  [PLAIN, RLE, BIT_PACKED], 4}, 
> ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED 
> [c_current_cdemo_sk] optional int64 c_current_cdemo_sk  [PLAIN, RLE, 
> BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] 
> optional int64 c_current_hdemo_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 
> c_current_addr_sk  [PLAIN, RLE, BIT_PACKED], 57421932}, 
> ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 
> c_first_shipto_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, 
> ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 
> c_first_sales_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, 
> ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, 
> ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, 
> ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, 
> ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary 
> c_preferred_cust_flag (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 
> c_birth_day  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, 
> ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month  
> [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED 
> [c_birth_year] optional int32 c_birth_year  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] 
> optional binary c_birth_country (STRING)  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary 
> c_login (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, 
> ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED 
> [c_last_review_date_sk] optional int64 c_last_review_date_sk  [RLE, 
> PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}]
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> 

[jira] [Commented] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version

2021-08-26 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17405227#comment-17405227
 ] 

Gabor Szadovszky commented on PARQUET-2078:
---

[~nemon], thanks a lot for the investigation. What is not clear to me how it 
could happen that we set the wrong value to {{RowGroup.file_offset}}. Based on 
the code in 
[ParquetMetadataConverter|https://github.com/apache/parquet-mr/blob/apache-parquet-1.12.0/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L580]
 we use the starting position of the first column chunk of the actual row 
group. The starting position of the column chunk is the dictionary page offset 
or the first data page offset, whatever is the smaller (because dictionary page 
is always at the starting position of the column chunk.) If the dictionary page 
offset or the first data page offset would be wrong we should have other issues 
as well. Can you read the file content without using InputSplits (e.g. 
parquet-tools, parquet-cli or java code that reads the whole file)? There is a 
new parquet-cli tool called footer that can list the raw footer of the file. It 
would be interesting to see the output of it on the related parquet file. 
Unfortunately, this feature is not released yet so it have to be built from 
master. If you are interested to do so please check the 
[readme|https://github.com/apache/parquet-mr/blob/master/parquet-cli/README.md] 
for details.

If you are right and we write invalid offsets to the file since 1.12.0 that it 
is a serious issue. We not only have to fix the writing path but the reading as 
well since we will have files already written by 1.12.0.

> Failed to read parquet file after writing with the same parquet version
> ---
>
> Key: PARQUET-2078
> URL: https://issues.apache.org/jira/browse/PARQUET-2078
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Nemon Lou
>Priority: Critical
>
> Writing parquet  file with version 1.12.0 in Apache Hive, then read that 
> file, returns the following error:
> {noformat}
> Caused by: java.lang.IllegalStateException: All of the offsets in the split 
> should be found in the file. expected: [4, 133961161] found: 
> [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED 
> [c_customer_sk] optional int64 c_customer_sk  [PLAIN, RLE, BIT_PACKED], 4}, 
> ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED 
> [c_current_cdemo_sk] optional int64 c_current_cdemo_sk  [PLAIN, RLE, 
> BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] 
> optional int64 c_current_hdemo_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 
> c_current_addr_sk  [PLAIN, RLE, BIT_PACKED], 57421932}, 
> ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 
> c_first_shipto_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, 
> ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 
> c_first_sales_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, 
> ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, 
> ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, 
> ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, 
> ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary 
> c_preferred_cust_flag (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 
> c_birth_day  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, 
> ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month  
> [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED 
> [c_birth_year] optional int32 c_birth_year  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] 
> optional binary c_birth_country (STRING)  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary 
> c_login (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, 
> ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED 
> [c_last_review_date_sk] optional int64 c_last_review_date_sk  [RLE, 
> PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}]
>   at 
> 

[jira] [Updated] (PARQUET-2078) Failed to read parquet file after writing with the same parquet version

2021-08-26 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-2078:
--
Fix Version/s: 1.12.1
   1.13.0

> Failed to read parquet file after writing with the same parquet version
> ---
>
> Key: PARQUET-2078
> URL: https://issues.apache.org/jira/browse/PARQUET-2078
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Nemon Lou
>Priority: Critical
> Fix For: 1.13.0, 1.12.1
>
>
> Writing parquet  file with version 1.12.0 in Apache Hive, then read that 
> file, returns the following error:
> {noformat}
> Caused by: java.lang.IllegalStateException: All of the offsets in the split 
> should be found in the file. expected: [4, 133961161] found: 
> [BlockMetaData{1530100, 133961157 [ColumnMetaData{UNCOMPRESSED 
> [c_customer_sk] optional int64 c_customer_sk  [PLAIN, RLE, BIT_PACKED], 4}, 
> ColumnMetaData{UNCOMPRESSED [c_customer_id] optional binary c_customer_id 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 12243647}, ColumnMetaData{UNCOMPRESSED 
> [c_current_cdemo_sk] optional int64 c_current_cdemo_sk  [PLAIN, RLE, 
> BIT_PACKED], 42848491}, ColumnMetaData{UNCOMPRESSED [c_current_hdemo_sk] 
> optional int64 c_current_hdemo_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 54868535}, ColumnMetaData{UNCOMPRESSED [c_current_addr_sk] optional int64 
> c_current_addr_sk  [PLAIN, RLE, BIT_PACKED], 57421932}, 
> ColumnMetaData{UNCOMPRESSED [c_first_shipto_date_sk] optional int64 
> c_first_shipto_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 69694809}, 
> ColumnMetaData{UNCOMPRESSED [c_first_sales_date_sk] optional int64 
> c_first_sales_date_sk  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 72093040}, 
> ColumnMetaData{UNCOMPRESSED [c_salutation] optional binary c_salutation 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 74461508}, 
> ColumnMetaData{UNCOMPRESSED [c_first_name] optional binary c_first_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 75092758}, 
> ColumnMetaData{UNCOMPRESSED [c_last_name] optional binary c_last_name 
> (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 77626525}, 
> ColumnMetaData{UNCOMPRESSED [c_preferred_cust_flag] optional binary 
> c_preferred_cust_flag (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 
> 80116456}, ColumnMetaData{UNCOMPRESSED [c_birth_day] optional int32 
> c_birth_day  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 80505351}, 
> ColumnMetaData{UNCOMPRESSED [c_birth_month] optional int32 c_birth_month  
> [RLE, PLAIN_DICTIONARY, BIT_PACKED], 81581772}, ColumnMetaData{UNCOMPRESSED 
> [c_birth_year] optional int32 c_birth_year  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 82473740}, ColumnMetaData{UNCOMPRESSED [c_birth_country] 
> optional binary c_birth_country (STRING)  [RLE, PLAIN_DICTIONARY, 
> BIT_PACKED], 83921564}, ColumnMetaData{UNCOMPRESSED [c_login] optional binary 
> c_login (STRING)  [RLE, PLAIN_DICTIONARY, BIT_PACKED], 85457674}, 
> ColumnMetaData{UNCOMPRESSED [c_email_address] optional binary c_email_address 
> (STRING)  [PLAIN, RLE, BIT_PACKED], 85460523}, ColumnMetaData{UNCOMPRESSED 
> [c_last_review_date_sk] optional int64 c_last_review_date_sk  [RLE, 
> PLAIN_DICTIONARY, BIT_PACKED], 132146109}]}]
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:172)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>   at 
> org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140)
>  ~[parquet-hadoop-bundle-1.12.0.jar:1.12.0]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:95)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.read.ParquetRecordReaderWrapper.(ParquetRecordReaderWrapper.java:60)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:89)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at 
> org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.(CombineHiveRecordReader.java:96)
>  ~[hive-exec-4.0.0-SNAPSHOT.jar:4.0.0-SNAPSHOT]
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method) ~[?:1.8.0_292]
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>  ~[?:1.8.0_292]
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>  ~[?:1.8.0_292]
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423) 
> ~[?:1.8.0_292]
>   at 
> 

[jira] [Commented] (PARQUET-2071) Encryption translation tool

2021-08-23 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17403039#comment-17403039
 ] 

Gabor Szadovszky commented on PARQUET-2071:
---

[~sha...@uber.com], sure, I am fine with having the "universal tool" and the 
required refactors be handled under the separate jira.

> Encryption translation tool 
> 
>
> Key: PARQUET-2071
> URL: https://issues.apache.org/jira/browse/PARQUET-2071
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When translating existing data to encryption state, we could develop a tool 
> like TransCompression to translate the data at page level to encryption state 
> without reading to record and rewrite. This will speed up the process a lot. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2064) Make Range public accessible in RowRanges

2021-08-16 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2064.
---
Resolution: Fixed

> Make Range public accessible in RowRanges
> -
>
> Key: PARQUET-2064
> URL: https://issues.apache.org/jira/browse/PARQUET-2064
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When rolling out to Presto, I found we need to know the boundaries of each 
> Range in RowRanges. It is still doable with Iterator but Presto has. batch 
> reader, we cannot use iterator for each row. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2073) Is there something wrong calculate usedMem in ColumnWriteStoreBase.java

2021-08-16 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2073.
---
Resolution: Fixed

> Is there something wrong calculate usedMem in ColumnWriteStoreBase.java
> ---
>
> Key: PARQUET-2073
> URL: https://issues.apache.org/jira/browse/PARQUET-2073
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: JiangYang
>Assignee: JiangYang
>Priority: Critical
> Attachments: image-2021-08-05-14-37-51-299.png
>
>
> !image-2021-08-05-14-37-51-299.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2059) Tests require too much memory

2021-08-16 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2059.
---
Resolution: Fixed

> Tests require too much memory
> -
>
> Key: PARQUET-2059
> URL: https://issues.apache.org/jira/browse/PARQUET-2059
> Project: Parquet
>  Issue Type: Test
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> For testing the solution of PARQUET-1633 we require ~3GB memory that is not 
> always available. To solve this issue we temporarily disabled the implemented 
> unit test.
> We need to ensure somehow that [this 
> test|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLargeColumnChunk.java]
>  (and maybe some other similar ones) are executed regularly. Some options we 
> might have:
> * Execute this test separately with a maven profile. I am not sure if the CI 
> allows allocating such large memory but with Xmx options we might give a try 
> and create a separate check for this test only.
> * Similar to the previous with the profile but not executing in the CI ever. 
> Instead, we add some comments to the release doc so this test will be 
> executed at least once per release.
> * Configuring the CI profile to skip this test but have it in the normal 
> scenario meaning the devs will execute it locally. There are a couple of cons 
> though. There is no guarantee that devs executes all the tests including this 
> one. It also can cause issues if the dev doesn't have enough memory and don't 
> know that the test failure is not related to the current change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2043) Fail build for used but not declared direct dependencies

2021-08-16 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2043.
---
Resolution: Fixed

> Fail build for used but not declared direct dependencies
> 
>
> Key: PARQUET-2043
> URL: https://issues.apache.org/jira/browse/PARQUET-2043
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> It is always a good practice to specify all the dependencies directly used 
> (classes are imported from) by our modules. We have a couple of issues where 
> classes are imported from transitive dependencies. It makes hard to validate 
> the actual dependency tree and also may result in using wrong versions of 
> classes (see PARQUET-2038 for example).
> It would be good to enforce to reference such dependencies directly in the 
> module poms. The [maven-dependency-plugin analyze-only 
> goal|http://maven.apache.org/plugins/maven-dependency-plugin/analyze-only-mojo.html]
>  can be used for this purpose.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2063) Remove Compile Warnings from MemoryManager

2021-08-10 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2063.
---
Resolution: Fixed

> Remove Compile Warnings from MemoryManager
> --
>
> Key: PARQUET-2063
> URL: https://issues.apache.org/jira/browse/PARQUET-2063
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2074) Upgrade to JDK 9+

2021-08-09 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17396113#comment-17396113
 ] 

Gabor Szadovszky commented on PARQUET-2074:
---

[~belugabehr], it sounds good to me but also keep in mind that switching to 
JDK9 and using its new capabilities would make parqet-mr incompatible with 
certain environments. Also, this would require a community agreement.

I would suggest bringing up this topic in the next parquet sync (Aug 24) and/or 
start a formal vote in the dev list.

> Upgrade to JDK 9+
> -
>
> Key: PARQUET-2074
> URL: https://issues.apache.org/jira/browse/PARQUET-2074
> Project: Parquet
>  Issue Type: Improvement
>Reporter: David Mollitor
>Priority: Major
>
> Moving to JDK 9 will provide a plethora of new compares/equals capabilities 
> on arrays that are all based on vectorization and implement 
> {{\@IntrinsicCandidate}}
> https://docs.oracle.com/javase/9/docs/api/java/util/Arrays.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-2073) Is there something wrong calculate usedMem in ColumnWriteStoreBase.java

2021-08-09 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2073:
-

Assignee: JiangYang

> Is there something wrong calculate usedMem in ColumnWriteStoreBase.java
> ---
>
> Key: PARQUET-2073
> URL: https://issues.apache.org/jira/browse/PARQUET-2073
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: JiangYang
>Assignee: JiangYang
>Priority: Critical
> Attachments: image-2021-08-05-14-37-51-299.png
>
>
> !image-2021-08-05-14-37-51-299.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2072) Do Not Determine Both Min/Max for Binary Stats

2021-08-09 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2072.
---
Resolution: Fixed

> Do Not Determine Both Min/Max for Binary Stats
> --
>
> Key: PARQUET-2072
> URL: https://issues.apache.org/jira/browse/PARQUET-2072
> Project: Parquet
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>
> I'm looking at some benchmarking code of Apache ORC v.s. Apache Parquet and 
> see that Parquet is quite a bit slower for writes (reads TBD).  Based on my 
> investigation, I have noticed a significant amount of time spent in 
> determining min/max for binary types.
> One quick improvement is to bypass a "max" value determinization if the value 
> has already been determined to be a "min".
> While I'm at it, remove calls to deprecated functions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2073) Is there something wrong calculate usedMem in ColumnWriteStoreBase.java

2021-08-06 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17394605#comment-17394605
 ] 

Gabor Szadovszky commented on PARQUET-2073:
---

[~JiangYang], you're right, {{rowsToFillPage}} will always be zero. It means 
(because of [line 
256|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java#L256])
 that we never use this estimate correctly so the next row count check will 
always step by {{props.getMinRowCountForPageSizeCheck()}}. Funny that it was 
working this way ever since we have this estimation logic. Strange that no one 
have ever noticed.

About fixing this issue. We can have proper results without casting:
{code:java}
rows * remainingMem / usedMem
{code}
Meanwhile, this form is a bit misleading so we need some comments that we are 
calculating the estimated number of rows can be written to the page based on 
the average size of rows already written.

The tricky part is how to test it. This will be a new behavior of the page 
writing and we have never tested this properly. (Otherwise, we would have 
caught this issue.) It highly depends on the characteristics of the values if 
this approach works fine or not. (For example small values at the beginning and 
large ones later can cause this logic overrun the maximum size of the page. 
However, the same can happen if the wrong values are used for 
{{min/maxRowCountForPageSizeCheck}}.)

Sure, please, create a PR. I am happy to review.

> Is there something wrong calculate usedMem in ColumnWriteStoreBase.java
> ---
>
> Key: PARQUET-2073
> URL: https://issues.apache.org/jira/browse/PARQUET-2073
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: JiangYang
>Priority: Critical
> Attachments: image-2021-08-05-14-37-51-299.png
>
>
> !image-2021-08-05-14-37-51-299.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2073) Is there something wrong calculate usedMem in ColumnWriteStoreBase.java

2021-08-05 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393807#comment-17393807
 ] 

Gabor Szadovszky commented on PARQUET-2073:
---

So, we are talking about [this 
line|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriteStoreBase.java#L243].
 
The original line was
{code:java}
(long) ((float) rows) / usedMem * remainingMem
{code}
Here both the casts are for {{rows}} so it is completely fine removing the 
{{(float)}} cast. Even the {{(long)}} cast can be removed since all the tree 
values are {{long}}. I can see only one option were the result can be different 
in the two when the value in {{rows}} overflows at downcast to {{float}}. Could 
you please list exact numbers where you got different numbers?

It is another thing that the very original code should have been
{code:java}
(long) ((double) rows / usedMem * remainingMem )
{code}
This way you would get more accurate numbers.

> Is there something wrong calculate usedMem in ColumnWriteStoreBase.java
> ---
>
> Key: PARQUET-2073
> URL: https://issues.apache.org/jira/browse/PARQUET-2073
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: JiangYang
>Priority: Critical
> Attachments: image-2021-08-05-14-37-51-299.png
>
>
> !image-2021-08-05-14-37-51-299.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2071) Encryption translation tool

2021-08-05 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17393788#comment-17393788
 ] 

Gabor Szadovszky commented on PARQUET-2071:
---

I think it is a great idea to skip unnecessary deserialization/serialization 
steps in such cases. Meanwhile, we already have some tools with similar 
approach like trans-compression or prune columns. What do you think of 
implementing a more universal tool where you can configure the projection 
schema and the configuration of the target file. Then the tool can decide which 
level of deserialization/serialization is required. For example for 
trans-compression you need to decompress the pages while for encryption you 
don't. What do you think?

> Encryption translation tool 
> 
>
> Key: PARQUET-2071
> URL: https://issues.apache.org/jira/browse/PARQUET-2071
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When translating existing data to encryption state, we could develop a tool 
> like TransCompression to translate the data at page level to encryption state 
> without reading to record and rewrite. This will speed up the process a lot. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-2070) Replace deprecated syntax in protobuf support

2021-08-04 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2070:
-

Assignee: Svend Vanderveken

> Replace deprecated syntax in protobuf support
> -
>
> Key: PARQUET-2070
> URL: https://issues.apache.org/jira/browse/PARQUET-2070
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Svend Vanderveken
>Assignee: Svend Vanderveken
>Priority: Minor
>
>  This is trivial change, though at the moment ProtoWriteSupport.java is 
> producing a human-readable JSON output of the protobuf schema with  the 
> following deprecated syntax:
>  
> {code:java}
> TextFormat.printToString(asProto){code}
>  
> Also, the method where is code is present executed one reflection invocation 
> to get the protobuf descriptor which is unnecesserary since the context from 
> where it's called already has this descriptor.
> => all minor and trivial stuff though well, housekeeping I guess :)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2070) Replace deprecated syntax in protobuf support

2021-08-04 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2070.
---
Resolution: Fixed

> Replace deprecated syntax in protobuf support
> -
>
> Key: PARQUET-2070
> URL: https://issues.apache.org/jira/browse/PARQUET-2070
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Svend Vanderveken
>Assignee: Svend Vanderveken
>Priority: Minor
>
>  This is trivial change, though at the moment ProtoWriteSupport.java is 
> producing a human-readable JSON output of the protobuf schema with  the 
> following deprecated syntax:
>  
> {code:java}
> TextFormat.printToString(asProto){code}
>  
> Also, the method where is code is present executed one reflection invocation 
> to get the protobuf descriptor which is unnecesserary since the context from 
> where it's called already has this descriptor.
> => all minor and trivial stuff though well, housekeeping I guess :)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2065) parquet-cli not working in release 1.12.0

2021-07-16 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17381910#comment-17381910
 ] 

Gabor Szadovszky commented on PARQUET-2065:
---

I've checked this with 1.11.0 and is reproducible so not a regression in 1.12.0.

The problem is that in target there are multiple parquet-cli jars are 
generated. One is a slim jar (parquet-cli-1.12.0.jar) and another one is a fat 
jar (parquet-cli-1.12.0-runtime.jar) that contains the avro dependency shaded. 
If all of these jars put on the classpath (target/*) it can mix up things. So, 
I would suggest using one specific jar file from the listed ones instead of 
putting all jars on the classpath from target. The other dependency jars are 
required.
For example:
{code}
java -cp target/parquet-cli-1.12.0.jar:target/dependency/* 
org.apache.parquet.cli.Main head 
{code}

> parquet-cli not working in release 1.12.0
> -
>
> Key: PARQUET-2065
> URL: https://issues.apache.org/jira/browse/PARQUET-2065
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.12.0
>Reporter: Akshay Sundarraj
>Priority: Major
>
> When I run parquet-cli getting  java.lang.NoSuchMethodError
> Steps to repdouce:
>  # Download parquet-mr 1.12.0 from 
> [https://github.com/apache/parquet-mr/archive/refs/tags/apache-parquet-1.12.0.tar.gz]
>  # Build and install using mvn clean install
>  # cd parquet-cli
>  # {{mvn dependency:copy-dependencies}}
>  # {{java -cp 'target/*:target/dependency/*' org.apache.parquet.cli.Main head 
> }}
>  # Got below exception
> WARNING: An illegal reflective access operation has occurred
>  WARNING: Illegal reflective access by 
> org.apache.hadoop.security.authentication.util.KerberosUtil 
> ([file:/home/amsundar/hgroot/parquet-mr-apache-parquet-1.12.0/parquet-cli/target/dependency/hadoop-auth-2.10.1.jar|file://home/amsundar/hgroot/parquet-mr-apache-parquet-1.12.0/parquet-cli/target/dependency/hadoop-auth-2.10.1.jar])
>  to method sun.security.krb5.Config.getInstance()
>  WARNING: Please consider reporting this to the maintainers of 
> org.apache.hadoop.security.authentication.util.KerberosUtil
>  WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
>  WARNING: All illegal access operations will be denied in a future release
>  Exception in thread "main" java.lang.NoSuchMethodError: 
> org.apache.parquet.avro.AvroSchemaConverter.convert(Lorg/apache/parquet/schema/MessageType;)Lorg/apache/avro/Schema;
>  at org.apache.parquet.cli.util.Schemas.fromParquet(Schemas.java:89)
>  at org.apache.parquet.cli.BaseCommand.getAvroSchema(BaseCommand.java:405)
>  at org.apache.parquet.cli.commands.CatCommand.run(CatCommand.java:66)
>  at org.apache.parquet.cli.Main.run(Main.java:155)
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>  at org.apache.parquet.cli.Main.main(Main.java:185)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2064) Make Range public accessible in RowRanges

2021-07-12 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17379219#comment-17379219
 ] 

Gabor Szadovszky commented on PARQUET-2064:
---

[~sha...@uber.com], sorry if I was misleading. I do agree to make the required 
classes/methods public if it makes our clients' lives easier.

> Make Range public accessible in RowRanges
> -
>
> Key: PARQUET-2064
> URL: https://issues.apache.org/jira/browse/PARQUET-2064
> Project: Parquet
>  Issue Type: New Feature
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When rolling out to Presto, I found we need to know the boundaries of each 
> Range in RowRanges. It is still doable with Iterator but Presto has. batch 
> reader, we cannot use iterator for each row. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-2059) Tests require too much memory

2021-07-05 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-2059:
--
Summary: Tests require too much memory  (was: Tests require to much memory)

> Tests require too much memory
> -
>
> Key: PARQUET-2059
> URL: https://issues.apache.org/jira/browse/PARQUET-2059
> Project: Parquet
>  Issue Type: Test
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> For testing the solution of PARQUET-1633 we require ~3GB memory that is not 
> always available. To solve this issue we temporarily disabled the implemented 
> unit test.
> We need to ensure somehow that [this 
> test|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLargeColumnChunk.java]
>  (and maybe some other similar ones) are executed regularly. Some options we 
> might have:
> * Execute this test separately with a maven profile. I am not sure if the CI 
> allows allocating such large memory but with Xmx options we might give a try 
> and create a separate check for this test only.
> * Similar to the previous with the profile but not executing in the CI ever. 
> Instead, we add some comments to the release doc so this test will be 
> executed at least once per release.
> * Configuring the CI profile to skip this test but have it in the normal 
> scenario meaning the devs will execute it locally. There are a couple of cons 
> though. There is no guarantee that devs executes all the tests including this 
> one. It also can cause issues if the dev doesn't have enough memory and don't 
> know that the test failure is not related to the current change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2062) Data masking(null) for column encryption

2021-07-05 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17374642#comment-17374642
 ] 

Gabor Szadovszky commented on PARQUET-2062:
---

If we allow the user to set a default value we can provide similar support for 
non-optional columns as well.

> Data masking(null) for column encryption 
> -
>
> Key: PARQUET-2062
> URL: https://issues.apache.org/jira/browse/PARQUET-2062
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When user doesn't have permisson on a column that are encrypted by the column 
> encryption feature (parquet-1178), returning a masked value could avoid an 
> exception and let the call succeed. 
> We would like to introduce the data masking with null values. The idea is 
> when the user gets key access denied and the user can accept null(via a 
> reading option flag), we would return null for the encrypted columns. This 
> solution doesn't need to save extra columns for masked value and doesn't need 
> to translate existing data. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2060) Parquet corruption can cause infinite loop with Snappy

2021-06-24 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368819#comment-17368819
 ] 

Gabor Szadovszky commented on PARQUET-2060:
---

[~mmeimaris], what do you think about simply returning the zero length 
BytesInput object (just like in the case of the codec is null)? This way we 
shall catch the error at same place if the data stream is empty. (We shall 
handle this case for uncompressed data as well.)
Are you willing to implement a PR about this? I'm happy to help/review.

> Parquet corruption can cause infinite loop with Snappy
> --
>
> Key: PARQUET-2060
> URL: https://issues.apache.org/jira/browse/PARQUET-2060
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Marios Meimaris
>Priority: Major
> Attachments: datapage_v2.snappy.parquet, 
> datapage_v2.snappy.parquet1383
>
>
> I am attaching a valid and corrupt parquet file (datapageV2) that differ in 
> one byte.
> We hit an infinite loop when trying to read the corrupt file in 
> [https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnReaderBase.java#L698]
>  and specifically in the `page.getData().toInputStream()` call.  
> Stack trace of infinite loop:
> java.io.DataInputStream.readFully(DataInputStream.java:195)
>  java.io.DataInputStream.readFully(DataInputStream.java:169)
>  
> org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:287)
>  org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:237)
>  org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:246)
>  
> org.apache.parquet.column.impl.ColumnReaderBase.readPageV2(ColumnReaderBase.java:698)
>  
> org.apache.parquet.column.impl.ColumnReaderBase.access$400(ColumnReaderBase.java:57)
>  
> org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:628)
>  
> org.apache.parquet.column.impl.ColumnReaderBase$3.visit(ColumnReaderBase.java:620)
>  org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192)
>  
> org.apache.parquet.column.impl.ColumnReaderBase.readPage(ColumnReaderBase.java:620)
>  
> org.apache.parquet.column.impl.ColumnReaderBase.checkRead(ColumnReaderBase.java:594)
>  
> The call to `readFully` will underneath go through 
> `NonBlockedDecompressorStream` which will always hit this path: 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/NonBlockedDecompressorStream.java#L45].
>  This will cause `setInput` to not be called on the decompressor, and the 
> subsequent calls to `decompress` will always hit this condition: 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/codec/SnappyDecompressor.java#L54].
>  Therefore, the 0 value will be returned by the read method, which will cause 
> an infinite loop in 
> [https://github.com/openjdk-mirror/jdk7u-jdk/blob/master/src/share/classes/java/io/DataInputStream.java#L198]
>  
>  This originates from the corruption, which causes the input stream of the 
> data page to be of size 0, which makes `getCompressedData` always return -1. 
> I am wondering whether this can be caught earlier so that the read fails in 
> case of such corruptions. 
> Since this happens in `BytesInput.toInputStream`, I don't think it's only 
> relevant to DataPageV2. 
>  
> In 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/CodecFactory.java#L111,]
>  if we call `bytes.toByteArray` and log its length, it is 0 in the case of 
> the corrupt file, and 6 in the case of the valid file. 
> A potential fix is to check the array size there and fail early, but I am not 
> sure if a zero-length byte array can ever be expected in the case of valid 
> files.
>  
> Attached:
> Valid file: `datapage_v2_snappy.parquet`
> Corrupt file: `datapage_v2_snappy.parquet1383`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2034) Document dictionary page position

2021-06-24 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2034.
---
Resolution: Fixed

> Document dictionary page position
> -
>
> Key: PARQUET-2034
> URL: https://issues.apache.org/jira/browse/PARQUET-2034
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Dictionary page shall be always written to the first position of the column 
> chunk. Unfortunately, we only have one statement about this "hidden" at the 
> [encodings 
> doc|https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8]:
> {quote}The dictionary page is written first, before the data pages of the 
> column chunk.{quote}
> This statement is not emphasized enough and not prepared for the potential 
> extension of the available page types. It also should be placed to a more 
> central place of the specification and also in the thrift file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-2043) Fail build for used but not declared direct dependencies

2021-06-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2043:
-

Assignee: Gabor Szadovszky

> Fail build for used but not declared direct dependencies
> 
>
> Key: PARQUET-2043
> URL: https://issues.apache.org/jira/browse/PARQUET-2043
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> It is always a good practice to specify all the dependencies directly used 
> (classes are imported from) by our modules. We have a couple of issues where 
> classes are imported from transitive dependencies. It makes hard to validate 
> the actual dependency tree and also may result in using wrong versions of 
> classes (see PARQUET-2038 for example).
> It would be good to enforce to reference such dependencies directly in the 
> module poms. The [maven-dependency-plugin analyze-only 
> goal|http://maven.apache.org/plugins/maven-dependency-plugin/analyze-only-mojo.html]
>  can be used for this purpose.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-2034) Document dictionary page position

2021-06-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2034:
-

Assignee: Gabor Szadovszky

> Document dictionary page position
> -
>
> Key: PARQUET-2034
> URL: https://issues.apache.org/jira/browse/PARQUET-2034
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> Dictionary page shall be always written to the first position of the column 
> chunk. Unfortunately, we only have one statement about this "hidden" at the 
> [encodings 
> doc|https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8]:
> {quote}The dictionary page is written first, before the data pages of the 
> column chunk.{quote}
> This statement is not emphasized enough and not prepared for the potential 
> extension of the available page types. It also should be placed to a more 
> central place of the specification and also in the thrift file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2009) Remove deprecated modules

2021-06-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2009.
---
Resolution: Duplicate

> Remove deprecated modules
> -
>
> Key: PARQUET-2009
> URL: https://issues.apache.org/jira/browse/PARQUET-2009
> Project: Parquet
>  Issue Type: New Feature
>Affects Versions: 1.13.0
>Reporter: Gabor Szadovszky
>Priority: Major
>
> We have deprecated a couple of modules. They were renamed to 
> \{{*-deprecated}}. These modules shall be removed for the next release if 
> there are no objections come up in the community. (We might wait for the 
> removal until the preparation of the next release.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (PARQUET-2059) Tests require to much memory

2021-06-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reopened PARQUET-2059:
---
  Assignee: Gabor Szadovszky  (was: Edward Wright)

Sorry, mixed up the two jiras somehow. Re-opening/re-assigning.

> Tests require to much memory
> 
>
> Key: PARQUET-2059
> URL: https://issues.apache.org/jira/browse/PARQUET-2059
> Project: Parquet
>  Issue Type: Test
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> For testing the solution of PARQUET-1633 we require ~3GB memory that is not 
> always available. To solve this issue we temporarily disabled the implemented 
> unit test.
> We need to ensure somehow that [this 
> test|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLargeColumnChunk.java]
>  (and maybe some other similar ones) are executed regularly. Some options we 
> might have:
> * Execute this test separately with a maven profile. I am not sure if the CI 
> allows allocating such large memory but with Xmx options we might give a try 
> and create a separate check for this test only.
> * Similar to the previous with the profile but not executing in the CI ever. 
> Instead, we add some comments to the release doc so this test will be 
> executed at least once per release.
> * Configuring the CI profile to skip this test but have it in the normal 
> scenario meaning the devs will execute it locally. There are a couple of cons 
> though. There is no guarantee that devs executes all the tests including this 
> one. It also can cause issues if the dev doesn't have enough memory and don't 
> know that the test failure is not related to the current change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList

2021-06-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1633.
---
Resolution: Fixed

> Integer overflow in ParquetFileReader.ConsecutiveChunkList
> --
>
> Key: PARQUET-1633
> URL: https://issues.apache.org/jira/browse/PARQUET-1633
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Assignee: Edward Wright
>Priority: Major
>
> When reading a large Parquet file (2.8GB), I encounter the following 
> exception:
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file 
> dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228)
> ... 14 more
> Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212
> at java.util.ArrayList.(ArrayList.java:157)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code}
>  
> The file metadata is:
>  * block 1 (3 columns)
>  ** rowCount: 110,100
>  ** totalByteSize: 348,492,072
>  ** compressedSize: 165,689,649
>  * block 2 (3 columns)
>  ** rowCount: 90,054
>  ** totalByteSize: 3,243,165,541
>  ** compressedSize: 2,509,579,966
>  * block 3 (3 columns)
>  ** rowCount: 105,119
>  ** totalByteSize: 350,901,693
>  ** compressedSize: 144,952,177
>  * block 4 (3 columns)
>  ** rowCount: 48,741
>  ** totalByteSize: 1,275,995
>  ** compressedSize: 914,205
> I don't have the code to reproduce the issue, unfortunately; however, I 
> looked at the code and it seems that integer {{length}} field in 
> ConsecutiveChunkList overflows, which results in negative capacity for array 
> list in {{readAll}} method:
> {code:java}
> int fullAllocations = length / options.getMaxAllocationSize();
> int lastAllocationSize = length % options.getMaxAllocationSize();
>   
> int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0);
> List buffers = new ArrayList<>(numAllocations);{code}
>  
> This is caused by cast to integer in {{readNextRowGroup}} method in 
> ParquetFileReader:
> {code:java}
> currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, 
> (int)mc.getTotalSize()));
> {code}
> which overflows when total size of the column is larger than 
> Integer.MAX_VALUE.
> I would appreciate if you could help addressing the issue. Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-1633) Integer overflow in ParquetFileReader.ConsecutiveChunkList

2021-06-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-1633:
-

Assignee: Edward Wright

> Integer overflow in ParquetFileReader.ConsecutiveChunkList
> --
>
> Key: PARQUET-1633
> URL: https://issues.apache.org/jira/browse/PARQUET-1633
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.1
>Reporter: Ivan Sadikov
>Assignee: Edward Wright
>Priority: Major
>
> When reading a large Parquet file (2.8GB), I encounter the following 
> exception:
> {code:java}
> Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value 
> at 0 in block -1 in file 
> dbfs:/user/hive/warehouse/demo.db/test_table/part-00014-tid-1888470069989036737-593c82a4-528b-4975-8de0-5bcbc5e9827d-10856-1-c000.snappy.parquet
> at 
> org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
> at 
> org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:207)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:40)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:228)
> ... 14 more
> Caused by: java.lang.IllegalArgumentException: Illegal Capacity: -212
> at java.util.ArrayList.(ArrayList.java:157)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1169){code}
>  
> The file metadata is:
>  * block 1 (3 columns)
>  ** rowCount: 110,100
>  ** totalByteSize: 348,492,072
>  ** compressedSize: 165,689,649
>  * block 2 (3 columns)
>  ** rowCount: 90,054
>  ** totalByteSize: 3,243,165,541
>  ** compressedSize: 2,509,579,966
>  * block 3 (3 columns)
>  ** rowCount: 105,119
>  ** totalByteSize: 350,901,693
>  ** compressedSize: 144,952,177
>  * block 4 (3 columns)
>  ** rowCount: 48,741
>  ** totalByteSize: 1,275,995
>  ** compressedSize: 914,205
> I don't have the code to reproduce the issue, unfortunately; however, I 
> looked at the code and it seems that integer {{length}} field in 
> ConsecutiveChunkList overflows, which results in negative capacity for array 
> list in {{readAll}} method:
> {code:java}
> int fullAllocations = length / options.getMaxAllocationSize();
> int lastAllocationSize = length % options.getMaxAllocationSize();
>   
> int numAllocations = fullAllocations + (lastAllocationSize > 0 ? 1 : 0);
> List buffers = new ArrayList<>(numAllocations);{code}
>  
> This is caused by cast to integer in {{readNextRowGroup}} method in 
> ParquetFileReader:
> {code:java}
> currentChunks.addChunk(new ChunkDescriptor(columnDescriptor, mc, startingPos, 
> (int)mc.getTotalSize()));
> {code}
> which overflows when total size of the column is larger than 
> Integer.MAX_VALUE.
> I would appreciate if you could help addressing the issue. Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2054) TCP connection leaking when calling appendFile()

2021-06-22 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2054.
---
Resolution: Fixed

> TCP connection leaking when calling appendFile()
> 
>
> Key: PARQUET-2054
> URL: https://issues.apache.org/jira/browse/PARQUET-2054
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Kai Jiang
>Priority: Major
>
> When appendFile() is called, the file reader path is opened but not closed. 
> It caused many TCP connections leaked. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2059) Tests require to much memory

2021-06-11 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2059.
---
Resolution: Fixed

> Tests require to much memory
> 
>
> Key: PARQUET-2059
> URL: https://issues.apache.org/jira/browse/PARQUET-2059
> Project: Parquet
>  Issue Type: Test
>Reporter: Gabor Szadovszky
>Assignee: Edward Wright
>Priority: Major
>
> For testing the solution of PARQUET-1633 we require ~3GB memory that is not 
> always available. To solve this issue we temporarily disabled the implemented 
> unit test.
> We need to ensure somehow that [this 
> test|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLargeColumnChunk.java]
>  (and maybe some other similar ones) are executed regularly. Some options we 
> might have:
> * Execute this test separately with a maven profile. I am not sure if the CI 
> allows allocating such large memory but with Xmx options we might give a try 
> and create a separate check for this test only.
> * Similar to the previous with the profile but not executing in the CI ever. 
> Instead, we add some comments to the release doc so this test will be 
> executed at least once per release.
> * Configuring the CI profile to skip this test but have it in the normal 
> scenario meaning the devs will execute it locally. There are a couple of cons 
> though. There is no guarantee that devs executes all the tests including this 
> one. It also can cause issues if the dev doesn't have enough memory and don't 
> know that the test failure is not related to the current change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-2059) Tests require to much memory

2021-06-11 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2059:
-

Assignee: Edward Wright

> Tests require to much memory
> 
>
> Key: PARQUET-2059
> URL: https://issues.apache.org/jira/browse/PARQUET-2059
> Project: Parquet
>  Issue Type: Test
>Reporter: Gabor Szadovszky
>Assignee: Edward Wright
>Priority: Major
>
> For testing the solution of PARQUET-1633 we require ~3GB memory that is not 
> always available. To solve this issue we temporarily disabled the implemented 
> unit test.
> We need to ensure somehow that [this 
> test|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLargeColumnChunk.java]
>  (and maybe some other similar ones) are executed regularly. Some options we 
> might have:
> * Execute this test separately with a maven profile. I am not sure if the CI 
> allows allocating such large memory but with Xmx options we might give a try 
> and create a separate check for this test only.
> * Similar to the previous with the profile but not executing in the CI ever. 
> Instead, we add some comments to the release doc so this test will be 
> executed at least once per release.
> * Configuring the CI profile to skip this test but have it in the normal 
> scenario meaning the devs will execute it locally. There are a couple of cons 
> though. There is no guarantee that devs executes all the tests including this 
> one. It also can cause issues if the dev doesn't have enough memory and don't 
> know that the test failure is not related to the current change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-2059) Tests require to much memory

2021-06-11 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-2059:
--
Description: 
For testing the solution of PARQUET-1633 we require ~3GB memory that is not 
always available. To solve this issue we temporarily disabled the implemented 
unit test.
We need to ensure somehow that [this 
test|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLargeColumnChunk.java]
 (and maybe some other similar ones) are executed regularly. Some options we 
might have:
* Execute this test separately with a maven profile. I am not sure if the CI 
allows allocating such large memory but with Xmx options we might give a try 
and create a separate check for this test only.
* Similar to the previous with the profile but not executing in the CI ever. 
Instead, we add some comments to the release doc so this test will be executed 
at least once per release.
* Configuring the CI profile to skip this test but have it in the normal 
scenario meaning the devs will execute it locally. There are a couple of cons 
though. There is no guarantee that devs executes all the tests including this 
one. It also can cause issues if the dev doesn't have enough memory and don't 
know that the test failure is not related to the current change.

  was:
For testing the solution of PARQUET-1633 we require ~3GB memory that is not 
always available. To solve this issue we temporarily disabled the implemented 
unit test.
We need to ensure somehow that this test (and maybe some other similar ones) 
are executed regularly. Some options we might have:
* Execute this test separately with a maven profile. I am not sure if the CI 
allows allocating such large memory but with Xmx options we might give a try 
and create a separate check for this test only.
* Similar to the previous with the profile but not executing in the CI ever. 
Instead, we add some comments to the release doc so this test will be executed 
at least once per release.
* Configuring the CI profile to skip this test but have it in the normal 
scenario meaning the devs will execute it locally. There are a couple of cons 
though. There is no guarantee that devs executes all the tests including this 
one. It also can cause issues if the dev doesn't have enough memory and don't 
know that the test failure is not related to the current change.


> Tests require to much memory
> 
>
> Key: PARQUET-2059
> URL: https://issues.apache.org/jira/browse/PARQUET-2059
> Project: Parquet
>  Issue Type: Test
>Reporter: Gabor Szadovszky
>Priority: Major
>
> For testing the solution of PARQUET-1633 we require ~3GB memory that is not 
> always available. To solve this issue we temporarily disabled the implemented 
> unit test.
> We need to ensure somehow that [this 
> test|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestLargeColumnChunk.java]
>  (and maybe some other similar ones) are executed regularly. Some options we 
> might have:
> * Execute this test separately with a maven profile. I am not sure if the CI 
> allows allocating such large memory but with Xmx options we might give a try 
> and create a separate check for this test only.
> * Similar to the previous with the profile but not executing in the CI ever. 
> Instead, we add some comments to the release doc so this test will be 
> executed at least once per release.
> * Configuring the CI profile to skip this test but have it in the normal 
> scenario meaning the devs will execute it locally. There are a couple of cons 
> though. There is no guarantee that devs executes all the tests including this 
> one. It also can cause issues if the dev doesn't have enough memory and don't 
> know that the test failure is not related to the current change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2059) Tests require to much memory

2021-06-10 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-2059:
-

 Summary: Tests require to much memory
 Key: PARQUET-2059
 URL: https://issues.apache.org/jira/browse/PARQUET-2059
 Project: Parquet
  Issue Type: Test
Reporter: Gabor Szadovszky


For testing the solution of PARQUET-1633 we require ~3GB memory that is not 
always available. To solve this issue we temporarily disabled the implemented 
unit test.
We need to ensure somehow that this test (and maybe some other similar ones) 
are executed regularly. Some options we might have:
* Execute this test separately with a maven profile. I am not sure if the CI 
allows allocating such large memory but with Xmx options we might give a try 
and create a separate check for this test only.
* Similar to the previous with the profile but not executing in the CI ever. 
Instead, we add some comments to the release doc so this test will be executed 
at least once per release.
* Configuring the CI profile to skip this test but have it in the normal 
scenario meaning the devs will execute it locally. There are a couple of cons 
though. There is no guarantee that devs executes all the tests including this 
one. It also can cause issues if the dev doesn't have enough memory and don't 
know that the test failure is not related to the current change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2057) Upgrade ZSTD-JNI to 1.5.0-1

2021-06-10 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2057.
---
Resolution: Fixed

> Upgrade ZSTD-JNI to 1.5.0-1
> ---
>
> Key: PARQUET-2057
> URL: https://issues.apache.org/jira/browse/PARQUET-2057
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: David Christle
>Assignee: David Christle
>Priority: Major
>
> This issue tracks upgrading the zstd-jni dependency to version 1.5.0-1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-2057) Upgrade ZSTD-JNI to 1.5.0-1

2021-06-10 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2057:
-

Assignee: David Christle

> Upgrade ZSTD-JNI to 1.5.0-1
> ---
>
> Key: PARQUET-2057
> URL: https://issues.apache.org/jira/browse/PARQUET-2057
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: David Christle
>Assignee: David Christle
>Priority: Major
>
> This issue tracks upgrading the zstd-jni dependency to version 1.5.0-1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2058) Parquet-tools is affected by multiple CVEs

2021-06-09 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17359819#comment-17359819
 ] 

Gabor Szadovszky commented on PARQUET-2058:
---

Since parquet-tools is deprecated in 1.12.0 and already removed in master I 
don't think it makes sense to work on this. (I don't think it worth patch 
releases to address these issues.)
Could you please check parquet-cli if it fits your needs and have 
vulnerabilities to be fixed?

> Parquet-tools is affected by multiple CVEs
> --
>
> Key: PARQUET-2058
> URL: https://issues.apache.org/jira/browse/PARQUET-2058
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0, 1.10.1, 1.11.1
>Reporter: Tony Liu
>Priority: Blocker
>  Labels: security
>
> The parquet-tools library is affected by multiple CVEs.
>  
> |CVE-2018-10237|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-10237|Unbounded
>  memory allocation in Google Guava 11.0 through 24.x before 24.1.1 allows 
> remote attackers to conduct denial of service attacks against servers that 
> depend on this library and deserialize attacker-provided data, because the 
> AtomicDoubleArray class (when serialized with Java serialization) and the 
> CompoundOrdering class (when serialized with GWT serialization) perform eager 
> allocation without appropriate checks on what a client has sent and whether 
> the data size is reasonable.|
> |CVE-2020-8908|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2020-8908|A
>  temp directory creation vulnerability exists in all versions of Guava, 
> allowing an attacker with access to the machine to potentially access data in 
> a temporary directory created by the Guava API 
> com.google.common.io.Files.createTempDir(). By default, on unix-like systems, 
> the created directory is world-readable (readable by an attacker with access 
> to the system). The method in question has been marked @Deprecated in 
> versions 30.0 and later and should not be used. For Android developers, we 
> recommend choosing a temporary directory API provided by Android, such as 
> context.getCacheDir(). For other Java developers, we recommend migrating to 
> the Java 7 API java.nio.file.Files.createTempDirectory() which explicitly 
> configures permissions of 700, or configuring the Java runtime\'s 
> java.io.tmpdir system property to point to a location whose permissions are 
> appropriately configured.|
> |CVE-2019-17571|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2019-17571|Included
>  in Log4j 1.2 is a SocketServer class that is vulnerable to deserialization 
> of untrusted data which can be exploited to remotely execute arbitrary code 
> when combined with a deserialization gadget when listening to untrusted 
> network traffic for log data. This affects Log4j versions up to 1.2 up to 
> 1.2.17.|
> |CVE-2020-9488|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2020-9488|Improper
>  validation of certificate with host mismatch in Apache Log4j SMTP appender. 
> This could allow an SMTPS connection to be intercepted by a man-in-the-middle 
> attack which could leak any log messages sent through that appender.|
>  
>  
> Is it possible to upgrade the POM files to reference the latest version of 
> log4j and guava library?
>  
> Thanks
> Tony
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (PARQUET-2058) Parquet-tools is affected by multiple CVEs

2021-06-09 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky updated PARQUET-2058:
--
Component/s: (was: parquet-format)
 parquet-mr

> Parquet-tools is affected by multiple CVEs
> --
>
> Key: PARQUET-2058
> URL: https://issues.apache.org/jira/browse/PARQUET-2058
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.10.0, 1.10.1, 1.11.1
>Reporter: Tony Liu
>Priority: Blocker
>  Labels: security
>
> The parquet-tools library is affected by multiple CVEs.
>  
> |CVE-2018-10237|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2018-10237|Unbounded
>  memory allocation in Google Guava 11.0 through 24.x before 24.1.1 allows 
> remote attackers to conduct denial of service attacks against servers that 
> depend on this library and deserialize attacker-provided data, because the 
> AtomicDoubleArray class (when serialized with Java serialization) and the 
> CompoundOrdering class (when serialized with GWT serialization) perform eager 
> allocation without appropriate checks on what a client has sent and whether 
> the data size is reasonable.|
> |CVE-2020-8908|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2020-8908|A
>  temp directory creation vulnerability exists in all versions of Guava, 
> allowing an attacker with access to the machine to potentially access data in 
> a temporary directory created by the Guava API 
> com.google.common.io.Files.createTempDir(). By default, on unix-like systems, 
> the created directory is world-readable (readable by an attacker with access 
> to the system). The method in question has been marked @Deprecated in 
> versions 30.0 and later and should not be used. For Android developers, we 
> recommend choosing a temporary directory API provided by Android, such as 
> context.getCacheDir(). For other Java developers, we recommend migrating to 
> the Java 7 API java.nio.file.Files.createTempDirectory() which explicitly 
> configures permissions of 700, or configuring the Java runtime\'s 
> java.io.tmpdir system property to point to a location whose permissions are 
> appropriately configured.|
> |CVE-2019-17571|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2019-17571|Included
>  in Log4j 1.2 is a SocketServer class that is vulnerable to deserialization 
> of untrusted data which can be exploited to remotely execute arbitrary code 
> when combined with a deserialization gadget when listening to untrusted 
> network traffic for log data. This affects Log4j versions up to 1.2 up to 
> 1.2.17.|
> |CVE-2020-9488|https://web.nvd.nist.gov/view/vuln/detail?vulnId=CVE-2020-9488|Improper
>  validation of certificate with host mismatch in Apache Log4j SMTP appender. 
> This could allow an SMTPS connection to be intercepted by a man-in-the-middle 
> attack which could leak any log messages sent through that appender.|
>  
>  
> Is it possible to upgrade the POM files to reference the latest version of 
> log4j and guava library?
>  
> Thanks
> Tony
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2055) Schema mismatch for reading Avro from parquet file with old schema version?

2021-06-07 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358478#comment-17358478
 ] 

Gabor Szadovszky commented on PARQUET-2055:
---

[~philipwilcox], There is no schema evolution for Parquet neither for the spec 
nor for the java implementation (parquet-mr). The support for projection is not 
really a feature of schema evolution but a practical feature of the column 
oriented formats that they can skip columns without any effort (unlike row 
oriented formats).
Since the bindings like parquet-avro are about "simply" converting the schemas 
and the data if required I am not sure how easy it would be to support Avro 
schema evolution in Parquet. Since PARQUET-465 is related and not solved for 
more than 5 years I would be skeptical to leave such a feature request to the 
community. So,  if you can invest on implementing it, it would be very welcomed.

> Schema mismatch for reading Avro from parquet file with old schema version?
> ---
>
> Key: PARQUET-2055
> URL: https://issues.apache.org/jira/browse/PARQUET-2055
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.11.0
> Environment: Linux, Apache Beam 2.28.0, Java 11
>Reporter: Philip Wilcox
>Priority: Minor
>
> I ran into what looks like a bug in the Parquet Avro reading code, around 
> trying to read a file written with a previous version of a schema with a new, 
> evolved version of the schema.
> I'm using Apache Beam's ParquetIO library, which supports passing in schemas 
> to use for "projection" and I was investigating if that would work for me 
> here. However, it didn't work, complaining that my new reader schema had a 
> field that wasn't in the writer schema.
>  
> I traced this through to a couple places in the parquet-avro code that don't 
> look right to me:
>  
> First, in `prepareForRead` here: 
> [https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java#L116]
> The `parquetSchema` var comes from `parquetSchema = 
> readContext.getRequestedSchema();` while the `avroSchema` var comes from the 
> parquet file itself with `avroSchema = new 
> Schema.Parser().parse(keyValueMetaData.get(AVRO_SCHEMA_METADATA_KEY));`
> I can verify that `parquetSchema` is the schema I'm requesting it be 
> projected to and that `avroSchema` is the schema from the file, but the 
> naming looks backward, shouldn't `parquetSchema` be the one from the parquet 
> file?
> Following the stack down, I was hitting this line: 
> https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroIndexedRecordConverter.java#L91
> here it was failing because the `avroSchema` didn't have a field that was in 
> the `parquetSchema`, with the variables assigned in the same way as above. 
> That's the case I was hoping to use this projection for, though - to get the 
> record read with the new reader schema, using the default value from the new 
> schema for the new field. In fact, the comment on line 101 "store defaults 
> for any new Avro fields from avroSchema that are not in the writer schema 
> (parquetSchema)" suggests that the intent was for this to work, but the 
> actual code has the writer schema in avroSchema and the reader schema in 
> parquetSchema.
> (Additionally, I'd want this to support schema evolution both for adding an 
> optional field and also removing an old field - so just flipping the names 
> around would result in this still breaking if the reader schema dropped a 
> field from the writer schema...)
> Looking to understand if I'm interpreting this correctly, or if there's 
> another path that's intended to be used.
> Thank you!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-2054) TCP connection leaking when calling appendFile()

2021-06-07 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2054:
-

Assignee: Kai Jiang

> TCP connection leaking when calling appendFile()
> 
>
> Key: PARQUET-2054
> URL: https://issues.apache.org/jira/browse/PARQUET-2054
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-mr
>Reporter: Xinli Shang
>Assignee: Kai Jiang
>Priority: Major
>
> When appendFile() is called, the file reader path is opened but not closed. 
> It caused many TCP connections leaked. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2055) Schema mismatch for reading Avro from parquet file with old schema version?

2021-06-04 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17357419#comment-17357419
 ] 

Gabor Szadovszky commented on PARQUET-2055:
---

[~philipwilcox], sorry if I was misleading. I wanted to say that parquet-mr 
only supports projection so you cannot use other "schema evolution 
capabilities". However, I do not have too much experience in parquet-avro and 
found a discussion in another jira: PARQUET-465. It's quite old but you may 
check if Ryan's answers helps in your case.

> Schema mismatch for reading Avro from parquet file with old schema version?
> ---
>
> Key: PARQUET-2055
> URL: https://issues.apache.org/jira/browse/PARQUET-2055
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.11.0
> Environment: Linux, Apache Beam 2.28.0, Java 11
>Reporter: Philip Wilcox
>Priority: Minor
>
> I ran into what looks like a bug in the Parquet Avro reading code, around 
> trying to read a file written with a previous version of a schema with a new, 
> evolved version of the schema.
> I'm using Apache Beam's ParquetIO library, which supports passing in schemas 
> to use for "projection" and I was investigating if that would work for me 
> here. However, it didn't work, complaining that my new reader schema had a 
> field that wasn't in the writer schema.
>  
> I traced this through to a couple places in the parquet-avro code that don't 
> look right to me:
>  
> First, in `prepareForRead` here: 
> [https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java#L116]
> The `parquetSchema` var comes from `parquetSchema = 
> readContext.getRequestedSchema();` while the `avroSchema` var comes from the 
> parquet file itself with `avroSchema = new 
> Schema.Parser().parse(keyValueMetaData.get(AVRO_SCHEMA_METADATA_KEY));`
> I can verify that `parquetSchema` is the schema I'm requesting it be 
> projected to and that `avroSchema` is the schema from the file, but the 
> naming looks backward, shouldn't `parquetSchema` be the one from the parquet 
> file?
> Following the stack down, I was hitting this line: 
> https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroIndexedRecordConverter.java#L91
> here it was failing because the `avroSchema` didn't have a field that was in 
> the `parquetSchema`, with the variables assigned in the same way as above. 
> That's the case I was hoping to use this projection for, though - to get the 
> record read with the new reader schema, using the default value from the new 
> schema for the new field. In fact, the comment on line 101 "store defaults 
> for any new Avro fields from avroSchema that are not in the writer schema 
> (parquetSchema)" suggests that the intent was for this to work, but the 
> actual code has the writer schema in avroSchema and the reader schema in 
> parquetSchema.
> (Additionally, I'd want this to support schema evolution both for adding an 
> optional field and also removing an old field - so just flipping the names 
> around would result in this still breaking if the reader schema dropped a 
> field from the writer schema...)
> Looking to understand if I'm interpreting this correctly, or if there's 
> another path that's intended to be used.
> Thank you!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2055) Schema mismatch for reading Avro from parquet file with old schema version?

2021-06-04 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17357121#comment-17357121
 ] 

Gabor Szadovszky commented on PARQUET-2055:
---

[~philipwilcox], I think the main misunderstanding comes from the fact that 
Avro supports schema evolution while Parquet does not. What happens in the 
parquet-avro binding is basically the conversion of the schema (if required 
because avro schema is not set nor saved in the file) and the conversion of the 
values. You may set the projection avro schema by setting the hadoop conf 
{{parquet.avro.projection}} but it cannot support Avro schema evolution since 
Parquet does not have such things (like default values). So currently you are 
not able to read columns that are not in the file (by using default values or 
nulls). The file schema has to contain the projection schema.

> Schema mismatch for reading Avro from parquet file with old schema version?
> ---
>
> Key: PARQUET-2055
> URL: https://issues.apache.org/jira/browse/PARQUET-2055
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.11.0
> Environment: Linux, Apache Beam 2.28.0, Java 11
>Reporter: Philip Wilcox
>Priority: Minor
>
> I ran into what looks like a bug in the Parquet Avro reading code, around 
> trying to read a file written with a previous version of a schema with a new, 
> evolved version of the schema.
> I'm using Apache Beam's ParquetIO library, which supports passing in schemas 
> to use for "projection" and I was investigating if that would work for me 
> here. However, it didn't work, complaining that my new reader schema had a 
> field that wasn't in the writer schema.
>  
> I traced this through to a couple places in the parquet-avro code that don't 
> look right to me:
>  
> First, in `prepareForRead` here: 
> [https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroReadSupport.java#L116]
> The `parquetSchema` var comes from `parquetSchema = 
> readContext.getRequestedSchema();` while the `avroSchema` var comes from the 
> parquet file itself with `avroSchema = new 
> Schema.Parser().parse(keyValueMetaData.get(AVRO_SCHEMA_METADATA_KEY));`
> I can verify that `parquetSchema` is the schema I'm requesting it be 
> projected to and that `avroSchema` is the schema from the file, but the 
> naming looks backward, shouldn't `parquetSchema` be the one from the parquet 
> file?
> Following the stack down, I was hitting this line: 
> https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroIndexedRecordConverter.java#L91
> here it was failing because the `avroSchema` didn't have a field that was in 
> the `parquetSchema`, with the variables assigned in the same way as above. 
> That's the case I was hoping to use this projection for, though - to get the 
> record read with the new reader schema, using the default value from the new 
> schema for the new field. In fact, the comment on line 101 "store defaults 
> for any new Avro fields from avroSchema that are not in the writer schema 
> (parquetSchema)" suggests that the intent was for this to work, but the 
> actual code has the writer schema in avroSchema and the reader schema in 
> parquetSchema.
> (Additionally, I'd want this to support schema evolution both for adding an 
> optional field and also removing an old field - so just flipping the names 
> around would result in this still breaking if the reader schema dropped a 
> field from the writer schema...)
> Looking to understand if I'm interpreting this correctly, or if there's 
> another path that's intended to be used.
> Thank you!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2052) Integer overflow when writing huge binary using dictionary encoding

2021-05-26 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2052.
---
Resolution: Fixed

> Integer overflow when writing huge binary using dictionary encoding
> ---
>
> Key: PARQUET-2052
> URL: https://issues.apache.org/jira/browse/PARQUET-2052
> Project: Parquet
>  Issue Type: Bug
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> To check whether it should fallback to plain encoding, 
> {{DictionaryValuesWriter}} currently use two variables: 
> {{dictionaryByteSize}} and {{maxDictionaryByteSize}}, both of which are 
> integer. This will cause issue when one first writes a relatively small 
> binary within the threshold and then write a huge string which cause 
> {{dictionaryByteSize}} overflow and becoming negative.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2050) Expose repetition & definition level from ColumnIO

2021-05-19 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2050.
---
Resolution: Fixed

> Expose repetition & definition level from ColumnIO
> --
>
> Key: PARQUET-2050
> URL: https://issues.apache.org/jira/browse/PARQUET-2050
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
>
> {{ColumnIO}} is pretty useful for obtaining repetition and definition level 
> info, for constructing nested records (the {{ColumnDescriptor}} only expose 
> the info for leave nodes). However, currently {{getDefinitionLevel}} and 
> {{getRepetitionLevel}} are both package-private and other applications depend 
> on Parquet have to find workaround for this (e.g., [ColumnIOUtil used by 
> Presto|https://github.com/prestodb/presto-hive-apache/blob/master/src/main/java/org/apache/parquet/io/ColumnIOUtil.java]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1761) Lower Logging Level in ParquetOutputFormat

2021-05-18 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1761.
---
Resolution: Fixed

> Lower Logging Level in ParquetOutputFormat
> --
>
> Key: PARQUET-1761
> URL: https://issues.apache.org/jira/browse/PARQUET-1761
> Project: Parquet
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-2051) AvroWriteSupport does not pass Configuration to AvroSchemaConverter on Creation

2021-05-17 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2051:
-

Assignee: Andreas Hailu

> AvroWriteSupport does not pass Configuration to AvroSchemaConverter on 
> Creation
> ---
>
> Key: PARQUET-2051
> URL: https://issues.apache.org/jira/browse/PARQUET-2051
> Project: Parquet
>  Issue Type: Bug
>Reporter: Andreas Hailu
>Assignee: Andreas Hailu
>Priority: Major
>
> Because of this, we're unable to fully leverage the ThreeLevelListWriter 
> functionality when trying to write Avro lists out using Parquet through the 
> AvroParquetOutputFormat.
> The following record is used for testing:
>  Schema:
> { "type": "record", "name": "NullLists", "namespace": "com.test", "fields": [ 
> \{ "name": "KeyID", "type": "string" }, \{ "name": "NullableList", "type": [ 
> "null", { "type": "array", "items": [ "null", "string" ] } ], "default": null 
> } ] }
> Record (using basic JSON just for display purposes):
> { "KeyID": "0", "NullableList": [ "foo", null, "baz" ] }
> During testing, we see the following exception:
> {quote}{{Caused by: java.lang.ClassCastException: repeated binary array 
> (STRING) is not a group}}
>  \{{ at org.apache.parquet.schema.Type.asGroupType(Type.java:250)}}
>  \{{ at 
> org.apache.parquet.avro.AvroWriteSupport$ThreeLevelListWriter.writeCollection(AvroWriteSupport.java:612)}}
>  \{{ at 
> org.apache.parquet.avro.AvroWriteSupport$ListWriter.writeList(AvroWriteSupport.java:397)}}
>  \{{ at 
> org.apache.parquet.avro.AvroWriteSupport.writeValueWithoutConversion(AvroWriteSupport.java:355)}}
>  \{{ at 
> org.apache.parquet.avro.AvroWriteSupport.writeValue(AvroWriteSupport.java:278)}}
>  \{{ at 
> org.apache.parquet.avro.AvroWriteSupport.writeRecordFields(AvroWriteSupport.java:191)}}
>  \{{ at 
> org.apache.parquet.avro.AvroWriteSupport.write(AvroWriteSupport.java:165)}}
>  \{{ at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:128}}
> {quote}
> Upon review, it was found that the configuration option that was set in 
> AvroWriteSupport for the ThreeLevelListWriter, 
> parquet.avro.write-old-list-structure being set to false, was never shared 
> with the AvroSchemaConverter.
> Once we made this change and tested locally, we observe the record with nulls 
> in the array being successfully written by AvroParquetOutputFormat. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2046) Upgrade Apache POM to 23

2021-05-17 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2046.
---
Resolution: Fixed

> Upgrade Apache POM to 23
> 
>
> Key: PARQUET-2046
> URL: https://issues.apache.org/jira/browse/PARQUET-2046
> Project: Parquet
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2048) Deprecate BaseRecordReader

2021-05-17 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2048.
---
Resolution: Fixed

> Deprecate BaseRecordReader
> --
>
> Key: PARQUET-2048
> URL: https://issues.apache.org/jira/browse/PARQUET-2048
> Project: Parquet
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Minor
>
> No longer used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-1922) Deprecate IOExceptionUtils

2021-05-14 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-1922.
---
Resolution: Fixed

> Deprecate IOExceptionUtils
> --
>
> Key: PARQUET-1922
> URL: https://issues.apache.org/jira/browse/PARQUET-1922
> Project: Parquet
>  Issue Type: Improvement
>Reporter: David Mollitor
>Assignee: David Mollitor
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2037) Write INT96 with parquet-avro

2021-05-12 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2037.
---
Resolution: Fixed

> Write INT96 with parquet-avro
> -
>
> Key: PARQUET-2037
> URL: https://issues.apache.org/jira/browse/PARQUET-2037
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro, parquet-mr
>Reporter: Gabor Szadovszky
>Assignee: Gabor Szadovszky
>Priority: Major
>
> This jira is about the write path of PARQUET-1928. 
> The issue here is how to identify an Avro FIXED type that was an INT96 
> before. Of course, this feature would be behind a configuration flag 
> similarly to PARQUET-1928. But even with this flag it is not obvious to 
> differentiate a "simple" FIXED[12] byte from one that was an INT96 before.
> Two options to solve this issue:
> * Write the doc field of the avro schema that the FIXED value was an INT96.
> * Instead of implementing a configuration flag let the user specify the names 
> of the columns to be converted to INT96 via the configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (PARQUET-2044) Enable ZSTD buffer pool by default

2021-05-10 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky reassigned PARQUET-2044:
-

Assignee: Dongjoon Hyun

> Enable ZSTD buffer pool by default
> --
>
> Key: PARQUET-2044
> URL: https://issues.apache.org/jira/browse/PARQUET-2044
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2038) Upgrade Jackson version used in parquet encryption

2021-05-04 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2038.
---
Resolution: Fixed

> Upgrade Jackson version used in parquet encryption
> --
>
> Key: PARQUET-2038
> URL: https://issues.apache.org/jira/browse/PARQUET-2038
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Maya Anderson
>Assignee: Maya Anderson
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2043) Fail build for used but not declared direct dependencies

2021-05-04 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-2043:
-

 Summary: Fail build for used but not declared direct dependencies
 Key: PARQUET-2043
 URL: https://issues.apache.org/jira/browse/PARQUET-2043
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-mr
Reporter: Gabor Szadovszky


It is always a good practice to specify all the dependencies directly used 
(classes are imported from) by our modules. We have a couple of issues where 
classes are imported from transitive dependencies. It makes hard to validate 
the actual dependency tree and also may result in using wrong versions of 
classes (see PARQUET-2038 for example).

It would be good to enforce to reference such dependencies directly in the 
module poms. The [maven-dependency-plugin analyze-only 
goal|http://maven.apache.org/plugins/maven-dependency-plugin/analyze-only-mojo.html]
 can be used for this purpose.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2039) AvroReadSupport.setRequestedProjection in 1.11.1+ not backwards compatible with MAPS

2021-04-29 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17335226#comment-17335226
 ] 

Gabor Szadovszky commented on PARQUET-2039:
---

I am a bit confused about the version numbers. Are you comparing 1.11.1 and 
1.10.0 or 1.11.1 and 1.11.0? Anyway, PARQUET-1879 seems to be related. It is 
around the same area and got in to 1.11.1.
You are invoking {{setRequestedProjection}}. Could you please list the exact 
avro schema you are using for the projection?

[~maccamlc], since your modification potentially caused this regression(?) you 
may want to check this out.

> AvroReadSupport.setRequestedProjection in 1.11.1+ not backwards compatible 
> with MAPS
> 
>
> Key: PARQUET-2039
> URL: https://issues.apache.org/jira/browse/PARQUET-2039
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro, parquet-mr
>Affects Versions: 1.11.0, 1.11.1
>Reporter: Bil Bingham
>Priority: Minor
> Attachments: PQMapTest.java
>
>
>  
> using AvroReadSupport.setRequestedProjection in 1.11.1 reading a 1.10.0 
> generated parquet file sets MAP fields to null (and vice versa 
> 1.10.0 reader against a 1.11.1 generated file) 
> - Not using a projected schema works, the map fields are converted correctly.
> In my case. Parquet file is generated by hive 
> {code:java}
> CREATE TABLE parquetmaptest (
>  a string,
>  b MAP
> )
> STORED AS PARQUET 
> tblproperties(
>  "parquet.compression"="SNAPPY"
> );
> insert into parquetmaptest select "a",map("k1","v1","k2","v2");{code}
> Using parquet-avro 1.11.1 (and appropriate dependencies) result in field "b" 
> being null. 
> {code:java}
> data:null
> row:{"a": "a", "b": null}{code}
> Using parquet-avro 1.11.0 (and appropriate dependencies) result in field "b" 
> being  the right map value.  
> {code:java}
> data:{k1=v1, k2=v2}
> row:{"a": "a", "b": {"k1": "v1", "k2": "v2"}}{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2026) Allow empty row in parquet file

2021-04-27 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1779#comment-1779
 ] 

Gabor Szadovszky commented on PARQUET-2026:
---

[~vitalii], Based on the discussions on the recent Parquet sync meeting the 
community is not against allowing to create empty parquet files. Meanwhile, we 
do not have the bandwidth to invest on this feature. 
Feel free to contribute and I am happy to help/review.

> Allow empty row in parquet file
> ---
>
> Key: PARQUET-2026
> URL: https://issues.apache.org/jira/browse/PARQUET-2026
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Vitalii Diravka
>Priority: Major
>  Labels: Drill, empty-file
> Fix For: 1.13.0
>
> Attachments: Screenshot from 2021-04-13 08-52-56.png
>
>
> PARQUET-1851 starts abandon to write parquet files with schema (meta 
> information), but with 0 rows, aka empty files.
> In result it prevent to store empty tables in DRILL by using parquet files, 
> for example:
> {code:java}
> CREATE TABLE dfs.tmp.%s AS SELECT * FROM cp.`employee.json` WHERE 1=0{code}
> {code:java}
> CREATE TABLE dfs.tmp.%s AS select * from 
> dfs.`parquet/alltypes_required.parquet` where `col_int` = 0{code}
> {code:java}
> create table dfs.tmp.%s as select * from 
> dfs.`parquet/empty/complex/empty_complex.parquet`{code}
> So PARQUET-1851 breaks the following test cases:
> {code:java}
> TestUntypedNull.testParquetTableCreation   
> TestParquetWriterEmptyFiles.testComplexEmptyFileSchema   
> TestParquetWriterEmptyFiles.testWriteEmptyFile   
> TestParquetWriterEmptyFiles.testWriteEmptyFileWithSchema   
> TestParquetWriterEmptyFiles.testWriteEmptySchemaChange 
> TestMetastoreCommands.testAnalyzeEmptyRequiredParquetTable  
> TestMetastoreCommands.testSelectEmptyRequiredParquetTable{code}
>  I suggest to use warning in the process of creating empty parquet files or 
> create alternative _endBlock_ for backward compatibility with other tools:
> !Screenshot from 2021-04-13 08-52-56.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (PARQUET-2037) Write INT96 with parquet-avro

2021-04-27 Thread Gabor Szadovszky (Jira)
Gabor Szadovszky created PARQUET-2037:
-

 Summary: Write INT96 with parquet-avro
 Key: PARQUET-2037
 URL: https://issues.apache.org/jira/browse/PARQUET-2037
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-avro, parquet-mr
Reporter: Gabor Szadovszky
Assignee: Gabor Szadovszky


This jira is about the write path of PARQUET-1928. 

The issue here is how to identify an Avro FIXED type that was an INT96 before. 
Of course, this feature would be behind a configuration flag similarly to 
PARQUET-1928. But even with this flag it is not obvious to differentiate a 
"simple" FIXED[12] byte from one that was an INT96 before.

Two options to solve this issue:
* Write the doc field of the avro schema that the FIXED value was an INT96.
* Instead of implementing a configuration flag let the user specify the names 
of the columns to be converted to INT96 via the configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2035) Java module import error due to shaded package shaded.parquet.it.unimi.dsi.fastutil

2021-04-27 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333154#comment-17333154
 ] 

Gabor Szadovszky commented on PARQUET-2035:
---

Got it, thanks for explaining. I guess, we don't have and don't need such 
constructs that would break the boundaries in case of fastutil. Meanwhile, I am 
not sure we can add proper checks to avoid them.
However, it would be great if we could implement some tests that would ensure 
parquet-mr works fine in a java modularized environment. Do you have any idea 
to implement such thing?

> Java module import error due to shaded package 
> shaded.parquet.it.unimi.dsi.fastutil
> ---
>
> Key: PARQUET-2035
> URL: https://issues.apache.org/jira/browse/PARQUET-2035
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.12.0, 1.11.1
>Reporter: Maxim Kolesnikov
>Priority: Major
> Attachments: parquet-example.zip
>
>
> *Description:*
> Due to collision of shaded packages 
> {code:java}
> shaded.parquet.it.unimi.dsi.fastutil{code}
> in 
> {code:java}
> org.apache.parquet:parquet-avro{code}
> and 
> {code:java}
> org.apache.parquet:parquet-column{code}
> it is not possible to use both these dependencies within a modularized java 
> project at the same time.
>  
>  
> *How to reproduce:*
>  * create a maven project with dependency 
> org.apache.parquet:parquet-avro:1.11.1
>  * declare java module that requires both parquet.avro and parquet.column
>  * run
> {code:java}
> mvn compile{code}
>  
> *Expected behaviour:*
> Project should compile without errors.
>  
> *Actual behaviour:*
> Project fails with compilation errors:
>  
> {code:java}
> [ERROR] the unnamed module reads package shaded.parquet.it.unimi.dsi.fastutil 
> from both parquet.column and parquet.avro
> ...{code}
>  
>  
> *Reproducible example* (same code as in the attached zip file): 
> https://github.com/xCASx/parquet-example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2035) Java module import error due to shaded package shaded.parquet.it.unimi.dsi.fastutil

2021-04-27 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333092#comment-17333092
 ] 

Gabor Szadovszky commented on PARQUET-2035:
---

The issue that [~fokko] reported was solved by PARQUET-1853. The fix was simply 
adding `true` to the pom. The issue with the central 
shading is this option would not be available so we would re-introduce the 
problem. It is independent from we are using one dependency module or separate 
ones. Maybe, option1 would be better for fastutils. It is even easier to 
implement.

Could you please try out one of the solutions in your environment so we know 
this is the only issue we have with java modules?

> Java module import error due to shaded package 
> shaded.parquet.it.unimi.dsi.fastutil
> ---
>
> Key: PARQUET-2035
> URL: https://issues.apache.org/jira/browse/PARQUET-2035
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.12.0, 1.11.1
>Reporter: Maxim Kolesnikov
>Priority: Major
> Attachments: parquet-example.zip
>
>
> *Description:*
> Due to collision of shaded packages 
> {code:java}
> shaded.parquet.it.unimi.dsi.fastutil{code}
> in 
> {code:java}
> org.apache.parquet:parquet-avro{code}
> and 
> {code:java}
> org.apache.parquet:parquet-column{code}
> it is not possible to use both these dependencies within a modularized java 
> project at the same time.
>  
>  
> *How to reproduce:*
>  * create a maven project with dependency 
> org.apache.parquet:parquet-avro:1.11.1
>  * declare java module that requires both parquet.avro and parquet.column
>  * run
> {code:java}
> mvn compile{code}
>  
> *Expected behaviour:*
> Project should compile without errors.
>  
> *Actual behaviour:*
> Project fails with compilation errors:
>  
> {code:java}
> [ERROR] the unnamed module reads package shaded.parquet.it.unimi.dsi.fastutil 
> from both parquet.column and parquet.avro
> ...{code}
>  
>  
> *Reproducible example* (same code as in the attached zip file): 
> https://github.com/xCASx/parquet-example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2035) Java module import error due to shaded package shaded.parquet.it.unimi.dsi.fastutil

2021-04-27 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17333072#comment-17333072
 ] 

Gabor Szadovszky commented on PARQUET-2035:
---

bq. Use unique names for shaded packages, e.g. 
shaded.parquet.avro.it.unimi.dsi.fastutil, 
shaded.parquet.column.it.unimi.dsi.fastutil. Probably not an option, as it may 
break application logic due to presence of multiple instances of the same 
classes.
I think it should work however, it would not be a nice solution. The classes 
wouldn't be the same because their package would be different.
bq. Get rid of dependency on fastutil in any two out of three modules that are 
currently shading it. That may require significant code refactoring and may be 
not feasible. At least in parquet-hadoop it seems to be used in a single place 
for some performance optimisation.
As you've said we require fastutil for performance. Dropping fastutil from any 
place we use it would result in performance drawback.
bq. Achieve consistency of fastutil across hadoop projects. Would be ideal, but 
probably even less feasible solution.
I would agree this is not possible. There are a lot of components in the 
ecosystem and much smaller efforts on similar issues were died already.
bq. Create a new module, e.g. parquet-fastutil that would contain only the 
shaded library. Add this module to transitive non shaded dependencies that have 
dependency on fastutil: parquet-avro, parquet-column, parquet-hadoop.
I think this is our best option. We already have a separate module for jackson 
for the same purpose. What do you think about instead of creating a new module 
for fastutil we would rename the existing parquet-jackson module to 
parquet-3rdparty (or whatever better name) and it would contain all our 
dependencies that we would like to shade? The only issue with this approach is 
that we cannot minimize the jar during shading and fastutil is quite big (19M).

> Java module import error due to shaded package 
> shaded.parquet.it.unimi.dsi.fastutil
> ---
>
> Key: PARQUET-2035
> URL: https://issues.apache.org/jira/browse/PARQUET-2035
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.12.0, 1.11.1
>Reporter: Maxim Kolesnikov
>Priority: Major
> Attachments: parquet-example.zip
>
>
> *Description:*
> Due to collision of shaded packages 
> {code:java}
> shaded.parquet.it.unimi.dsi.fastutil{code}
> in 
> {code:java}
> org.apache.parquet:parquet-avro{code}
> and 
> {code:java}
> org.apache.parquet:parquet-column{code}
> it is not possible to use both these dependencies within a modularized java 
> project at the same time.
>  
>  
> *How to reproduce:*
>  * create a maven project with dependency 
> org.apache.parquet:parquet-avro:1.11.1
>  * declare java module that requires both parquet.avro and parquet.column
>  * run
> {code:java}
> mvn compile{code}
>  
> *Expected behaviour:*
> Project should compile without errors.
>  
> *Actual behaviour:*
> Project fails with compilation errors:
>  
> {code:java}
> [ERROR] the unnamed module reads package shaded.parquet.it.unimi.dsi.fastutil 
> from both parquet.column and parquet.avro
> ...{code}
>  
>  
> *Reproducible example* (same code as in the attached zip file): 
> https://github.com/xCASx/parquet-example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2035) Java module import error due to shaded package shaded.parquet.it.unimi.dsi.fastutil

2021-04-26 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332566#comment-17332566
 ] 

Gabor Szadovszky commented on PARQUET-2035:
---

[~cas], I am not sure about the original purpose of shading. Usually, shading 
is implemented if a component relies on a specific version of dependency and do 
not want to add conflicts in a large ecosystem like Hadoop. 

Even though java8 is already EOL it does not mean we cannot limit our source to 
be compatible with java8 and still built for that. Most of the environments 
where parquet-mr is used use java11 already but only for runtime. AFAIK, the 
whole Hadoop ecosystem is still stuck on java8 talking about the source level. 
So this is not about supporting java9 or java11 but supporting the the modules 
feature of java.
As I've said, I don't have any experience with java modules so I am not sure 
about the actual problem and what would be the best fix for it.

Base on the original commit we shade fastutils since a while. PARQUET-1529 was 
only about keeping it shaded in all of our modules where it is used. By 
reverting PARQUET-1529 we would end up conflicting other fastutil versions in 
the Hadoop ecosystem.

We need someone who have experience with java modules to contribute here. It 
also seems a tough issue as we can hardly implement unit tests for such changes.

> Java module import error due to shaded package 
> shaded.parquet.it.unimi.dsi.fastutil
> ---
>
> Key: PARQUET-2035
> URL: https://issues.apache.org/jira/browse/PARQUET-2035
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.12.0, 1.11.1
>Reporter: Maxim Kolesnikov
>Priority: Major
> Attachments: parquet-example.zip
>
>
> *Description:*
> Due to collision of shaded packages 
> {code:java}
> shaded.parquet.it.unimi.dsi.fastutil{code}
> in 
> {code:java}
> org.apache.parquet:parquet-avro{code}
> and 
> {code:java}
> org.apache.parquet:parquet-column{code}
> it is not possible to use both these dependencies within a modularized java 
> project at the same time.
>  
>  
> *How to reproduce:*
>  * create a maven project with dependency 
> org.apache.parquet:parquet-avro:1.11.1
>  * declare java module that requires both parquet.avro and parquet.column
>  * run
> {code:java}
> mvn compile{code}
>  
> *Expected behaviour:*
> Project should compile without errors.
>  
> *Actual behaviour:*
> Project fails with compilation errors:
>  
> {code:java}
> [ERROR] the unnamed module reads package shaded.parquet.it.unimi.dsi.fastutil 
> from both parquet.column and parquet.avro
> ...{code}
>  
>  
> *Reproducible example* (same code as in the attached zip file): 
> https://github.com/xCASx/parquet-example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2036) implicitly defining DEBUG mode in MessageColumnIO causes 80% performance overhead

2021-04-26 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332454#comment-17332454
 ] 

Gabor Szadovszky commented on PARQUET-2036:
---

[~elad_yosifon], thanks for reporting this.

I am not sure I get the actual circumstances leading to getting to the DEBUG 
log level without an explicit configuration.

The "magic behavior" was implemented to allow JIT to remove the logging parts 
from the compiled code so it'll run faster. Without this static flag the code 
would at least check the log level at each value read/write even if the log 
level is much higher than DEBUG.

About your tips for preventing the issue. Do you think printing to the STDOUT 
would help? I think STDOUT is only checked in cases when there is an error. If 
you do not realize you have a performance impact you would not notice the 
message on STDOUT either. Meanwhile, if you start checking the logs it would be 
clear that the log level is DEBUG.
What do you mean by "waiting for explicit configuration"? I think 
{{isDebugEnabled}} should return the explicit configuration of the log level. 
parquet-mr uses SLF4J just to allow the users (other components) to specify a 
logging FW and configuration.

> implicitly defining DEBUG mode in MessageColumnIO causes 80% performance 
> overhead
> -
>
> Key: PARQUET-2036
> URL: https://issues.apache.org/jira/browse/PARQUET-2036
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.0, 1.10.1, 1.12.0
>Reporter: Elad Yosifon
>Priority: Critical
>
> *parquet-column* jar leverages +slf4j and log4j as default logger+, 
> neglecting to define a log4j configuration, defaults to *DEBUG* log level.
>  
> {code:java}
> public class MessageColumnIO extends GroupColumnIO {
>   private static final Logger LOG = 
> LoggerFactory.getLogger(MessageColumnIO.class);
>   private static final boolean DEBUG = LOG.isDebugEnabled(); // <--
> }
> {code}
>  
> this "magic behavior" defaults parquet library to be in DEBUG mode, without 
> any notification or warnings. Unfortunately, the 
> *RecordConsumerLoggingWrapper* implementation generates 5x performance 
> overhead in comparison to the *MessageColumnIORecordConsumer* implementation, 
> causing a massive hit in performance and wasteful server utilization.
>  
> +IMHO there are two things that could prevent such issue:+
>  * printing a message to STDOUT notifying about DEBUG mode being set to 
> active.
>  * defaulting to *MessageColumnIORecordConsumer* implementation, and waiting 
> for explicit configuration to define DEBUG mode, and use 
> *RecordConsumerLoggingWrapper*.
>  
> In the past 2 years, this issue probably cost my company 50,000$ in excessive 
> cloud costs!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2035) Java module import error due to shaded package shaded.parquet.it.unimi.dsi.fastutil

2021-04-26 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332440#comment-17332440
 ] 

Gabor Szadovszky commented on PARQUET-2035:
---

[~cas], thanks for reporting this issue.

I don't have any experience in java11 modules. Since parquet-mr is still 
targeted to java8 and there are a couple of other projects use it without any 
issue (probably not in a java11 modularized environment) I would not say this 
is a bug. I would expect to have some workarounds for the java11 modularized 
environments since it is working without modules.

Meanwhile, I am happy to help/review for any contribution to parquet-mr that 
makes it work properly for java11 modules.

> Java module import error due to shaded package 
> shaded.parquet.it.unimi.dsi.fastutil
> ---
>
> Key: PARQUET-2035
> URL: https://issues.apache.org/jira/browse/PARQUET-2035
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-avro
>Affects Versions: 1.12.0, 1.11.1
>Reporter: Maxim Kolesnikov
>Priority: Major
> Attachments: parquet-example.zip
>
>
> *Description:*
> Due to collision of shaded packages 
> {code:java}
> shaded.parquet.it.unimi.dsi.fastutil{code}
> in 
> {code:java}
> org.apache.parquet:parquet-avro{code}
> and 
> {code:java}
> org.apache.parquet:parquet-column{code}
> it is not possible to use both these dependencies within a modularized java 
> project at the same time.
>  
>  
> *How to reproduce:*
>  * create a maven project with dependency 
> org.apache.parquet:parquet-avro:1.11.1
>  * declare java module that requires both parquet.avro and parquet.column
>  * run
> {code:java}
> mvn compile{code}
>  
> *Expected behaviour:*
> Project should compile without errors.
>  
> *Actual behaviour:*
> Project fails with compilation errors:
>  
> {code:java}
> [ERROR] the unnamed module reads package shaded.parquet.it.unimi.dsi.fastutil 
> from both parquet.column and parquet.avro
> ...{code}
>  
>  
> *Reproducible example* (same code as in the attached zip file): 
> https://github.com/xCASx/parquet-example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1006) ColumnChunkPageWriter uses only heap memory.

2021-04-26 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332432#comment-17332432
 ] 

Gabor Szadovszky commented on PARQUET-1006:
---

[~vitalii],

If you are interested in the thread on the dev list you may either subscribe to 
the list or check the 
[list|https://lists.apache.org/list.html?dev@parquet.apache.org] or the 
[thread|https://lists.apache.org/thread.html/r0a82d1bf3f7a850da1f1e4d952721a81f624adb68250de191d90aa08%40%3Cdev.parquet.apache.org%3E]
 itself on [ponymail|https://lists.apache.org].

{{CapacityByteArrayOutputStream}} (name is misleading) requires an initial slab 
size, a max capacity hint and an allocator. If you want to use it you have to 
come up with a good hint independently from which allocator would you use. I 
think it is a good idea to use {{CapacityByteArrayOutputStream}} instead of the 
currently used {{ByteBufferOutputStream}} but you need to implement it in a way 
that it does not hurt performance neither for direct nor for heap allocations.
{{ConcatenatingByteArrayCollector}} is for a different purpose. It is not even 
an OutputStream.

> ColumnChunkPageWriter uses only heap memory.
> 
>
> Key: PARQUET-1006
> URL: https://issues.apache.org/jira/browse/PARQUET-1006
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.12.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
>Priority: Major
> Fix For: 1.13.0
>
>
> After PARQUET-160 was resolved, ColumnChunkPageWriter started using 
> ConcatenatingByteArrayCollector. There are all data is collected in the List 
> of byte[], before writing the page. No way to use direct memory for 
> allocating buffers. ByteBufferAllocator is present in the 
> [ColumnChunkPageWriter|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L73]
>  class, but never used.
> Using of java heap space in some cases can cause OOM exceptions or GC's 
> overhead. 
> ByteBufferAllocator should be used in the ConcatenatingByteArrayCollector or 
> OutputStream classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2027) Merging parquet files created in 1.11.1 not possible using 1.12.0

2021-04-26 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332386#comment-17332386
 ] 

Gabor Szadovszky commented on PARQUET-2027:
---

I've created the branch parquet-1.12.x and backported this change. Until we 
create the release 1.12.1 you may build and use your own version.

> Merging parquet files created in 1.11.1 not possible using 1.12.0 
> --
>
> Key: PARQUET-2027
> URL: https://issues.apache.org/jira/browse/PARQUET-2027
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Matthew M
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.12.1
>
>
> I have parquet files created using 1.11.1. In the process I join two files 
> (with the same schema) into a one output file. I create Hadoop writer:
> {code:scala}
> val hadoopWriter = new ParquetFileWriter(
>   HadoopOutputFile.fromPath(
> new Path(outputPath.toString),
> new Configuration()
>   ), outputSchema, Mode.OVERWRITE,
>   8 * 1024 * 1024,
>   2097152,
>   DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH,
>   DEFAULT_STATISTICS_TRUNCATE_LENGTH,
>   DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED
> )
> hadoopWriter.start()
> {code}
> and try to append one file into another:
> {code:scala}
> hadoopWriter.appendFile(HadoopInputFile.fromPath(new Path(file), new 
> Configuration()))
> {code}
> Everything works on 1.11.1. But when I've switched to 1.12.0 it fails with 
> that error:
> {code:scala}
> STDERR: Exception in thread "main" java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' 
> was not found in serialized data! Struct: 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4
>  at org.apache.parquet.format.Util.read(Util.java:365)
>  at org.apache.parquet.format.Util.readPageHeader(Util.java:132)
>  at org.apache.parquet.format.Util.readPageHeader(Util.java:127)
>  at org.apache.parquet.hadoop.Offsets.readDictionaryPageSize(Offsets.java:75)
>  at org.apache.parquet.hadoop.Offsets.getOffsets(Offsets.java:58)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroup(ParquetFileWriter.java:998)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroups(ParquetFileWriter.java:918)
>  at 
> org.apache.parquet.hadoop.ParquetFileReader.appendTo(ParquetFileReader.java:888)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:895)
>  at [...]
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'uncompressed_page_size' was not found in serialized data! 
> Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4
>  at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1108)
>  at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019)
>  at org.apache.parquet.format.PageHeader.read(PageHeader.java:896)
>  at org.apache.parquet.format.Util.read(Util.java:362)
>  ... 14 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-2027) Merging parquet files created in 1.11.1 not possible using 1.12.0

2021-04-26 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17332052#comment-17332052
 ] 

Gabor Szadovszky commented on PARQUET-2027:
---

[~eltherion], currently, I don't have the bandwidth to create another release. 
And, there might be other candidates as well. But, I'll create a branch and 
backport this one so if we create one it'll part of it.

> Merging parquet files created in 1.11.1 not possible using 1.12.0 
> --
>
> Key: PARQUET-2027
> URL: https://issues.apache.org/jira/browse/PARQUET-2027
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Matthew M
>Assignee: Gabor Szadovszky
>Priority: Major
>
> I have parquet files created using 1.11.1. In the process I join two files 
> (with the same schema) into a one output file. I create Hadoop writer:
> {code:scala}
> val hadoopWriter = new ParquetFileWriter(
>   HadoopOutputFile.fromPath(
> new Path(outputPath.toString),
> new Configuration()
>   ), outputSchema, Mode.OVERWRITE,
>   8 * 1024 * 1024,
>   2097152,
>   DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH,
>   DEFAULT_STATISTICS_TRUNCATE_LENGTH,
>   DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED
> )
> hadoopWriter.start()
> {code}
> and try to append one file into another:
> {code:scala}
> hadoopWriter.appendFile(HadoopInputFile.fromPath(new Path(file), new 
> Configuration()))
> {code}
> Everything works on 1.11.1. But when I've switched to 1.12.0 it fails with 
> that error:
> {code:scala}
> STDERR: Exception in thread "main" java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' 
> was not found in serialized data! Struct: 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4
>  at org.apache.parquet.format.Util.read(Util.java:365)
>  at org.apache.parquet.format.Util.readPageHeader(Util.java:132)
>  at org.apache.parquet.format.Util.readPageHeader(Util.java:127)
>  at org.apache.parquet.hadoop.Offsets.readDictionaryPageSize(Offsets.java:75)
>  at org.apache.parquet.hadoop.Offsets.getOffsets(Offsets.java:58)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroup(ParquetFileWriter.java:998)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroups(ParquetFileWriter.java:918)
>  at 
> org.apache.parquet.hadoop.ParquetFileReader.appendTo(ParquetFileReader.java:888)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:895)
>  at [...]
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'uncompressed_page_size' was not found in serialized data! 
> Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4
>  at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1108)
>  at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019)
>  at org.apache.parquet.format.PageHeader.read(PageHeader.java:896)
>  at org.apache.parquet.format.Util.read(Util.java:362)
>  ... 14 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (PARQUET-2027) Merging parquet files created in 1.11.1 not possible using 1.12.0

2021-04-26 Thread Gabor Szadovszky (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Szadovszky resolved PARQUET-2027.
---
Fix Version/s: 1.12.1
   Resolution: Fixed

> Merging parquet files created in 1.11.1 not possible using 1.12.0 
> --
>
> Key: PARQUET-2027
> URL: https://issues.apache.org/jira/browse/PARQUET-2027
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Matthew M
>Assignee: Gabor Szadovszky
>Priority: Major
> Fix For: 1.12.1
>
>
> I have parquet files created using 1.11.1. In the process I join two files 
> (with the same schema) into a one output file. I create Hadoop writer:
> {code:scala}
> val hadoopWriter = new ParquetFileWriter(
>   HadoopOutputFile.fromPath(
> new Path(outputPath.toString),
> new Configuration()
>   ), outputSchema, Mode.OVERWRITE,
>   8 * 1024 * 1024,
>   2097152,
>   DEFAULT_COLUMN_INDEX_TRUNCATE_LENGTH,
>   DEFAULT_STATISTICS_TRUNCATE_LENGTH,
>   DEFAULT_PAGE_WRITE_CHECKSUM_ENABLED
> )
> hadoopWriter.start()
> {code}
> and try to append one file into another:
> {code:scala}
> hadoopWriter.appendFile(HadoopInputFile.fromPath(new Path(file), new 
> Configuration()))
> {code}
> Everything works on 1.11.1. But when I've switched to 1.12.0 it fails with 
> that error:
> {code:scala}
> STDERR: Exception in thread "main" java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' 
> was not found in serialized data! Struct: 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4
>  at org.apache.parquet.format.Util.read(Util.java:365)
>  at org.apache.parquet.format.Util.readPageHeader(Util.java:132)
>  at org.apache.parquet.format.Util.readPageHeader(Util.java:127)
>  at org.apache.parquet.hadoop.Offsets.readDictionaryPageSize(Offsets.java:75)
>  at org.apache.parquet.hadoop.Offsets.getOffsets(Offsets.java:58)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroup(ParquetFileWriter.java:998)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendRowGroups(ParquetFileWriter.java:918)
>  at 
> org.apache.parquet.hadoop.ParquetFileReader.appendTo(ParquetFileReader.java:888)
>  at 
> org.apache.parquet.hadoop.ParquetFileWriter.appendFile(ParquetFileWriter.java:895)
>  at [...]
> Caused by: shaded.parquet.org.apache.thrift.protocol.TProtocolException: 
> Required field 'uncompressed_page_size' was not found in serialized data! 
> Struct: org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@b91d8c4
>  at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1108)
>  at 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme.read(PageHeader.java:1019)
>  at org.apache.parquet.format.PageHeader.read(PageHeader.java:896)
>  at org.apache.parquet.format.Util.read(Util.java:362)
>  ... 14 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1006) ColumnChunkPageWriter uses only heap memory.

2021-04-23 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17330425#comment-17330425
 ] 

Gabor Szadovszky commented on PARQUET-1006:
---

[~vitalii], it is fine to have 1.8.0 in the affected versions field. I've just 
wanted to highlight that we won't release a potential fix in that branch but 
only in the next release based on master.

I've written my answer to the dev list. Let it be here as well so one doesn't 
have to search in the archives later.
{quote}CapacityByteArrayOutputStream is not only about the selectable allocator 
but the growing mechanism as well. Based on the its documentation you will need 
a good maxCapacityHint which I am not sure you have in case of a column chunk. 
We have size limits/hints for pages and row groups but don't have such things 
for column chunks. If you set the hint too high you may end up allocating too 
much space however, it should not be worse than the existing 
ByteArrayOutputStream. However, if you set it too low then you might end up too 
many allocations at growing which could hit performance. 
 If you can come up with good maxCapacityHint and prove with performance tests 
that the change is not slower than the original, I am fine with this update.
 About the API. ColumnChunkPageWriteStore is not part of the public API of 
parquet-mr. I know it is public from java point of view but it has never meant 
to be used directly. It is neither a pro nor a con to add a new public method 
just good to know what we are extending. I think, if performance tests approve, 
it would be cleaner to simply change the ByteArrayOutputStream to 
CapacityByteArrayOutputStream without any new API.
{quote}

> ColumnChunkPageWriter uses only heap memory.
> 
>
> Key: PARQUET-1006
> URL: https://issues.apache.org/jira/browse/PARQUET-1006
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.0, 1.12.0
>Reporter: Vitalii Diravka
>Assignee: Vitalii Diravka
>Priority: Major
>
> After PARQUET-160 was resolved, ColumnChunkPageWriter started using 
> ConcatenatingByteArrayCollector. There are all data is collected in the List 
> of byte[], before writing the page. No way to use direct memory for 
> allocating buffers. ByteBufferAllocator is present in the 
> [ColumnChunkPageWriter|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java#L73]
>  class, but never used.
> Using of java heap space in some cases can cause OOM exceptions or GC's 
> overhead. 
> ByteBufferAllocator should be used in the ConcatenatingByteArrayCollector or 
> OutputStream classes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   3   4   5   6   7   8   9   >