[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705032#comment-17705032
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148487083


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 
+ * where each element represens the count of the number of times that 
level occurs in the page/column chunk.
+ */
+   8: optional list repetition_level_histogram;

Review Comment:
   Oh, sorry for missunderstand this. How can we make full use of this 
histogram?





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: Initial proposal for unencoded/uncompressed statistics

2023-03-25 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148487083


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 
+ * where each element represens the count of the number of times that 
level occurs in the page/column chunk.
+ */
+   8: optional list repetition_level_histogram;

Review Comment:
   Oh, sorry for missunderstand this. How can we make full use of this 
histogram?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (PARQUET-2154) ParquetFileReader should close its input stream when `filterRowGroups` throw Exception in constructor

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2154.
--
Resolution: Fixed

> ParquetFileReader should close its input stream when `filterRowGroups` throw 
> Exception in constructor
> -
>
> Key: PARQUET-2154
> URL: https://issues.apache.org/jira/browse/PARQUET-2154
> Project: Parquet
>  Issue Type: Bug
>Reporter: Yang Jie
>Priority: Major
>
>  
> {code:java}
> public ParquetFileReader(InputFile file, ParquetReadOptions options) throws 
> IOException {
>   this.converter = new ParquetMetadataConverter(options);
>   this.file = file;
>   this.f = file.newStream();
>   this.options = options;
>   try {
> this.footer = readFooter(file, options, f, converter);
>   } catch (Exception e) {
> // In case that reading footer throws an exception in the constructor, 
> the new stream
> // should be closed. Otherwise, there's no way to close this outside.
> f.close();
> throw e;
>   }
>   this.fileMetaData = footer.getFileMetaData();
>   this.fileDecryptor = fileMetaData.getFileDecryptor(); // must be called 
> before filterRowGroups!
>   if (null != fileDecryptor && fileDecryptor.plaintextFile()) {
> this.fileDecryptor = null; // Plaintext file. No need in decryptor
>   }
>   this.blocks = filterRowGroups(footer.getBlocks());
>   this.blockIndexStores = listWithNulls(this.blocks.size());
>   this.blockRowRanges = listWithNulls(this.blocks.size());
>   for (ColumnDescriptor col : 
> footer.getFileMetaData().getSchema().getColumns()) {
> paths.put(ColumnPath.get(col.getPath()), col);
>   }
>   this.crc = options.usePageChecksumVerification() ? new CRC32() : null;
> } {code}
> During the construction of ParquetFileReader, if `filterRowGroups` method 
> throws an exception, it will cause resource leak because when 
> `filterRowGroups(footer.getBlocks())` throw an Exception, the open stream 
> `{{{}this.f = file.newStream()`{}}} looks unable to be closed.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2161) Row positions are computed incorrectly when range or offset metadata filter is used

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2161.
--
Resolution: Fixed

> Row positions are computed incorrectly when range or offset metadata filter 
> is used
> ---
>
> Key: PARQUET-2161
> URL: https://issues.apache.org/jira/browse/PARQUET-2161
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Ala Luszczak
>Priority: Major
>
> The row indexes introduced in PARQUET-2117 are not computed correctly when
> (1) range or offset metadata filter is applied, and
> (2) the first row group was eliminated by the filter
> For example, if a file has two row groups with 10 rows each, and we attempt 
> to only read the 2nd row group, we are going to produce row indexes 0, 1, 2, 
> ..., 9 instead of expected 10, 11, ..., 19.
> This happens because functions `filterFileMetaDataByStart` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1453)
>  and `filterFileMetaDataByMidpoint` (used here: 
> https://github.com/apache/parquet-mr/blob/e06384455567c56d5906fc3a152ab00fd8dfdf33/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L1460)
>  modify their input `FileMetaData`. To address the issue we need to 
> `generateRowGroupOffsets` before these filters are applied.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2138) Add ShowBloomFilterCommand to parquet-cli

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2138.
--
Resolution: Fixed

> Add ShowBloomFilterCommand to parquet-cli
> -
>
> Key: PARQUET-2138
> URL: https://issues.apache.org/jira/browse/PARQUET-2138
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: EdisonWang
>Priority: Minor
>
> Add ShowBloomFilterCommand to parquet-cli, which can check whether given 
> values of a column match bloom filter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2134) Incorrect type checking in HadoopStreams.wrap

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2134.
--
Resolution: Fixed

> Incorrect type checking in HadoopStreams.wrap
> -
>
> Key: PARQUET-2134
> URL: https://issues.apache.org/jira/browse/PARQUET-2134
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.8.3, 1.10.1, 1.11.2, 1.12.2
>Reporter: Todd Gao
>Priority: Minor
>
> The method 
> [HadoopStreams.wrap|https://github.com/apache/parquet-mr/blob/4d062dc37577e719dcecc666f8e837843e44a9be/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/util/HadoopStreams.java#L51]
>  wraps an FSDataInputStream to a SeekableInputStream. 
> It checks whether the underlying stream of the passed  FSDataInputStream 
> implements ByteBufferReadable: if true, wraps the FSDataInputStream to 
> H2SeekableInputStream; otherwise, wraps to H1SeekableInputStream.
> In some cases, we may add another wrapper over FSDataInputStream. For 
> example, 
> {code:java}
> class CustomDataInputStream extends FSDataInputStream {
> public CustomDataInputStream(FSDataInputStream original) {
> super(original);
> }
> }
> {code}
> When we create an FSDataInputStream, whose underlying stream does not 
> implements ByteBufferReadable, and then creates a CustomDataInputStream with 
> it. If we use HadoopStreams.wrap to create a SeekableInputStream, we may get 
> an error like 
> {quote}java.lang.UnsupportedOperationException: Byte-buffer read unsupported 
> by input stream{quote}
> We can fix this by taking recursive checks over the underlying stream of 
> FSDataInputStream.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2155) Upgrade protobuf version to 3.17.3

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2155.
--
  Assignee: Chao Sun
Resolution: Fixed

> Upgrade protobuf version to 3.17.3
> --
>
> Key: PARQUET-2155
> URL: https://issues.apache.org/jira/browse/PARQUET-2155
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2167) CLI show footer command fails if Parquet file contains date fields

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2167.
--
Resolution: Fixed

> CLI show footer command fails if Parquet file contains date fields
> --
>
> Key: PARQUET-2167
> URL: https://issues.apache.org/jira/browse/PARQUET-2167
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cli
>Affects Versions: 1.12.2
>Reporter: Bryan Keller
>Priority: Minor
> Attachments: sample.parquet
>
>
> The show footer command in the CLI fails with the following error if run 
> against a file with date fields:
> com.fasterxml.jackson.databind.exc.InvalidDefinitionException: Java 8 
> date/time type `java.time.ZoneOffset` not supported by default: add Module 
> "com.fasterxml.jackson.datatype:jackson-datatype-jsr310" to enable handling 
> (through reference chain: 
> org.apache.parquet.hadoop.metadata.ParquetMetadata["blocks"]->java.util.ArrayList[0]->org.apache.parquet.hadoop.metadata.BlockMetaData["columns"]->java.util.ArrayList[2]->org.apache.parquet.hadoop.metadata.IntColumnChunkMetaData["statistics"]->org.apache.parquet.column.statistics.IntStatistics["stringifier"]->org.apache.parquet.schema.PrimitiveStringifier$5["formatter"]->java.time.format.DateTimeFormatter["zone"])



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2169) Upgrade Avro to version 1.11.1

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2169.
--
Resolution: Fixed

> Upgrade Avro to version 1.11.1
> --
>
> Key: PARQUET-2169
> URL: https://issues.apache.org/jira/browse/PARQUET-2169
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-avro
>Reporter: Ismaël Mejía
>Assignee: Ismaël Mejía
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2191) Upgrade Scala to 2.12.17

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2191.
--
Resolution: Fixed

> Upgrade Scala to 2.12.17
> 
>
> Key: PARQUET-2191
> URL: https://issues.apache.org/jira/browse/PARQUET-2191
> Project: Parquet
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2192) Add Java 17 build test to GitHub action

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2192.
--
Resolution: Fixed

> Add Java 17 build test to GitHub action
> ---
>
> Key: PARQUET-2192
> URL: https://issues.apache.org/jira/browse/PARQUET-2192
> Project: Parquet
>  Issue Type: Test
>  Components: parquet-testing
>Affects Versions: 1.13.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2185) ParquetReader constructed using builder fails to read encrypted files

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2185.
--
Resolution: Fixed

> ParquetReader constructed using builder fails to read encrypted files
> -
>
> Key: PARQUET-2185
> URL: https://issues.apache.org/jira/browse/PARQUET-2185
> Project: Parquet
>  Issue Type: Bug
>Reporter: Atul Mohan
>Priority: Minor
>
> ParquetReader objects can be constructed using the builder as follows:
>  {code:java}
> ParquetReader builderReader = ParquetReader.builder(new 
> GroupReadSupport(),new Path("path/to/c000.snappy.parquet"))
> .withConf(conf)
> .build();
> {code}
> This parquetReader object cannot be used to read encrypted files as 
> {noformat}
> builderReader.read(){noformat}
>  fails with the following exception:
>  
> {code:java}
> java.lang.NullPointerException at 
> org.apache.parquet.crypto.keytools.FileKeyUnwrapper.getKey(FileKeyUnwrapper.java:87)
>   {code}
> It seems like the reason is that the _withConf_ method within the 
> ParquetReader builder [clears the optionsBuilder set 
> earlier|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetReader.java#L231].
> Here is a sample test showcasing the issue: 
> [https://gist.github.com/a2l007/3d813cc5e44c45100dda169dc6245ae4]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2197) Document uniform encryption

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2197.
--
Resolution: Fixed

> Document uniform encryption
> ---
>
> Key: PARQUET-2197
> URL: https://issues.apache.org/jira/browse/PARQUET-2197
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Minor
>
> Document the hadoop parameter for uniform encryption



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2176) Parquet writers should allow for configurable index/statistics truncation

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2176.
--
Resolution: Fixed

> Parquet writers should allow for configurable index/statistics truncation
> -
>
> Key: PARQUET-2176
> URL: https://issues.apache.org/jira/browse/PARQUET-2176
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: patchwork01
>Priority: Major
>
> ParquetWriter does not expose any way to set the properties for column index 
> or statistics truncation.
> With ParquetOutputFormat those can be set with 
> parquet.columnindex.truncate.length and parquet.statistics.truncate.length. 
> These are not applied for ParquetWriter.
> These properties are documented here: 
> [https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-1711) [parquet-protobuf] stack overflow when work with well known json type

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-1711.
--
Resolution: Fixed

> [parquet-protobuf] stack overflow when work with well known json type
> -
>
> Key: PARQUET-1711
> URL: https://issues.apache.org/jira/browse/PARQUET-1711
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.10.1
>Reporter: Lawrence He
>Priority: Major
>
> Writing following protobuf message as parquet file is not possible: 
> {code:java}
> syntax = "proto3";
> import "google/protobuf/struct.proto";
> package test;
> option java_outer_classname = "CustomMessage";
> message TestMessage {
> map data = 1;
> } {code}
> Protobuf introduced "well known json type" such like 
> [ListValue|https://developers.google.com/protocol-buffers/docs/reference/google.protobuf#listvalue]
>  to work around json schema conversion. 
> However writing above messages traps parquet writer into an infinite loop due 
> to the "general type" support in protobuf. Current implementation will keep 
> referencing 6 possible types defined in protobuf (null, bool, number, string, 
> struct, list) and entering infinite loop when referencing "struct".
> {code:java}
> java.lang.StackOverflowErrorjava.lang.StackOverflowError at 
> java.base/java.util.Arrays$ArrayItr.(Arrays.java:4418) at 
> java.base/java.util.Arrays$ArrayList.iterator(Arrays.java:4410) at 
> java.base/java.util.Collections$UnmodifiableCollection$1.(Collections.java:1044)
>  at 
> java.base/java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1043)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:64)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.convertFields(ProtoSchemaConverter.java:66)
>  at 
> org.apache.parquet.proto.ProtoSchemaConverter.addField(ProtoSchemaConverter.java:96)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Assigned] (PARQUET-2195) Add scan command to parquet-cli

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu reassigned PARQUET-2195:


Assignee: Gang Wu

> Add scan command to parquet-cli
> ---
>
> Key: PARQUET-2195
> URL: https://issues.apache.org/jira/browse/PARQUET-2195
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> parquet-cli has *cat* and *head* commands to print the records but it does 
> not have the capability to *scan* (w/o printing) all records to check if the 
> file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2195) Add scan command to parquet-cli

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2195.
--
Resolution: Fixed

> Add scan command to parquet-cli
> ---
>
> Key: PARQUET-2195
> URL: https://issues.apache.org/jira/browse/PARQUET-2195
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cli
>Reporter: Gang Wu
>Assignee: Gang Wu
>Priority: Major
>
> parquet-cli has *cat* and *head* commands to print the records but it does 
> not have the capability to *scan* (w/o printing) all records to check if the 
> file is corrupted.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2177) Fix parquet-cli not to fail showing descriptions

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2177.
--
Resolution: Fixed

> Fix parquet-cli not to fail showing descriptions
> 
>
> Key: PARQUET-2177
> URL: https://issues.apache.org/jira/browse/PARQUET-2177
> Project: Parquet
>  Issue Type: Bug
>Reporter: Kengo Seki
>Assignee: Kengo Seki
>Priority: Minor
>
> Currently, trying to show the descriptions of the 'prune' and 'masking' 
> subcommands leads to NPE as follows.
> {code}
> $ java -cp 'target/parquet-cli-1.13.0-SNAPSHOT.jar:target/dependency/*' 
> org.apache.parquet.cli.Main help prune
> Exception in thread "main" java.lang.NullPointerException
>   at 
> com.beust.jcommander.JCommander$MainParameter.access$900(JCommander.java:64)
>   at 
> com.beust.jcommander.JCommander.getMainParameterDescription(JCommander.java:965)
>   at org.apache.parquet.cli.Help.run(Help.java:65)
>   at org.apache.parquet.cli.Main.run(Main.java:146)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.parquet.cli.Main.main(Main.java:189)
> {code}
> {code}
> $ java -cp 'target/parquet-cli-1.13.0-SNAPSHOT.jar:target/dependency/*' 
> org.apache.parquet.cli.Main help masking
> Exception in thread "main" java.lang.NullPointerException
>   at 
> com.beust.jcommander.JCommander$MainParameter.access$900(JCommander.java:64)
>   at 
> com.beust.jcommander.JCommander.getMainParameterDescription(JCommander.java:965)
>   at org.apache.parquet.cli.Help.run(Help.java:65)
>   at org.apache.parquet.cli.Main.run(Main.java:146)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>   at org.apache.parquet.cli.Main.main(Main.java:189)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2198) Vulnerabilities in jackson-databind

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2198.
--
Resolution: Fixed

> Vulnerabilities in jackson-databind
> ---
>
> Key: PARQUET-2198
> URL: https://issues.apache.org/jira/browse/PARQUET-2198
> Project: Parquet
>  Issue Type: Bug
>Affects Versions: 1.12.3
>Reporter: Łukasz Dziedziul
>Priority: Major
>  Labels: jackson-databind, security, vulnerabilities
>
> Update jackson-databind to mitigate CVEs:
>  * [CVE-2022-42003|https://github.com/advisories/GHSA-jjjh-jjxp-wpff] - 
> [https://nvd.nist.gov/vuln/detail/CVE-2022-42003]
>  * [CVE-2022-42004|https://github.com/advisories/GHSA-rgv9-q543-rqg4] - 
> [https://nvd.nist.gov/vuln/detail/CVE-2022-42004 (fixed in  
> 2.13.4)|https://nvd.nist.gov/vuln/detail/CVE-2022-42004]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2208) Add details to nested column encryption config doc and exception text

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2208.
--
  Assignee: Gidon Gershinsky
Resolution: Fixed

> Add details to nested column encryption config doc and exception text
> -
>
> Key: PARQUET-2208
> URL: https://issues.apache.org/jira/browse/PARQUET-2208
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Minor
>
> Parquet columnar encryption requires an explicit full path for each column to 
> be encrypted. If a partial path is configured, the thrown exception is not 
> informative enough, doesn't help much in correcting the parameters.
> The goal is to make the exception print something like:
> _Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: 
> Encrypted column [rider] not in file schema column list: [foo] , 
> [rider.list.element.foo] , [rider.list.element.bar] , [ts] , [uuid]_
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2224) Publish SBOM artifacts

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2224.
--
Resolution: Fixed

> Publish SBOM artifacts
> --
>
> Key: PARQUET-2224
> URL: https://issues.apache.org/jira/browse/PARQUET-2224
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705031#comment-17705031
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148481734


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 
+ * where each element represens the count of the number of times that 
level occurs in the page/column chunk.
+ */
+   8: optional list repetition_level_histogram;

Review Comment:
   I'm not sure I understand, I thought statistics can be on either 
[page](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L530)
 or [column 
chunk](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L747).
  the histograms would vary based on column chunk I think, but I might be 
misunderstanding your suggestion.





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Initial proposal for unencoded/uncompressed statistics

2023-03-25 Thread via GitHub


emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148481734


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 
+ * where each element represens the count of the number of times that 
level occurs in the page/column chunk.
+ */
+   8: optional list repetition_level_histogram;

Review Comment:
   I'm not sure I understand, I thought statistics can be on either 
[page](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L530)
 or [column 
chunk](https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L747).
  the histograms would vary based on column chunk I think, but I might be 
misunderstanding your suggestion.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705030#comment-17705030
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148481707


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   Yes I meant @wgtmac interpretation. I'm open to either approach.  IIUC the 
suggestion here to change the name to something like:
   ```
   /** Optionally set.  But only  set for byte array columns to help 
applications determine total unencoded/uncompressed size of the page.
  * This is equivalent to PlainEncoding(values) - (num_values_encoded * 4) 
(i.e. it doesn't include the size
  * needed to record the lengths of the bytes) nor does it include any size 
to account for nulls.
  */
   encoded_byte_array_data_bytes
   ```





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Initial proposal for unencoded/uncompressed statistics

2023-03-25 Thread via GitHub


emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148481707


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   Yes I meant @wgtmac interpretation. I'm open to either approach.  IIUC the 
suggestion here to change the name to something like:
   ```
   /** Optionally set.  But only  set for byte array columns to help 
applications determine total unencoded/uncompressed size of the page.
  * This is equivalent to PlainEncoding(values) - (num_values_encoded * 4) 
(i.e. it doesn't include the size
  * needed to record the lengths of the bytes) nor does it include any size 
to account for nulls.
  */
   encoded_byte_array_data_bytes
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705028#comment-17705028
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148481707


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   I'm open to either approach.  IIUC the suggestion here to change the name to 
something like:
   ```
   /** Optionally set.  But only  set for byte array columns to help 
applications determine total unencoded/uncompressed size of the page.
  * This is equivalent to PlainEncoding(values) - (num_values_encoded * 4) 
(i.e. it doesn't include the size
  * needed to record the lengths of the bytes) nor does it include any size 
to account for nulls.
  */
   encoded_byte_array_data_bytes
   ```





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705029#comment-17705029
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148481734


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 
+ * where each element represens the count of the number of times that 
level occurs in the page/column chunk.
+ */
+   8: optional list repetition_level_histogram;

Review Comment:
   I'm not sure I understand, I thought statistics can be on either page or row 
group.





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Initial proposal for unencoded/uncompressed statistics

2023-03-25 Thread via GitHub


emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148481734


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 
+ * where each element represens the count of the number of times that 
level occurs in the page/column chunk.
+ */
+   8: optional list repetition_level_histogram;

Review Comment:
   I'm not sure I understand, I thought statistics can be on either page or row 
group.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [parquet-format] emkornfield commented on a diff in pull request #197: PARQUET-2261: Initial proposal for unencoded/uncompressed statistics

2023-03-25 Thread via GitHub


emkornfield commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148481707


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   I'm open to either approach.  IIUC the suggestion here to change the name to 
something like:
   ```
   /** Optionally set.  But only  set for byte array columns to help 
applications determine total unencoded/uncompressed size of the page.
  * This is equivalent to PlainEncoding(values) - (num_values_encoded * 4) 
(i.e. it doesn't include the size
  * needed to record the lengths of the bytes) nor does it include any size 
to account for nulls.
  */
   encoded_byte_array_data_bytes
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (PARQUET-2103) crypto exception in print toPrettyJSON

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2103.
--
Resolution: Fixed

> crypto exception in print toPrettyJSON
> --
>
> Key: PARQUET-2103
> URL: https://issues.apache.org/jira/browse/PARQUET-2103
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.0, 1.12.1, 1.12.2, 1.12.3
>Reporter: Gidon Gershinsky
>Assignee: Gidon Gershinsky
>Priority: Minor
>
> In debug mode, this code 
> {{if (LOG.isDebugEnabled()) {}}
> {{  LOG.debug(ParquetMetadata.toPrettyJSON(parquetMetadata));}}
> {{}}}
> called in 
> {{org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata()}}
>  
> _*in encrypted files with plaintext footer*_ 
> triggers an exception:
>  
> {{Caused by: org.apache.parquet.crypto.ParquetCryptoRuntimeException: [id]. 
> Null File Decryptor     }}
> {{    at 
> org.apache.parquet.hadoop.metadata.EncryptedColumnChunkMetaData.decryptIfNeeded(ColumnChunkMetaData.java:602)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at 
> org.apache.parquet.hadoop.metadata.ColumnChunkMetaData.getEncodingStats(ColumnChunkMetaData.java:353)
>  ~[parquet-hadoop-1.12.0jar:1.12.0]}}
> {{    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]}}
> {{    at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]}}
> {{    at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]}}
> {{    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:689)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serializeContents(IndexedListSerializer.java:119)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:79)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.impl.IndexedListSerializer.serialize(IndexedListSerializer.java:18)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanPropertyWriter.serializeAsField(BeanPropertyWriter.java:728)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.std.BeanSerializerBase.serializeFields(BeanSerializerBase.java:755)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.BeanSerializer.serialize(BeanSerializer.java:178)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:480)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter$Prefetch.serialize(ObjectWriter.java:1516)
>  ~[parquet-jackson-1.12.0jar:1.12.0]}}
> {{    at 
> shaded.parquet.com.fasterxml.jackson.databind.ObjectWriter._writeValueAndClose(ObjectWriter.java:1217)
>  ~[p

[jira] [Resolved] (PARQUET-2159) Parquet bit-packing de/encode optimization

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2159.
--
Fix Version/s: (was: 1.13.0)
   Resolution: Fixed

> Parquet bit-packing de/encode optimization
> --
>
> Key: PARQUET-2159
> URL: https://issues.apache.org/jira/browse/PARQUET-2159
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.13.0
>Reporter: Fang-Xie
>Assignee: Fang-Xie
>Priority: Major
> Attachments: image-2022-06-15-22-56-08-396.png, 
> image-2022-06-15-22-57-15-964.png, image-2022-06-15-22-58-01-442.png, 
> image-2022-06-15-22-58-40-704.png
>
>
> Current Spark use Parquet-mr as parquet reader/writer library, but the 
> built-in bit-packing en/decode is not efficient enough. 
> Our optimization for Parquet bit-packing en/decode with jdk.incubator.vector 
> in Open JDK18 brings prominent performance improvement.
> Due to Vector API is added to OpenJDK since 16, So this optimization request 
> JDK16 or higher.
> *Below are our test results*
> Functional test is based on open-source parquet-mr Bit-pack decoding 
> function: *_public final void unpack8Values(final byte[] in, final int inPos, 
> final int[] out, final int outPos)_* __
> compared with our implementation with vector API *_public final void 
> unpack8Values_vec(final byte[] in, final int inPos, final int[] out, final 
> int outPos)_*
> We tested 10 pairs (open source parquet bit unpacking vs ours optimized 
> vectorized SIMD implementation) decode function with bit 
> width=\{1,2,3,4,5,6,7,8,9,10}, below are test results:
> !image-2022-06-15-22-56-08-396.png|width=437,height=223!
> We integrated our bit-packing decode implementation into parquet-mr, tested 
> the parquet batch reader ability from Spark VectorizedParquetRecordReader 
> which get parquet column data by the batch way. We construct parquet file 
> with different row count and column count, the column data type is Int32, the 
> maximum int value is 127 which satisfies bit pack encode with bit width=7,   
> the count of the row is from 10k to 100 million and the count of the column 
> is from 1 to 4.
> !image-2022-06-15-22-57-15-964.png|width=453,height=229!
> !image-2022-06-15-22-58-01-442.png|width=439,height=217!
> !image-2022-06-15-22-58-40-704.png|width=415,height=208!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2252) Make some methods public to allow external projects to implement page skipping

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu updated PARQUET-2252:
-
Issue Type: Improvement  (was: New Feature)

> Make some methods public to allow external projects to implement page skipping
> --
>
> Key: PARQUET-2252
> URL: https://issues.apache.org/jira/browse/PARQUET-2252
> Project: Parquet
>  Issue Type: Improvement
>Reporter: Yujiang Zhong
>Assignee: Yujiang Zhong
>Priority: Major
>
> Iceberg hopes to implement the column index filter based on Iceberg's own 
> expressions, we would like to be able to use some of the methods in Parquet 
> repo, for example: methods in `RowRanges` and `IndexIterator`, however these 
> are currently not public. Currently we can only rely on reflection to use 
> them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2164) CapacityByteArrayOutputStream overflow while writing causes negative row group sizes to be written

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2164.
--
Fix Version/s: (was: 1.12.3)
   Resolution: Fixed

> CapacityByteArrayOutputStream overflow while writing causes negative row 
> group sizes to be written
> --
>
> Key: PARQUET-2164
> URL: https://issues.apache.org/jira/browse/PARQUET-2164
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.2
>Reporter: Parth Chandra
>Priority: Major
> Attachments: TestLargeDictionaryWriteParquet.java
>
>
> It is possible, while writing a parquet file, to cause 
> {{CapacityByteArrayOutputStream}} to overflow.
> This is an extreme case but it has been observed in a real world data set.
> The attached Spark program manages to reproduce the issue.
> Short summary of how this happens - 
> 1. After many small records possibly including nulls, the dictionary page 
> fills up and subsequent pages are written using plain encoding
> 2. The estimate of when to perform the page size check is based on the number 
> of values observed per page so far. Let's say this is about 100K
> 3. A sequence of very large records shows up. Let's say each of these record 
> is 200K. 
> 4. After 11K of these records the size of the page has gone up beyond 2GB.
> 5. {{CapacityByteArrayOutputStream}} is capable of holding more than 2GB of 
> data but also it holds the size of the data in an int which overflows.
> There are a couple of things to fix here -
> 1. The check for page size should check both the number of values added as 
> well as the buffered size of the data
> 2. {{CapacityByteArrayOutputStream}} should throw an exception is the data 
> size increases beyond 2GB ({{java.io.ByteArrayOutputStream}} does exactly 
> that).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (PARQUET-2202) Redundant String allocation on the hot path in CapacityByteArrayOutputStream.setByte

2023-03-25 Thread Gang Wu (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gang Wu resolved PARQUET-2202.
--
Resolution: Fixed

> Redundant String allocation on the hot path in 
> CapacityByteArrayOutputStream.setByte
> 
>
> Key: PARQUET-2202
> URL: https://issues.apache.org/jira/browse/PARQUET-2202
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-mr
>Affects Versions: 1.12.3
>Reporter: Andrei Pangin
>Priority: Major
>  Labels: performance
> Attachments: profile-alloc.png, profile-cpu.png
>
>
> Profiling of a Spark application revealed a performance issue in production:
> {{CapacityByteArrayOutputStream.setByte}} consumed 2.2% of total CPU time and 
> made up 4.6% of total allocations. However, in normal case, this method 
> should allocate nothing at all.
> Here is an excerpt from async-profiler report.
> CPU profile:
> !profile-cpu.png|width=560!
> Allocation profile:
> !profile-alloc.png|width=560!
> The reason is a {{checkArgument()}} call with an unconditionally constructed 
> dynamic String:
> [https://github.com/apache/parquet-mr/blob/62b774cd0f0c60cfbe540bbfa60bee15929af5d4/parquet-common/src/main/java/org/apache/parquet/bytes/CapacityByteArrayOutputStream.java#L303]
> The suggested fix is to move String construction under the condition:
> {code:java}
> if (index >= bytesUsed) {
>   throw new IllegalArgumentException("Index: " + index +
>   " is >= the current size of: " + bytesUsed);
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705027#comment-17705027
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148479992


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   Can we only count the variable-length bytes in "Plain" here? Other info can 
be deduced from Type.





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: Initial proposal for unencoded/uncompressed statistics

2023-03-25 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148479992


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   Can we only count the variable-length bytes in "Plain" here? Other info can 
be deduced from Type.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705026#comment-17705026
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

wgtmac commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148479555


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   IIUC, it means total bytes that as if the data is plain-encoded.



##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 

Review Comment:
   ```suggestion
* When present there is expected to be one element corresponding to 
each repetition (i.e. size=max repetition_level) 
   ```





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] wgtmac commented on a diff in pull request #197: PARQUET-2261: Initial proposal for unencoded/uncompressed statistics

2023-03-25 Thread via GitHub


wgtmac commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148479555


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   IIUC, it means total bytes that as if the data is plain-encoded.



##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 

Review Comment:
   ```suggestion
* When present there is expected to be one element corresponding to 
each repetition (i.e. size=max repetition_level) 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17705025#comment-17705025
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-

mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148478782


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   Seems that only Plain Encoding will be accouted here? Would non-plain 
encoding have same statistics here (like non-null-size * type->size()`?



##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 
+ * where each element represens the count of the number of times that 
level occurs in the page/column chunk.
+ */
+   8: optional list repetition_level_histogram;

Review Comment:
   Seems it's a per-rowgroup statistics?





> [Format] Add statistics that reflect decoded size to metadata
> -
>
> Key: PARQUET-2261
> URL: https://issues.apache.org/jira/browse/PARQUET-2261
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Micah Kornfield
>Assignee: Micah Kornfield
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-format] mapleFU commented on a diff in pull request #197: PARQUET-2261: Initial proposal for unencoded/uncompressed statistics

2023-03-25 Thread via GitHub


mapleFU commented on code in PR #197:
URL: https://github.com/apache/parquet-format/pull/197#discussion_r1148478782


##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;

Review Comment:
   Seems that only Plain Encoding will be accouted here? Would non-plain 
encoding have same statistics here (like non-null-size * type->size()`?



##
src/main/thrift/parquet.thrift:
##
@@ -223,6 +223,17 @@ struct Statistics {
 */
5: optional binary max_value;
6: optional binary min_value;
+   /** The number of bytes the row/group or page would take if encoded with 
plain-encoding */
+   7: optional i64 plain_encoded_bytes;
+   /** 
+ * When present there is expected to be one element corresponding to each 
repetition (i.e. size=max repetition_leve) 
+ * where each element represens the count of the number of times that 
level occurs in the page/column chunk.
+ */
+   8: optional list repetition_level_histogram;

Review Comment:
   Seems it's a per-rowgroup statistics?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
I put together a draft PR:
https://github.com/apache/parquet-format/pull/197/files

Thinking about the nulls and nesting level a bit more I think keeping a
historgram of repetition and definition levels probably strikes the right
balance for simplicity and accuracy but it would be great to here if there
are other approaches.

On Sat, Mar 25, 2023 at 5:37 PM Micah Kornfield 
wrote:

> 2.  For repeated values, I think it is sufficient to get a reasonable
>> estimate to know the number of start arrays (this includes nested arrays)
>> contained in a page/column chunk and we can add a new field to record this
>> separately.
>
>
> Apologies for replying to myself but one more thought, I don't think any
> set of simple statistics is going to be able to give accurate memory
> estimates for combinations of repeated and nested fields and the intent of
> my proposal is more to get meaningful estimates rather than exact
> measures..  A more exact approach for deeply nested and repeated field
> could be a repeated struct of {rep-level, number of array starts,  total
> number of elements contained in arrays at this level}.  This would still be
> reasonably compact (1 per rep-level) but seems too complex to me.  If
> people think the complexity isn't there or provides meaningful value it
> seems like a plausible approach, there might be others.
>
> On Sat, Mar 25, 2023 at 4:59 PM Micah Kornfield 
> wrote:
>
>> 1. How primitive types are computed? Should we simply compute the raw size
>>> by assuming the data is plain-encoded?
>>> For example, does INT16 use the same bit-width as INT32?
>>> What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
>>> sizeof(int32) for its length?
>>
>> Yes my suggestion is raw size assuming it is plain encoded.  INT16 has
>> the same size as int32.  In general fixed width types it is easy to back
>> out actual byte size in memory given the number of values stored and the
>> number of null values.  For Byte Array this means we store 4 bytes for
>> every non-null value.  For FIXED_SIZE_BYTE_ARRAY plain encoding would have
>> 4 bytes for every value so my suggestion is yes we add in the size
>> overhead.  Again size overhead can be backed out given number nulls and
>> number of values.  Given this for
>>
>>
>>> 2. How do we take care of null values? Should we add the size of validity
>>> bitmap or null buffer to the raw size?
>>
>> No, I think this can be inferred from metadata and consumers can
>> calculate the space they think this would take in their memory
>> representation.   Open to thoughts here but it seems standardizing on plain
>> encoding is the easiest for people to understand and adapt the estimate to
>> what they actually care about, but I can see the other side of simplifying
>> the computation for systems consuming parquet, so happy to go either way.
>>
>>
>>> 3. What about complex types?
>>> Actually only leaf columns have data in the Parquet file. Should we
>>> use
>>> the sum of all sub columns to be the raw size of a nested column?
>>
>>
>> My preference would keep this on leaf columns.  This leads to two
>> complications:
>> 1.  Accurately estimated cost of group values (structs).  This should be
>> reverse engineerable if the number of records in the page/column chunk are
>> known (i.e. size estimates do not account for group values, and reader
>> would calculate based on the number of rows).  This might be a good reason
>> to try to get data page v2 into shape or back-port the number of started
>> records into a data page v1.
>>
>> 2.  For repeated values, I think it is sufficient to get a reasonable
>> estimate to know the number of start arrays (this includes nested arrays)
>> contained in a page/column chunk and we can add a new field to record this
>> separately.
>>
>> Thoughts?
>>
>> 4. Where to store these raw sizes?
>>
>>> Add it to the PageHeader? Or should we aggregate it in the
>>> ColumnChunkMetaData?
>>
>> I would suggest adding it to both (IIUC we store uncompressed size and
>> other values in both as well).
>>
>> Thanks,
>> Micah
>>
>> On Sat, Mar 25, 2023 at 7:23 AM Gang Wu  wrote:
>>
>>> +1 for adding the raw size of each column into the Parquet specs.
>>>
>>> I used to work around these by adding similar but hidden fields to the
>>> file
>>> formats.
>>>
>>> Let me bring some detailed questions to the table.
>>> 1. How primitive types are computed? Should we simply compute the raw
>>> size
>>> by assuming the data is plain-encoded?
>>> For example, does INT16 use the same bit-width as INT32?
>>> What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
>>> sizeof(int32) for its length?
>>> 2. How do we take care of null values? Should we add the size of validity
>>> bitmap or null buffer to the raw size?
>>> 3. What about complex types?
>>> Actually only leaf columns have data in the Parquet file. Should we
>>> use
>>> the sum of all sub columns to be the raw size of a nested column?
>>> 4. Where to stor

[jira] [Created] (PARQUET-2261) [Format] Add statistics that reflect decoded size to metadata

2023-03-25 Thread Micah Kornfield (Jira)
Micah Kornfield created PARQUET-2261:


 Summary: [Format] Add statistics that reflect decoded size to 
metadata
 Key: PARQUET-2261
 URL: https://issues.apache.org/jira/browse/PARQUET-2261
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-format
Reporter: Micah Kornfield
Assignee: Micah Kornfield






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
>
> 2.  For repeated values, I think it is sufficient to get a reasonable
> estimate to know the number of start arrays (this includes nested arrays)
> contained in a page/column chunk and we can add a new field to record this
> separately.


Apologies for replying to myself but one more thought, I don't think any
set of simple statistics is going to be able to give accurate memory
estimates for combinations of repeated and nested fields and the intent of
my proposal is more to get meaningful estimates rather than exact
measures..  A more exact approach for deeply nested and repeated field
could be a repeated struct of {rep-level, number of array starts,  total
number of elements contained in arrays at this level}.  This would still be
reasonably compact (1 per rep-level) but seems too complex to me.  If
people think the complexity isn't there or provides meaningful value it
seems like a plausible approach, there might be others.

On Sat, Mar 25, 2023 at 4:59 PM Micah Kornfield 
wrote:

> 1. How primitive types are computed? Should we simply compute the raw size
>> by assuming the data is plain-encoded?
>> For example, does INT16 use the same bit-width as INT32?
>> What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
>> sizeof(int32) for its length?
>
> Yes my suggestion is raw size assuming it is plain encoded.  INT16 has the
> same size as int32.  In general fixed width types it is easy to back out
> actual byte size in memory given the number of values stored and the number
> of null values.  For Byte Array this means we store 4 bytes for every
> non-null value.  For FIXED_SIZE_BYTE_ARRAY plain encoding would have 4
> bytes for every value so my suggestion is yes we add in the size overhead.
> Again size overhead can be backed out given number nulls and number of
> values.  Given this for
>
>
>> 2. How do we take care of null values? Should we add the size of validity
>> bitmap or null buffer to the raw size?
>
> No, I think this can be inferred from metadata and consumers can calculate
> the space they think this would take in their memory representation.   Open
> to thoughts here but it seems standardizing on plain encoding is the
> easiest for people to understand and adapt the estimate to what they
> actually care about, but I can see the other side of simplifying the
> computation for systems consuming parquet, so happy to go either way.
>
>
>> 3. What about complex types?
>> Actually only leaf columns have data in the Parquet file. Should we
>> use
>> the sum of all sub columns to be the raw size of a nested column?
>
>
> My preference would keep this on leaf columns.  This leads to two
> complications:
> 1.  Accurately estimated cost of group values (structs).  This should be
> reverse engineerable if the number of records in the page/column chunk are
> known (i.e. size estimates do not account for group values, and reader
> would calculate based on the number of rows).  This might be a good reason
> to try to get data page v2 into shape or back-port the number of started
> records into a data page v1.
>
> 2.  For repeated values, I think it is sufficient to get a reasonable
> estimate to know the number of start arrays (this includes nested arrays)
> contained in a page/column chunk and we can add a new field to record this
> separately.
>
> Thoughts?
>
> 4. Where to store these raw sizes?
>
>> Add it to the PageHeader? Or should we aggregate it in the
>> ColumnChunkMetaData?
>
> I would suggest adding it to both (IIUC we store uncompressed size and
> other values in both as well).
>
> Thanks,
> Micah
>
> On Sat, Mar 25, 2023 at 7:23 AM Gang Wu  wrote:
>
>> +1 for adding the raw size of each column into the Parquet specs.
>>
>> I used to work around these by adding similar but hidden fields to the
>> file
>> formats.
>>
>> Let me bring some detailed questions to the table.
>> 1. How primitive types are computed? Should we simply compute the raw size
>> by assuming the data is plain-encoded?
>> For example, does INT16 use the same bit-width as INT32?
>> What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
>> sizeof(int32) for its length?
>> 2. How do we take care of null values? Should we add the size of validity
>> bitmap or null buffer to the raw size?
>> 3. What about complex types?
>> Actually only leaf columns have data in the Parquet file. Should we
>> use
>> the sum of all sub columns to be the raw size of a nested column?
>> 4. Where to store these raw sizes?
>> Add it to the PageHeader? Or should we aggregate it in the
>> ColumnChunkMetaData?
>>
>> Best,
>> Gang
>>
>> On Sat, Mar 25, 2023 at 12:59 AM Will Jones 
>> wrote:
>>
>> > Hi Micah,
>> >
>> > We were just discussing in the Arrow repo how useful it would be to have
>> > utilities that could accurately estimate the deserialized size of a
>> Parquet
>> > file. [1] So I would be very supportive of this.
>> >
>> > IIUC the implementation of this should be trivial for many fi

Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
>
> 1. How primitive types are computed? Should we simply compute the raw size
> by assuming the data is plain-encoded?
> For example, does INT16 use the same bit-width as INT32?
> What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
> sizeof(int32) for its length?

Yes my suggestion is raw size assuming it is plain encoded.  INT16 has the
same size as int32.  In general fixed width types it is easy to back out
actual byte size in memory given the number of values stored and the number
of null values.  For Byte Array this means we store 4 bytes for every
non-null value.  For FIXED_SIZE_BYTE_ARRAY plain encoding would have 4
bytes for every value so my suggestion is yes we add in the size overhead.
Again size overhead can be backed out given number nulls and number of
values.  Given this for


> 2. How do we take care of null values? Should we add the size of validity
> bitmap or null buffer to the raw size?

No, I think this can be inferred from metadata and consumers can calculate
the space they think this would take in their memory representation.   Open
to thoughts here but it seems standardizing on plain encoding is the
easiest for people to understand and adapt the estimate to what they
actually care about, but I can see the other side of simplifying the
computation for systems consuming parquet, so happy to go either way.


> 3. What about complex types?
> Actually only leaf columns have data in the Parquet file. Should we use
> the sum of all sub columns to be the raw size of a nested column?


My preference would keep this on leaf columns.  This leads to two
complications:
1.  Accurately estimated cost of group values (structs).  This should be
reverse engineerable if the number of records in the page/column chunk are
known (i.e. size estimates do not account for group values, and reader
would calculate based on the number of rows).  This might be a good reason
to try to get data page v2 into shape or back-port the number of started
records into a data page v1.

2.  For repeated values, I think it is sufficient to get a reasonable
estimate to know the number of start arrays (this includes nested arrays)
contained in a page/column chunk and we can add a new field to record this
separately.

Thoughts?

4. Where to store these raw sizes?

> Add it to the PageHeader? Or should we aggregate it in the
> ColumnChunkMetaData?

I would suggest adding it to both (IIUC we store uncompressed size and
other values in both as well).

Thanks,
Micah

On Sat, Mar 25, 2023 at 7:23 AM Gang Wu  wrote:

> +1 for adding the raw size of each column into the Parquet specs.
>
> I used to work around these by adding similar but hidden fields to the file
> formats.
>
> Let me bring some detailed questions to the table.
> 1. How primitive types are computed? Should we simply compute the raw size
> by assuming the data is plain-encoded?
> For example, does INT16 use the same bit-width as INT32?
> What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
> sizeof(int32) for its length?
> 2. How do we take care of null values? Should we add the size of validity
> bitmap or null buffer to the raw size?
> 3. What about complex types?
> Actually only leaf columns have data in the Parquet file. Should we use
> the sum of all sub columns to be the raw size of a nested column?
> 4. Where to store these raw sizes?
> Add it to the PageHeader? Or should we aggregate it in the
> ColumnChunkMetaData?
>
> Best,
> Gang
>
> On Sat, Mar 25, 2023 at 12:59 AM Will Jones 
> wrote:
>
> > Hi Micah,
> >
> > We were just discussing in the Arrow repo how useful it would be to have
> > utilities that could accurately estimate the deserialized size of a
> Parquet
> > file. [1] So I would be very supportive of this.
> >
> > IIUC the implementation of this should be trivial for many fixed-size
> > types, although there may be cases that are more complex to track. I'd
> > definitely be interested to hear from folks who have worked on the
> > implementations for the other size fields what the level of difficulty is
> > to implement such a field.
> >
> > Best,
> >
> > Will Jones
> >
> >  [1] https://github.com/apache/arrow/issues/34712
> >
> > On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield 
> > wrote:
> >
> > > Parquet metadata currently tracks uncompressed and compressed
> page/column
> > > sizes [1][2].  Uncompressed size here corresponds to encoded size which
> > can
> > > differ substantially from the plain encoding size due to RLE/Dictionary
> > > encoding.
> > >
> > > When doing query planning/execution it can be useful to understand the
> > > total raw size of bytes (e.g. whether to do a broad-cast join).
> > >
> > > Would people be open to adding an optional field that records the
> > estimated
> > > (or exact) size of the column if plain encoding had been used?
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrif

Re: Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Micah Kornfield
>
> 1. Null variables. In Arrow Array, null-value should occupy some place, but
> field-raw size cannot represent that value.

This is a good point.  The number of nulls can be inferred from statistics
or is included in data-page v2 [1].  I'd rather not bake in assumptions
about size of nulls as different systems can represent them differently and
I would prefer keep this memory representation agnostic.  I'm open to
thoughts here?


> 2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
> variable-size-summary + sizeof(ByteArray) * value-count

My suggestion here is the size of the plain encoded values because the
encoding is already well defined in Parquet.  I think for FLBA this ends up
being equal to ["size of column" * number of non-null values]/. for byte
array the formula listed here is what plain encoding would work out to
where sizeof(ByteArray) = 4.I think this is preferable but let me know
if this doesn't cover your use-case.


> 3. Sometimes Arrow data is not equal to Parquet data, like Decimal stored
> as int32 or int64.
> Hope that helps.

Yes, my intent would be to keep this agnostic from other systems but I
think the information allows other systems to use the estimate reasonably
well or back out their own computation.  The size of the Decimal values in
Arrow can determines by precisions and scale of the column, the chosen
Arrow decimal width and the number of values.

Best, Xuwei Fu

[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L563

On Sat, Mar 25, 2023 at 9:36 AM wish maple  wrote:

> +1 For uncompressed size for the field. However, it's a bit-tricky here.
> I've
> implement a similar size-hint in our system, here are some problems I met:
> 1. Null variables. In Arrow Array, null-value should occupy some place, but
> field-raw size cannot represent that value.
> 2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
> variable-size-summary + sizeof(ByteArray) * value-count
> 3. Sometimes Arrow data is not equal to Parquet data, like Decimal stored
> as int32 or int64.
> Hope that helps.
>
> Best, Xuwei Fu
>
> On 2023/03/24 16:59:31 Will Jones wrote:
> > Hi Micah,
> >
> > We were just discussing in the Arrow repo how useful it would be to have
> > utilities that could accurately estimate the deserialized size of a
> Parquet
> > file. [1] So I would be very supportive of this.
> >
> > IIUC the implementation of this should be trivial for many fixed-size
> > types, although there may be cases that are more complex to track. I'd
> > definitely be interested to hear from folks who have worked on the
> > implementations for the other size fields what the level of difficulty is
> > to implement such a field.
> >
> > Best,
> >
> > Will Jones
> >
> >  [1] https://github.com/apache/arrow/issues/34712
> >
> > On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield 
> > wrote:
> >
> > > Parquet metadata currently tracks uncompressed and compressed
> page/column
> > > sizes [1][2].  Uncompressed size here corresponds to encoded size which
> can
> > > differ substantially from the plain encoding size due to RLE/Dictionary
> > > encoding.
> > >
> > > When doing query planning/execution it can be useful to understand the
> > > total raw size of bytes (e.g. whether to do a broad-cast join).
> > >
> > > Would people be open to adding an optional field that records the
> estimated
> > > (or exact) size of the column if plain encoding had been used?
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1]
> > >
> > >
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> > > [2]
> > >
> > >
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
> > >
> >
>


[jira] [Commented] (PARQUET-2149) Implement async IO for Parquet file reader

2023-03-25 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-2149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704971#comment-17704971
 ] 

ASF GitHub Bot commented on PARQUET-2149:
-

parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1483892749

   > @parthchandra Do you have time to resolve the conflicts? I think it would 
be nice to be included in the next release.
   
   Done
   




> Implement async IO for Parquet file reader
> --
>
> Key: PARQUET-2149
> URL: https://issues.apache.org/jira/browse/PARQUET-2149
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Reporter: Parth Chandra
>Priority: Major
>
> ParquetFileReader's implementation has the following flow (simplified) - 
>       - For every column -> Read from storage in 8MB blocks -> Read all 
> uncompressed pages into output queue 
>       - From output queues -> (downstream ) decompression + decoding
> This flow is serialized, which means that downstream threads are blocked 
> until the data has been read. Because a large part of the time spent is 
> waiting for data from storage, threads are idle and CPU utilization is really 
> low.
> There is no reason why this cannot be made asynchronous _and_ parallel. So 
> For Column _i_ -> reading one chunk until end, from storage -> intermediate 
> output queue -> read one uncompressed page until end -> output queue -> 
> (downstream ) decompression + decoding
> Note that this can be made completely self contained in ParquetFileReader and 
> downstream implementations like Iceberg and Spark will automatically be able 
> to take advantage without code change as long as the ParquetFileReader apis 
> are not changed. 
> In past work with async io  [Drill - async page reader 
> |https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/columnreaders/AsyncPageReader.java]
>  , I have seen 2x-3x improvement in reading speed for Parquet files.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] parthchandra commented on pull request #968: PARQUET-2149: Async IO implementation for ParquetFileReader

2023-03-25 Thread via GitHub


parthchandra commented on PR #968:
URL: https://github.com/apache/parquet-mr/pull/968#issuecomment-1483892749

   > @parthchandra Do you have time to resolve the conflicts? I think it would 
be nice to be included in the next release.
   
   Done
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



RE: Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread wish maple
+1 For uncompressed size for the field. However, it's a bit-tricky here.
I've
implement a similar size-hint in our system, here are some problems I met:
1. Null variables. In Arrow Array, null-value should occupy some place, but
field-raw size cannot represent that value.
2. Size of FLBA/ByteArray. It's size should be variable-size-summary or
variable-size-summary + sizeof(ByteArray) * value-count
3. Sometimes Arrow data is not equal to Parquet data, like Decimal stored
as int32 or int64.
Hope that helps.

Best, Xuwei Fu

On 2023/03/24 16:59:31 Will Jones wrote:
> Hi Micah,
>
> We were just discussing in the Arrow repo how useful it would be to have
> utilities that could accurately estimate the deserialized size of a
Parquet
> file. [1] So I would be very supportive of this.
>
> IIUC the implementation of this should be trivial for many fixed-size
> types, although there may be cases that are more complex to track. I'd
> definitely be interested to hear from folks who have worked on the
> implementations for the other size fields what the level of difficulty is
> to implement such a field.
>
> Best,
>
> Will Jones
>
>  [1] https://github.com/apache/arrow/issues/34712
>
> On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield 
> wrote:
>
> > Parquet metadata currently tracks uncompressed and compressed
page/column
> > sizes [1][2].  Uncompressed size here corresponds to encoded size which
can
> > differ substantially from the plain encoding size due to RLE/Dictionary
> > encoding.
> >
> > When doing query planning/execution it can be useful to understand the
> > total raw size of bytes (e.g. whether to do a broad-cast join).
> >
> > Would people be open to adding an optional field that records the
estimated
> > (or exact) size of the column if plain encoding had been used?
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> > [2]
> >
> >
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
> >
>


Re: [DISCUSS] Add a Plain Encoding Size Bytes to Parquet Metadata

2023-03-25 Thread Gang Wu
+1 for adding the raw size of each column into the Parquet specs.

I used to work around these by adding similar but hidden fields to the file
formats.

Let me bring some detailed questions to the table.
1. How primitive types are computed? Should we simply compute the raw size
by assuming the data is plain-encoded?
For example, does INT16 use the same bit-width as INT32?
What about BYTE_ARRAY/FIXED_SIZE_BYTE_ARRAY? Should we add an extra
sizeof(int32) for its length?
2. How do we take care of null values? Should we add the size of validity
bitmap or null buffer to the raw size?
3. What about complex types?
Actually only leaf columns have data in the Parquet file. Should we use
the sum of all sub columns to be the raw size of a nested column?
4. Where to store these raw sizes?
Add it to the PageHeader? Or should we aggregate it in the
ColumnChunkMetaData?

Best,
Gang

On Sat, Mar 25, 2023 at 12:59 AM Will Jones  wrote:

> Hi Micah,
>
> We were just discussing in the Arrow repo how useful it would be to have
> utilities that could accurately estimate the deserialized size of a Parquet
> file. [1] So I would be very supportive of this.
>
> IIUC the implementation of this should be trivial for many fixed-size
> types, although there may be cases that are more complex to track. I'd
> definitely be interested to hear from folks who have worked on the
> implementations for the other size fields what the level of difficulty is
> to implement such a field.
>
> Best,
>
> Will Jones
>
>  [1] https://github.com/apache/arrow/issues/34712
>
> On Fri, Mar 24, 2023 at 9:27 AM Micah Kornfield 
> wrote:
>
> > Parquet metadata currently tracks uncompressed and compressed page/column
> > sizes [1][2].  Uncompressed size here corresponds to encoded size which
> can
> > differ substantially from the plain encoding size due to RLE/Dictionary
> > encoding.
> >
> > When doing query planning/execution it can be useful to understand the
> > total raw size of bytes (e.g. whether to do a broad-cast join).
> >
> > Would people be open to adding an optional field that records the
> estimated
> > (or exact) size of the column if plain encoding had been used?
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L728
> > [2]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L637
> >
>