[jira] [Updated] (PARQUET-2232) Add an api to ColumnChunkMetaData to indicate if the column chunk uses a bloom filter
[ https://issues.apache.org/jira/browse/PARQUET-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinpeng Zhou updated PARQUET-2232: -- Description: Although bloom filter is not fully supported in parquet-cpp for now, it can be useful to provide an api that tells if a column chunk is using bloom filters. This would lead to better understanding of file characteristics. (was: Although bloom filter is not fully supported in parquet-cpp, it can be useful to provide an api that tells if a column chunk is using the bloom filter) Summary: Add an api to ColumnChunkMetaData to indicate if the column chunk uses a bloom filter (was: It can be useful to provide an api that tells if a column chunk is using the bloom filter) > Add an api to ColumnChunkMetaData to indicate if the column chunk uses a > bloom filter > -- > > Key: PARQUET-2232 > URL: https://issues.apache.org/jira/browse/PARQUET-2232 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Jinpeng Zhou >Assignee: Jinpeng Zhou >Priority: Minor > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Although bloom filter is not fully supported in parquet-cpp for now, it can > be useful to provide an api that tells if a column chunk is using bloom > filters. This would lead to better understanding of file characteristics. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (PARQUET-2232) It can be useful to provide an api that tells if a column chunk is using the bloom filter
[ https://issues.apache.org/jira/browse/PARQUET-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated PARQUET-2232: Labels: pull-request-available (was: ) > It can be useful to provide an api that tells if a column chunk is using the > bloom filter > - > > Key: PARQUET-2232 > URL: https://issues.apache.org/jira/browse/PARQUET-2232 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Jinpeng Zhou >Assignee: Jinpeng Zhou >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Although bloom filter is not fully supported in parquet-cpp, it can be useful > to provide an api that tells if a column chunk is using the bloom filter -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (PARQUET-2232) It can be useful to provide an api that tells if a column chunk is using the bloom filter
[ https://issues.apache.org/jira/browse/PARQUET-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jinpeng Zhou updated PARQUET-2232: -- Description: Although bloom filter is not fully supported in parquet-cpp, it can be useful to provide an api that tells if a column chunk is using the bloom filter Summary: It can be useful to provide an api that tells if a column chunk is using the bloom filter (was: Although bloom filter is not fully supported in parquet-cpp, it can be useful to provide an api that tells if a column chunk is using the bloom filter) > It can be useful to provide an api that tells if a column chunk is using the > bloom filter > - > > Key: PARQUET-2232 > URL: https://issues.apache.org/jira/browse/PARQUET-2232 > Project: Parquet > Issue Type: Improvement > Components: parquet-cpp >Reporter: Jinpeng Zhou >Assignee: Jinpeng Zhou >Priority: Minor > > Although bloom filter is not fully supported in parquet-cpp, it can be useful > to provide an api that tells if a column chunk is using the bloom filter -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (PARQUET-2232) Although bloom filter is not fully supported in parquet-cpp, it can be useful to provide an api that tells if a column chunk is using the bloom filter
Jinpeng Zhou created PARQUET-2232: - Summary: Although bloom filter is not fully supported in parquet-cpp, it can be useful to provide an api that tells if a column chunk is using the bloom filter Key: PARQUET-2232 URL: https://issues.apache.org/jira/browse/PARQUET-2232 Project: Parquet Issue Type: Improvement Components: parquet-cpp Reporter: Jinpeng Zhou Assignee: Jinpeng Zhou -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (PARQUET-1622) Add BYTE_STREAM_SPLIT encoding
[ https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678056#comment-17678056 ] Gang Wu edited comment on PARQUET-1622 at 1/18/23 3:05 AM: --- The issue raised by [~mwish] above may also exist in the parquet-mr. cc [~xinlishang] [~gershinsky] was (Author: wgtmac): The issue raised by [~mwish] above may also exist in the parquet-mr. cc [~xinlishang]] [~gershinsky] > Add BYTE_STREAM_SPLIT encoding > -- > > Key: PARQUET-1622 > URL: https://issues.apache.org/jira/browse/PARQUET-1622 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift >Reporter: Martin Radev >Assignee: Martin Radev >Priority: Minor > Labels: features, pull-request-available > Fix For: 1.12.0, format-2.8.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > Apache Parquet does not have any encodings suitable for FP data and the > available text compressors (zstd, gzip, etc) do not handle FP data very well. > It is possible to apply a simple data transformation named "stream > splitting". Such could be "byte stream splitting" which creates K streams of > length N where K is the number of bytes in the data type (4 for floats, 8 for > doubles) and N is the number of elements in the sequence. > The transformed data compresses significantly better on average than the > original data and for some cases there is a performance improvement in > compression and decompression speed. > You can read a more detailed report here: > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1622) Add BYTE_STREAM_SPLIT encoding
[ https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678056#comment-17678056 ] Gang Wu commented on PARQUET-1622: -- The issue raised by [~mwish] above may also exist in the parquet-mr. cc [~xinlishang]] [~gershinsky] > Add BYTE_STREAM_SPLIT encoding > -- > > Key: PARQUET-1622 > URL: https://issues.apache.org/jira/browse/PARQUET-1622 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift >Reporter: Martin Radev >Assignee: Martin Radev >Priority: Minor > Labels: features, pull-request-available > Fix For: 1.12.0, format-2.8.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > Apache Parquet does not have any encodings suitable for FP data and the > available text compressors (zstd, gzip, etc) do not handle FP data very well. > It is possible to apply a simple data transformation named "stream > splitting". Such could be "byte stream splitting" which creates K streams of > length N where K is the number of bytes in the data type (4 for floats, 8 for > doubles) and N is the number of elements in the sequence. > The transformed data compresses significantly better on average than the > original data and for some cases there is a performance improvement in > compression and decompression speed. > You can read a more detailed report here: > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (PARQUET-1622) Add BYTE_STREAM_SPLIT encoding
[ https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678055#comment-17678055 ] Xuwei Fu commented on PARQUET-1622: --- [~gszadovszky] [~martinradev] Hi all, I meet a problem here: [https://github.com/apache/arrow/issues/15173] Would you mind take a look? Seems we don't have "non-null value count" here. > Add BYTE_STREAM_SPLIT encoding > -- > > Key: PARQUET-1622 > URL: https://issues.apache.org/jira/browse/PARQUET-1622 > Project: Parquet > Issue Type: New Feature > Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift >Reporter: Martin Radev >Assignee: Martin Radev >Priority: Minor > Labels: features, pull-request-available > Fix For: 1.12.0, format-2.8.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > Apache Parquet does not have any encodings suitable for FP data and the > available text compressors (zstd, gzip, etc) do not handle FP data very well. > It is possible to apply a simple data transformation named "stream > splitting". Such could be "byte stream splitting" which creates K streams of > length N where K is the number of bytes in the data type (4 for floats, 8 for > doubles) and N is the number of elements in the sequence. > The transformed data compresses significantly better on average than the > original data and for some cases there is a performance improvement in > compression and decompression speed. > You can read a more detailed report here: > https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view -- This message was sent by Atlassian Jira (v8.20.10#820010)
[GitHub] [parquet-mr] vectorijk commented on pull request #1015: add support re-encryption in ColumnEncryptor
vectorijk commented on PR #1015: URL: https://github.com/apache/parquet-mr/pull/1015#issuecomment-1385727824 @wgtmac thanks for the review! I will coordinate with https://github.com/apache/parquet-mr/pull/1014 and address the comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org