[jira] [Updated] (PARQUET-2232) Add an api to ColumnChunkMetaData to indicate if the column chunk uses a bloom filter

2023-01-17 Thread Jinpeng Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinpeng Zhou updated PARQUET-2232:
--
Description: Although bloom filter is not fully supported in parquet-cpp 
for now, it can be useful to provide an api that tells if a column chunk is 
using bloom filters. This would lead to better understanding of file 
characteristics.  (was: Although bloom filter is not fully supported in 
parquet-cpp, it can be useful to provide an api that tells if a column chunk is 
using the bloom filter)
Summary: Add an api to ColumnChunkMetaData to indicate if the column 
chunk uses a bloom filter   (was: It can be useful to provide an api that tells 
if a column chunk is using the bloom filter)

> Add an api to ColumnChunkMetaData to indicate if the column chunk uses a 
> bloom filter 
> --
>
> Key: PARQUET-2232
> URL: https://issues.apache.org/jira/browse/PARQUET-2232
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jinpeng Zhou
>Assignee: Jinpeng Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Although bloom filter is not fully supported in parquet-cpp for now, it can 
> be useful to provide an api that tells if a column chunk is using bloom 
> filters. This would lead to better understanding of file characteristics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2232) It can be useful to provide an api that tells if a column chunk is using the bloom filter

2023-01-17 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated PARQUET-2232:

Labels: pull-request-available  (was: )

> It can be useful to provide an api that tells if a column chunk is using the 
> bloom filter
> -
>
> Key: PARQUET-2232
> URL: https://issues.apache.org/jira/browse/PARQUET-2232
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jinpeng Zhou
>Assignee: Jinpeng Zhou
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Although bloom filter is not fully supported in parquet-cpp, it can be useful 
> to provide an api that tells if a column chunk is using the bloom filter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (PARQUET-2232) It can be useful to provide an api that tells if a column chunk is using the bloom filter

2023-01-17 Thread Jinpeng Zhou (Jira)


 [ 
https://issues.apache.org/jira/browse/PARQUET-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jinpeng Zhou updated PARQUET-2232:
--
Description: Although bloom filter is not fully supported in parquet-cpp, 
it can be useful to provide an api that tells if a column chunk is using the 
bloom filter
Summary: It can be useful to provide an api that tells if a column 
chunk is using the bloom filter  (was: Although bloom filter is not fully 
supported in parquet-cpp, it can be useful to provide an api that tells if a 
column chunk is using the bloom filter)

> It can be useful to provide an api that tells if a column chunk is using the 
> bloom filter
> -
>
> Key: PARQUET-2232
> URL: https://issues.apache.org/jira/browse/PARQUET-2232
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Jinpeng Zhou
>Assignee: Jinpeng Zhou
>Priority: Minor
>
> Although bloom filter is not fully supported in parquet-cpp, it can be useful 
> to provide an api that tells if a column chunk is using the bloom filter



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (PARQUET-2232) Although bloom filter is not fully supported in parquet-cpp, it can be useful to provide an api that tells if a column chunk is using the bloom filter

2023-01-17 Thread Jinpeng Zhou (Jira)
Jinpeng Zhou created PARQUET-2232:
-

 Summary: Although bloom filter is not fully supported in 
parquet-cpp, it can be useful to provide an api that tells if a column chunk is 
using the bloom filter
 Key: PARQUET-2232
 URL: https://issues.apache.org/jira/browse/PARQUET-2232
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Jinpeng Zhou
Assignee: Jinpeng Zhou






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (PARQUET-1622) Add BYTE_STREAM_SPLIT encoding

2023-01-17 Thread Gang Wu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678056#comment-17678056
 ] 

Gang Wu edited comment on PARQUET-1622 at 1/18/23 3:05 AM:
---

The issue raised by [~mwish] above may also exist in the parquet-mr.

cc [~xinlishang] [~gershinsky]


was (Author: wgtmac):
The issue raised by [~mwish] above may also exist in the parquet-mr.

cc [~xinlishang]] [~gershinsky]

> Add BYTE_STREAM_SPLIT encoding
> --
>
> Key: PARQUET-1622
> URL: https://issues.apache.org/jira/browse/PARQUET-1622
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: features, pull-request-available
> Fix For: 1.12.0, format-2.8.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1622) Add BYTE_STREAM_SPLIT encoding

2023-01-17 Thread Gang Wu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678056#comment-17678056
 ] 

Gang Wu commented on PARQUET-1622:
--

The issue raised by [~mwish] above may also exist in the parquet-mr.

cc [~xinlishang]] [~gershinsky]

> Add BYTE_STREAM_SPLIT encoding
> --
>
> Key: PARQUET-1622
> URL: https://issues.apache.org/jira/browse/PARQUET-1622
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: features, pull-request-available
> Fix For: 1.12.0, format-2.8.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (PARQUET-1622) Add BYTE_STREAM_SPLIT encoding

2023-01-17 Thread Xuwei Fu (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678055#comment-17678055
 ] 

Xuwei Fu commented on PARQUET-1622:
---

[~gszadovszky] [~martinradev] 

Hi all, I meet a problem here: [https://github.com/apache/arrow/issues/15173]

Would you mind take a look? Seems we don't have "non-null value count" here.

> Add BYTE_STREAM_SPLIT encoding
> --
>
> Key: PARQUET-1622
> URL: https://issues.apache.org/jira/browse/PARQUET-1622
> Project: Parquet
>  Issue Type: New Feature
>  Components: parquet-cpp, parquet-format, parquet-mr, parquet-thrift
>Reporter: Martin Radev
>Assignee: Martin Radev
>Priority: Minor
>  Labels: features, pull-request-available
> Fix For: 1.12.0, format-2.8.0
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> Apache Parquet does not have any encodings suitable for FP data and the 
> available text compressors (zstd, gzip, etc) do not handle FP data very well.
> It is possible to apply a simple data transformation named "stream 
> splitting". Such could be "byte stream splitting" which creates K streams of 
> length N where K is the number of bytes in the data type (4 for floats, 8 for 
> doubles) and N is the number of elements in the sequence.
> The transformed data compresses significantly better on average than the 
> original data and for some cases there is a performance improvement in 
> compression and decompression speed.
> You can read a more detailed report here:
> https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [parquet-mr] vectorijk commented on pull request #1015: add support re-encryption in ColumnEncryptor

2023-01-17 Thread GitBox


vectorijk commented on PR #1015:
URL: https://github.com/apache/parquet-mr/pull/1015#issuecomment-1385727824

   @wgtmac thanks for the review! I will coordinate with 
https://github.com/apache/parquet-mr/pull/1014 and address the comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org