[jira] [Commented] (PARQUET-1872) Add TransCompression command
[ https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244334#comment-17244334 ] Xinli Shang commented on PARQUET-1872: -- Thanks [~gszadovszky] for working on this! I just created the PR to add comments in the command. Once you review it and we merge, I will resolve this Jira. > Add TransCompression command > - > > Key: PARQUET-1872 > URL: https://issues.apache.org/jira/browse/PARQUET-1872 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When ZSTD becomes more popular, there is a need to translate existing data to > ZSTD compressed which can achieve a higher compression ratio. It would be > useful if we can have a tool to convert a Parquet file directly by just > decompressing/compressing each page without decoding/encoding or assembling > the record because it is much faster. The initial result shows it is ~5 times > faster. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1872) Add TransCompression command
[ https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243881#comment-17243881 ] Gabor Szadovszky commented on PARQUET-1872: --- [~sha...@uber.com], based on the git logs this feature is more or less complete. Could you please resolve the jiras accordingly? Also, bloom filter support is missing from the feature. Do you have bandwidth to implement it for 1.12.0? If not do you think it would be a good idea to show in the command that bloom filters are not supported (yet)? > Add TransCompression command > - > > Key: PARQUET-1872 > URL: https://issues.apache.org/jira/browse/PARQUET-1872 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When ZSTD becomes more popular, there is a need to translate existing data to > ZSTD compressed which can achieve a higher compression ratio. It would be > useful if we can have a tool to convert a Parquet file directly by just > decompressing/compressing each page without decoding/encoding or assembling > the record because it is much faster. The initial result shows it is ~5 times > faster. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1872) Add TransCompression command
[ https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138607#comment-17138607 ] Xinli Shang commented on PARQUET-1872: -- That is correct understanding [~gszadovszky]. > Add TransCompression command > - > > Key: PARQUET-1872 > URL: https://issues.apache.org/jira/browse/PARQUET-1872 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When ZSTD becomes more popular, there is a need to translate existing data to > ZSTD compressed which can achieve a higher compression ratio. It would be > useful if we can have a tool to convert a Parquet file directly by just > decompressing/compressing each page without decoding/encoding or assembling > the record because it is much faster. The initial result shows it is ~5 times > faster. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1872) Add TransCompression command
[ https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138248#comment-17138248 ] Gabor Szadovszky commented on PARQUET-1872: --- [~sha...@uber.com], if I understand correctly this PR does not handle bloom filters meaning it won't create any bloom filter in the new file. If that's true, I'm fine with implementing the bloom filter support separately because it means that this PR does not hurt correctness. > Add TransCompression command > - > > Key: PARQUET-1872 > URL: https://issues.apache.org/jira/browse/PARQUET-1872 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When ZSTD becomes more popular, there is a need to translate existing data to > ZSTD compressed which can achieve a higher compression ratio. It would be > useful if we can have a tool to convert a Parquet file directly by just > decompressing/compressing each page without decoding/encoding or assembling > the record because it is much faster. The initial result shows it is ~5 times > faster. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1872) Add TransCompression command
[ https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137967#comment-17137967 ] Xinli Shang commented on PARQUET-1872: -- [~gszadovszky]Thanks for the reply! I just manually linked the PR. For the subtask, I was thinking to have a review & changes first with parquet-tools then I can add it to parquet-cli instead of changing both at the same time. But that is also fine for me to have the two places changes at the same PR. I just add to parquet-cli in the newest PR. For Column and OffsetIndex, they are taken care of in my PR. I also added tests for both ColumnIndex and OffsetIndex validation. For bloom filter, I will work on the subtask when this PR is done. That would require to copy over the existing bloom filters to the new files. Xinli > Add TransCompression command > - > > Key: PARQUET-1872 > URL: https://issues.apache.org/jira/browse/PARQUET-1872 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When ZSTD becomes more popular, there is a need to translate existing data to > ZSTD compressed which can achieve a higher compression ratio. It would be > useful if we can have a tool to convert a Parquet file directly by just > decompressing/compressing each page without decoding/encoding or assembling > the record because it is much faster. The initial result shows it is ~5 times > faster. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (PARQUET-1872) Add TransCompression command
[ https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134030#comment-17134030 ] Gabor Szadovszky commented on PARQUET-1872: --- [~sha...@uber.com], I don't know why the PR was not linked here automatically. Please, add it manually to have the reference. I don't get the sub-tasks. In the PR #796 header you reference this jira while you already resolve the parquet-tools related sub-task in it. I think, adding this functionality to {{parquet-cli}} shouldn't be a big deal to separate to another task. (I would suggest implementing the functionality at one place and invoke it from {{parquet-tools}} and {{parquet-cli}}. What is the bloom filter support is about? I am not sure about the bloom filters but offset indexes surely have to be updated as the page offsets will change. Without it the feature is incorrect so I would not merge a PR to master without implementing it. > Add TransCompression command > - > > Key: PARQUET-1872 > URL: https://issues.apache.org/jira/browse/PARQUET-1872 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr >Affects Versions: 1.12.0 >Reporter: Xinli Shang >Assignee: Xinli Shang >Priority: Major > > When ZSTD becomes more popular, there is a need to translate existing data > ZSTD compressed which can achieve a higher compression ratio. It would be > useful if we can have a tool to convert a Parquet file directly by just > decompressing/compressing each page without decoding/encoding or assembling > the record because it is much faster. The initial result shows it is ~5 times > faster. -- This message was sent by Atlassian Jira (v8.3.4#803005)