[jira] [Commented] (PARQUET-1872) Add TransCompression command

2020-12-04 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17244334#comment-17244334
 ] 

Xinli Shang commented on PARQUET-1872:
--

Thanks [~gszadovszky] for working on this! I just created the PR to add 
comments in the command. Once you review it and we merge, I will resolve this 
Jira. 

> Add TransCompression command 
> -
>
> Key: PARQUET-1872
> URL: https://issues.apache.org/jira/browse/PARQUET-1872
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data to 
> ZSTD compressed which can achieve a higher compression ratio. It would be 
> useful if we can have a tool to convert a Parquet file directly by just 
> decompressing/compressing each page without decoding/encoding or assembling 
> the record because it is much faster. The initial result shows it is ~5 times 
> faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1872) Add TransCompression command

2020-12-04 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243881#comment-17243881
 ] 

Gabor Szadovszky commented on PARQUET-1872:
---

[~sha...@uber.com], based on the git logs this feature is more or less 
complete. Could you please resolve the jiras accordingly?
Also, bloom filter support is missing from the feature. Do you have bandwidth 
to implement it for 1.12.0? If not do you think it would be a good idea to show 
in the command that bloom filters are not supported (yet)?

> Add TransCompression command 
> -
>
> Key: PARQUET-1872
> URL: https://issues.apache.org/jira/browse/PARQUET-1872
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data to 
> ZSTD compressed which can achieve a higher compression ratio. It would be 
> useful if we can have a tool to convert a Parquet file directly by just 
> decompressing/compressing each page without decoding/encoding or assembling 
> the record because it is much faster. The initial result shows it is ~5 times 
> faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1872) Add TransCompression command

2020-06-17 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138607#comment-17138607
 ] 

Xinli Shang commented on PARQUET-1872:
--

That is correct understanding [~gszadovszky]. 

> Add TransCompression command 
> -
>
> Key: PARQUET-1872
> URL: https://issues.apache.org/jira/browse/PARQUET-1872
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data to 
> ZSTD compressed which can achieve a higher compression ratio. It would be 
> useful if we can have a tool to convert a Parquet file directly by just 
> decompressing/compressing each page without decoding/encoding or assembling 
> the record because it is much faster. The initial result shows it is ~5 times 
> faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1872) Add TransCompression command

2020-06-17 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138248#comment-17138248
 ] 

Gabor Szadovszky commented on PARQUET-1872:
---

[~sha...@uber.com], if I understand correctly this PR does not handle bloom 
filters meaning it won't create any bloom filter in the new file. If that's 
true, I'm fine with implementing the bloom filter support separately because it 
means that this PR does not hurt correctness.

> Add TransCompression command 
> -
>
> Key: PARQUET-1872
> URL: https://issues.apache.org/jira/browse/PARQUET-1872
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data to 
> ZSTD compressed which can achieve a higher compression ratio. It would be 
> useful if we can have a tool to convert a Parquet file directly by just 
> decompressing/compressing each page without decoding/encoding or assembling 
> the record because it is much faster. The initial result shows it is ~5 times 
> faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1872) Add TransCompression command

2020-06-16 Thread Xinli Shang (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137967#comment-17137967
 ] 

Xinli Shang commented on PARQUET-1872:
--

[~gszadovszky]Thanks for the reply! I just manually linked the PR. 

For the subtask, I was thinking to have a review & changes first with 
parquet-tools then I can add it to parquet-cli instead of changing both at the 
same time. But that is also fine for me to have the two places changes at the 
same PR. I just add to parquet-cli in the newest PR. 

For Column and OffsetIndex, they are taken care of in my PR. I also added tests 
for both ColumnIndex and OffsetIndex validation. 

For bloom filter, I will work on the subtask when this PR is done. That would 
require to copy over the existing bloom filters to the new files.

Xinli 

> Add TransCompression command 
> -
>
> Key: PARQUET-1872
> URL: https://issues.apache.org/jira/browse/PARQUET-1872
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data to 
> ZSTD compressed which can achieve a higher compression ratio. It would be 
> useful if we can have a tool to convert a Parquet file directly by just 
> decompressing/compressing each page without decoding/encoding or assembling 
> the record because it is much faster. The initial result shows it is ~5 times 
> faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (PARQUET-1872) Add TransCompression command

2020-06-12 Thread Gabor Szadovszky (Jira)


[ 
https://issues.apache.org/jira/browse/PARQUET-1872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17134030#comment-17134030
 ] 

Gabor Szadovszky commented on PARQUET-1872:
---

[~sha...@uber.com], I don't know why the PR was not linked here automatically. 
Please, add it manually to have the reference.
I don't get the sub-tasks. In the PR #796 header you reference this jira while 
you already resolve the parquet-tools related sub-task in it. I think, adding 
this functionality to {{parquet-cli}} shouldn't be a big deal to separate to 
another task. (I would suggest implementing the functionality at one place and 
invoke it from {{parquet-tools}} and {{parquet-cli}}.
What is the bloom filter support is about? I am not sure about the bloom 
filters but offset indexes surely have to be updated as the page offsets will 
change. Without it the feature is incorrect so I would not merge a PR to master 
without implementing it.

> Add TransCompression command 
> -
>
> Key: PARQUET-1872
> URL: https://issues.apache.org/jira/browse/PARQUET-1872
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-mr
>Affects Versions: 1.12.0
>Reporter: Xinli Shang
>Assignee: Xinli Shang
>Priority: Major
>
> When ZSTD becomes more popular, there is a need to translate existing data 
> ZSTD compressed which can achieve a higher compression ratio. It would be 
> useful if we can have a tool to convert a Parquet file directly by just 
> decompressing/compressing each page without decoding/encoding or assembling 
> the record because it is much faster. The initial result shows it is ~5 times 
> faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)