[ 
https://issues.apache.org/jira/browse/PARQUET-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zoltan Ivanfi reassigned PARQUET-1337:
--------------------------------------

    Assignee: Zoltan Ivanfi

> Implement better estimate of page size for RLE+bitpacking
> ---------------------------------------------------------
>
>                 Key: PARQUET-1337
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1337
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Gabor Szadovszky
>            Assignee: Zoltan Ivanfi
>            Priority: Major
>
> If there are many columns with encoding RLE+bitpacking (e.g. dictionary 
> encoding) where the value variance is low the estimate of the size of the 
> open pages (which are not encoded yet) are much larger than the final page 
> size. Because of that parquet-mr fails to create row-groups that size are 
> close to {{parquet.block.size}} which causes performance issues while reading.
> A hint from Ryan to solve this issue:
> {quote}
> We could probably get a better estimate by using the amount of buffered
> data and how large other pages in a column were after fully encoding and
> compressing. So if you have 5 pages compressed and buffered, and another
> 1000 values, use the compression ratio of the 5 pages to estimate the final
> size. We'd probably want to use some overhead value for the header. And,
> we'd want to separate the amount of buffered data from our row group size
> estimate, which are currently the same thing.
> {quote}
> (So, it is not only about RLE+bitpacking but any kind of encoding which is 
> done only after "closing" a page.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to