Hi Shawn,

I suggested you open a ticket on JIRA to keep track of this.

Regards

Antoine.


On Mon, 7 Mar 2022 17:33:53 +0800
Shawn Zeng <[email protected]> wrote:
> Although the python Parquet api is a wrapper of c++, there are some
> tuning knobs not included in python. For example,
> dictionary_pagesize_limit_. The dictionary page size will easily
> exceed the limit when any or many of the following happen: 1. The
> row_group_size is relatively large e.g. the default is 64M. 2. The
> size per entry is large e.g large string column 3. the repeatability
> of data is not so high. This may result in the dictionary encoding not
> being fully utilized if this parameter cannot be tuned. In C++,
> however, this parameter can be tuned to the optimized setting.
> 
> There are also other parameters not exposed in python, for example,
> max_statistics_size.
> 
> There are also some unalignment between C++ and Python. For example,
> the parameter "chunk_size" in C++ WriteTable function is actually row
> group size. Calling it "chunk_size" without any explanation in the C++
> doc is really confusing. In the Cython code it actually passes
> "row_group_size" to "chunk_size".
> 
> Thanks in advance,
> Shawn Zeng
> 



Reply via email to