My bad: I meant "suggest" not "suggested" :-)
On Mon, 7 Mar 2022 11:26:43 +0100 Antoine Pitrou <[email protected]> wrote: > Hi Shawn, > > I suggested you open a ticket on JIRA to keep track of this. > > Regards > > Antoine. > > > On Mon, 7 Mar 2022 17:33:53 +0800 > Shawn Zeng <[email protected]> wrote: > > Although the python Parquet api is a wrapper of c++, there are some > > tuning knobs not included in python. For example, > > dictionary_pagesize_limit_. The dictionary page size will easily > > exceed the limit when any or many of the following happen: 1. The > > row_group_size is relatively large e.g. the default is 64M. 2. The > > size per entry is large e.g large string column 3. the repeatability > > of data is not so high. This may result in the dictionary encoding not > > being fully utilized if this parameter cannot be tuned. In C++, > > however, this parameter can be tuned to the optimized setting. > > > > There are also other parameters not exposed in python, for example, > > max_statistics_size. > > > > There are also some unalignment between C++ and Python. For example, > > the parameter "chunk_size" in C++ WriteTable function is actually row > > group size. Calling it "chunk_size" without any explanation in the C++ > > doc is really confusing. In the Cython code it actually passes > > "row_group_size" to "chunk_size". > > > > Thanks in advance, > > Shawn Zeng > > > > > >
