On Mon, 7 Mar 2022 19:09:35 +0800
Shawn Zeng <[email protected]> wrote:
> I created https://issues.apache.org/jira/browse/ARROW-15855 for the
> "dictionary_pagesize_limit" issue.
> 
> For the naming of "chunk_size" of C++ write_table, I can also create
> another ticket if you feel it is necessary and I am willing to try the
> fix.

Yes, you can create another ticket for "chunk_size".

Regards

Antoine.



> 
> On Mon, Mar 7, 2022 at 6:30 PM Antoine Pitrou <[email protected]> wrote:
> >
> >
> > My bad: I meant "suggest" not "suggested" :-)
> >
> >
> >
> > On Mon, 7 Mar 2022 11:26:43 +0100
> > Antoine Pitrou <[email protected]> wrote:  
> > > Hi Shawn,
> > >
> > > I suggested you open a ticket on JIRA to keep track of this.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > > On Mon, 7 Mar 2022 17:33:53 +0800
> > > Shawn Zeng <[email protected]> wrote:  
> > > > Although the python Parquet api is a wrapper of c++, there are some
> > > > tuning knobs not included in python. For example,
> > > > dictionary_pagesize_limit_. The dictionary page size will easily
> > > > exceed the limit when any or many of the following happen: 1. The
> > > > row_group_size is relatively large e.g. the default is 64M. 2. The
> > > > size per entry is large e.g large string column 3. the repeatability
> > > > of data is not so high. This may result in the dictionary encoding not
> > > > being fully utilized if this parameter cannot be tuned. In C++,
> > > > however, this parameter can be tuned to the optimized setting.
> > > >
> > > > There are also other parameters not exposed in python, for example,
> > > > max_statistics_size.
> > > >
> > > > There are also some unalignment between C++ and Python. For example,
> > > > the parameter "chunk_size" in C++ WriteTable function is actually row
> > > > group size. Calling it "chunk_size" without any explanation in the C++
> > > > doc is really confusing. In the Cython code it actually passes
> > > > "row_group_size" to "chunk_size".
> > > >
> > > > Thanks in advance,
> > > > Shawn Zeng
> > > >  
> > >
> > >
> > >
> > >  
> >
> >
> >  
> 



Reply via email to