> we may need to consider c++ implementations to get full benefit. The way that you are executing your expression is not the way that we intend users to write performant code.
The intended way is to build an expression and then execute the expression — the current expression evaluator is not very efficient (it does not reuse temporary allocations yet), but that is where the optimization work should happen rather than in custom C++ code. I'm struggling to find documentation for this right now (expressions aren't discussed in https://arrow.apache.org/docs/python/compute.html) but there are many examples in the unit tests: https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_compute.py On Tue, May 31, 2022 at 11:25 PM Cedric Yau <[email protected]> wrote: > > I have code like below to range partition a file. It looks like each of the > pc.less, pc.cast, and pc.add allocates new arrays. So the code appears to be > spending more time performing memory allocations than it is doing the > comparisons. The performance is still pretty good (and faster than the > alternatives), but it does make me think as we start shifting more > calculations to arrow, we may need to consider c++ implementations to get > full benefit. > > Thanks, > Cedric > > import pyarrow.parquet as pq > import pyarrow.compute as pc > > t = pq.read_table('/path/to/file', columns=['username']) > ta = t.column('username') > > output_file = pc.cast( pc.less(ta,'Bil'), 'int8') > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Cou'), 'int8')) > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Eve'), 'int8')) > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Ish'), 'int8')) > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Kib'), 'int8')) > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Mat'), 'int8')) > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Pat'), 'int8')) > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Sco'), 'int8')) > output_file = pc.add(output_file, pc.cast( pc.less(ta,'Tok'), 'int8')) > output_file = pc.coalesce(output_file,9) > > On Tue, May 31, 2022 at 2:25 PM Weston Pace <[email protected]> wrote: >> >> I'd be more interested in some kind of buffer / array pool plus the >> ability to specify an output buffer for a kernel function. I think it >> would achieve the same goal (avoiding allocation) with more >> flexibility (e.g. you wouldn't have to overwrite your input buffer). >> >> At the moment though I wonder if this is a concern. Jemalloc should >> do some level of memory reuse. Is there a specific performance issue >> you are encountering? >> >> On Tue, May 31, 2022 at 11:45 AM Wes McKinney <[email protected]> wrote: >> > >> > *In principle*, it would be possible to provide mutable output buffers >> > for a kernel's execution, so that input and output buffers could be >> > the same (essentially exposing the lower-level kernel execution >> > interface that underlies arrow::compute::CallFunction). But this would >> > be a fair amount of development work to achieve. If there are others >> > interested in exploring an implementation, we could create a Jira >> > issue. >> > >> > On Sun, May 29, 2022 at 3:04 PM Micah Kornfield <[email protected]> >> > wrote: >> > > >> > > I think even in cython this might be difficult as Array data structures >> > > are generally considered immutable, so this is inherently unsafe, and >> > > requires doing with care. >> > > >> > > On Sun, May 29, 2022 at 11:21 AM Cedric Yau <[email protected]> wrote: >> > >> >> > >> Suppose I have an array with 1MM integers and I add 1 to them with >> > >> pyarrow.compute.add. It looks like a new array is assigned. >> > >> >> > >> Is there a way to do this inplace? It looks like a new array is >> > >> allocated. Would cython be required at this point? >> > >> >> > >> ``` >> > >> import pyarrow as pa >> > >> import pyarrow.compute as pc >> > >> >> > >> a = pa.array(range(1000000)) >> > >> print(id(a)) >> > >> a = pc.add(a,1) >> > >> print(id(a)) >> > >> >> > >> # output >> > >> # 139634974909024 >> > >> # 139633492705920 >> > >> ``` >> > >> >> > >> Thanks, >> > >> Cedric > > >
