I have code like below to range partition a file. It looks like each of
the pc.less, pc.cast, and pc.add allocates new arrays. So the code appears
to be spending more time performing memory allocations than it is doing the
comparisons. The performance is still pretty good (and faster than the
alternatives), but it does make me think as we start shifting more
calculations to arrow, we may need to consider c++ implementations to get
full benefit.
Thanks,
Cedric
import pyarrow.parquet as pq
import pyarrow.compute as pc
t = pq.read_table('/path/to/file', columns=['username'])
ta = t.column('username')
output_file = pc.cast( pc.less(ta,'Bil'), 'int8')
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Cou'), 'int8'))
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Eve'), 'int8'))
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Ish'), 'int8'))
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Kib'), 'int8'))
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Mat'), 'int8'))
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Pat'), 'int8'))
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Sco'), 'int8'))
output_file = pc.add(output_file, pc.cast( pc.less(ta,'Tok'), 'int8'))
output_file = pc.coalesce(output_file,9)
On Tue, May 31, 2022 at 2:25 PM Weston Pace <[email protected]> wrote:
> I'd be more interested in some kind of buffer / array pool plus the
> ability to specify an output buffer for a kernel function. I think it
> would achieve the same goal (avoiding allocation) with more
> flexibility (e.g. you wouldn't have to overwrite your input buffer).
>
> At the moment though I wonder if this is a concern. Jemalloc should
> do some level of memory reuse. Is there a specific performance issue
> you are encountering?
>
> On Tue, May 31, 2022 at 11:45 AM Wes McKinney <[email protected]> wrote:
> >
> > *In principle*, it would be possible to provide mutable output buffers
> > for a kernel's execution, so that input and output buffers could be
> > the same (essentially exposing the lower-level kernel execution
> > interface that underlies arrow::compute::CallFunction). But this would
> > be a fair amount of development work to achieve. If there are others
> > interested in exploring an implementation, we could create a Jira
> > issue.
> >
> > On Sun, May 29, 2022 at 3:04 PM Micah Kornfield <[email protected]>
> wrote:
> > >
> > > I think even in cython this might be difficult as Array data
> structures are generally considered immutable, so this is inherently
> unsafe, and requires doing with care.
> > >
> > > On Sun, May 29, 2022 at 11:21 AM Cedric Yau <[email protected]>
> wrote:
> > >>
> > >> Suppose I have an array with 1MM integers and I add 1 to them with
> pyarrow.compute.add. It looks like a new array is assigned.
> > >>
> > >> Is there a way to do this inplace? It looks like a new array is
> allocated. Would cython be required at this point?
> > >>
> > >> ```
> > >> import pyarrow as pa
> > >> import pyarrow.compute as pc
> > >>
> > >> a = pa.array(range(1000000))
> > >> print(id(a))
> > >> a = pc.add(a,1)
> > >> print(id(a))
> > >>
> > >> # output
> > >> # 139634974909024
> > >> # 139633492705920
> > >> ```
> > >>
> > >> Thanks,
> > >> Cedric
>