Re: acero speed versus numpy

Spencer Nelson Tue, 22 Aug 2023 09:22:12 -0700

> How many rows match your timestamp criteria?

Usually between 1,000 and 5,000. I agree that filtering could be way more
costly - it probably is. I just thought the expression is more complex and
worth explaining in more detail.


> Acero will not "fuse" the kernel and has no expression optimization.  If
your expression has 20 compute calls in it then Acero will make 20 passes
over the data.  In fact, these won't even be in-place (e.g. 20 arrays will
be initialized).

Thanks - this was my misunderstanding, and it clarifies it all. I thought
Acero did some optimization of the compute graph to reduce unnecessary
materializations. I think my misunderstanding comes from this line that
leads the Acero documentation (
https://arrow.apache.org/docs/cpp/streaming_execution.html):

 > For many complex computations, successive direct invocation of compute
functions is not feasible in either memory or computation time. Doing so
causes all intermediate data to be fully materialized. To facilitate
arbitrarily large inputs and more efficient resource usage, the Arrow C++
implementation also provides Acero...

I think I read this the wrong way, and your explanation makes good sense.

On Tue, Aug 22, 2023 at 8:08 AM Weston Pace <[email protected]> wrote:

> How many rows match your timestamp criteria? In other words, how many rows
> are you applying the function to?  If there is an earlier exact match
> filter on a timestamp that only matches 1 (or a few rows) then I are you
> sure the expression evaluation (and not the filtering) is the costly spot?
>
> > But I expected that Acero would need to only visit columnar values once
>
> I'm not sure what this means.  There has been very little work on
> optimizing expression evaluation (most, if any, optimization work has
> focused on optimizing the individual kernels themselves).  Acero will not
> "fuse" the kernel and has no expression optimization.  If your expression
> has 20 compute calls in it then Acero will make 20 passes over the data.
> In fact, these won't even be in-place (e.g. 20 arrays will be initialized).
>
> > Should I instead think of Acero as mainly about working on very large
> datasets?
>
> Yes.  At the moment I would expect that pyarrow compute kernels are more
> or less as fast as the numpy variants (there are some exceptions, string
> functions tend to be faster, some are slower, no one has done an exhaustive
> survey).  Running these through Acero should have some overhead and give
> you the ability to run on larger-than-memory data.  There is potential for
> Acero to implement some clever tricks (like those I described earlier)
> which might make it faster (instead of adding overhead).  However, I do not
> know if anyone is working on these.
>
> On Mon, Aug 21, 2023 at 6:35 PM Chak-Pong Chung <[email protected]>
> wrote:
>
>> Could you provide a script with which people can reproduce the problem
>> for the performance comparison? That way we can take a closer look.
>>
>> On Mon, Aug 21, 2023 at 8:42 PM Spencer Nelson <[email protected]> wrote:
>>
>>> I'd like some help calibrating my expectations regarding acero
>>> performance. I'm finding that some pretty naive numpy is about 10x faster
>>> than acero for my use case.
>>>
>>> I'm working with a table with 13,000,000 values. The values are angular
>>> positions on the sky and times. I'd like to filter to a specific one of the
>>> times, and to values within a calculated great-circle distance on the sky.
>>>
>>> I've implemented the Vincenty formula (
>>> https://en.wikipedia.org/wiki/Great-circle_distance
>>> <https://urldefense.com/v3/__https://en.wikipedia.org/wiki/Great-circle_distance__;!!K-Hz7m0Vt54!iamSerPfwX5Py-rLTN-43QlIDLrfWiBCJq_iI2jl5JNKf3-iXzy5_I8ihVQ9ZRFA3GnubZUMXtvxyfXHD60$>)
>>> for this:
>>>
>>> ```
>>> def pc_angular_separation(lon1, lat1, lon2, lat2):
>>>      sdlon = pc.sin(pc.subtract(lon2, lon1))
>>>      cdlon = pc.cos(pc.subtract(lon2, lon1))
>>>      slat1 = pc.sin(lat1)
>>>      slat2 = pc.sin(lat2)
>>>      clat1 = pc.cos(lat1)
>>>      clat2 = pc.cos(lat2)
>>>
>>>      num1 = pc.multiply(clat2, sdlon)
>>>      num2 = pc.subtract(pc.multiply(slat2, clat1),
>>> pc.multiply(pc.multiply(clat2, slat1), cdlon))
>>>      denominator = pc.add(pc.multiply(slat2, slat1),
>>> pc.multiply(pc.multiply(clat2, clat1), cdlon))
>>>      hypot = pc.sqrt(pc.add(pc.multiply(num1, num1), pc.multiply(num2,
>>> num2)))
>>>      return pc.atan2(hypot, denominator)
>>> ```
>>>
>>> The resulting pyarrow.compute.Expression is fairly monstrous:
>>>
>>> <pyarrow.compute.Expression
>>> atan2(sqrt(add(multiply(multiply(cos(Dec_deg), sin(subtract(RA_deg,
>>> 168.9776949652776))), multiply(cos(Dec_deg), sin(subtract(RA_deg,
>>> 168.9776949652776)))), multiply(subtract(multiply(sin(Dec_deg),
>>> -0.9304510671785976), multiply(multiply(cos(Dec_deg), 0.3664161726591893),
>>> cos(subtract(RA_deg, 168.9776949652776)))), subtract(multiply(sin(Dec_deg),
>>> -0.9304510671785976), multiply(multiply(cos(Dec_deg), 0.3664161726591893),
>>> cos(subtract(RA_deg, 168.9776949652776))))))), add(multiply(sin(Dec_deg),
>>> 0.3664161726591893), multiply(multiply(cos(Dec_deg), -0.9304510671785976),
>>> cos(subtract(RA_deg, 168.9776949652776)))))>
>>>
>>> Then my Acero graph is very simple. Just a table source node, then a
>>> filter node on the timestamp (for exact match), and then another filter
>>> node for a computed value of that expression under a threshold.
>>>
>>> For 13 million observations, this takes about 15ms on my laptop using
>>> Acero.
>>>
>>> But the same computation done with totally naive numpy is about 3ms.
>>>
>>> The numpy version has no fanciness, just calling numpy trigonometric
>>> functions and materializing all the intermediate results like you might
>>> imagine, then eventually coming up with a boolean mask over everything and
>>> calling `table.filter(mask)`.
>>>
>>> So finally, my question: is this about what I should expect? I know
>>> Acero has an advantage that it *would* work if my data were larger than
>>> fits in memory, which is not true of my numpy approach. But I expected that
>>> Acero would need to only visit columnar values once, so it should be able
>>> to outpace the numpy approach. Should I instead think of Acero as mainly
>>> about working on very large datasets?
>>>
>>> -Spencer
>>>
>>
>>
>> --
>> Regards,
>> Chak-Pong
>>
>

Re: acero speed versus numpy

Reply via email to