There is an existing enhancement request for this feature at
https://github.com/apache/arrow/issues/29537 but I don't think there
is any work planned on this in the near future, so the workaround Neal
suggested is the way to go for now.

Ian

On Fri, May 19, 2023 at 3:52 PM Neal Richardson
<[email protected]> wrote:
>
> max is an aggregation, so it requires scanning all of the data. Filtering is 
> a scalar (row by row operation), so to evaluate mpg > max(mpg), you have to 
> pass over all of the data to compute the max, then pass through the data 
> again to filter. This is trivial for data frames like mtcars, but imagine a 
> dataset that can't be held in memory.
>
> One way query engines handle this is with window functions. If you use 
> dbplyr, you get SQL with a window function:
>
> > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
> <SQL>
> SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
> FROM (
>   SELECT *, MAX(mpg) OVER () AS q03
>   FROM mtcars
> ) q01
> WHERE (mpg > q03)
>
> Acero, the query engine in Arrow, does not currently support window 
> functions. The easiest way for you to handle this today is probably to 
> evaluate the max first, then pass that in to the filter:
>
> max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
> tbl |> filter(mpg == max_mpg) |> collect()
>
> Neal
>
>
> On Thu, May 18, 2023 at 10:08 PM David Greiss <[email protected]> wrote:
>>
>> Hi
>>
>> The base R max function is not supported when used within `filter`:
>>
>> library(arrow)
>> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>>
>> tbl |>
>> filter(mpg > max(mpg)) |>
>> collect()
>> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling data into 
>> R
>>
>> but this works:
>>
>> tbl |>
>> summarize(x = max(mpg))
>>
>> Should this be supported or am I missing something
>>
>> Thanks for the help
>> David

Reply via email to