max is an aggregation, so it requires scanning all of the data. Filtering
is a scalar (row by row operation), so to evaluate mpg > max(mpg), you have
to pass over all of the data to compute the max, then pass through the data
again to filter. This is trivial for data frames like mtcars, but imagine a
dataset that can't be held in memory.

One way query engines handle this is with window functions. If you use
dbplyr, you get SQL with a window function:

> tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
<SQL>
SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
FROM (
  SELECT *, MAX(mpg) OVER () AS q03
  FROM mtcars
) q01
WHERE (mpg > q03)

Acero, the query engine in Arrow, does not currently support window
functions. The easiest way for you to handle this today is probably to
evaluate the max first, then pass that in to the filter:

max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
tbl |> filter(mpg == max_mpg) |> collect()

Neal


On Thu, May 18, 2023 at 10:08 PM David Greiss <[email protected]>
wrote:

> Hi
>
> The base R max function is not supported when used within `filter`:
>
> library(arrow)
> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>
> tbl |>
> filter(mpg > max(mpg)) |>
> collect()
> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling data
> into R
>
> but this works:
>
> tbl |>
> summarize(x = max(mpg))
>
> Should this be supported or am I missing something
>
> Thanks for the help
> David
>

Reply via email to