max is an aggregation, so it requires scanning all of the data. Filtering is a scalar (row by row operation), so to evaluate mpg > max(mpg), you have to pass over all of the data to compute the max, then pass through the data again to filter. This is trivial for data frames like mtcars, but imagine a dataset that can't be held in memory.
One way query engines handle this is with window functions. If you use dbplyr, you get SQL with a window function: > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query() <SQL> SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb FROM ( SELECT *, MAX(mpg) OVER () AS q03 FROM mtcars ) q01 WHERE (mpg > q03) Acero, the query engine in Arrow, does not currently support window functions. The easiest way for you to handle this today is probably to evaluate the max first, then pass that in to the filter: max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull() tbl |> filter(mpg == max_mpg) |> collect() Neal On Thu, May 18, 2023 at 10:08 PM David Greiss <[email protected]> wrote: > Hi > > The base R max function is not supported when used within `filter`: > > library(arrow) > tbl <- arrow_table(name = rownames(mtcars), mtcars) > > tbl |> > filter(mpg > max(mpg)) |> > collect() > Warning: Expression mpg > max(mpg) not supported in Arrow; pulling data > into R > > but this works: > > tbl |> > summarize(x = max(mpg)) > > Should this be supported or am I missing something > > Thanks for the help > David >
