Thanks for the insight and the suggested workaround. My example was a bit contrived but I am looking to filter on a grouped dataframe more analogous to this:
tbl <- arrow_table(name = rownames(mtcars), mtcars) tbl |> group_by(cyl) |> filter(mpg == max(mpg)) |> collect() The issue that Ian referenced suggests a workaround using left_join which did the trick for me: tbl <- arrow_table(name = rownames(mtcars), mtcars) tbl |> left_join(tbl |> group_by(cyl) |> summarize(max_mpg = max(mpg)) ) |> filter(mpg == max_mpg) |> select(-max_mpg) |> collect() If there's any interest, I'd be happy to submit a PR to document these workarounds. Thanks again for the help and work on the package. David On Fri, May 19, 2023 at 3:58 PM Ian Cook <[email protected]> wrote: > There is an existing enhancement request for this feature at > https://github.com/apache/arrow/issues/29537 but I don't think there > is any work planned on this in the near future, so the workaround Neal > suggested is the way to go for now. > > Ian > > On Fri, May 19, 2023 at 3:52 PM Neal Richardson > <[email protected]> wrote: > > > > max is an aggregation, so it requires scanning all of the data. > Filtering is a scalar (row by row operation), so to evaluate mpg > > max(mpg), you have to pass over all of the data to compute the max, then > pass through the data again to filter. This is trivial for data frames like > mtcars, but imagine a dataset that can't be held in memory. > > > > One way query engines handle this is with window functions. If you use > dbplyr, you get SQL with a window function: > > > > > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query() > > <SQL> > > SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb > > FROM ( > > SELECT *, MAX(mpg) OVER () AS q03 > > FROM mtcars > > ) q01 > > WHERE (mpg > q03) > > > > Acero, the query engine in Arrow, does not currently support window > functions. The easiest way for you to handle this today is probably to > evaluate the max first, then pass that in to the filter: > > > > max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull() > > tbl |> filter(mpg == max_mpg) |> collect() > > > > Neal > > > > > > On Thu, May 18, 2023 at 10:08 PM David Greiss <[email protected]> > wrote: > >> > >> Hi > >> > >> The base R max function is not supported when used within `filter`: > >> > >> library(arrow) > >> tbl <- arrow_table(name = rownames(mtcars), mtcars) > >> > >> tbl |> > >> filter(mpg > max(mpg)) |> > >> collect() > >> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling data > into R > >> > >> but this works: > >> > >> tbl |> > >> summarize(x = max(mpg)) > >> > >> Should this be supported or am I missing something > >> > >> Thanks for the help > >> David >
