Sure, if you wanted to add to the docs, somewhere in this section is probably the best place: https://github.com/apache/arrow/blob/main/r/vignettes/data_wrangling.Rmd#L124
On Fri, May 19, 2023 at 5:34 PM David Greiss <[email protected]> wrote: > Thanks for the insight and the suggested workaround. My example was a bit > contrived but I am looking to filter on a grouped dataframe more analogous > to this: > > tbl <- arrow_table(name = rownames(mtcars), mtcars) > > tbl |> > group_by(cyl) |> > filter(mpg == max(mpg)) |> > collect() > > The issue that Ian referenced suggests a workaround using left_join which > did the trick for me: > > tbl <- arrow_table(name = rownames(mtcars), mtcars) > > tbl |> > left_join(tbl |> > group_by(cyl) |> > summarize(max_mpg = max(mpg)) > ) |> > filter(mpg == max_mpg) |> > select(-max_mpg) |> > collect() > > If there's any interest, I'd be happy to submit a PR to document these > workarounds. > > Thanks again for the help and work on the package. > > David > > > On Fri, May 19, 2023 at 3:58 PM Ian Cook <[email protected]> wrote: > >> There is an existing enhancement request for this feature at >> https://github.com/apache/arrow/issues/29537 but I don't think there >> is any work planned on this in the near future, so the workaround Neal >> suggested is the way to go for now. >> >> Ian >> >> On Fri, May 19, 2023 at 3:52 PM Neal Richardson >> <[email protected]> wrote: >> > >> > max is an aggregation, so it requires scanning all of the data. >> Filtering is a scalar (row by row operation), so to evaluate mpg > >> max(mpg), you have to pass over all of the data to compute the max, then >> pass through the data again to filter. This is trivial for data frames like >> mtcars, but imagine a dataset that can't be held in memory. >> > >> > One way query engines handle this is with window functions. If you use >> dbplyr, you get SQL with a window function: >> > >> > > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query() >> > <SQL> >> > SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb >> > FROM ( >> > SELECT *, MAX(mpg) OVER () AS q03 >> > FROM mtcars >> > ) q01 >> > WHERE (mpg > q03) >> > >> > Acero, the query engine in Arrow, does not currently support window >> functions. The easiest way for you to handle this today is probably to >> evaluate the max first, then pass that in to the filter: >> > >> > max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull() >> > tbl |> filter(mpg == max_mpg) |> collect() >> > >> > Neal >> > >> > >> > On Thu, May 18, 2023 at 10:08 PM David Greiss <[email protected]> >> wrote: >> >> >> >> Hi >> >> >> >> The base R max function is not supported when used within `filter`: >> >> >> >> library(arrow) >> >> tbl <- arrow_table(name = rownames(mtcars), mtcars) >> >> >> >> tbl |> >> >> filter(mpg > max(mpg)) |> >> >> collect() >> >> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling >> data into R >> >> >> >> but this works: >> >> >> >> tbl |> >> >> summarize(x = max(mpg)) >> >> >> >> Should this be supported or am I missing something >> >> >> >> Thanks for the help >> >> David >> >
