There's also the cookbook https://arrow.apache.org/cookbook/r/ which might also be a good place for how-to examples like these. The repository to contribute to those is https://github.com/apache/arrow-cookbook
-Jon On Sat, May 20, 2023 at 7:31 AM Neal Richardson <[email protected]> wrote: > Sure, if you wanted to add to the docs, somewhere in this section is > probably the best place: > https://github.com/apache/arrow/blob/main/r/vignettes/data_wrangling.Rmd#L124 > > On Fri, May 19, 2023 at 5:34 PM David Greiss <[email protected]> > wrote: > >> Thanks for the insight and the suggested workaround. My example was a bit >> contrived but I am looking to filter on a grouped dataframe more analogous >> to this: >> >> tbl <- arrow_table(name = rownames(mtcars), mtcars) >> >> tbl |> >> group_by(cyl) |> >> filter(mpg == max(mpg)) |> >> collect() >> >> The issue that Ian referenced suggests a workaround using left_join which >> did the trick for me: >> >> tbl <- arrow_table(name = rownames(mtcars), mtcars) >> >> tbl |> >> left_join(tbl |> >> group_by(cyl) |> >> summarize(max_mpg = max(mpg)) >> ) |> >> filter(mpg == max_mpg) |> >> select(-max_mpg) |> >> collect() >> >> If there's any interest, I'd be happy to submit a PR to document these >> workarounds. >> >> Thanks again for the help and work on the package. >> >> David >> >> >> On Fri, May 19, 2023 at 3:58 PM Ian Cook <[email protected]> wrote: >> >>> There is an existing enhancement request for this feature at >>> https://github.com/apache/arrow/issues/29537 but I don't think there >>> is any work planned on this in the near future, so the workaround Neal >>> suggested is the way to go for now. >>> >>> Ian >>> >>> On Fri, May 19, 2023 at 3:52 PM Neal Richardson >>> <[email protected]> wrote: >>> > >>> > max is an aggregation, so it requires scanning all of the data. >>> Filtering is a scalar (row by row operation), so to evaluate mpg > >>> max(mpg), you have to pass over all of the data to compute the max, then >>> pass through the data again to filter. This is trivial for data frames like >>> mtcars, but imagine a dataset that can't be held in memory. >>> > >>> > One way query engines handle this is with window functions. If you use >>> dbplyr, you get SQL with a window function: >>> > >>> > > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query() >>> > <SQL> >>> > SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb >>> > FROM ( >>> > SELECT *, MAX(mpg) OVER () AS q03 >>> > FROM mtcars >>> > ) q01 >>> > WHERE (mpg > q03) >>> > >>> > Acero, the query engine in Arrow, does not currently support window >>> functions. The easiest way for you to handle this today is probably to >>> evaluate the max first, then pass that in to the filter: >>> > >>> > max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull() >>> > tbl |> filter(mpg == max_mpg) |> collect() >>> > >>> > Neal >>> > >>> > >>> > On Thu, May 18, 2023 at 10:08 PM David Greiss <[email protected]> >>> wrote: >>> >> >>> >> Hi >>> >> >>> >> The base R max function is not supported when used within `filter`: >>> >> >>> >> library(arrow) >>> >> tbl <- arrow_table(name = rownames(mtcars), mtcars) >>> >> >>> >> tbl |> >>> >> filter(mpg > max(mpg)) |> >>> >> collect() >>> >> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling >>> data into R >>> >> >>> >> but this works: >>> >> >>> >> tbl |> >>> >> summarize(x = max(mpg)) >>> >> >>> >> Should this be supported or am I missing something >>> >> >>> >> Thanks for the help >>> >> David >>> >>
