Sure, if you wanted to add to the docs, somewhere in this section is
probably the best place:
https://github.com/apache/arrow/blob/main/r/vignettes/data_wrangling.Rmd#L124

On Fri, May 19, 2023 at 5:34 PM David Greiss <[email protected]>
wrote:

> Thanks for the insight and the suggested workaround. My example was a bit
> contrived but I am looking to filter on a grouped dataframe more analogous
> to this:
>
> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>
> tbl |>
> group_by(cyl) |>
> filter(mpg == max(mpg)) |>
> collect()
>
> The issue that Ian referenced suggests a workaround using left_join which
> did the trick for me:
>
> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>
> tbl |>
> left_join(tbl |>
>   group_by(cyl) |>
>   summarize(max_mpg = max(mpg))
> ) |>
> filter(mpg == max_mpg) |>
> select(-max_mpg) |>
> collect()
>
> If there's any interest, I'd be happy to submit a PR to document these
> workarounds.
>
> Thanks again for the help and work on the package.
>
> David
>
>
> On Fri, May 19, 2023 at 3:58 PM Ian Cook <[email protected]> wrote:
>
>> There is an existing enhancement request for this feature at
>> https://github.com/apache/arrow/issues/29537 but I don't think there
>> is any work planned on this in the near future, so the workaround Neal
>> suggested is the way to go for now.
>>
>> Ian
>>
>> On Fri, May 19, 2023 at 3:52 PM Neal Richardson
>> <[email protected]> wrote:
>> >
>> > max is an aggregation, so it requires scanning all of the data.
>> Filtering is a scalar (row by row operation), so to evaluate mpg >
>> max(mpg), you have to pass over all of the data to compute the max, then
>> pass through the data again to filter. This is trivial for data frames like
>> mtcars, but imagine a dataset that can't be held in memory.
>> >
>> > One way query engines handle this is with window functions. If you use
>> dbplyr, you get SQL with a window function:
>> >
>> > > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
>> > <SQL>
>> > SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
>> > FROM (
>> >   SELECT *, MAX(mpg) OVER () AS q03
>> >   FROM mtcars
>> > ) q01
>> > WHERE (mpg > q03)
>> >
>> > Acero, the query engine in Arrow, does not currently support window
>> functions. The easiest way for you to handle this today is probably to
>> evaluate the max first, then pass that in to the filter:
>> >
>> > max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
>> > tbl |> filter(mpg == max_mpg) |> collect()
>> >
>> > Neal
>> >
>> >
>> > On Thu, May 18, 2023 at 10:08 PM David Greiss <[email protected]>
>> wrote:
>> >>
>> >> Hi
>> >>
>> >> The base R max function is not supported when used within `filter`:
>> >>
>> >> library(arrow)
>> >> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>> >>
>> >> tbl |>
>> >> filter(mpg > max(mpg)) |>
>> >> collect()
>> >> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling
>> data into R
>> >>
>> >> but this works:
>> >>
>> >> tbl |>
>> >> summarize(x = max(mpg))
>> >>
>> >> Should this be supported or am I missing something
>> >>
>> >> Thanks for the help
>> >> David
>>
>

Reply via email to