Thanks for the insight and the suggested workaround. My example was a bit
contrived but I am looking to filter on a grouped dataframe more analogous
to this:

tbl <- arrow_table(name = rownames(mtcars), mtcars)

tbl |>
group_by(cyl) |>
filter(mpg == max(mpg)) |>
collect()

The issue that Ian referenced suggests a workaround using left_join which
did the trick for me:

tbl <- arrow_table(name = rownames(mtcars), mtcars)

tbl |>
left_join(tbl |>
  group_by(cyl) |>
  summarize(max_mpg = max(mpg))
) |>
filter(mpg == max_mpg) |>
select(-max_mpg) |>
collect()

If there's any interest, I'd be happy to submit a PR to document these
workarounds.

Thanks again for the help and work on the package.

David


On Fri, May 19, 2023 at 3:58 PM Ian Cook <[email protected]> wrote:

> There is an existing enhancement request for this feature at
> https://github.com/apache/arrow/issues/29537 but I don't think there
> is any work planned on this in the near future, so the workaround Neal
> suggested is the way to go for now.
>
> Ian
>
> On Fri, May 19, 2023 at 3:52 PM Neal Richardson
> <[email protected]> wrote:
> >
> > max is an aggregation, so it requires scanning all of the data.
> Filtering is a scalar (row by row operation), so to evaluate mpg >
> max(mpg), you have to pass over all of the data to compute the max, then
> pass through the data again to filter. This is trivial for data frames like
> mtcars, but imagine a dataset that can't be held in memory.
> >
> > One way query engines handle this is with window functions. If you use
> dbplyr, you get SQL with a window function:
> >
> > > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
> > <SQL>
> > SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
> > FROM (
> >   SELECT *, MAX(mpg) OVER () AS q03
> >   FROM mtcars
> > ) q01
> > WHERE (mpg > q03)
> >
> > Acero, the query engine in Arrow, does not currently support window
> functions. The easiest way for you to handle this today is probably to
> evaluate the max first, then pass that in to the filter:
> >
> > max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
> > tbl |> filter(mpg == max_mpg) |> collect()
> >
> > Neal
> >
> >
> > On Thu, May 18, 2023 at 10:08 PM David Greiss <[email protected]>
> wrote:
> >>
> >> Hi
> >>
> >> The base R max function is not supported when used within `filter`:
> >>
> >> library(arrow)
> >> tbl <- arrow_table(name = rownames(mtcars), mtcars)
> >>
> >> tbl |>
> >> filter(mpg > max(mpg)) |>
> >> collect()
> >> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling data
> into R
> >>
> >> but this works:
> >>
> >> tbl |>
> >> summarize(x = max(mpg))
> >>
> >> Should this be supported or am I missing something
> >>
> >> Thanks for the help
> >> David
>

Reply via email to