There's also the cookbook https://arrow.apache.org/cookbook/r/ which might
also be a good place for how-to examples like these. The repository to
contribute to those is https://github.com/apache/arrow-cookbook

-Jon


On Sat, May 20, 2023 at 7:31 AM Neal Richardson <[email protected]>
wrote:

> Sure, if you wanted to add to the docs, somewhere in this section is
> probably the best place:
> https://github.com/apache/arrow/blob/main/r/vignettes/data_wrangling.Rmd#L124
>
> On Fri, May 19, 2023 at 5:34 PM David Greiss <[email protected]>
> wrote:
>
>> Thanks for the insight and the suggested workaround. My example was a bit
>> contrived but I am looking to filter on a grouped dataframe more analogous
>> to this:
>>
>> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>>
>> tbl |>
>> group_by(cyl) |>
>> filter(mpg == max(mpg)) |>
>> collect()
>>
>> The issue that Ian referenced suggests a workaround using left_join which
>> did the trick for me:
>>
>> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>>
>> tbl |>
>> left_join(tbl |>
>>   group_by(cyl) |>
>>   summarize(max_mpg = max(mpg))
>> ) |>
>> filter(mpg == max_mpg) |>
>> select(-max_mpg) |>
>> collect()
>>
>> If there's any interest, I'd be happy to submit a PR to document these
>> workarounds.
>>
>> Thanks again for the help and work on the package.
>>
>> David
>>
>>
>> On Fri, May 19, 2023 at 3:58 PM Ian Cook <[email protected]> wrote:
>>
>>> There is an existing enhancement request for this feature at
>>> https://github.com/apache/arrow/issues/29537 but I don't think there
>>> is any work planned on this in the near future, so the workaround Neal
>>> suggested is the way to go for now.
>>>
>>> Ian
>>>
>>> On Fri, May 19, 2023 at 3:52 PM Neal Richardson
>>> <[email protected]> wrote:
>>> >
>>> > max is an aggregation, so it requires scanning all of the data.
>>> Filtering is a scalar (row by row operation), so to evaluate mpg >
>>> max(mpg), you have to pass over all of the data to compute the max, then
>>> pass through the data again to filter. This is trivial for data frames like
>>> mtcars, but imagine a dataset that can't be held in memory.
>>> >
>>> > One way query engines handle this is with window functions. If you use
>>> dbplyr, you get SQL with a window function:
>>> >
>>> > > tbl(con, "mtcars") |> filter(mpg > max(mpg)) |> show_query()
>>> > <SQL>
>>> > SELECT mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
>>> > FROM (
>>> >   SELECT *, MAX(mpg) OVER () AS q03
>>> >   FROM mtcars
>>> > ) q01
>>> > WHERE (mpg > q03)
>>> >
>>> > Acero, the query engine in Arrow, does not currently support window
>>> functions. The easiest way for you to handle this today is probably to
>>> evaluate the max first, then pass that in to the filter:
>>> >
>>> > max_mpg <- tbl |> summarize(max(mpg)) |> collect() |> pull()
>>> > tbl |> filter(mpg == max_mpg) |> collect()
>>> >
>>> > Neal
>>> >
>>> >
>>> > On Thu, May 18, 2023 at 10:08 PM David Greiss <[email protected]>
>>> wrote:
>>> >>
>>> >> Hi
>>> >>
>>> >> The base R max function is not supported when used within `filter`:
>>> >>
>>> >> library(arrow)
>>> >> tbl <- arrow_table(name = rownames(mtcars), mtcars)
>>> >>
>>> >> tbl |>
>>> >> filter(mpg > max(mpg)) |>
>>> >> collect()
>>> >> Warning: Expression mpg > max(mpg) not supported in Arrow; pulling
>>> data into R
>>> >>
>>> >> but this works:
>>> >>
>>> >> tbl |>
>>> >> summarize(x = max(mpg))
>>> >>
>>> >> Should this be supported or am I missing something
>>> >>
>>> >> Thanks for the help
>>> >> David
>>>
>>

Reply via email to