Neal Richardson created ARROW-17462: ---------------------------------------
Summary: [R] Cast scalars to type of field in Expression building Key: ARROW-17462 URL: https://issues.apache.org/jira/browse/ARROW-17462 Project: Apache Arrow Issue Type: Improvement Components: R Reporter: Neal Richardson Assignee: Neal Richardson Fix For: 10.0.0 After looking at the ExecPlan output of some queries, it jumped out at me how we translate {{ int_field == 5 }} in R as {{ cast(int_field, float64) == 5 }} because 5 is a double in R. This extra work has a noticeable performance impact. Here's a simple query on the taxi dataset, filtering down to 54 out of 1.5 billion rows and selecting a single column. My idea was to make a query that does not much other than evaluate the filter. {code} > system.time(ds |> select(passenger_count) |> filter(passenger_count > 10) |> > compute()) user system elapsed 0.391 0.024 0.362 > system.time(ds |> select(passenger_count) |> filter(passenger_count > > Scalar$create(10, type = int8())) |> compute()) user system elapsed 0.206 0.025 0.179 {code} You can see the difference in the query plans too: {code} > ds |> select(passenger_count) |> filter(passenger_count > 10) |> explain() ExecPlan with 4 nodes: 3:SinkNode{} 2:ProjectNode{projection=[passenger_count]} 1:FilterNode{filter=(cast(passenger_count, {to_type=double, allow_int_overflow=false, allow_time_truncate=false, allow_time_overflow=false, allow_decimal_truncate=false, allow_float_truncate=false, allow_invalid_utf8=false}) > 10)} 0:SourceNode{} > ds |> select(passenger_count) |> filter(passenger_count > Scalar$create(10, > type = int8())) |> explain() ExecPlan with 4 nodes: 3:SinkNode{} 2:ProjectNode{projection=[passenger_count]} 1:FilterNode{filter=(passenger_count > 10)} 0:SourceNode{} {code} Ideally Acero would do this more intelligently (cf. ARROW-11402), but we should also be able to do smarter things when assembling the Expression in R. -- This message was sent by Atlassian Jira (v8.20.10#820010)