Neal Richardson created ARROW-17437:
---------------------------------------

             Summary: [R][C++] Scalar UDFs don't actually deal with scalars
                 Key: ARROW-17437
                 URL: https://issues.apache.org/jira/browse/ARROW-17437
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++, R
            Reporter: Neal Richardson


Noted while testing out UDFs in R. I was wrapping a {{system()}} call in a UDF 
to shell out and capture the stdout for each value in the data, but I ended up 
getting the same result for all rows. After some exploration, I figured out 
that the problem was that the data going into the UDF is actually a vector, so 
unless the R UDF function is properly vectorized, you'll get unexpected data. 

Here's an example that illustrates:

{code}
register_scalar_function(
  "test", 
  function(context, x) paste(x, collapse=","), 
  utf8(), 
  utf8(), 
  auto_convert=TRUE
)

Table$create(x = c("a", "b", "c")) |>
  transmute(test(x)) |>
  collect()

# # A tibble: 3 × 1
#   `test(x)`
#   <chr>    
# 1 a,b,c    
# 2 a,b,c    
# 3 a,b,c    
{code}

Basically, the UDF gets the chunk of data and evaluates to return a Scalar, 
which gets recycled for all rows.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to