nealrichardson commented on pull request #9621: URL: https://github.com/apache/arrow/pull/9621#issuecomment-790998553
With an assist from @bkietz, I've written a very basic R wrapper that exercises this in https://github.com/apache/arrow/commit/aa530cb586462bee98390d129575fe1622ffb222. It's enough to expose some issues to address, to say nothing of the interface questions. ```r library(arrow) library(dplyr) # The commit uses this option to switch to use the group_by compute function options(arrow.summarize = TRUE) # If the Arrow aggregation function isn't implemented, or if the Arrow call errors, # it falls back to pulling the data in R and evaluating in R. # mtcars is a standard dataset that ships with R mt <- Table$create(mtcars) mt %>% group_by(cyl) %>% summarize(total_hp = sum(hp)) # Warning: Error : NotImplemented: Key of typedouble # ../src/arrow/compute/function.cc:178 kernel_ctx.status() # ; pulling data into R # # A tibble: 3 x 2 # cyl total_hp # * <dbl> <dbl> # 1 4 909 # 2 6 856 # 3 8 2929 # That's unfortunate. R blurs the distinction for users between integer and double, # so it's not uncommon to have integer data stored as a float. # (Also, the error message is missing some whitespace.) # We can cast that to an integer and try again mt$cyl <- mt$cyl$cast(int32()) unique(mt$cyl) # Array # <int32> # [ # 6, # 4, # 8 # ] mt %>% group_by(cyl) %>% summarize(total_hp = sum(hp)) # StructArray # <struct<: double, : int32>> # -- is_valid: all not null # -- child 0 type: double # [ # 856, # 909, # 2929 # ] # -- child 1 type: int64 # [ # 17179869190, # 8, # 0 # ] # Alright, it computed and got the same numbers, but the StructArray # is not valid. Type says int32 but data says int64 and we have misplaced bits # Let's try a different stock dataset ir <- Table$create(iris) ir %>% group_by(Species) %>% summarize(total_length = sum(Sepal.Length)) # Warning: Error : NotImplemented: Key of typedictionary<values=string, indices=int8, ordered=0> # ../src/arrow/compute/function.cc:178 kernel_ctx.status() # ; pulling data into R # # A tibble: 3 x 2 # Species total_length # * <fct> <dbl> # 1 setosa 250. # 2 versicolor 297. # 3 virginica 329. # Hmm. dictionary types really need to be supported. # Let's work around and cast it to string ir$Species <- ir$Species$cast(utf8()) unique(ir$Species) # Array # <string> # [ # "setosa", # "versicolor", # "virginica" # ] ir %>% group_by(Species) %>% summarize(total_length = sum(Sepal.Length)) # Warning: Error : Invalid: Negative buffer resize: -219443965 # ../src/arrow/buffer.cc:262 buffer->Resize(size) # ../src/arrow/compute/kernels/aggregate_basic.cc:1005 (_error_or_value9).status() # ../src/arrow/compute/function.cc:193 executor->Execute(implicitly_cast_args, listener.get()) # ; pulling data into R # # A tibble: 3 x 2 # Species total_length # * <chr> <dbl> # 1 setosa 250. # 2 versicolor 297. # 3 virginica 329. ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org