[GitHub] [arrow] nealrichardson commented on pull request #9621: ARROW-11591: [C++][Compute] Grouped aggregation

GitBox Thu, 04 Mar 2021 14:45:19 -0800


nealrichardson commented on pull request #9621:
URL: https://github.com/apache/arrow/pull/9621#issuecomment-790998553



   With an assist from @bkietz, I've written a very basic R wrapper that 
exercises this in 
https://github.com/apache/arrow/commit/aa530cb586462bee98390d129575fe1622ffb222.
 It's enough to expose some issues to address, to say nothing of the interface 
questions.
   
   ```r
   library(arrow)
   library(dplyr)
   
   # The commit uses this option to switch to use the group_by compute function
   options(arrow.summarize = TRUE)
   # If the Arrow aggregation function isn't implemented, or if the Arrow call 
errors,
   # it falls back to pulling the data in R and evaluating in R.
   
   # mtcars is a standard dataset that ships with R
   mt <- Table$create(mtcars)
   mt %>%
     group_by(cyl) %>%
     summarize(total_hp = sum(hp))
   # Warning: Error : NotImplemented: Key of typedouble
   # ../src/arrow/compute/function.cc:178  kernel_ctx.status()
   # ; pulling data into R
   # # A tibble: 3 x 2
   #     cyl total_hp
   # * <dbl>    <dbl>
   # 1     4      909
   # 2     6      856
   # 3     8     2929
   
   # That's unfortunate. R blurs the distinction for users between integer and 
double,
   # so it's not uncommon to have integer data stored as a float.
   # (Also, the error message is missing some whitespace.)
   
   # We can cast that to an integer and try again
   
   mt$cyl <- mt$cyl$cast(int32())
   unique(mt$cyl)
   # Array
   # <int32>
   # [
   #   6,
   #   4,
   #   8
   # ]
   
   mt %>%
     group_by(cyl) %>%
     summarize(total_hp = sum(hp))
   # StructArray
   # <struct<: double, : int32>>
   # -- is_valid: all not null
   # -- child 0 type: double
   #   [
   #     856,
   #     909,
   #     2929
   #   ]
   # -- child 1 type: int64
   #   [
   #     17179869190,
   #     8,
   #     0
   #   ]
   
   # Alright, it computed and got the same numbers, but the StructArray
   # is not valid. Type says int32 but data says int64 and we have misplaced 
bits
   
   # Let's try a different stock dataset
   ir <- Table$create(iris)
   ir %>%
     group_by(Species) %>%
     summarize(total_length = sum(Sepal.Length))
   # Warning: Error : NotImplemented: Key of typedictionary<values=string, 
indices=int8, ordered=0>
   # ../src/arrow/compute/function.cc:178  kernel_ctx.status()
   # ; pulling data into R
   # # A tibble: 3 x 2
   #   Species    total_length
   # * <fct>             <dbl>
   # 1 setosa             250.
   # 2 versicolor         297.
   # 3 virginica          329.
   
   # Hmm. dictionary types really need to be supported.
   # Let's work around and cast it to string
   
   ir$Species <- ir$Species$cast(utf8())
   unique(ir$Species)
   # Array
   # <string>
   # [
   #   "setosa",
   #   "versicolor",
   #   "virginica"
   # ]
   ir %>%
     group_by(Species) %>%
     summarize(total_length = sum(Sepal.Length))
   # Warning: Error : Invalid: Negative buffer resize: -219443965
   # ../src/arrow/buffer.cc:262  buffer->Resize(size)
   # ../src/arrow/compute/kernels/aggregate_basic.cc:1005  
(_error_or_value9).status()
   # ../src/arrow/compute/function.cc:193  
executor->Execute(implicitly_cast_args, listener.get())
   # ; pulling data into R
   # # A tibble: 3 x 2
   #   Species    total_length
   # * <chr>             <dbl>
   # 1 setosa             250.
   # 2 versicolor         297.
   # 3 virginica          329.
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] nealrichardson commented on pull request #9621: ARROW-11591: [C++][Compute] Grouped aggregation

Reply via email to