Jameel Alsalam created ARROW-15926: -------------------------------------- Summary: [R] CsvConvertOptions include_columns bug in open_dataset vs. read_csv_arrow Key: ARROW-15926 URL: https://issues.apache.org/jira/browse/ARROW-15926 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 7.0.0 Environment: Windows 10 Reporter: Jameel Alsalam
I think there is a bug when reading a csv dataset where you don't want to read in all columns. As shown below, the identical code works in read_csv_arrow but errors in open_dataset. This can be worked around by reading in all columns and then selecting afterwards, but I am not sure if there is any performance advantage to omitting columns at the reading step. ``` r library(tidyverse) library(arrow) #> #> Attaching package: 'arrow' tmpf <- tempfile() dat <- tribble( ~key, ~val1, "A", "1", "B", "2", ) write_csv(dat, tmpf) # works in read_csv_arrow, errors in open_dataset: read_csv_arrow( tmpf, convert_options = CsvConvertOptions$create( include_columns = "key" )) #> # A tibble: 2 x 1 #> key #> <chr> #> 1 A #> 2 B open_dataset( tmpf, format = "csv", convert_options = CsvConvertOptions$create( include_columns = "key" )) %>% collect() #> Error in `handle_csv_read_error()`: #> ! Invalid: Multiple matches for FieldRef.Name(key) in key: [ #> "A", #> "B" #> ] #> key: [ #> "A", #> "B" #> ] # Note that it does work to select after open_dataset, thus not a blocking issue: open_dataset(tmpf, format = "csv") %>% select(key) %>% collect() #> # A tibble: 2 x 1 #> key #> <chr> #> 1 A #> 2 B ``` <sup>Created on 2022-03-12 by the [reprex package](https://reprex.tidyverse.org) (v2.0.1)</sup> I have tried this both with CRAN version 7 and the nightly version. -- This message was sent by Atlassian Jira (v8.20.1#820001)