[jira] [Created] (ARROW-15926) [R] CsvConvertOptions include_columns bug in open_dataset vs. read_csv_arrow

Jameel Alsalam (Jira) Sat, 12 Mar 2022 15:35:07 -0800

Jameel Alsalam created ARROW-15926:
--------------------------------------

             Summary: [R] CsvConvertOptions include_columns bug in open_dataset 
vs. read_csv_arrow
                 Key: ARROW-15926
                 URL: https://issues.apache.org/jira/browse/ARROW-15926
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
    Affects Versions: 7.0.0
         Environment: Windows 10
            Reporter: Jameel Alsalam



I think there is a bug when reading a csv dataset where you don't want to read 
in all columns. As shown below, the identical code works in read_csv_arrow but 
errors in open_dataset. This can be worked around by reading in all columns and 
then selecting afterwards, but I am not sure if there is any performance 
advantage to omitting columns at the reading step.

 

``` r
library(tidyverse)
library(arrow)
#> 
#> Attaching package: 'arrow'


tmpf <- tempfile()

dat <- tribble(
  ~key, ~val1,
  "A", "1",
  "B", "2",
)

write_csv(dat, tmpf)


# works in read_csv_arrow, errors in open_dataset:

read_csv_arrow(
  tmpf,
  convert_options = CsvConvertOptions$create(
    include_columns = "key"
  ))
#> # A tibble: 2 x 1
#>   key  
#>   <chr>
#> 1 A    
#> 2 B

open_dataset(
  tmpf, format = "csv",
  convert_options = CsvConvertOptions$create(
    include_columns = "key"
  )) %>% collect()
#> Error in `handle_csv_read_error()`:
#> ! Invalid: Multiple matches for FieldRef.Name(key) in key:   [
#>     "A",
#>     "B"
#>   ]
#> key:   [
#>     "A",
#>     "B"
#>   ]


# Note that it does work to select after open_dataset, thus not a blocking 
issue:

open_dataset(tmpf, format = "csv") %>%
  select(key) %>%
  collect()
#> # A tibble: 2 x 1
#>   key  
#>   <chr>
#> 1 A    
#> 2 B
```

<sup>Created on 2022-03-12 by the [reprex 
package](https://reprex.tidyverse.org) (v2.0.1)</sup>

 

I have tried this both with CRAN version 7 and the nightly version.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (ARROW-15926) [R] CsvConvertOptions include_columns bug in open_dataset vs. read_csv_arrow

Reply via email to