Zsolt Kegyes-Brassai created ARROW-16833: --------------------------------------------
Summary: [R] how to enforce type conversion in open_dataset() Key: ARROW-16833 URL: https://issues.apache.org/jira/browse/ARROW-16833 Project: Apache Arrow Issue Type: Improvement Affects Versions: 8.0.0 Reporter: Zsolt Kegyes-Brassai Here is a small example: {{}} {code:java} library(arrow) df_numbers <- tibble::tibble(number = c(1,2,3,"error", 4, 5, NA, 6)) str(df_numbers) #> tibble [8 x 1] (S3: tbl_df/tbl/data.frame) #> $ number: chr [1:8] "1" "2" "3" "error" ... write_parquet(df_numbers, "numbers.parquet") open_dataset("numbers.parquet") #> FileSystemDataset with 1 Parquet file #> number: string open_dataset("numbers.parquet", schema(number = int8())) |> dplyr::collect() #> Error in `dplyr::collect()`: #> ! Invalid: Failed to parse string: 'error' as a scalar of type int8 {code} The expected result is having an input column of integers; where the non-integer values are converted to NAs. How this type conversion can be enforced using schema definition in in the {{{}open_dataset(){}}}? Rationale: I would like to include this in a code chunk which imports a csv dataset and saves to parquet dataset (open_dataset -> write_dataset); where the type conversion based on a preset schema would be done at the same time. And all these steps without loading all the data in memory. -- This message was sent by Atlassian Jira (v8.20.7#820007)