Sam Albers created ARROW-8216: --------------------------------- Summary: filter method for Dataset doesn't distinguish between empty strings and NAs Key: ARROW-8216 URL: https://issues.apache.org/jira/browse/ARROW-8216 Project: Apache Arrow Issue Type: Bug Components: R Affects Versions: 0.16.0 Environment: R 3.6.3, Windows 10 Reporter: Sam Albers
I have just noticed some slightly odd behaviour with the filter method for Dataset. {code:java} library(arrow) library(dplyr) packageVersion("arrow") #> [1] '0.16.0.20200323' ## Make sample parquet starwars$hair_color[starwars$hair_color == "brown"] <- "" dir <- tempdir() fpath <- file.path(dir, 'data.parquet') write_parquet(starwars, fpath) ## df in memory df_mem <- starwars %>% filter(hair_color == "") ## reading from the parquet df_parquet <- read_parquet(fpath) %>% filter(hair_color == "") ## using open_dataset df_dataset <- open_dataset(dir) %>% filter(hair_color == "") %>% collect() {code} I'm pretty sure all these should return the same data.frame. Am I missing something? -- This message was sent by Atlassian Jira (v8.3.4#803005)