[ https://issues.apache.org/jira/browse/ARROW-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sam Albers updated ARROW-8216: ------------------------------ Description: I have just noticed some slightly odd behaviour with the filter method for Dataset. {code:java} library(arrow) library(dplyr) packageVersion("arrow") #> [1] '0.16.0.20200323' ## Make sample parquet starwars$hair_color[starwars$hair_color == "brown"] <- "" dir <- tempdir() fpath <- file.path(dir, "data.parquet") write_parquet(starwars, fpath) ## df in memory df_mem <- starwars %>% filter(hair_color == "") ## reading from the parquet df_parquet <- read_parquet(fpath) %>% filter(hair_color == "") ## using open_dataset df_dataset <- open_dataset(dir) %>% filter(hair_color == "") %>% collect() identical(df_mem, df_parquet) #> [1] TRUE identical(df_mem, df_dataset) #> [1] FALSE {code} I'm pretty sure all these should return the same data.frame. Am I missing something? was: I have just noticed some slightly odd behaviour with the filter method for Dataset. {code:java} library(arrow) library(dplyr) packageVersion("arrow") #> [1] '0.16.0.20200323' ## Make sample parquet starwars$hair_color[starwars$hair_color == "brown"] <- "" dir <- tempdir() fpath <- file.path(dir, 'data.parquet') write_parquet(starwars, fpath) ## df in memory df_mem <- starwars %>% filter(hair_color == "") ## reading from the parquet df_parquet <- read_parquet(fpath) %>% filter(hair_color == "") ## using open_dataset df_dataset <- open_dataset(dir) %>% filter(hair_color == "") %>% collect() {code} I'm pretty sure all these should return the same data.frame. Am I missing something? > filter method for Dataset doesn't distinguish between empty strings and NAs > --------------------------------------------------------------------------- > > Key: ARROW-8216 > URL: https://issues.apache.org/jira/browse/ARROW-8216 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 0.16.0 > Environment: R 3.6.3, Windows 10 > Reporter: Sam Albers > Priority: Minor > > > I have just noticed some slightly odd behaviour with the filter method for > Dataset. > > {code:java} > library(arrow) > library(dplyr) > packageVersion("arrow") > #> [1] '0.16.0.20200323' > ## Make sample parquet > starwars$hair_color[starwars$hair_color == "brown"] <- "" > dir <- tempdir() > fpath <- file.path(dir, "data.parquet") > write_parquet(starwars, fpath) > ## df in memory > df_mem <- starwars %>% > filter(hair_color == "") > ## reading from the parquet > df_parquet <- read_parquet(fpath) %>% > filter(hair_color == "") > ## using open_dataset > df_dataset <- open_dataset(dir) %>% > filter(hair_color == "") %>% > collect() > identical(df_mem, df_parquet) > #> [1] TRUE > identical(df_mem, df_dataset) > #> [1] FALSE > {code} > > > I'm pretty sure all these should return the same data.frame. Am I missing > something? > -- This message was sent by Atlassian Jira (v8.3.4#803005)