[ 
https://issues.apache.org/jira/browse/ARROW-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sam Albers updated ARROW-8216:
------------------------------
    Description: 
 

I have just noticed some slightly odd behaviour with the filter method for 
Dataset. 

 
{code:java}
library(arrow)
library(dplyr)
packageVersion("arrow")
#> [1] '0.16.0.20200323'
## Make sample parquet
starwars$hair_color[starwars$hair_color == "brown"] <- ""
dir <- tempdir()
fpath <- file.path(dir, "data.parquet")
write_parquet(starwars, fpath)
## df in memory
df_mem <- starwars %>%
 filter(hair_color == "")
## reading from the parquet
df_parquet <- read_parquet(fpath) %>%
 filter(hair_color == "")
## using open_dataset
df_dataset <- open_dataset(dir) %>%
 filter(hair_color == "") %>%
 collect()
identical(df_mem, df_parquet)
#> [1] TRUE
identical(df_mem, df_dataset)
#> [1] FALSE
{code}
 

 

I'm pretty sure all these should return the same data.frame. Am I missing 
something?

 

  was:
 

I have just noticed some slightly odd behaviour with the filter method for 
Dataset. 
{code:java}
library(arrow)
library(dplyr)
packageVersion("arrow")
#> [1] '0.16.0.20200323'
## Make sample parquet
starwars$hair_color[starwars$hair_color == "brown"] <- ""
dir <- tempdir()
fpath <- file.path(dir, 'data.parquet')
write_parquet(starwars, fpath)
## df in memory
df_mem <- starwars %>% 
 filter(hair_color == "")
## reading from the parquet
df_parquet <- read_parquet(fpath) %>% 
 filter(hair_color == "")
## using open_dataset
df_dataset <- open_dataset(dir) %>% 
 filter(hair_color == "") %>% 
 collect()
{code}
I'm pretty sure all these should return the same data.frame. Am I missing 
something?

 


> filter method for Dataset doesn't distinguish between empty strings and NAs
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-8216
>                 URL: https://issues.apache.org/jira/browse/ARROW-8216
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 0.16.0
>         Environment: R 3.6.3, Windows 10
>            Reporter: Sam Albers
>            Priority: Minor
>
>  
> I have just noticed some slightly odd behaviour with the filter method for 
> Dataset. 
>  
> {code:java}
> library(arrow)
> library(dplyr)
> packageVersion("arrow")
> #> [1] '0.16.0.20200323'
> ## Make sample parquet
> starwars$hair_color[starwars$hair_color == "brown"] <- ""
> dir <- tempdir()
> fpath <- file.path(dir, "data.parquet")
> write_parquet(starwars, fpath)
> ## df in memory
> df_mem <- starwars %>%
>  filter(hair_color == "")
> ## reading from the parquet
> df_parquet <- read_parquet(fpath) %>%
>  filter(hair_color == "")
> ## using open_dataset
> df_dataset <- open_dataset(dir) %>%
>  filter(hair_color == "") %>%
>  collect()
> identical(df_mem, df_parquet)
> #> [1] TRUE
> identical(df_mem, df_dataset)
> #> [1] FALSE
> {code}
>  
>  
> I'm pretty sure all these should return the same data.frame. Am I missing 
> something?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to