GitHub user thisisnic edited a comment on the discussion: how to debug 
arrow/dplyr to consider a bug report?

One useful thing to try at this point is working out whether the discrepancy 
lives in the R bindings to the Arrow C++ library or in the Arrow C++ library 
itself.  In the case of the former, I'll dig into it more myself, but in the 
case of the latter, I might choose to ask someone more familiar with it to 
help.  One way to work this out is to test out the equivalent PyArrow code - 
both R and Python provide bindings to the C++ library, so if they have 
different results, we can conclude the issue is in R.


I asked chatGPT for the Python equivalent of the snippet:

```r
full_papers <- 
open_dataset('data/softcite-extractions-oa-data/p01_one_percent_random_subset/papers.parquet',
 format = 'parquet')

full_papers |>
  filter(published_year < 1990) |>
  collect() |>
  nrow()
```

and got this:

```py
import pyarrow.dataset as ds

# Load dataset
full_papers = 
ds.dataset('data/softcite-extractions-oa-data/p01_one_percent_random_subset/papers.parquet',
 format='parquet')

# Filter and count rows
full_papers.to_table(filter=ds.field("published_year") < 1990).num_rows
```

which gave me the result:

```
0
```

And just to check things looked the same, I also tried the following Python:

```py
full_papers.to_table(filter=ds.field("published_year") >= 1990).num_rows
```

which returned

```
62421
```

Given that this maps to what you found in R, it looks like this is happening at 
the C++ level.  

GitHub link: 
https://github.com/apache/arrow/discussions/46383#discussioncomment-13119345

----
This is an automatically sent email for user@arrow.apache.org.
To unsubscribe, please send an email to: user-unsubscr...@arrow.apache.org

Reply via email to