GitHub user thisisnic edited a comment on the discussion: how to debug arrow/dplyr to consider a bug report?
One useful thing to try at this point is working out whether the discrepancy lives in the R bindings to the Arrow C++ library or in the Arrow C++ library itself. In the case of the former, I'll dig into it more myself, but in the case of the latter, I might choose to ask someone more familiar with it to help. One way to work this out is to test out the equivalent PyArrow code - both R and Python provide bindings to the C++ library, so if they have different results, we can conclude the issue is in R. I asked chatGPT for the Python equivalent of the snippet: ```r full_papers <- open_dataset('data/softcite-extractions-oa-data/p01_one_percent_random_subset/papers.parquet', format = 'parquet') full_papers |> filter(published_year < 1990) |> collect() |> nrow() ``` and got this: ```py import pyarrow.dataset as ds # Load dataset full_papers = ds.dataset('data/softcite-extractions-oa-data/p01_one_percent_random_subset/papers.parquet', format='parquet') # Filter and count rows full_papers.to_table(filter=ds.field("published_year") < 1990).num_rows ``` which gave me the result: ``` 0 ``` And just to check things looked the same, I also tried the following Python: ```py full_papers.to_table(filter=ds.field("published_year") >= 1990).num_rows ``` which returned ``` 62421 ``` Given that this maps to what you found in R, it looks like this is happening at the C++ level. GitHub link: https://github.com/apache/arrow/discussions/46383#discussioncomment-13119345 ---- This is an automatically sent email for user@arrow.apache.org. To unsubscribe, please send an email to: user-unsubscr...@arrow.apache.org