GitHub user thisisnic added a comment to the discussion: how to debug
arrow/dplyr to consider a bug report?
First thing I tried was inspecting the Parquet file to see if there was
anything particular about it and the comparing it with a version I'd written
from Arrow C++.
I added a few extra parameters to try to match as closely as possible:
```
open_dataset('data/softcite-extractions-oa-data/p01_one_percent_random_subset/papers.parquet')
%>%
write_dataset(tf, compression = "gzip", min_rows_per_group = 100000)
```
The I ran parquet-tools to compare them, e.g.
```
parquet-tools inspect "/tmp/RtmpfoyxmB/file18fa6b312836/part-0.parquet"
```
There's a diff here comparing the original file and the one written by Arrow
C++ which seems to be working fine: https://www.diffchecker.com/OE6AnZgn/
Super weird - they're pretty similar but getting different results. Unsure how
to proceed right now but I'll have a think and get back to you!
GitHub link:
https://github.com/apache/arrow/discussions/46383#discussioncomment-13119639
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]