GitHub user thisisnic added a comment to the discussion: how to debug 
arrow/dplyr to consider a bug report?

First thing I tried was inspecting the Parquet file to see if there was 
anything particular about it and the comparing it with a version I'd written 
from Arrow C++.

I added a few extra parameters to try to match as closely as possible:

```
open_dataset('data/softcite-extractions-oa-data/p01_one_percent_random_subset/papers.parquet')
 %>%
  write_dataset(tf, compression = "gzip", min_rows_per_group = 100000)
```

The I ran parquet-tools to compare them, e.g. 

```
parquet-tools inspect "/tmp/RtmpfoyxmB/file18fa6b312836/part-0.parquet"
```

There's a diff here comparing the original file and the one written by Arrow 
C++ which seems to be working fine: https://www.diffchecker.com/OE6AnZgn/

Super weird - they're pretty similar but getting different results.  Unsure how 
to proceed right now but I'll have a think and get back to you!

GitHub link: 
https://github.com/apache/arrow/discussions/46383#discussioncomment-13119639

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to