GitHub user thisisnic added a comment to the discussion: how to debug 
arrow/dplyr to consider a bug report?

I'm curious if there's anything special about the parquet file itself too, so I 
installed parquet-tools and took a look:

```
nic@xps-15:~/arrow$ parquet-tools inspect ../Downloads/papers.parquet 

############ file meta data ############
created_by: parquet-go version 18.0.0-SNAPSHOT
num_columns: 13
num_rows: 64141
num_row_groups: 1
format_version: 2.6
serialized_size: 1819


############ Columns ############
paper_id
softcite_id
title
published_year
published_date
publication_venue
publisher_name
doi
pmcid
pmid
genre
license_type
has_mentions

############ Column(paper_id) ############
name: paper_id
path: paper_id
max_definition_level: 0
max_repetition_level: 0
physical_type: INT32
logical_type: Int(bitWidth=32, isSigned=false)
converted_type (legacy): UINT_32
compression: GZIP (space_saved: 13%)

############ Column(softcite_id) ############
name: softcite_id
path: softcite_id
max_definition_level: 0
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 50%)

############ Column(title) ############
name: title
path: title
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 56%)

############ Column(published_year) ############
name: published_year
path: published_year
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Int(bitWidth=16, isSigned=false)
converted_type (legacy): UINT_16
compression: GZIP (space_saved: 18%)

############ Column(published_date) ############
name: published_date
path: published_date
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: Date
converted_type (legacy): DATE
compression: GZIP (space_saved: 10%)

############ Column(publication_venue) ############
name: publication_venue
path: publication_venue
max_definition_level: 0
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 59%)

############ Column(publisher_name) ############
name: publisher_name
path: publisher_name
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 49%)

############ Column(doi) ############
name: doi
path: doi
max_definition_level: 0
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 61%)

############ Column(pmcid) ############
name: pmcid
path: pmcid
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 63%)

############ Column(pmid) ############
name: pmid
path: pmid
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 57%)

############ Column(genre) ############
name: genre
path: genre
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 60%)

############ Column(license_type) ############
name: license_type
path: license_type
max_definition_level: 1
max_repetition_level: 0
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: GZIP (space_saved: 49%)

############ Column(has_mentions) ############
name: has_mentions
path: has_mentions
max_definition_level: 0
max_repetition_level: 0
physical_type: BOOLEAN
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 99%)

```

Nothing too out of the ordinary here, though I'll note the file was written 
with a snapshot (i.e. dev) version of parquet-go though I don't *think* this 
should be an issue.  It's Parquet 2.6 which is good, a later version. The 
column in question is a uint16 type, but this should be an issue.

Next thing I'm gonna do is try to rule out any issues with working with this 
column type and work out whether it's something up with this file or with Arrow 
itself.

GitHub link: 
https://github.com/apache/arrow/discussions/46383#discussioncomment-13119429

----
This is an automatically sent email for user@arrow.apache.org.
To unsubscribe, please send an email to: user-unsubscr...@arrow.apache.org

Reply via email to