GitHub user thisisnic added a comment to the discussion: how to debug arrow/dplyr to consider a bug report?
I'm curious if there's anything special about the parquet file itself too, so I installed parquet-tools and took a look: ``` nic@xps-15:~/arrow$ parquet-tools inspect ../Downloads/papers.parquet ############ file meta data ############ created_by: parquet-go version 18.0.0-SNAPSHOT num_columns: 13 num_rows: 64141 num_row_groups: 1 format_version: 2.6 serialized_size: 1819 ############ Columns ############ paper_id softcite_id title published_year published_date publication_venue publisher_name doi pmcid pmid genre license_type has_mentions ############ Column(paper_id) ############ name: paper_id path: paper_id max_definition_level: 0 max_repetition_level: 0 physical_type: INT32 logical_type: Int(bitWidth=32, isSigned=false) converted_type (legacy): UINT_32 compression: GZIP (space_saved: 13%) ############ Column(softcite_id) ############ name: softcite_id path: softcite_id max_definition_level: 0 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 compression: GZIP (space_saved: 50%) ############ Column(title) ############ name: title path: title max_definition_level: 1 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 compression: GZIP (space_saved: 56%) ############ Column(published_year) ############ name: published_year path: published_year max_definition_level: 1 max_repetition_level: 0 physical_type: INT32 logical_type: Int(bitWidth=16, isSigned=false) converted_type (legacy): UINT_16 compression: GZIP (space_saved: 18%) ############ Column(published_date) ############ name: published_date path: published_date max_definition_level: 1 max_repetition_level: 0 physical_type: INT32 logical_type: Date converted_type (legacy): DATE compression: GZIP (space_saved: 10%) ############ Column(publication_venue) ############ name: publication_venue path: publication_venue max_definition_level: 0 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 compression: GZIP (space_saved: 59%) ############ Column(publisher_name) ############ name: publisher_name path: publisher_name max_definition_level: 1 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 compression: GZIP (space_saved: 49%) ############ Column(doi) ############ name: doi path: doi max_definition_level: 0 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 compression: GZIP (space_saved: 61%) ############ Column(pmcid) ############ name: pmcid path: pmcid max_definition_level: 1 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 compression: GZIP (space_saved: 63%) ############ Column(pmid) ############ name: pmid path: pmid max_definition_level: 1 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 compression: GZIP (space_saved: 57%) ############ Column(genre) ############ name: genre path: genre max_definition_level: 1 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 compression: GZIP (space_saved: 60%) ############ Column(license_type) ############ name: license_type path: license_type max_definition_level: 1 max_repetition_level: 0 physical_type: BYTE_ARRAY logical_type: String converted_type (legacy): UTF8 compression: GZIP (space_saved: 49%) ############ Column(has_mentions) ############ name: has_mentions path: has_mentions max_definition_level: 0 max_repetition_level: 0 physical_type: BOOLEAN logical_type: None converted_type (legacy): NONE compression: GZIP (space_saved: 99%) ``` Nothing too out of the ordinary here, though I'll note the file was written with a snapshot (i.e. dev) version of parquet-go though I don't *think* this should be an issue. It's Parquet 2.6 which is good, a later version. The column in question is a uint16 type, but this should be an issue. Next thing I'm gonna do is try to rule out any issues with working with this column type and work out whether it's something up with this file or with Arrow itself. GitHub link: https://github.com/apache/arrow/discussions/46383#discussioncomment-13119429 ---- This is an automatically sent email for user@arrow.apache.org. To unsubscribe, please send an email to: user-unsubscr...@arrow.apache.org