The number of rows is the only one that is causing some hiccups on my end due 
to the performance bottleneck with large files, the others would be 
nice-to-haves, but aren’t blocking in any way.

I opened up a Jira issue here: 
https://issues.apache.org/jira/browse/ARROW-14072. Thank you.

Best,
Ben McDonald

From: Sutou Kouhei <[email protected]>
Date: Tuesday, September 21, 2021 at 5:20 PM
To: [email protected] <[email protected]>
Subject: Re: Getting Parquet File Metadata in C/GLib interface
Hi,

Unfortunately, Apache Arrow GLib doesn't provide an API to
get the number of rows in Parquet without reading all row
groups yet.

Could you open a JIRA issue that requests this feature?
  https://issues.apache.org/jira<https://issues.apache.org/jira>

I'll implement it until the next release.

We can get the number of columns from schema got by the
following API:

GArrowSchema *
gparquet_arrow_file_reader_get_schema(GParquetArrowFileReader *reader,
                                      GError **error);

We can get the number of row groups by the following API:

gint
gparquet_arrow_file_reader_get_n_row_groups(GParquetArrowFileReader *reader);


We can't get "created_by", "format_version" and
"serialized_size" yet. Do you want to get all of them?


Thanks,
--
kou

In 
<df4pr8401mb0364865470943123c6f4c04e8d...@df4pr8401mb0364.namprd84.prod.outlook.com>
  "Getting Parquet File Metadata in C/GLib interface" on Tue, 21 Sep 2021 
22:13:09 +0000,
  "McDonald, Ben" <[email protected]> wrote:

> Hello,
>
> I am working with the C/GLib Arrow interface to read Parquet files and I am 
> having trouble accessing all of the file metadata.
>
> Reading my file into Python and printing the metadata like this:
> ```
> pq.ParquetFile('f1.parquet').metadata
> ```
>
> Results in this metadata:
> ```
> <pyarrow._parquet.FileMetaData object at 0x1176e8ea0>
>   created_by: parquet-cpp-arrow version 5.0.0
>   num_columns: 3
>   num_rows: 10
>   num_row_groups: 1
>   format_version: 1.0
>   serialized_size: 420
> ```
>
> But reading the same file into the C/GLib interface and printing the metadata 
> from this call (where the schema is from the same file):
> ```
> garrow_schema_to_string_metadata(schema, trueGbooleanValue)
> ```
>
> Results in this metadata, which is only the schema and doesn’t include any of 
> the above metadata:
> ```
> first-int-col: int64
> str-col: string
> second-int-col: int64
> ```
>
> My specific question is: is it possible to easily get the number of rows of a 
> Parquet file in the C/GLib Arrow library? (i.e., without having to read in 
> the whole table), but I would also be interested in getting the rest of the 
> metadata that is shown in pyarrow. I wasn’t able to find a way to do this in 
> the C/GLib documentation, but feel like I must be missing something. Thank 
> you.
>
> Best,
> Ben McDonald

Reply via email to