The number of rows is the only one that is causing some hiccups on my end due to the performance bottleneck with large files, the others would be nice-to-haves, but aren’t blocking in any way.
I opened up a Jira issue here: https://issues.apache.org/jira/browse/ARROW-14072. Thank you. Best, Ben McDonald From: Sutou Kouhei <[email protected]> Date: Tuesday, September 21, 2021 at 5:20 PM To: [email protected] <[email protected]> Subject: Re: Getting Parquet File Metadata in C/GLib interface Hi, Unfortunately, Apache Arrow GLib doesn't provide an API to get the number of rows in Parquet without reading all row groups yet. Could you open a JIRA issue that requests this feature? https://issues.apache.org/jira<https://issues.apache.org/jira> I'll implement it until the next release. We can get the number of columns from schema got by the following API: GArrowSchema * gparquet_arrow_file_reader_get_schema(GParquetArrowFileReader *reader, GError **error); We can get the number of row groups by the following API: gint gparquet_arrow_file_reader_get_n_row_groups(GParquetArrowFileReader *reader); We can't get "created_by", "format_version" and "serialized_size" yet. Do you want to get all of them? Thanks, -- kou In <df4pr8401mb0364865470943123c6f4c04e8d...@df4pr8401mb0364.namprd84.prod.outlook.com> "Getting Parquet File Metadata in C/GLib interface" on Tue, 21 Sep 2021 22:13:09 +0000, "McDonald, Ben" <[email protected]> wrote: > Hello, > > I am working with the C/GLib Arrow interface to read Parquet files and I am > having trouble accessing all of the file metadata. > > Reading my file into Python and printing the metadata like this: > ``` > pq.ParquetFile('f1.parquet').metadata > ``` > > Results in this metadata: > ``` > <pyarrow._parquet.FileMetaData object at 0x1176e8ea0> > created_by: parquet-cpp-arrow version 5.0.0 > num_columns: 3 > num_rows: 10 > num_row_groups: 1 > format_version: 1.0 > serialized_size: 420 > ``` > > But reading the same file into the C/GLib interface and printing the metadata > from this call (where the schema is from the same file): > ``` > garrow_schema_to_string_metadata(schema, trueGbooleanValue) > ``` > > Results in this metadata, which is only the schema and doesn’t include any of > the above metadata: > ``` > first-int-col: int64 > str-col: string > second-int-col: int64 > ``` > > My specific question is: is it possible to easily get the number of rows of a > Parquet file in the C/GLib Arrow library? (i.e., without having to read in > the whole table), but I would also be interested in getting the rest of the > metadata that is shown in pyarrow. I wasn’t able to find a way to do this in > the C/GLib documentation, but feel like I must be missing something. Thank > you. > > Best, > Ben McDonald
