Hi, Thanks. I've implemented: https://github.com/apache/arrow/pull/11215
-- kou In <df4pr8401mb03646d43254e2da4941501d48d...@df4pr8401mb0364.namprd84.prod.outlook.com> "Re: Getting Parquet File Metadata in C/GLib interface" on Wed, 22 Sep 2021 15:41:19 +0000, "McDonald, Ben" <[email protected]> wrote: > The number of rows is the only one that is causing some hiccups on my end due > to the performance bottleneck with large files, the others would be > nice-to-haves, but aren’t blocking in any way. > > I opened up a Jira issue here: > httpgs://issues.apache.org/jira/browse/ARROW-14072. Thank you. > > Best, > Ben McDonald > > From: Sutou Kouhei <[email protected]> > Date: Tuesday, September 21, 2021 at 5:20 PM > To: [email protected] <[email protected]> > Subject: Re: Getting Parquet File Metadata in C/GLib interface > Hi, > > Unfortunately, Apache Arrow GLib doesn't provide an API to > get the number of rows in Parquet without reading all row > groups yet. > > Could you open a JIRA issue that requests this feature? > https://issues.apache.org/jira<https://issues.apache.org/jira> > > I'll implement it until the next release. > > We can get the number of columns from schema got by the > following API: > > GArrowSchema * > gparquet_arrow_file_reader_get_schema(GParquetArrowFileReader *reader, > GError **error); > > We can get the number of row groups by the following API: > > gint > gparquet_arrow_file_reader_get_n_row_groups(GParquetArrowFileReader *reader); > > > We can't get "created_by", "format_version" and > "serialized_size" yet. Do you want to get all of them? > > > Thanks, > -- > kou > > In > <df4pr8401mb0364865470943123c6f4c04e8d...@df4pr8401mb0364.namprd84.prod.outlook.com> > "Getting Parquet File Metadata in C/GLib interface" on Tue, 21 Sep 2021 > 22:13:09 +0000, > "McDonald, Ben" <[email protected]> wrote: > >> Hello, >> >> I am working with the C/GLib Arrow interface to read Parquet files and I am >> having trouble accessing all of the file metadata. >> >> Reading my file into Python and printing the metadata like this: >> ``` >> pq.ParquetFile('f1.parquet').metadata >> ``` >> >> Results in this metadata: >> ``` >> <pyarrow._parquet.FileMetaData object at 0x1176e8ea0> >> created_by: parquet-cpp-arrow version 5.0.0 >> num_columns: 3 >> num_rows: 10 >> num_row_groups: 1 >> format_version: 1.0 >> serialized_size: 420 >> ``` >> >> But reading the same file into the C/GLib interface and printing the >> metadata from this call (where the schema is from the same file): >> ``` >> garrow_schema_to_string_metadata(schema, trueGbooleanValue) >> ``` >> >> Results in this metadata, which is only the schema and doesn’t include any >> of the above metadata: >> ``` >> first-int-col: int64 >> str-col: string >> second-int-col: int64 >> ``` >> >> My specific question is: is it possible to easily get the number of rows of >> a Parquet file in the C/GLib Arrow library? (i.e., without having to read in >> the whole table), but I would also be interested in getting the rest of the >> metadata that is shown in pyarrow. I wasn’t able to find a way to do this in >> the C/GLib documentation, but feel like I must be missing something. Thank >> you. >> >> Best, >> Ben McDonald
