Re: Getting Parquet File Metadata in C/GLib interface

Sutou Kouhei Fri, 24 Sep 2021 13:06:43 -0700

Hi,

Thanks. I've implemented: https://github.com/apache/arrow/pull/11215


-- 
kou

In 
<df4pr8401mb03646d43254e2da4941501d48d...@df4pr8401mb0364.namprd84.prod.outlook.com>
  "Re: Getting Parquet File Metadata in C/GLib interface" on Wed, 22 Sep 2021 
15:41:19 +0000,
  "McDonald, Ben" <[email protected]> wrote:

> The number of rows is the only one that is causing some hiccups on my end due 
> to the performance bottleneck with large files, the others would be 
> nice-to-haves, but aren’t blocking in any way.
> 
> I opened up a Jira issue here: 
> httpgs://issues.apache.org/jira/browse/ARROW-14072. Thank you.
> 
> Best,
> Ben McDonald
> 
> From: Sutou Kouhei <[email protected]>
> Date: Tuesday, September 21, 2021 at 5:20 PM
> To: [email protected] <[email protected]>
> Subject: Re: Getting Parquet File Metadata in C/GLib interface
> Hi,
> 
> Unfortunately, Apache Arrow GLib doesn't provide an API to
> get the number of rows in Parquet without reading all row
> groups yet.
> 
> Could you open a JIRA issue that requests this feature?
>   https://issues.apache.org/jira<https://issues.apache.org/jira>
> 
> I'll implement it until the next release.
> 
> We can get the number of columns from schema got by the
> following API:
> 
> GArrowSchema *
> gparquet_arrow_file_reader_get_schema(GParquetArrowFileReader *reader,
>                                       GError **error);
> 
> We can get the number of row groups by the following API:
> 
> gint
> gparquet_arrow_file_reader_get_n_row_groups(GParquetArrowFileReader *reader);
> 
> 
> We can't get "created_by", "format_version" and
> "serialized_size" yet. Do you want to get all of them?
> 
> 
> Thanks,
> --
> kou
> 
> In 
> <df4pr8401mb0364865470943123c6f4c04e8d...@df4pr8401mb0364.namprd84.prod.outlook.com>
>   "Getting Parquet File Metadata in C/GLib interface" on Tue, 21 Sep 2021 
> 22:13:09 +0000,
>   "McDonald, Ben" <[email protected]> wrote:
> 
>> Hello,
>>
>> I am working with the C/GLib Arrow interface to read Parquet files and I am 
>> having trouble accessing all of the file metadata.
>>
>> Reading my file into Python and printing the metadata like this:
>> ```
>> pq.ParquetFile('f1.parquet').metadata
>> ```
>>
>> Results in this metadata:
>> ```
>> <pyarrow._parquet.FileMetaData object at 0x1176e8ea0>
>>   created_by: parquet-cpp-arrow version 5.0.0
>>   num_columns: 3
>>   num_rows: 10
>>   num_row_groups: 1
>>   format_version: 1.0
>>   serialized_size: 420
>> ```
>>
>> But reading the same file into the C/GLib interface and printing the 
>> metadata from this call (where the schema is from the same file):
>> ```
>> garrow_schema_to_string_metadata(schema, trueGbooleanValue)
>> ```
>>
>> Results in this metadata, which is only the schema and doesn’t include any 
>> of the above metadata:
>> ```
>> first-int-col: int64
>> str-col: string
>> second-int-col: int64
>> ```
>>
>> My specific question is: is it possible to easily get the number of rows of 
>> a Parquet file in the C/GLib Arrow library? (i.e., without having to read in 
>> the whole table), but I would also be interested in getting the rest of the 
>> metadata that is shown in pyarrow. I wasn’t able to find a way to do this in 
>> the C/GLib documentation, but feel like I must be missing something. Thank 
>> you.
>>
>> Best,
>> Ben McDonald

Re: Getting Parquet File Metadata in C/GLib interface

Reply via email to