[C++][Parquet] Field selection of complex field types

Louis C Mon, 05 Dec 2022 06:11:25 -0800

Hello,
I use the

parquet::arrow::FileReader::ReadRowGroups(const std::vector<int>& row_groups,



                                        const std::vector<int>& column_indices,
                                        std::shared_ptr<::arrow::Table>* out)





for importing Parquet data. However, when dealing with the table 
blogs.parquet<https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
 I came across a problem : the number of fields of the table (when querying the 
import object) was 2, but when I tried to import the 2 fields (putting 
column_indices as {0,1} in C++), it only returned the first field. The reason 
seems to be that the first field is a struct with 2 sub elements, and the 
parquet reader takes into account the sub elements of the fields when it 
chooses the fields to output.
For reference, here is the structure of the table that pyarrow returns :
pyarrow.Table
reply: struct<reply_id: int32 not null, next_id: int32>
  child 0, reply_id: int32 not null
  child 1, next_id: int32
blog_id: int64

So my question will be :
Is that the intended behaviour (parquet reader dealing with column_indices as 
refering to sub fields) ? In this case I think it will be a bit incoherent with 
what is done with

Result<std::shared_ptr<RecordBatch>> SelectColumns(
      const std::vector<int>& indices) const;


from the RecordBatch class.
In the code we also see (parquet/arrow/reader.h line 208):

/// The indicated column indices are relative to the schema


which would mean that this is not the intended behaviour.
So is that normal and how to import only certain fields (at the higher level, 
not sub fields) ?

Best regards,
Louis Calot
[https://opengraph.githubassets.com/ae72de1c9388132eba0535ffc338630eca4165eacce66973c3ee3923d6200287/apache/arrow-testing]<https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
arrow-testing/blogs.parquet at master · 
apache/arrow-testing<https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
Auxiliary testing files for Apache Arrow. Contribute to apache/arrow-testing 
development by creating an account on GitHub.
github.com





[C++][Parquet] Field selection of complex field types

Reply via email to