RE: [C++][Parquet] Field selection of complex field types

Louis C Wed, 04 Jan 2023 01:58:21 -0800

This would be a great contribution if you are willing to take a stab at it I am 
happy to review.


I eventually had the time to make small modifications to the comments of the 
code.
Here is the PR : https://github.com/apache/arrow/pull/15184
I feel the comments now better reflect the actual behaviour of the code.
I will be glad if you can review it,

Thanks
Louis
________________________________
De : Micah Kornfield <[email protected]>
Envoyé : mercredi 14 décembre 2022 07:13
À : [email protected] <[email protected]>
Objet : Re: [C++][Parquet] Field selection of complex field types

I would say that indeed the documentation/parameters name could be clearer, 
because it is quite hard to know to which level the column indices refer to.

This would be a great contribution if you are willing to take a stab at it I am 
happy to review.

Thanks,
Micah


On Thu, Dec 8, 2022 at 12:43 AM Louis C 
<[email protected]<mailto:[email protected]>> wrote:
Hello Micah,
Thanks for your answer. I ended up doing method 1, and my code now runs 
correctly (that was not too hard to do using the schema_fields member).
I would say that indeed the documentation/parameters name could be clearer, 
because it is quite hard to know to which level the column indices refer to.

Cheers,
Louis Calot
________________________________
De : Micah Kornfield <[email protected]<mailto:[email protected]>>
Envoyé : mercredi 7 décembre 2022 04:59
À : [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Objet : Re: [C++][Parquet] Field selection of complex field types

Hi Louis,
In the code we also see (parquet/arrow/reader.h line 208):
/// The indicated column indices are relative to the schema
which would mean that this is not the intended behaviour.

I think this could be a documentation and parameter name could be clearer as 
the definitions of indices are not well defined and differ by method call.  
column_indices for ReadRowGroups take leaf parquet column indices as the 
columns it selects which is why you are seeing that behavior.  Ultimately, 
these get translated to top level indices via Schema.GetFieldIndices [1]


So is that normal and how to import only certain fields (at the higher level, 
not sub fields) ?
Unfortunately, as far as I know this would be a DIY in one of two ways:
1.  Do a traversal of root elements in the schema [2] and retrieve all the leaf 
indices
2.  Use the GetColumn API calls, which I believe take top level fields for 
reading, and piece together a Table in your code. [3]
3.  Contribute a patch which can take top level field indices.  I think the 
main challenge here is naming/distinguishing this from existing APIs.  Givent 
the proliferation of APIs I'm not sure adding a new one is a great idea because 
it adds to the confusion (maybe contributing a utility method to do the 
traversal mentioned in 1 is more practical).

Cheers,
Micah

[1] 
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/schema.h#L158
[2] 
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/schema.h#L115
[3] 
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/reader.h#L127


On Mon, Dec 5, 2022 at 6:11 AM Louis C 
<[email protected]<mailto:[email protected]>> wrote:
Hello,
I use the

parquet::arrow::FileReader::ReadRowGroups(const std::vector<int>& row_groups,


                                        const std::vector<int>& column_indices,
                                        std::shared_ptr<::arrow::Table>* out)





for importing Parquet data. However, when dealing with the table 
blogs.parquet<https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
 I came across a problem : the number of fields of the table (when querying the 
import object) was 2, but when I tried to import the 2 fields (putting 
column_indices as {0,1} in C++), it only returned the first field. The reason 
seems to be that the first field is a struct with 2 sub elements, and the 
parquet reader takes into account the sub elements of the fields when it 
chooses the fields to output.
For reference, here is the structure of the table that pyarrow returns :
pyarrow.Table
reply: struct<reply_id: int32 not null, next_id: int32>
  child 0, reply_id: int32 not null
  child 1, next_id: int32
blog_id: int64

So my question will be :
Is that the intended behaviour (parquet reader dealing with column_indices as 
refering to sub fields) ? In this case I think it will be a bit incoherent with 
what is done with

Result<std::shared_ptr<RecordBatch>> SelectColumns(
      const std::vector<int>& indices) const;


from the RecordBatch class.
In the code we also see (parquet/arrow/reader.h line 208):

/// The indicated column indices are relative to the schema


which would mean that this is not the intended behaviour.
So is that normal and how to import only certain fields (at the higher level, 
not sub fields) ?

Best regards,
Louis Calot
[https://opengraph.githubassets.com/ae72de1c9388132eba0535ffc338630eca4165eacce66973c3ee3923d6200287/apache/arrow-testing]<https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
arrow-testing/blogs.parquet at master · 
apache/arrow-testing<https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
Auxiliary testing files for Apache Arrow. Contribute to apache/arrow-testing 
development by creating an account on GitHub.
github.com<http://github.com>





RE: [C++][Parquet] Field selection of complex field types

Reply via email to