Hi Louis,

> In the code we also see (parquet/arrow/reader.h line 208):
> /// The indicated column indices are relative to the schema
> which would mean that this is not the intended behaviour.


I think this could be a documentation and parameter name could be clearer
as the definitions of indices are not well defined and differ by method
call.  column_indices for ReadRowGroups take leaf parquet column indices as
the columns it selects which is why you are seeing that behavior.
Ultimately, these get translated to top level indices via
Schema.GetFieldIndices [1]



> So is that normal and how to import only certain fields (at the higher
> level, not sub fields) ?

Unfortunately, as far as I know this would be a DIY in one of two ways:
1.  Do a traversal of root elements in the schema [2] and retrieve all the
leaf indices
2.  Use the GetColumn API calls, which I believe take top level fields for
reading, and piece together a Table in your code. [3]
3.  Contribute a patch which can take top level field indices.  I think the
main challenge here is naming/distinguishing this from existing APIs.
Givent the proliferation of APIs I'm not sure adding a new one is a great
idea because it adds to the confusion (maybe contributing a utility method
to do the traversal mentioned in 1 is more practical).

Cheers,
Micah

[1]
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/schema.h#L158
[2]
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/schema.h#L115
[3]
https://github.com/apache/arrow/blob/8e43f23dcc6a9e630516228f110c48b64d13cec6/cpp/src/parquet/arrow/reader.h#L127


On Mon, Dec 5, 2022 at 6:11 AM Louis C <[email protected]> wrote:

> Hello,
> I use the
>
> parquet::arrow::FileReader::ReadRowGroups(const std::vector<int>& row_groups,
>
>                                         const std::vector<int>& 
> column_indices,
>                                         std::shared_ptr<::arrow::Table>* out)
>
>
>  for importing Parquet data. However, when dealing with the table
> blogs.parquet
> <https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
> I came across a problem : the number of fields of the table (when querying
> the import object) was 2, but when I tried to import the 2 fields (putting
> column_indices as {0,1} in C++), it only returned the first field. The
> reason seems to be that the first field is a struct with 2 sub elements,
> and the parquet reader takes into account the sub elements of the fields
> when it chooses the fields to output.
> For reference, here is the structure of the table that pyarrow returns :
> pyarrow.Table
> reply: struct<reply_id: int32 not null, next_id: int32>
>   child 0, reply_id: int32 not null
>   child 1, next_id: int32
> blog_id: int64
>
> So my question will be :
> Is that the intended behaviour (parquet reader dealing with column_indices
> as refering to sub fields) ? In this case I think it will be a bit
> incoherent with what is done with
>
> Result<std::shared_ptr<RecordBatch>> SelectColumns(
>       const std::vector<int>& indices) const;
>
> from the RecordBatch class.
> In the code we also see (parquet/arrow/reader.h line 208):
>
> /// The indicated column indices are relative to the schema
>
> which would mean that this is not the intended behaviour.
> So is that normal and how to import only certain fields (at the higher
> level, not sub fields) ?
>
> Best regards,
> Louis Calot
>
> <https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
> arrow-testing/blogs.parquet at master · apache/arrow-testing
> <https://github.com/apache/arrow-testing/blob/master/data/parquet/generated_simple_numerics/blogs.parquet>
> Auxiliary testing files for Apache Arrow. Contribute to
> apache/arrow-testing development by creating an account on GitHub.
> github.com
> **
> **
> **
> **
>
>

Reply via email to