Hello Arthur. I am using something like this:
auto chunked_column = table->GetColumnByName(col_name); auto listArray =
std::static_pointer_cast<arrow::LargeListArray>(chunked_column->chunk(0)); // I
have only one chunk but this is not a problem auto array =
std::static_pointer_cast<arrow::FloatArray>(listArray->values());
In this example I am using the LargeListArray but it is similar to the ListArray
Not related to your issue. but is necessary to customize the options of the
ArrowWriterProperties to save all the type information, for instance:
parquet::ArrowWriterProperties::Builder builder;builder.store_schema();
Without this the parquet file is created by the arrow library has a ListArray
instead of using a LargeListArray on these columns.
On Wednesday, 16 November 2022 at 10:39:02 GMT-3, Arthur Passos
<[email protected]> wrote:
Hi Niranda
Yes, one of the columns (there are over 50 columns in this file), is of type
List<Int64>. You can see that in the parquet-tools inspect output below:
arthur@arthur:~/parquet-validation$ parquet-tools inspect
~/Downloads/test_file.parquet | grep test_array_column -A 10path:
test_array_column.list.elementmax_definition_level: 2max_repetition_level:
1physical_type: INT64logical_type: Noneconverted_type (legacy):
NONEcompression: GZIP (space_saved: 56%)
As far as I know, the arrow lib represents List columns with an array of
offsets and one or more chunks of memory storing the nested column data (). On
my side, I have a very similar structure, so I would like to extract both the
array of offsets and the nested column data with the less amount of copying
possible.
Best,Arthur
De: Niranda Perera <[email protected]>
Enviado: quarta-feira, 16 de novembro de 2022 10:28
Para: [email protected] <[email protected]>
Assunto: Re: [C++] Need an example on how to extract data from a column of type
Array(int64) with multiple chunks Hi Arthur,
I'm not very clear about the usecase here. Just to clarify, in your original
parquet file, do you have List<int64> typed columns?
On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos <[email protected]> wrote:
Hi
I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table
returned contains columns with two chunks (column->num_chunks() == 2). The
column in question, although not limited to, is of type Array(Int64).
I want to extract the data (nested column data) as well as the offsets from
that column. I have found only one example of Array columns and it assumes the
nested type is known at compile time AND the column has only one chunk.
I have tried to loop over the Array(Int64) column chunks and grab the
`values()` member, but for some reason, for that specific Parquet file, the
values member point to the same memory location. Therefore, if I do something
like the below, I end up with duplicated data:
static std::shared_ptr<arrow::ChunkedArray>
getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column)
{
arrow::ArrayVector array_vector;
array_vector.reserve(arrow_column->num_chunks());
for (size_t chunk_i = 0, num_chunks =
static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks;
++chunk_i)
{
arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray
&>(*(arrow_column->chunk(chunk_i)));
std::shared_ptr<arrow::Array> chunk = list_chunk.values();
array_vector.emplace_back(std::move(chunk));
}
return std::make_shared<arrow::ChunkedArray>(array_vector);
}
I can provide more info, but to keep the initial request short and simple, I'll
leave it at that.
Thanks in advance,Arthur
--
Niranda Perera
https://niranda.dev/@n1r44