Hi Niranda

Yes, one of the columns (there are over 50 columns in this file), is of type 
List<Int64>. You can see that in the parquet-tools inspect output below:

arthur@arthur:~/parquet-validation$ parquet-tools inspect 
~/Downloads/test_file.parquet | grep test_array_column -A 10
path: test_array_column.list.element
max_definition_level: 2
max_repetition_level: 1
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 56%)

As far as I know, the arrow lib represents List columns with an array of 
offsets and one or more chunks of memory storing the nested column data (). On 
my side, I have a very similar structure, so I would like to extract both the 
array of offsets and the nested column data with the less amount of copying 
possible.

Best,
Arthur

________________________________
De: Niranda Perera <[email protected]>
Enviado: quarta-feira, 16 de novembro de 2022 10:28
Para: [email protected] <[email protected]>
Assunto: Re: [C++] Need an example on how to extract data from a column of type 
Array(int64) with multiple chunks

Hi Arthur,

I'm not very clear about the usecase here. Just to clarify, in your original 
parquet file, do you have List<int64> typed columns?

On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos 
<[email protected]<mailto:[email protected]>> wrote:
Hi

I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table 
returned contains columns with two chunks (column->num_chunks() == 2). The 
column in question, although not limited to, is of type Array(Int64).

I want to extract the data (nested column data) as well as the offsets from 
that column. I have found only one 
example<https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121>
 of Array columns and it assumes the nested type is known at compile time AND 
the column has only one chunk.

I have tried to loop over the Array(Int64) column chunks and grab the 
`values()` member, but for some reason, for that specific Parquet file, the 
values member point to the same memory location. Therefore, if I do something 
like the below, I end up with duplicated data:


static std::shared_ptr<arrow::ChunkedArray> 
getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column)
{
    arrow::ArrayVector array_vector;
    array_vector.reserve(arrow_column->num_chunks());
    for (size_t chunk_i = 0, num_chunks = 
static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
++chunk_i)
      {
          arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray 
&>(*(arrow_column->chunk(chunk_i)));
          std::shared_ptr<arrow::Array> chunk = list_chunk.values();
          array_vector.emplace_back(std::move(chunk));
      }
    return std::make_shared<arrow::ChunkedArray>(array_vector);
}

I can provide more info, but to keep the initial request short and simple, I'll 
leave it at that.

Thanks in advance,
Arthur


--
Niranda Perera
https://niranda.dev/
@n1r44<https://twitter.com/N1R44>

Reply via email to