Re: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks

David Li Mon, 21 Nov 2022 14:05:45 -0800

Hi Arthur,

Sorry for the late reply - is it possible to provide examples of each kind of 
file? I can try to take a look, at least the behavior here seems confusing.


-David

On Mon, Nov 21, 2022, at 15:19, Arthur Passos wrote:
> Hi guys.
> 
> I could not find written evidence that both shared and non shared 
> *arrow::ListArray::values *can co-exist, but that seems to be the case since 
> I have files that trigger both cases. If any of you have evidence that 
> supports this or that shows this is not accurate, it'll be appreciated.
> 
> In any case, what I ended up doing is checking whether the offsets are 
> zero-based or not. If the former, that means *arrow::ListArray::values* is 
> not shared across chunks. If the latter, it is shared. This leads to the 
> following logic for *getNested* and *getOffsets*:
> 
> *getNested:*
> **
> Loop over all chunks and call *arrow::ListArray::Flatten *to properly slice 
> based on offsets. This will avoid duplicated data in case 
> *arrow::ListArray::values() *is shared.
> 
> *getOffsets:*
> **
> Use a variable to control current offset. Loop through all chunks and check 
> if the chunk offset is zero. If it is, current_offset is updated to the last 
> offset collected. Then, offset is stored as follows: auto offset = 
> arrow_offsets.Value(i); offsets_data.emplace_back(start_offset + offset);
> 
> 
> Full code can be found in: https://github.com/ClickHouse/ClickHouse/pull/43297
 <https://github.com/ClickHouse/ClickHouse/pull/43297>
> Flatten list type arrow chunks on parsing by arthurpassos · Pull Request 
> #43297 · ClickHouse/ClickHouse 
> <https://github.com/ClickHouse/ClickHouse/pull/43297>
> Changelog category (leave one): Bug Fix (user-visible misbehavior in official 
> stable or prestable release) Changelog entry (a user-readable short 
> description of the changes that goes to CHANGELOG...
> github.com
> **
> 
> 
> Best,
> Arthur
> 
> 
> 
> *De:* Arthur Passos <[email protected]>
> *Enviado:* quarta-feira, 16 de novembro de 2022 16:30
> *Para:* [email protected] <[email protected]>
> *Cc:* Alan Souza <[email protected]>
> *Assunto:* RE: [C++] Need an example on how to extract data from a column of 
> type Array(int64) with multiple chunks 
>  
> Hi Niranda,
> 
> Yes, the offsets are properly set and if call *arrow::ListArray::Flatten()*,* 
> *it'll slice based on those offsets and data will be "correct". The problem 
> is that this is not always true, I have just tested against a much simpler 
> test parquet file and this logic doesn't apply. The *arrow::ListArray::values 
> *member is not shared across all chunks and offsets are all zero-based. The 
> file that triggers the former case contains confidential data, but the latter 
> is generated with the below python script:
> 
> import pyarrow as pa
> import pyarrow.parquet as pq
> arr = pa.array([[1, 2] for i in range(70000)])
> table  = pa.table([arr], ["arr"])
> pq.write_table(table, "a-test.parquet")
> 
> So it looks like arrow::ListArray::values might or might not be shared across 
> chunks. If it's shared, then offsets are not zero based. If it's not shared, 
> offsets are zero based. I am under the feeling this is an implementation 
> detail and I am facing such problems because I am accessing "low level APIs"? 
> If that's so, what would be the proper/ reliable way to extract the offsets 
> and nested column data if type is not known at compile time AND it might 
> contain multiple chunks.
> 
> 
> I already shared above how I am extracting the arrow nested column from an 
> arrow list column. For reference, the below method is the one used to extract 
> the offsets. It starts at index 1 because I do not store 0 offsets.
> 
> auto readOffsetsFromArrowListColumn(std::shared_ptr<arrow::ChunkedArray> & 
> arrow_column) {
>     std::vector<uint64_t> offsets;
> 
>     offsets.reserve(arrow_column->length());
> 
>     for (size_t chunk_i = 0, num_chunks = 
> static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
> ++chunk_i)
>     {
>         arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray 
> &>(*(arrow_column->chunk(chunk_i)));
>         auto arrow_offsets_array = list_chunk.offsets();
>         auto & arrow_offsets = dynamic_cast<arrow::Int32Array 
> &>(*arrow_offsets_array);
>         for (int64_t i = 1; i < arrow_offsets.length(); ++i)
>             offsets.emplace_back(arrow_offsets.Value(i));
>     }
>     return std::make_shared<NumericColumn<uint64_t>>(std::move(offsets));
> }
> Numeric column (Int64) data extraction is with the below method:
> 
> template <typename NumericType>
> static auto readNumericColumn(std::shared_ptr<arrow::ChunkedArray> & 
> arrow_column)
> {
>     std::vector<NumericType> array;
> 
>     for (size_t chunk_i = 0, num_chunks = 
> static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
> ++chunk_i)
>     {
>         std::shared_ptr<arrow::Array> chunk = arrow_column->chunk(chunk_i);
>         auto chunk_length = chunk->length();
>         if (chunk_length == 0)
>             continue;
> 
>         /// buffers[0] is a null bitmap and buffers[1] are actual values
>         std::shared_ptr<arrow::Buffer> buffer = chunk->data()->buffers[1];
>         const auto * raw_data = reinterpret_cast<const NumericType 
> *>(buffer->data());
>         array.insert(array.end(), raw_data, raw_data + chunk_length);
>     }
> 
>     return std::make_shared<NumericColumn<NumericType>>(std::move(array));
> }
> Last but not least, these methods get called recursively by the below 
> readArrowColumn:
> 
> std::shared_ptr<Column> readArrowColumn(auto arrow_column) {
>     switch (arrow_column->type()->id()) {
>         case arrow::Type::*INT64*:
>         {
>             return readNumericColumn<uint64_t>(arrow_column);
>         }
>         case arrow::Type::*LIST*:
>         {
>             auto arrow_nested_column = getNestedArrowColumn(arrow_column);
>             auto nested_column = readArrowColumn(arrow_nested_column);
>             auto offsets_column = 
> readOffsetsFromArrowListColumn(arrow_column);
>             return std::make_shared<ArrayColumn>(nested_column, 
> offsets_column);
>         }
>     }
>     return nullptr;
> 
> }
> 
> Thanks,
> Arthur
> 
> 
> *De:* Niranda Perera <[email protected]>
> *Enviado:* quarta-feira, 16 de novembro de 2022 12:55
> *Para:* [email protected] <[email protected]>
> *Cc:* Alan Souza <[email protected]>
> *Assunto:* Re: [C++] Need an example on how to extract data from a column of 
> type Array(int64) with multiple chunks 
>  
> Did you check the offset array? AFAIU one way of constructing chunks of list 
> arrays, is duplicating a global value array, and having monotonically 
> increasing offsets in the offset arrays.  
> If the offsets are all zero-based, it would be a bug. 
> 
> On Wed, Nov 16, 2022 at 9:10 AM Arthur Passos <[email protected]> wrote:
>> Hi Alan,
>> 
>> In my case, *arrow::ListArray::values* seems to point to the same memory 
>> location for all chunks. It feels like I need to offset it by the chunk 
>> offset or something like that, but that would assume the 
>> *arrow::ListArray::values* method always point to the same memory location 
>> for all chunks, which doesn't seem to be the case for other files.
>> 
>> Thanks for the ArrowWriteProperties tip.
>> 
>> Best,
>> Arthur
>> 
>> 
>> *De:* Alan Souza via user <[email protected]>
>> *Enviado:* quarta-feira, 16 de novembro de 2022 11:02
>> *Para:* [email protected] <[email protected]>
>> *Assunto:* Re: [C++] Need an example on how to extract data from a column of 
>> type Array(int64) with multiple chunks 
>>  
>> 
>> Hello Arthur. I am using something like this:
>> 
>> 
>> auto chunked_column = table->GetColumnByName(col_name);
>> auto listArray = 
>> std::static_pointer_cast<arrow::LargeListArray>(chunked_column->chunk(0));* 
>> // I have only one chunk but this is not a problem*
>> auto array = 
>> std::static_pointer_cast<arrow::FloatArray>(listArray->values());
>> 
>> In this example I am using the LargeListArray but it is similar to the 
>> ListArray
>> 
>> Not related to your issue. but is necessary to customize the options of the 
>> ArrowWriterProperties to save all the type information, for instance:
>> 
>> parquet::ArrowWriterProperties::Builder builder;
>> builder.store_schema();
>> 
>> 
>> Without this the parquet file is created by the arrow library has a 
>> ListArray instead of using a LargeListArray on these columns.
>> 
>> On Wednesday, 16 November 2022 at 10:39:02 GMT-3, Arthur Passos 
>> <[email protected]> wrote:
>> 
>> 
>> Hi Niranda
>> 
>> Yes, one of the columns (there are over 50 columns in this file), is of type 
>> List<Int64>. You can see that in the parquet-tools inspect output below:
>> 
>>> arthur@arthur:~/parquet-validation$ parquet-tools inspect 
>>> ~/Downloads/test_file.parquet | grep test_array_column -A 10 
>>> path: test_array_column.list.element
>>> max_definition_level: 2
>>> max_repetition_level: 1
>>> physical_type: INT64
>>> logical_type: None
>>> converted_type (legacy): NONE
>>> compression: GZIP (space_saved: 56%)
>> 
>> As far as I know, the arrow lib represents List columns with an array of 
>> offsets and one or more chunks of memory storing the nested column data (). 
>> On my side, I have a very similar structure, so I would like to extract both 
>> the array of offsets and the nested column data with the less amount of 
>> copying possible.
>> 
>> Best,
>> Arthur
>> 
>> 
>> 
>> *De:* Niranda Perera <[email protected]>
>> *Enviado:* quarta-feira, 16 de novembro de 2022 10:28
>> *Para:* [email protected] <[email protected]>
>> *Assunto:* Re: [C++] Need an example on how to extract data from a column of 
>> type Array(int64) with multiple chunks 
>>  
>> Hi Arthur, 
>> 
>> I'm not very clear about the usecase here. Just to clarify, in your original 
>> parquet file, do you have List<int64> typed columns? 
>> 
>> On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos <[email protected]> wrote:
>>> Hi
>>> 
>>> I am reading a parquet file with arrow::RecordBatchReader and the 
>>> arrow::Table returned contains columns with two chunks 
>>> (column->num_chunks() == 2). The column in question, although not limited 
>>> to, is of type Array(Int64).
>>> 
>>> I want to extract the data (nested column data) as well as the offsets from 
>>> that column. I have found only one example 
>>> <https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121>
>>>  of Array columns and it assumes the nested type is known at compile time 
>>> AND the column has only one chunk.
>>> 
>>> I have tried to loop over the Array(Int64) column chunks and grab the 
>>> `values()` member, but for some reason, for that specific Parquet file, the 
>>> values member point to the same memory location. Therefore, if I do 
>>> something like the below, I end up with duplicated data:
>>> 
>>> 
>>> static std::shared_ptr<arrow::ChunkedArray> 
>>> getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column)
>>> {
>>>     arrow::ArrayVector array_vector;
>>>     array_vector.reserve(arrow_column->num_chunks());
>>>     for (size_t chunk_i = 0, num_chunks = 
>>> static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
>>> ++chunk_i)
>>>       {
>>>           arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray 
>>> &>(*(arrow_column->chunk(chunk_i)));
>>>           std::shared_ptr<arrow::Array> chunk = list_chunk.values();
>>>           array_vector.emplace_back(std::move(chunk));
>>>       }
>>>     return std::make_shared<arrow::ChunkedArray>(array_vector);
>>> }
>>> 
>>> I can provide more info, but to keep the initial request short and simple, 
>>> I'll leave it at that.
>>> 
>>> Thanks in advance,
>>> Arthur
>> 
>> 
>> -- 
>> 
>> Niranda Perera
>> https://niranda.dev/
>> @n1r44 <https://twitter.com/N1R44>
>> 
> 
> 
> -- 
> 
> Niranda Perera
> https://niranda.dev/
> @n1r44 <https://twitter.com/N1R44>
>

Re: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks

Reply via email to