RE: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks

Arthur Passos Tue, 22 Nov 2022 03:03:00 -0800

Hi David,

Thanks for the response. Just recapping, I have two files that trigger two 
different cases: 1. array data is shared across chunks and 2. array data is not 
shared across chunks. "Array data" being arrow::ListArray::values. In the 
former, offsets are monotonically increasing. In the latter, they are zero 
based.


Unfortunately, the file that triggers the first case contains confidential data 
from one of our customers. I have spent a fair amount of time trying to 
generate one, but failed to do so. The latter, I can certainly provide an 
example. Below is a python script that'll generate it.


import pyarrow as pa
import pyarrow.parquet as pq
import random


def gen_array(offset):
      array = []
      array_length = random.randint(0, 9)
      for i in range(array_length):
            array.append(i + offset)

      return array


def gen_arrays(number_of_arrays):
      list_of_arrays = []
      for i in range(number_of_arrays):
            list_of_arrays.append(gen_array(i))
      return list_of_arrays

arr = pa.array(gen_arrays(70000))
table  = pa.table([arr], ["arr"])
pq.write_table(table, "int-list-zero-based-chunked-array.parquet")


Thanks,
Arthur
________________________________
De: David Li <[email protected]>
Enviado: segunda-feira, 21 de novembro de 2022 19:05
Para: dl <[email protected]>
Assunto: Re: [C++] Need an example on how to extract data from a column of type 
Array(int64) with multiple chunks

Hi Arthur,

Sorry for the late reply - is it possible to provide examples of each kind of 
file? I can try to take a look, at least the behavior here seems confusing.

-David

On Mon, Nov 21, 2022, at 15:19, Arthur Passos wrote:
Hi guys.

I could not find written evidence that both shared and non shared 
arrow::ListArray::values can co-exist, but that seems to be the case since I 
have files that trigger both cases. If any of you have evidence that supports 
this or that shows this is not accurate, it'll be appreciated.

In any case, what I ended up doing is checking whether the offsets are 
zero-based or not. If the former, that means arrow::ListArray::values is not 
shared across chunks. If the latter, it is shared. This leads to the following 
logic for getNested and getOffsets:

getNested:

Loop over all chunks and call arrow::ListArray::Flatten to properly slice based 
on offsets. This will avoid duplicated data in case arrow::ListArray::values() 
is shared.

getOffsets:

Use a variable to control current offset. Loop through all chunks and check if 
the chunk offset is zero. If it is, current_offset is updated to the last 
offset collected. Then, offset is stored as follows: auto offset = 
arrow_offsets.Value(i); offsets_data.emplace_back(start_offset + offset);


Full code can be found in: https://github.com/ClickHouse/ClickHouse/pull/43297
[https://opengraph.githubassets.com/69a4bf186b1cc2a41e7b0fefe62ecea59a6bb57fa574f7b0fbb13e44aa7fbcfa/ClickHouse/ClickHouse/pull/43297]<https://github.com/ClickHouse/ClickHouse/pull/43297>
Flatten list type arrow chunks on parsing by arthurpassos · Pull Request #43297 
· ClickHouse/ClickHouse<https://github.com/ClickHouse/ClickHouse/pull/43297>
Changelog category (leave one): Bug Fix (user-visible misbehavior in official 
stable or prestable release) Changelog entry (a user-readable short description 
of the changes that goes to CHANGELOG...
github.com



Best,
Arthur


________________________________

De: Arthur Passos <[email protected]>
Enviado: quarta-feira, 16 de novembro de 2022 16:30
Para: [email protected] <[email protected]>
Cc: Alan Souza <[email protected]>
Assunto: RE: [C++] Need an example on how to extract data from a column of type 
Array(int64) with multiple chunks

Hi Niranda,

Yes, the offsets are properly set and if call arrow::ListArray::Flatten(), 
it'll slice based on those offsets and data will be "correct". The problem is 
that this is not always true, I have just tested against a much simpler test 
parquet file and this logic doesn't apply. The arrow::ListArray::values member 
is not shared across all chunks and offsets are all zero-based. The file that 
triggers the former case contains confidential data, but the latter is 
generated with the below python script:


import pyarrow as pa
import pyarrow.parquet as pq
arr = pa.array([[1, 2] for i in range(70000)])
table  = pa.table([arr], ["arr"])
pq.write_table(table, "a-test.parquet")

So it looks like arrow::ListArray::values might or might not be shared across 
chunks. If it's shared, then offsets are not zero based. If it's not shared, 
offsets are zero based. I am under the feeling this is an implementation detail 
and I am facing such problems because I am accessing "low level APIs"? If 
that's so, what would be the proper/ reliable way to extract the offsets and 
nested column data if type is not known at compile time AND it might contain 
multiple chunks.


I already shared above how I am extracting the arrow nested column from an 
arrow list column. For reference, the below method is the one used to extract 
the offsets. It starts at index 1 because I do not store 0 offsets.


auto readOffsetsFromArrowListColumn(std::shared_ptr<arrow::ChunkedArray> & 
arrow_column) {
    std::vector<uint64_t> offsets;

    offsets.reserve(arrow_column->length());

    for (size_t chunk_i = 0, num_chunks = 
static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
++chunk_i)
    {
        arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray 
&>(*(arrow_column->chunk(chunk_i)));
        auto arrow_offsets_array = list_chunk.offsets();
        auto & arrow_offsets = dynamic_cast<arrow::Int32Array 
&>(*arrow_offsets_array);
        for (int64_t i = 1; i < arrow_offsets.length(); ++i)
            offsets.emplace_back(arrow_offsets.Value(i));
    }
    return std::make_shared<NumericColumn<uint64_t>>(std::move(offsets));
}

Numeric column (Int64) data extraction is with the below method:


template <typename NumericType>
static auto readNumericColumn(std::shared_ptr<arrow::ChunkedArray> & 
arrow_column)
{
    std::vector<NumericType> array;

    for (size_t chunk_i = 0, num_chunks = 
static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
++chunk_i)
    {
        std::shared_ptr<arrow::Array> chunk = arrow_column->chunk(chunk_i);
        auto chunk_length = chunk->length();
        if (chunk_length == 0)
            continue;

        /// buffers[0] is a null bitmap and buffers[1] are actual values
        std::shared_ptr<arrow::Buffer> buffer = chunk->data()->buffers[1];
        const auto * raw_data = reinterpret_cast<const NumericType 
*>(buffer->data());
        array.insert(array.end(), raw_data, raw_data + chunk_length);
    }

    return std::make_shared<NumericColumn<NumericType>>(std::move(array));
}

Last but not least, these methods get called recursively by the below 
readArrowColumn:


std::shared_ptr<Column> readArrowColumn(auto arrow_column) {
    switch (arrow_column->type()->id()) {
        case arrow::Type::INT64:
        {
            return readNumericColumn<uint64_t>(arrow_column);
        }
        case arrow::Type::LIST:
        {
            auto arrow_nested_column = getNestedArrowColumn(arrow_column);
            auto nested_column = readArrowColumn(arrow_nested_column);
            auto offsets_column = readOffsetsFromArrowListColumn(arrow_column);
            return std::make_shared<ArrayColumn>(nested_column, offsets_column);
        }
    }
    return nullptr;

}

Thanks,
Arthur

________________________________

De: Niranda Perera <[email protected]>
Enviado: quarta-feira, 16 de novembro de 2022 12:55
Para: [email protected] <[email protected]>
Cc: Alan Souza <[email protected]>
Assunto: Re: [C++] Need an example on how to extract data from a column of type 
Array(int64) with multiple chunks

Did you check the offset array? AFAIU one way of constructing chunks of list 
arrays, is duplicating a global value array, and having monotonically 
increasing offsets in the offset arrays.
If the offsets are all zero-based, it would be a bug.

On Wed, Nov 16, 2022 at 9:10 AM Arthur Passos 
<[email protected]<mailto:[email protected]>> wrote:
Hi Alan,

In my case, arrow::ListArray::values seems to point to the same memory location 
for all chunks. It feels like I need to offset it by the chunk offset or 
something like that, but that would assume the arrow::ListArray::values method 
always point to the same memory location for all chunks, which doesn't seem to 
be the case for other files.

Thanks for the ArrowWriteProperties tip.

Best,
Arthur

________________________________

De: Alan Souza via user <[email protected]<mailto:[email protected]>>
Enviado: quarta-feira, 16 de novembro de 2022 11:02
Para: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Assunto: Re: [C++] Need an example on how to extract data from a column of type 
Array(int64) with multiple chunks


Hello Arthur. I am using something like this:


auto chunked_column = table->GetColumnByName(col_name);
auto listArray = 
std::static_pointer_cast<arrow::LargeListArray>(chunked_column->chunk(0)); // I 
have only one chunk but this is not a problem
auto array = std::static_pointer_cast<arrow::FloatArray>(listArray->values());

In this example I am using the LargeListArray but it is similar to the ListArray

Not related to your issue. but is necessary to customize the options of the 
ArrowWriterProperties to save all the type information, for instance:

parquet::ArrowWriterProperties::Builder builder;
builder.store_schema();


Without this the parquet file is created by the arrow library has a ListArray 
instead of using a LargeListArray on these columns.

On Wednesday, 16 November 2022 at 10:39:02 GMT-3, Arthur Passos 
<[email protected]<mailto:[email protected]>> wrote:


Hi Niranda

Yes, one of the columns (there are over 50 columns in this file), is of type 
List<Int64>. You can see that in the parquet-tools inspect output below:

arthur@arthur:~/parquet-validation$ parquet-tools inspect 
~/Downloads/test_file.parquet | grep test_array_column -A 10
path: test_array_column.list.element
max_definition_level: 2
max_repetition_level: 1
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: GZIP (space_saved: 56%)

As far as I know, the arrow lib represents List columns with an array of 
offsets and one or more chunks of memory storing the nested column data (). On 
my side, I have a very similar structure, so I would like to extract both the 
array of offsets and the nested column data with the less amount of copying 
possible.

Best,
Arthur


________________________________

De: Niranda Perera <[email protected]<mailto:[email protected]>>
Enviado: quarta-feira, 16 de novembro de 2022 10:28
Para: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Assunto: Re: [C++] Need an example on how to extract data from a column of type 
Array(int64) with multiple chunks

Hi Arthur,

I'm not very clear about the usecase here. Just to clarify, in your original 
parquet file, do you have List<int64> typed columns?

On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos 
<[email protected]<mailto:[email protected]>> wrote:
Hi

I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table 
returned contains columns with two chunks (column->num_chunks() == 2). The 
column in question, although not limited to, is of type Array(Int64).

I want to extract the data (nested column data) as well as the offsets from 
that column. I have found only one 
example<https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121>
 of Array columns and it assumes the nested type is known at compile time AND 
the column has only one chunk.

I have tried to loop over the Array(Int64) column chunks and grab the 
`values()` member, but for some reason, for that specific Parquet file, the 
values member point to the same memory location. Therefore, if I do something 
like the below, I end up with duplicated data:



static std::shared_ptr<arrow::ChunkedArray> 
getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column)
{
    arrow::ArrayVector array_vector;
    array_vector.reserve(arrow_column->num_chunks());
    for (size_t chunk_i = 0, num_chunks = 
static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; 
++chunk_i)
      {
          arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray 
&>(*(arrow_column->chunk(chunk_i)));
          std::shared_ptr<arrow::Array> chunk = list_chunk.values();
          array_vector.emplace_back(std::move(chunk));
      }
    return std::make_shared<arrow::ChunkedArray>(array_vector);
}

I can provide more info, but to keep the initial request short and simple, I'll 
leave it at that.

Thanks in advance,
Arthur


--

Niranda Perera
https://niranda.dev/
@n1r44<https://twitter.com/N1R44>



--

Niranda Perera
https://niranda.dev/
@n1r44<https://twitter.com/N1R44>

RE: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks

Reply via email to