Hi Arthur, Sorry for the late reply - is it possible to provide examples of each kind of file? I can try to take a look, at least the behavior here seems confusing.
-David On Mon, Nov 21, 2022, at 15:19, Arthur Passos wrote: > Hi guys. > > I could not find written evidence that both shared and non shared > *arrow::ListArray::values *can co-exist, but that seems to be the case since > I have files that trigger both cases. If any of you have evidence that > supports this or that shows this is not accurate, it'll be appreciated. > > In any case, what I ended up doing is checking whether the offsets are > zero-based or not. If the former, that means *arrow::ListArray::values* is > not shared across chunks. If the latter, it is shared. This leads to the > following logic for *getNested* and *getOffsets*: > > *getNested:* > ** > Loop over all chunks and call *arrow::ListArray::Flatten *to properly slice > based on offsets. This will avoid duplicated data in case > *arrow::ListArray::values() *is shared. > > *getOffsets:* > ** > Use a variable to control current offset. Loop through all chunks and check > if the chunk offset is zero. If it is, current_offset is updated to the last > offset collected. Then, offset is stored as follows: auto offset = > arrow_offsets.Value(i); offsets_data.emplace_back(start_offset + offset); > > > Full code can be found in: https://github.com/ClickHouse/ClickHouse/pull/43297 <https://github.com/ClickHouse/ClickHouse/pull/43297> > Flatten list type arrow chunks on parsing by arthurpassos · Pull Request > #43297 · ClickHouse/ClickHouse > <https://github.com/ClickHouse/ClickHouse/pull/43297> > Changelog category (leave one): Bug Fix (user-visible misbehavior in official > stable or prestable release) Changelog entry (a user-readable short > description of the changes that goes to CHANGELOG... > github.com > ** > > > Best, > Arthur > > > > *De:* Arthur Passos <[email protected]> > *Enviado:* quarta-feira, 16 de novembro de 2022 16:30 > *Para:* [email protected] <[email protected]> > *Cc:* Alan Souza <[email protected]> > *Assunto:* RE: [C++] Need an example on how to extract data from a column of > type Array(int64) with multiple chunks > > Hi Niranda, > > Yes, the offsets are properly set and if call *arrow::ListArray::Flatten()*,* > *it'll slice based on those offsets and data will be "correct". The problem > is that this is not always true, I have just tested against a much simpler > test parquet file and this logic doesn't apply. The *arrow::ListArray::values > *member is not shared across all chunks and offsets are all zero-based. The > file that triggers the former case contains confidential data, but the latter > is generated with the below python script: > > import pyarrow as pa > import pyarrow.parquet as pq > arr = pa.array([[1, 2] for i in range(70000)]) > table = pa.table([arr], ["arr"]) > pq.write_table(table, "a-test.parquet") > > So it looks like arrow::ListArray::values might or might not be shared across > chunks. If it's shared, then offsets are not zero based. If it's not shared, > offsets are zero based. I am under the feeling this is an implementation > detail and I am facing such problems because I am accessing "low level APIs"? > If that's so, what would be the proper/ reliable way to extract the offsets > and nested column data if type is not known at compile time AND it might > contain multiple chunks. > > > I already shared above how I am extracting the arrow nested column from an > arrow list column. For reference, the below method is the one used to extract > the offsets. It starts at index 1 because I do not store 0 offsets. > > auto readOffsetsFromArrowListColumn(std::shared_ptr<arrow::ChunkedArray> & > arrow_column) { > std::vector<uint64_t> offsets; > > offsets.reserve(arrow_column->length()); > > for (size_t chunk_i = 0, num_chunks = > static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; > ++chunk_i) > { > arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray > &>(*(arrow_column->chunk(chunk_i))); > auto arrow_offsets_array = list_chunk.offsets(); > auto & arrow_offsets = dynamic_cast<arrow::Int32Array > &>(*arrow_offsets_array); > for (int64_t i = 1; i < arrow_offsets.length(); ++i) > offsets.emplace_back(arrow_offsets.Value(i)); > } > return std::make_shared<NumericColumn<uint64_t>>(std::move(offsets)); > } > Numeric column (Int64) data extraction is with the below method: > > template <typename NumericType> > static auto readNumericColumn(std::shared_ptr<arrow::ChunkedArray> & > arrow_column) > { > std::vector<NumericType> array; > > for (size_t chunk_i = 0, num_chunks = > static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; > ++chunk_i) > { > std::shared_ptr<arrow::Array> chunk = arrow_column->chunk(chunk_i); > auto chunk_length = chunk->length(); > if (chunk_length == 0) > continue; > > /// buffers[0] is a null bitmap and buffers[1] are actual values > std::shared_ptr<arrow::Buffer> buffer = chunk->data()->buffers[1]; > const auto * raw_data = reinterpret_cast<const NumericType > *>(buffer->data()); > array.insert(array.end(), raw_data, raw_data + chunk_length); > } > > return std::make_shared<NumericColumn<NumericType>>(std::move(array)); > } > Last but not least, these methods get called recursively by the below > readArrowColumn: > > std::shared_ptr<Column> readArrowColumn(auto arrow_column) { > switch (arrow_column->type()->id()) { > case arrow::Type::*INT64*: > { > return readNumericColumn<uint64_t>(arrow_column); > } > case arrow::Type::*LIST*: > { > auto arrow_nested_column = getNestedArrowColumn(arrow_column); > auto nested_column = readArrowColumn(arrow_nested_column); > auto offsets_column = > readOffsetsFromArrowListColumn(arrow_column); > return std::make_shared<ArrayColumn>(nested_column, > offsets_column); > } > } > return nullptr; > > } > > Thanks, > Arthur > > > *De:* Niranda Perera <[email protected]> > *Enviado:* quarta-feira, 16 de novembro de 2022 12:55 > *Para:* [email protected] <[email protected]> > *Cc:* Alan Souza <[email protected]> > *Assunto:* Re: [C++] Need an example on how to extract data from a column of > type Array(int64) with multiple chunks > > Did you check the offset array? AFAIU one way of constructing chunks of list > arrays, is duplicating a global value array, and having monotonically > increasing offsets in the offset arrays. > If the offsets are all zero-based, it would be a bug. > > On Wed, Nov 16, 2022 at 9:10 AM Arthur Passos <[email protected]> wrote: >> Hi Alan, >> >> In my case, *arrow::ListArray::values* seems to point to the same memory >> location for all chunks. It feels like I need to offset it by the chunk >> offset or something like that, but that would assume the >> *arrow::ListArray::values* method always point to the same memory location >> for all chunks, which doesn't seem to be the case for other files. >> >> Thanks for the ArrowWriteProperties tip. >> >> Best, >> Arthur >> >> >> *De:* Alan Souza via user <[email protected]> >> *Enviado:* quarta-feira, 16 de novembro de 2022 11:02 >> *Para:* [email protected] <[email protected]> >> *Assunto:* Re: [C++] Need an example on how to extract data from a column of >> type Array(int64) with multiple chunks >> >> >> Hello Arthur. I am using something like this: >> >> >> auto chunked_column = table->GetColumnByName(col_name); >> auto listArray = >> std::static_pointer_cast<arrow::LargeListArray>(chunked_column->chunk(0));* >> // I have only one chunk but this is not a problem* >> auto array = >> std::static_pointer_cast<arrow::FloatArray>(listArray->values()); >> >> In this example I am using the LargeListArray but it is similar to the >> ListArray >> >> Not related to your issue. but is necessary to customize the options of the >> ArrowWriterProperties to save all the type information, for instance: >> >> parquet::ArrowWriterProperties::Builder builder; >> builder.store_schema(); >> >> >> Without this the parquet file is created by the arrow library has a >> ListArray instead of using a LargeListArray on these columns. >> >> On Wednesday, 16 November 2022 at 10:39:02 GMT-3, Arthur Passos >> <[email protected]> wrote: >> >> >> Hi Niranda >> >> Yes, one of the columns (there are over 50 columns in this file), is of type >> List<Int64>. You can see that in the parquet-tools inspect output below: >> >>> arthur@arthur:~/parquet-validation$ parquet-tools inspect >>> ~/Downloads/test_file.parquet | grep test_array_column -A 10 >>> path: test_array_column.list.element >>> max_definition_level: 2 >>> max_repetition_level: 1 >>> physical_type: INT64 >>> logical_type: None >>> converted_type (legacy): NONE >>> compression: GZIP (space_saved: 56%) >> >> As far as I know, the arrow lib represents List columns with an array of >> offsets and one or more chunks of memory storing the nested column data (). >> On my side, I have a very similar structure, so I would like to extract both >> the array of offsets and the nested column data with the less amount of >> copying possible. >> >> Best, >> Arthur >> >> >> >> *De:* Niranda Perera <[email protected]> >> *Enviado:* quarta-feira, 16 de novembro de 2022 10:28 >> *Para:* [email protected] <[email protected]> >> *Assunto:* Re: [C++] Need an example on how to extract data from a column of >> type Array(int64) with multiple chunks >> >> Hi Arthur, >> >> I'm not very clear about the usecase here. Just to clarify, in your original >> parquet file, do you have List<int64> typed columns? >> >> On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos <[email protected]> wrote: >>> Hi >>> >>> I am reading a parquet file with arrow::RecordBatchReader and the >>> arrow::Table returned contains columns with two chunks >>> (column->num_chunks() == 2). The column in question, although not limited >>> to, is of type Array(Int64). >>> >>> I want to extract the data (nested column data) as well as the offsets from >>> that column. I have found only one example >>> <https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121> >>> of Array columns and it assumes the nested type is known at compile time >>> AND the column has only one chunk. >>> >>> I have tried to loop over the Array(Int64) column chunks and grab the >>> `values()` member, but for some reason, for that specific Parquet file, the >>> values member point to the same memory location. Therefore, if I do >>> something like the below, I end up with duplicated data: >>> >>> >>> static std::shared_ptr<arrow::ChunkedArray> >>> getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column) >>> { >>> arrow::ArrayVector array_vector; >>> array_vector.reserve(arrow_column->num_chunks()); >>> for (size_t chunk_i = 0, num_chunks = >>> static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; >>> ++chunk_i) >>> { >>> arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray >>> &>(*(arrow_column->chunk(chunk_i))); >>> std::shared_ptr<arrow::Array> chunk = list_chunk.values(); >>> array_vector.emplace_back(std::move(chunk)); >>> } >>> return std::make_shared<arrow::ChunkedArray>(array_vector); >>> } >>> >>> I can provide more info, but to keep the initial request short and simple, >>> I'll leave it at that. >>> >>> Thanks in advance, >>> Arthur >> >> >> -- >> >> Niranda Perera >> https://niranda.dev/ >> @n1r44 <https://twitter.com/N1R44> >> > > > -- > > Niranda Perera > https://niranda.dev/ > @n1r44 <https://twitter.com/N1R44> >
