Hi David, Thanks for the response. Just recapping, I have two files that trigger two different cases: 1. array data is shared across chunks and 2. array data is not shared across chunks. "Array data" being arrow::ListArray::values. In the former, offsets are monotonically increasing. In the latter, they are zero based.
Unfortunately, the file that triggers the first case contains confidential data from one of our customers. I have spent a fair amount of time trying to generate one, but failed to do so. The latter, I can certainly provide an example. Below is a python script that'll generate it. import pyarrow as pa import pyarrow.parquet as pq import random def gen_array(offset): array = [] array_length = random.randint(0, 9) for i in range(array_length): array.append(i + offset) return array def gen_arrays(number_of_arrays): list_of_arrays = [] for i in range(number_of_arrays): list_of_arrays.append(gen_array(i)) return list_of_arrays arr = pa.array(gen_arrays(70000)) table = pa.table([arr], ["arr"]) pq.write_table(table, "int-list-zero-based-chunked-array.parquet") Thanks, Arthur ________________________________ De: David Li <[email protected]> Enviado: segunda-feira, 21 de novembro de 2022 19:05 Para: dl <[email protected]> Assunto: Re: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks Hi Arthur, Sorry for the late reply - is it possible to provide examples of each kind of file? I can try to take a look, at least the behavior here seems confusing. -David On Mon, Nov 21, 2022, at 15:19, Arthur Passos wrote: Hi guys. I could not find written evidence that both shared and non shared arrow::ListArray::values can co-exist, but that seems to be the case since I have files that trigger both cases. If any of you have evidence that supports this or that shows this is not accurate, it'll be appreciated. In any case, what I ended up doing is checking whether the offsets are zero-based or not. If the former, that means arrow::ListArray::values is not shared across chunks. If the latter, it is shared. This leads to the following logic for getNested and getOffsets: getNested: Loop over all chunks and call arrow::ListArray::Flatten to properly slice based on offsets. This will avoid duplicated data in case arrow::ListArray::values() is shared. getOffsets: Use a variable to control current offset. Loop through all chunks and check if the chunk offset is zero. If it is, current_offset is updated to the last offset collected. Then, offset is stored as follows: auto offset = arrow_offsets.Value(i); offsets_data.emplace_back(start_offset + offset); Full code can be found in: https://github.com/ClickHouse/ClickHouse/pull/43297 [https://opengraph.githubassets.com/69a4bf186b1cc2a41e7b0fefe62ecea59a6bb57fa574f7b0fbb13e44aa7fbcfa/ClickHouse/ClickHouse/pull/43297]<https://github.com/ClickHouse/ClickHouse/pull/43297> Flatten list type arrow chunks on parsing by arthurpassos · Pull Request #43297 · ClickHouse/ClickHouse<https://github.com/ClickHouse/ClickHouse/pull/43297> Changelog category (leave one): Bug Fix (user-visible misbehavior in official stable or prestable release) Changelog entry (a user-readable short description of the changes that goes to CHANGELOG... github.com Best, Arthur ________________________________ De: Arthur Passos <[email protected]> Enviado: quarta-feira, 16 de novembro de 2022 16:30 Para: [email protected] <[email protected]> Cc: Alan Souza <[email protected]> Assunto: RE: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks Hi Niranda, Yes, the offsets are properly set and if call arrow::ListArray::Flatten(), it'll slice based on those offsets and data will be "correct". The problem is that this is not always true, I have just tested against a much simpler test parquet file and this logic doesn't apply. The arrow::ListArray::values member is not shared across all chunks and offsets are all zero-based. The file that triggers the former case contains confidential data, but the latter is generated with the below python script: import pyarrow as pa import pyarrow.parquet as pq arr = pa.array([[1, 2] for i in range(70000)]) table = pa.table([arr], ["arr"]) pq.write_table(table, "a-test.parquet") So it looks like arrow::ListArray::values might or might not be shared across chunks. If it's shared, then offsets are not zero based. If it's not shared, offsets are zero based. I am under the feeling this is an implementation detail and I am facing such problems because I am accessing "low level APIs"? If that's so, what would be the proper/ reliable way to extract the offsets and nested column data if type is not known at compile time AND it might contain multiple chunks. I already shared above how I am extracting the arrow nested column from an arrow list column. For reference, the below method is the one used to extract the offsets. It starts at index 1 because I do not store 0 offsets. auto readOffsetsFromArrowListColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column) { std::vector<uint64_t> offsets; offsets.reserve(arrow_column->length()); for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; ++chunk_i) { arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray &>(*(arrow_column->chunk(chunk_i))); auto arrow_offsets_array = list_chunk.offsets(); auto & arrow_offsets = dynamic_cast<arrow::Int32Array &>(*arrow_offsets_array); for (int64_t i = 1; i < arrow_offsets.length(); ++i) offsets.emplace_back(arrow_offsets.Value(i)); } return std::make_shared<NumericColumn<uint64_t>>(std::move(offsets)); } Numeric column (Int64) data extraction is with the below method: template <typename NumericType> static auto readNumericColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column) { std::vector<NumericType> array; for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; ++chunk_i) { std::shared_ptr<arrow::Array> chunk = arrow_column->chunk(chunk_i); auto chunk_length = chunk->length(); if (chunk_length == 0) continue; /// buffers[0] is a null bitmap and buffers[1] are actual values std::shared_ptr<arrow::Buffer> buffer = chunk->data()->buffers[1]; const auto * raw_data = reinterpret_cast<const NumericType *>(buffer->data()); array.insert(array.end(), raw_data, raw_data + chunk_length); } return std::make_shared<NumericColumn<NumericType>>(std::move(array)); } Last but not least, these methods get called recursively by the below readArrowColumn: std::shared_ptr<Column> readArrowColumn(auto arrow_column) { switch (arrow_column->type()->id()) { case arrow::Type::INT64: { return readNumericColumn<uint64_t>(arrow_column); } case arrow::Type::LIST: { auto arrow_nested_column = getNestedArrowColumn(arrow_column); auto nested_column = readArrowColumn(arrow_nested_column); auto offsets_column = readOffsetsFromArrowListColumn(arrow_column); return std::make_shared<ArrayColumn>(nested_column, offsets_column); } } return nullptr; } Thanks, Arthur ________________________________ De: Niranda Perera <[email protected]> Enviado: quarta-feira, 16 de novembro de 2022 12:55 Para: [email protected] <[email protected]> Cc: Alan Souza <[email protected]> Assunto: Re: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks Did you check the offset array? AFAIU one way of constructing chunks of list arrays, is duplicating a global value array, and having monotonically increasing offsets in the offset arrays. If the offsets are all zero-based, it would be a bug. On Wed, Nov 16, 2022 at 9:10 AM Arthur Passos <[email protected]<mailto:[email protected]>> wrote: Hi Alan, In my case, arrow::ListArray::values seems to point to the same memory location for all chunks. It feels like I need to offset it by the chunk offset or something like that, but that would assume the arrow::ListArray::values method always point to the same memory location for all chunks, which doesn't seem to be the case for other files. Thanks for the ArrowWriteProperties tip. Best, Arthur ________________________________ De: Alan Souza via user <[email protected]<mailto:[email protected]>> Enviado: quarta-feira, 16 de novembro de 2022 11:02 Para: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Assunto: Re: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks Hello Arthur. I am using something like this: auto chunked_column = table->GetColumnByName(col_name); auto listArray = std::static_pointer_cast<arrow::LargeListArray>(chunked_column->chunk(0)); // I have only one chunk but this is not a problem auto array = std::static_pointer_cast<arrow::FloatArray>(listArray->values()); In this example I am using the LargeListArray but it is similar to the ListArray Not related to your issue. but is necessary to customize the options of the ArrowWriterProperties to save all the type information, for instance: parquet::ArrowWriterProperties::Builder builder; builder.store_schema(); Without this the parquet file is created by the arrow library has a ListArray instead of using a LargeListArray on these columns. On Wednesday, 16 November 2022 at 10:39:02 GMT-3, Arthur Passos <[email protected]<mailto:[email protected]>> wrote: Hi Niranda Yes, one of the columns (there are over 50 columns in this file), is of type List<Int64>. You can see that in the parquet-tools inspect output below: arthur@arthur:~/parquet-validation$ parquet-tools inspect ~/Downloads/test_file.parquet | grep test_array_column -A 10 path: test_array_column.list.element max_definition_level: 2 max_repetition_level: 1 physical_type: INT64 logical_type: None converted_type (legacy): NONE compression: GZIP (space_saved: 56%) As far as I know, the arrow lib represents List columns with an array of offsets and one or more chunks of memory storing the nested column data (). On my side, I have a very similar structure, so I would like to extract both the array of offsets and the nested column data with the less amount of copying possible. Best, Arthur ________________________________ De: Niranda Perera <[email protected]<mailto:[email protected]>> Enviado: quarta-feira, 16 de novembro de 2022 10:28 Para: [email protected]<mailto:[email protected]> <[email protected]<mailto:[email protected]>> Assunto: Re: [C++] Need an example on how to extract data from a column of type Array(int64) with multiple chunks Hi Arthur, I'm not very clear about the usecase here. Just to clarify, in your original parquet file, do you have List<int64> typed columns? On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos <[email protected]<mailto:[email protected]>> wrote: Hi I am reading a parquet file with arrow::RecordBatchReader and the arrow::Table returned contains columns with two chunks (column->num_chunks() == 2). The column in question, although not limited to, is of type Array(Int64). I want to extract the data (nested column data) as well as the offsets from that column. I have found only one example<https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121> of Array columns and it assumes the nested type is known at compile time AND the column has only one chunk. I have tried to loop over the Array(Int64) column chunks and grab the `values()` member, but for some reason, for that specific Parquet file, the values member point to the same memory location. Therefore, if I do something like the below, I end up with duplicated data: static std::shared_ptr<arrow::ChunkedArray> getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column) { arrow::ArrayVector array_vector; array_vector.reserve(arrow_column->num_chunks()); for (size_t chunk_i = 0, num_chunks = static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; ++chunk_i) { arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray &>(*(arrow_column->chunk(chunk_i))); std::shared_ptr<arrow::Array> chunk = list_chunk.values(); array_vector.emplace_back(std::move(chunk)); } return std::make_shared<arrow::ChunkedArray>(array_vector); } I can provide more info, but to keep the initial request short and simple, I'll leave it at that. Thanks in advance, Arthur -- Niranda Perera https://niranda.dev/ @n1r44<https://twitter.com/N1R44> -- Niranda Perera https://niranda.dev/ @n1r44<https://twitter.com/N1R44>
