Actually to answer my own question, this is likely expected behavior. What I bet is going on is the String column contains some pretty big strings and ends up being chunked itself. Then the TableBatchReader used for iteration [1] slices the list array to even out the chunks [2].
If you wanted to avoid this specifically for Parquet with the current implementation you could use ColumnReader [3] directly but I don't think we want to guarantee the exact structure of Arrays. Thanks, Micah [1] https://github.com/apache/arrow/blob/147b5c922efe19d34ef7e7cda635b7d8a07be2eb/cpp/src/parquet/arrow/reader.cc#L1043 [2] https://github.com/apache/arrow/blob/e1cf9041c482486c314a3d143f4b01d35baeaab4/cpp/src/arrow/table.cc#L661 [3] https://github.com/apache/arrow/blob/147b5c922efe19d34ef7e7cda635b7d8a07be2eb/cpp/src/parquet/arrow/reader.h#L289 On Wed, Nov 23, 2022 at 12:13 AM Micah Kornfield <[email protected]> wrote: > Hi Arthur, > >> I am reading a parquet file with arrow::RecordBatchReader and the >> arrow::Table returned contains columns with two chunks >> (column->num_chunks() == 2). The column in question, although not limited >> to, is of type Array(Int64). > > > Would it be possible to provide the code you are using to instantiate the > RecordBatchReader and then construct the Table? With the sample python > code, reading the data back in python (pq.read_table) produces a single > chunked array for me. Also my understanding of the read path of Parquet is > that it should be creating a new offset of each array it returns (with > values not shared [1]). So I think something must be doing this chunking > at a higher level but could very well be mistaken. > > Thanks, > Micah > > [1] > https://github.com/apache/arrow/blob/147b5c922efe19d34ef7e7cda635b7d8a07be2eb/cpp/src/parquet/arrow/reader.cc#L620 > > On Tue, Nov 22, 2022 at 12:22 PM Arthur Passos <[email protected]> > wrote: > >> Hello again, >> >> I talked to our customer and he was able to generate a randomized file >> that triggers the discussed situation. Therefore, I have attached both >> files so you can experiment with it. >> >> Files attached: >> >> *shared-data-monotonically-increasing-offset.parquet - *Contains a >> List(Int64) and String column. Internally, it is represented with two >> chunks. For the List(Int64) column, *arrow::ListArray::values *is shared >> across chunks and offsets are monotonically increasing. >> >> *zero-based-offsets.parquet* - Contains a List(Int64) column. It is >> represented with two chunks. *arrow::ListArray::values *is not shared >> and offsets are zero based. This file was generated using the previously >> shared python script. >> >> Thanks, >> Arthur >> >> ------------------------------ >> *De:* Arthur Passos <[email protected]> >> *Enviado:* terça-feira, 22 de novembro de 2022 08:02 >> *Para:* dl <[email protected]> >> *Assunto:* RE: [C++] Need an example on how to extract data from a >> column of type Array(int64) with multiple chunks >> >> Hi David, >> >> Thanks for the response. Just recapping, I have two files that trigger >> two different cases: 1. array data is shared across chunks and 2. array >> data is not shared across chunks. "Array data" being >> *arrow::ListArray::values*. In the former, offsets are monotonically >> increasing. In the latter, they are zero based. >> >> Unfortunately, the file that triggers the first case contains >> confidential data from one of our customers. I have spent a fair amount of >> time trying to generate one, but failed to do so. The latter, I can >> certainly provide an example. Below is a python script that'll generate it. >> >> >> import pyarrow as pa >> import pyarrow.parquet as pq >> import random >> >> >> def gen_array(offset): >> array = [] >> array_length = random.randint(0, 9) >> for i in range(array_length): >> array.append(i + offset) >> >> return array >> >> >> def gen_arrays(number_of_arrays): >> list_of_arrays = [] >> for i in range(number_of_arrays): >> list_of_arrays.append(gen_array(i)) >> return list_of_arrays >> >> arr = pa.array(gen_arrays(70000)) >> table = pa.table([arr], ["arr"]) >> pq.write_table(table, "int-list-zero-based-chunked-array.parquet") >> >> >> Thanks, >> Arthur >> ------------------------------ >> *De:* David Li <[email protected]> >> *Enviado:* segunda-feira, 21 de novembro de 2022 19:05 >> *Para:* dl <[email protected]> >> *Assunto:* Re: [C++] Need an example on how to extract data from a >> column of type Array(int64) with multiple chunks >> >> Hi Arthur, >> >> Sorry for the late reply - is it possible to provide examples of each >> kind of file? I can try to take a look, at least the behavior here seems >> confusing. >> >> -David >> >> On Mon, Nov 21, 2022, at 15:19, Arthur Passos wrote: >> >> Hi guys. >> >> I could not find written evidence that both shared and non shared >> *arrow::ListArray::values >> *can co-exist, but that seems to be the case since I have files that >> trigger both cases. If any of you have evidence that supports this or that >> shows this is not accurate, it'll be appreciated. >> >> In any case, what I ended up doing is checking whether the offsets are >> zero-based or not. If the former, that means *arrow::ListArray::values* >> is not shared across chunks. If the latter, it is shared. This leads to the >> following logic for *getNested* and *getOffsets*: >> >> *getNested:* >> >> Loop over all chunks and call *arrow::ListArray::Flatten *to properly >> slice based on offsets. This will avoid duplicated data in case >> *arrow::ListArray::values() >> *is shared. >> >> *getOffsets:* >> >> Use a variable to control current offset. Loop through all chunks and >> check if the chunk offset is zero. If it is, current_offset is updated to >> the last offset collected. Then, offset is stored as follows: auto offset >> = arrow_offsets.Value(i); offsets_data.emplace_back(start_offset + >> offset); >> >> >> Full code can be found in: >> https://github.com/ClickHouse/ClickHouse/pull/43297 >> <https://github.com/ClickHouse/ClickHouse/pull/43297> >> Flatten list type arrow chunks on parsing by arthurpassos · Pull Request >> #43297 · ClickHouse/ClickHouse >> <https://github.com/ClickHouse/ClickHouse/pull/43297> >> Changelog category (leave one): Bug Fix (user-visible misbehavior in >> official stable or prestable release) Changelog entry (a user-readable >> short description of the changes that goes to CHANGELOG... >> github.com >> ** >> >> >> Best, >> Arthur >> >> >> ------------------------------ >> >> *De:* Arthur Passos <[email protected]> >> *Enviado:* quarta-feira, 16 de novembro de 2022 16:30 >> *Para:* [email protected] <[email protected]> >> *Cc:* Alan Souza <[email protected]> >> *Assunto:* RE: [C++] Need an example on how to extract data from a >> column of type Array(int64) with multiple chunks >> >> Hi Niranda, >> >> Yes, the offsets are properly set and if call >> *arrow::ListArray::Flatten()*, it'll slice based on those offsets and >> data will be "correct". The problem is that this is not always true, I have >> just tested against a much simpler test parquet file and this logic doesn't >> apply. The *arrow::ListArray::values *member is not shared across all >> chunks and offsets are all zero-based. The file that triggers the former >> case contains confidential data, but the latter is generated with the below >> python script: >> >> import pyarrow as pa >> import pyarrow.parquet as pq >> arr = pa.array([[1, 2] for i in range(70000)]) >> table = pa.table([arr], ["arr"]) >> pq.write_table(table, "a-test.parquet") >> >> >> So it looks like arrow::ListArray::values might or might not be shared >> across chunks. If it's shared, then offsets are not zero based. If it's not >> shared, offsets are zero based. I am under the feeling this is an >> implementation detail and I am facing such problems because I am accessing >> "low level APIs"? If that's so, what would be the proper/ reliable way to >> extract the offsets and nested column data if type is not known at compile >> time AND it might contain multiple chunks. >> >> >> I already shared above how I am extracting the arrow nested column from >> an arrow list column. For reference, the below method is the one used to >> extract the offsets. It starts at index 1 because I do not store 0 offsets. >> >> auto readOffsetsFromArrowListColumn(std::shared_ptr<arrow::ChunkedArray> & >> arrow_column) { >> std::vector<uint64_t> offsets; >> >> offsets.reserve(arrow_column->length()); >> >> for (size_t chunk_i = 0, num_chunks = >> static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; >> ++chunk_i) >> { >> arrow::ListArray & list_chunk = dynamic_cast<arrow::ListArray >> &>(*(arrow_column->chunk(chunk_i))); >> auto arrow_offsets_array = list_chunk.offsets(); >> auto & arrow_offsets = dynamic_cast<arrow::Int32Array >> &>(*arrow_offsets_array); >> for (int64_t i = 1; i < arrow_offsets.length(); ++i) >> offsets.emplace_back(arrow_offsets.Value(i)); >> } >> return std::make_shared<NumericColumn<uint64_t>>(std::move(offsets)); >> } >> >> Numeric column (Int64) data extraction is with the below method: >> >> template <typename NumericType> >> static auto readNumericColumn(std::shared_ptr<arrow::ChunkedArray> & >> arrow_column) >> { >> std::vector<NumericType> array; >> >> for (size_t chunk_i = 0, num_chunks = >> static_cast<size_t>(arrow_column->num_chunks()); chunk_i < num_chunks; >> ++chunk_i) >> { >> std::shared_ptr<arrow::Array> chunk = arrow_column->chunk(chunk_i); >> auto chunk_length = chunk->length(); >> if (chunk_length == 0) >> continue; >> >> /// buffers[0] is a null bitmap and buffers[1] are actual values >> std::shared_ptr<arrow::Buffer> buffer = chunk->data()->buffers[1]; >> const auto * raw_data = reinterpret_cast<const NumericType >> *>(buffer->data()); >> array.insert(array.end(), raw_data, raw_data + chunk_length); >> } >> >> return std::make_shared<NumericColumn<NumericType>>(std::move(array)); >> } >> >> Last but not least, these methods get called recursively by the below >> readArrowColumn: >> >> std::shared_ptr<Column> readArrowColumn(auto arrow_column) { >> switch (arrow_column->type()->id()) { >> case arrow::Type::*INT64*: >> { >> return readNumericColumn<uint64_t>(arrow_column); >> } >> case arrow::Type::*LIST*: >> { >> auto arrow_nested_column = getNestedArrowColumn(arrow_column); >> auto nested_column = readArrowColumn(arrow_nested_column); >> auto offsets_column = >> readOffsetsFromArrowListColumn(arrow_column); >> return std::make_shared<ArrayColumn>(nested_column, >> offsets_column); >> } >> } >> return nullptr; >> >> } >> >> >> Thanks, >> Arthur >> >> ------------------------------ >> >> *De:* Niranda Perera <[email protected]> >> *Enviado:* quarta-feira, 16 de novembro de 2022 12:55 >> *Para:* [email protected] <[email protected]> >> *Cc:* Alan Souza <[email protected]> >> *Assunto:* Re: [C++] Need an example on how to extract data from a >> column of type Array(int64) with multiple chunks >> >> Did you check the offset array? AFAIU one way of constructing chunks of >> list arrays, is duplicating a global value array, and having monotonically >> increasing offsets in the offset arrays. >> If the offsets are all zero-based, it would be a bug. >> >> On Wed, Nov 16, 2022 at 9:10 AM Arthur Passos <[email protected]> >> wrote: >> >> Hi Alan, >> >> In my case, *arrow::ListArray::values* seems to point to the same memory >> location for all chunks. It feels like I need to offset it by the chunk >> offset or something like that, but that would assume the >> *arrow::ListArray::values* method always point to the same memory >> location for all chunks, which doesn't seem to be the case for other files. >> >> Thanks for the ArrowWriteProperties tip. >> >> Best, >> Arthur >> >> ------------------------------ >> >> *De:* Alan Souza via user <[email protected]> >> *Enviado:* quarta-feira, 16 de novembro de 2022 11:02 >> *Para:* [email protected] <[email protected]> >> *Assunto:* Re: [C++] Need an example on how to extract data from a >> column of type Array(int64) with multiple chunks >> >> >> Hello Arthur. I am using something like this: >> >> >> auto chunked_column = table->GetColumnByName(col_name); >> auto listArray = std::static_pointer_cast<arrow::LargeListArray >> >(chunked_column->chunk(0));* // I have only one chunk but this is not a >> problem* >> auto array = std::static_pointer_cast<arrow::FloatArray>(listArray-> >> values()); >> >> In this example I am using the LargeListArray but it is similar to the >> ListArray >> >> Not related to your issue. but is necessary to customize the options of >> the ArrowWriterProperties to save all the type information, for instance: >> >> parquet::ArrowWriterProperties::Builder builder; >> builder.store_schema(); >> >> >> Without this the parquet file is created by the arrow library has a >> ListArray instead of using a LargeListArray on these columns. >> >> On Wednesday, 16 November 2022 at 10:39:02 GMT-3, Arthur Passos < >> [email protected]> wrote: >> >> >> Hi Niranda >> >> Yes, one of the columns (there are over 50 columns in this file), is of >> type List<Int64>. You can see that in the parquet-tools inspect output >> below: >> >> arthur@arthur:~/parquet-validation$ parquet-tools inspect >> ~/Downloads/test_file.parquet | grep test_array_column -A 10 >> path: test_array_column.list.element >> max_definition_level: 2 >> max_repetition_level: 1 >> physical_type: INT64 >> logical_type: None >> converted_type (legacy): NONE >> compression: GZIP (space_saved: 56%) >> >> >> As far as I know, the arrow lib represents List columns with an array of >> offsets and one or more chunks of memory storing the nested column data (). >> On my side, I have a very similar structure, so I would like to extract >> both the array of offsets and the nested column data with the less amount >> of copying possible. >> >> Best, >> Arthur >> >> >> ------------------------------ >> >> *De:* Niranda Perera <[email protected]> >> *Enviado:* quarta-feira, 16 de novembro de 2022 10:28 >> *Para:* [email protected] <[email protected]> >> *Assunto:* Re: [C++] Need an example on how to extract data from a >> column of type Array(int64) with multiple chunks >> >> Hi Arthur, >> >> I'm not very clear about the usecase here. Just to clarify, in your >> original parquet file, do you have List<int64> typed columns? >> >> On Wed, Nov 16, 2022 at 8:02 AM Arthur Passos <[email protected]> >> wrote: >> >> Hi >> >> I am reading a parquet file with arrow::RecordBatchReader and the >> arrow::Table returned contains columns with two chunks >> (column->num_chunks() == 2). The column in question, although not limited >> to, is of type Array(Int64). >> >> I want to extract the data (nested column data) as well as the offsets >> from that column. I have found only one example >> <https://github.com/apache/arrow/blob/master/cpp/examples/arrow/row_wise_conversion_example.cc#L121> >> of Array columns and it assumes the nested type is known at compile >> time AND the column has only one chunk. >> >> I have tried to loop over the Array(Int64) column chunks and grab the >> `values()` member, but for some reason, for that specific Parquet file, the >> values member point to the same memory location. Therefore, if I do >> something like the below, I end up with duplicated data: >> >> >> static std::shared_ptr<arrow::ChunkedArray> >> getNestedArrowColumn(std::shared_ptr<arrow::ChunkedArray> & arrow_column) >> { arrow::ArrayVector array_vector; >> array_vector.reserve(arrow_column->num_chunks()); for (size_t chunk_i = >> 0, num_chunks = static_cast<size_t>(arrow_column->num_chunks()); chunk_i < >> num_chunks; ++chunk_i) { arrow::ListArray & list_chunk = >> dynamic_cast<arrow::ListArray &>(*(arrow_column->chunk(chunk_i))); >> std::shared_ptr<arrow::Array> chunk = list_chunk.values(); >> array_vector.emplace_back(std::move(chunk)); } return >> std::make_shared<arrow::ChunkedArray>(array_vector); >> } >> >> >> I can provide more info, but to keep the initial request short and >> simple, I'll leave it at that. >> >> Thanks in advance, >> Arthur >> >> >> >> -- >> >> Niranda Perera >> https://niranda.dev/ >> @n1r44 <https://twitter.com/N1R44> >> >> >> >> -- >> >> Niranda Perera >> https://niranda.dev/ >> @n1r44 <https://twitter.com/N1R44> >> >> >>
