Hi Ahmed,

It is valid to concatenate batches and the process you describe seems fine.

Your description certainly sounds as if there is something wrong with
`concat` that is producing incorrect RecordBatches -- would it be possible
to provide more information and file a ticket in
https://github.com/apache/arrow-rs/issues ?


Andrew

p.s. I wonder if you are using `StructArray` or `ListArray`s?


On Thu, May 19, 2022 at 4:47 AM Ahmed Riza <[email protected]> wrote:

> Hi,
>
> If we have an Arrow RecordBatch per Parquet file created via
> ParquetFileArrowReader, is it valid to concatenate these multiple batches?
>
> Let's say we have 1000 Parquet files, and created a Vec<RecordBatch>
> containing 1000 Record Batches. What we'd like to do is, take chunks of,
> say, 100 of these at a time, and concatenate them to produce a vector of 10
> Record Batches.  Something like the following:
>
>             let combined_record_batches = record_batchs
>                 .chunks(100)
>                 .map(|rb_chunk| RecordBatch::concat(&schema, rb_chunk))
>                 .collect::<anyhow::Result<Vec<_>>>()?;
>
> Whilst the above works as far as concatenating them goes, we've found that
> the resulting Record Batches cannot be converted to Parquet as they seem to
> be corrupted somehow.  That is, using an ArrowWriter and writing these
> concatenated Record Batches results in panics such as the following:
>
> A thread panicked, PanicInfo { payload: Any { .. }, message: Some(index
> out of bounds: the len is 163840 but the index is 18446744073709387776),
> location: Location { file: "/home/ahmed/.cargo/registr
> y/src/github.com-1ecc6299db9ec823/parquet-14.0.0/src/arrow/levels.rs",
> line: 504, col: 41 }, can_unwind: true }
>
> Thanks,
> Ahmed Riza
>

Reply via email to