Raphael has a proposed PR[1] to improve this situation.

Ahmed, I wonder if you have a chance to add your opinion

[1] https://github.com/apache/arrow-rs/pull/1719

On Sat, May 21, 2022 at 6:42 AM Andrew Lamb <[email protected]> wrote:

> Thanks Ahmed, yes I can see if you tried to write multiple RecordBatches
> to the same stream concurrently this would cause a problem.
>
> I filed [1] for the corrupt file and [2] for supportingparallel write --
> if you are able to provide examples of the parallel code that compiles as
> well as what you did with parquet2 that would be most helpful. Either via
> email or directly on the ticket.
>
> Thanks again for the report,
> Andrew
>
> [1] https://github.com/apache/arrow-rs/issues/1717
> [2] https://github.com/apache/arrow-rs/issues/1718
>
>
>
>
> On Thu, May 19, 2022 at 9:45 AM Ahmed Riza <[email protected]> wrote:
>
>> Hi Andrew,
>>
>> Thanks for checking. Turns out that this was my bad.  What I did
>> subsequently with the concatenated batches was naive and broken.
>>
>> I was attempting to build a single Parquet from the batches in what I
>> thought was a parallel manner using the ArrowWriter.  I tried to
>> "parallelise" the following serial code.
>>
>>             let cursor = InMemoryWriteableCursor::default();
>>             let mut writer = ArrowWriter::try_new(cursor.clone(), schema,
>> None)?;
>>             for batch in batches {
>>                 writer.write(batch)?;
>>             }
>>             writer.close()?;
>>
>> I realised that although the compiler accepted my incorrect parallel
>> version of this code, it in-fact was not sound which caused the corruption.
>>
>> Can't see a way that I can do this in parallel with the current
>> implementation.  I think parquet2 can do this, but I had trouble with
>> parquet2 as it couldn't handle the deeply nested Parquet we have.  Will
>> check further as to where parquet2 is falling over and raise it on parquet2.
>>
>> Thanks,
>> Ahmed.
>>
>> On Thu, May 19, 2022 at 12:21 PM Andrew Lamb <[email protected]>
>> wrote:
>>
>>> Hi Ahmed,
>>>
>>> It is valid to concatenate batches and the process you describe seems
>>> fine.
>>>
>>> Your description certainly sounds as if there is something wrong with
>>> `concat` that is producing incorrect RecordBatches -- would it be possible
>>> to provide more information and file a ticket in
>>> https://github.com/apache/arrow-rs/issues ?
>>>
>>>
>>> Andrew
>>>
>>> p.s. I wonder if you are using `StructArray` or `ListArray`s?
>>>
>>>
>>> On Thu, May 19, 2022 at 4:47 AM Ahmed Riza <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> If we have an Arrow RecordBatch per Parquet file created via
>>>> ParquetFileArrowReader, is it valid to concatenate these multiple batches?
>>>>
>>>> Let's say we have 1000 Parquet files, and created a Vec<RecordBatch>
>>>> containing 1000 Record Batches. What we'd like to do is, take chunks of,
>>>> say, 100 of these at a time, and concatenate them to produce a vector of 10
>>>> Record Batches.  Something like the following:
>>>>
>>>>             let combined_record_batches = record_batchs
>>>>                 .chunks(100)
>>>>                 .map(|rb_chunk| RecordBatch::concat(&schema, rb_chunk))
>>>>                 .collect::<anyhow::Result<Vec<_>>>()?;
>>>>
>>>> Whilst the above works as far as concatenating them goes, we've
>>>> found that the resulting Record Batches cannot be converted to Parquet as
>>>> they seem to be corrupted somehow.  That is, using an ArrowWriter and
>>>> writing these concatenated Record Batches results in panics such as the
>>>> following:
>>>>
>>>> A thread panicked, PanicInfo { payload: Any { .. }, message: Some(index
>>>> out of bounds: the len is 163840 but the index is 18446744073709387776),
>>>> location: Location { file: "/home/ahmed/.cargo/registr
>>>> y/src/github.com-1ecc6299db9ec823/parquet-14.0.0/src/arrow/levels.rs",
>>>> line: 504, col: 41 }, can_unwind: true }
>>>>
>>>> Thanks,
>>>> Ahmed Riza
>>>>
>>>
>>
>> --
>> Ahmed Riza
>>
>

Reply via email to