Re: [Rust] Concatenating Record Batches

Andrew Lamb Mon, 23 May 2022 03:58:20 -0700

Thank you! Getting feedback about the API changes from a user would be very
helpful


ANdrew

On Mon, May 23, 2022 at 6:44 AM Ahmed Riza <[email protected]> wrote:

> Thanks Andrew.  Will take a deeper look.  Can see API changes, in
> particular around the in-memory cursor we are currently using.
>
> Also need to create a minimal Parquet file to demonstrate the issues we've
> seen.
>
> Thanks
> Ahmed.
>
> On Mon, 23 May 2022, 11:28 Andrew Lamb, <[email protected]> wrote:
>
>> Raphael has a proposed PR[1] to improve this situation.
>>
>> Ahmed, I wonder if you have a chance to add your opinion
>>
>> [1] https://github.com/apache/arrow-rs/pull/1719
>>
>> On Sat, May 21, 2022 at 6:42 AM Andrew Lamb <[email protected]> wrote:
>>
>>> Thanks Ahmed, yes I can see if you tried to write multiple RecordBatches
>>> to the same stream concurrently this would cause a problem.
>>>
>>> I filed [1] for the corrupt file and [2] for supportingparallel write --
>>> if you are able to provide examples of the parallel code that compiles as
>>> well as what you did with parquet2 that would be most helpful. Either via
>>> email or directly on the ticket.
>>>
>>> Thanks again for the report,
>>> Andrew
>>>
>>> [1] https://github.com/apache/arrow-rs/issues/1717
>>> [2] https://github.com/apache/arrow-rs/issues/1718
>>>
>>>
>>>
>>>
>>> On Thu, May 19, 2022 at 9:45 AM Ahmed Riza <[email protected]> wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>> Thanks for checking. Turns out that this was my bad.  What I did
>>>> subsequently with the concatenated batches was naive and broken.
>>>>
>>>> I was attempting to build a single Parquet from the batches in what I
>>>> thought was a parallel manner using the ArrowWriter.  I tried to
>>>> "parallelise" the following serial code.
>>>>
>>>>             let cursor = InMemoryWriteableCursor::default();
>>>>             let mut writer = ArrowWriter::try_new(cursor.clone(),
>>>> schema, None)?;
>>>>             for batch in batches {
>>>>                 writer.write(batch)?;
>>>>             }
>>>>             writer.close()?;
>>>>
>>>> I realised that although the compiler accepted my incorrect parallel
>>>> version of this code, it in-fact was not sound which caused the corruption.
>>>>
>>>> Can't see a way that I can do this in parallel with the current
>>>> implementation.  I think parquet2 can do this, but I had trouble with
>>>> parquet2 as it couldn't handle the deeply nested Parquet we have.  Will
>>>> check further as to where parquet2 is falling over and raise it on 
>>>> parquet2.
>>>>
>>>> Thanks,
>>>> Ahmed.
>>>>
>>>> On Thu, May 19, 2022 at 12:21 PM Andrew Lamb <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Ahmed,
>>>>>
>>>>> It is valid to concatenate batches and the process you describe seems
>>>>> fine.
>>>>>
>>>>> Your description certainly sounds as if there is something wrong with
>>>>> `concat` that is producing incorrect RecordBatches -- would it be possible
>>>>> to provide more information and file a ticket in
>>>>> https://github.com/apache/arrow-rs/issues ?
>>>>>
>>>>>
>>>>> Andrew
>>>>>
>>>>> p.s. I wonder if you are using `StructArray` or `ListArray`s?
>>>>>
>>>>>
>>>>> On Thu, May 19, 2022 at 4:47 AM Ahmed Riza <[email protected]> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> If we have an Arrow RecordBatch per Parquet file created via
>>>>>> ParquetFileArrowReader, is it valid to concatenate these multiple 
>>>>>> batches?
>>>>>>
>>>>>> Let's say we have 1000 Parquet files, and created a Vec<RecordBatch>
>>>>>> containing 1000 Record Batches. What we'd like to do is, take chunks of,
>>>>>> say, 100 of these at a time, and concatenate them to produce a vector of 
>>>>>> 10
>>>>>> Record Batches.  Something like the following:
>>>>>>
>>>>>>             let combined_record_batches = record_batchs
>>>>>>                 .chunks(100)
>>>>>>                 .map(|rb_chunk| RecordBatch::concat(&schema,
>>>>>> rb_chunk))
>>>>>>                 .collect::<anyhow::Result<Vec<_>>>()?;
>>>>>>
>>>>>> Whilst the above works as far as concatenating them goes, we've
>>>>>> found that the resulting Record Batches cannot be converted to Parquet as
>>>>>> they seem to be corrupted somehow.  That is, using an ArrowWriter and
>>>>>> writing these concatenated Record Batches results in panics such as the
>>>>>> following:
>>>>>>
>>>>>> A thread panicked, PanicInfo { payload: Any { .. }, message:
>>>>>> Some(index out of bounds: the len is 163840 but the index is
>>>>>> 18446744073709387776), location: Location { file:
>>>>>> "/home/ahmed/.cargo/registr
>>>>>> y/src/github.com-1ecc6299db9ec823/parquet-14.0.0/src/arrow/levels.rs",
>>>>>> line: 504, col: 41 }, can_unwind: true }
>>>>>>
>>>>>> Thanks,
>>>>>> Ahmed Riza
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Ahmed Riza
>>>>
>>>

Re: [Rust] Concatenating Record Batches

Reply via email to