Re: Query regarding usage of VectorizedRowBatch

Sivaprasanna Tue, 07 Apr 2020 07:39:26 -0700

Bump.

I would really appreciate, if someone could help me with this.


Cheers,
Sivaprasanna

On Sun, Apr 5, 2020 at 9:28 PM Sivaprasanna <sivaprasanna...@gmail.com>
wrote:

> Hello,
>
> Context: I am working on a solution to enable bulk writing for ORC format
> in Apache Flink[1] which is a stream processing framework.
>
> The scenario is this: Flink receives an element/record (which could be any
> Java type) one by one, and we want to write them in bulk to have the actual
> benefit of ORC. To solve this, I have tried two approaches:
>
> 1. As and when the element is received, convert that single element to a
> VectorizedRowBatch and call writer.addRowBatch(rowBatch). This happens for
> all the incoming records, meaning a new VectorizedRowBatch was created per
> record and then get added using addRowBatch() one by one.
> 2. As and when the element is received, add them to a list and when we
> want to write, create an instance of VectorizedRowBatch and iterate over
> the list containing the elements and transform each record into
> ColumnVectors and add to the same VectorizedRowBatch previously created.
>
> In both the approaches, I saw that the records got into one stripe and
> have all the records in tact in the output ORC files. And I wasn't able to
> find any significant differences in the file sizes between these two
> approaches. So I want to understand the difference and trade-offs between
> these two approaches? Are there any difference w.r.t to compression between
> these two approaches?
>
> [1] https://issues.apache.org/jira/browse/FLINK-10114
>
> Thanks,
> Sivaprasanna
>

Re: Query regarding usage of VectorizedRowBatch

Reply via email to