Bump. I would really appreciate, if someone could help me with this.
Cheers, Sivaprasanna On Sun, Apr 5, 2020 at 9:28 PM Sivaprasanna <sivaprasanna...@gmail.com> wrote: > Hello, > > Context: I am working on a solution to enable bulk writing for ORC format > in Apache Flink[1] which is a stream processing framework. > > The scenario is this: Flink receives an element/record (which could be any > Java type) one by one, and we want to write them in bulk to have the actual > benefit of ORC. To solve this, I have tried two approaches: > > 1. As and when the element is received, convert that single element to a > VectorizedRowBatch and call writer.addRowBatch(rowBatch). This happens for > all the incoming records, meaning a new VectorizedRowBatch was created per > record and then get added using addRowBatch() one by one. > 2. As and when the element is received, add them to a list and when we > want to write, create an instance of VectorizedRowBatch and iterate over > the list containing the elements and transform each record into > ColumnVectors and add to the same VectorizedRowBatch previously created. > > In both the approaches, I saw that the records got into one stripe and > have all the records in tact in the output ORC files. And I wasn't able to > find any significant differences in the file sizes between these two > approaches. So I want to understand the difference and trade-offs between > these two approaches? Are there any difference w.r.t to compression between > these two approaches? > > [1] https://issues.apache.org/jira/browse/FLINK-10114 > > Thanks, > Sivaprasanna >