ORC has to buffer the entire stripe in memory, so that can write the data in column order rather than row order. If you have large blobs that you can't buffer, I'd suggest writing them to a side file and storing the offsets and lengths in the ORC file. That way you can write the large blobs without spending all of your memory caching them (on either read or write).
.. Owen On Mon, Aug 21, 2017 at 6:44 AM, Ozsvath, Tamas (GE Corporate, consultant) < [email protected]> wrote: > Dear Apache users, > > We are willing to create orc files with org.apache.orc.Writer. Our test > were okay, till we the orc file creation from a database table which > contained blob-s. We have tried to change the following settings but > neither of them was helpful: > > > > org.apache.orc.OrcFile.WriterOptions: > > bufferSize() > > stripeSize() > > blockSize() > > enforceBufferSize() > > > > Is there a way to continously populate the ORC file(flushing out from > memory continously), instead of flushing out data from memory up on > closing the file writer? What is the best practice to create an orc file > from datasource which contains blobs, and can’t be handled only in-memory? > > > > Any information is appreciated! > > > > Thanks, > Tamas > > >
