Hi Hinko,

Please see my inline answers below:

On Tue, Mar 28, 2023 at 3:16 PM Hinko Kocevar <hinko.koce...@ess.eu> wrote:

> Hi,
>
> I have a couple of questions about the persistence and consistency of the
> data when written to the file. In my use case I generally expect that the
> data rate is high enough such that I can write sized (1GB or more) ORC
> files in short (less than 60 seconds) amount of time. There could be
> occasions where the data stream would be significantly reduced. With that
> I'm afraid of is having substantial amount of data already in the opened
> ORC file that is still being written to, albeit now slowly, and risking of
> losing already "written" that data in case the writer process dies (file
> not cleanly closed). I would like to still have large files even if the
> data rate slow, and willing to wait for the data to accumulate up to the
> desired file size.
>
> I'm specifically interested in the C++ writer.
>
> How does the ORC writer handle the situation where some data has been
> written to the file, and then the writer process dies?
>
> In what state is such file? Can its contents be recovered?
>

If close() is not called on the writer, the writer is in an abnormal state
and data cannot be recovered.


>
> How is data persisted to the file; is the data buffered in the ORC library
> or in the OS or directly written to the file?
>

The ORC file consists of one or more stripes. The data will be buffered in
the memory and flush to the disk or remote file system if the buffered size
reaches a certain threshold (e.g. estimated size of 128MB after
compression) and a new stripe will be created and buffered for incoming
writes.

That said, if close() is not called then the ORC file does not create a
footer which is vital to read all stripes. The data will be lost anyway.


>
> Can a file be "temporary closed" as a precaution, on demand (ie. to
> achieve consistency on read), but then still be written to further until
> desired file size is achieved and closed for good? I'm imagining a
> intermediate "footers" that would superseded by the final footer.
>

That works and is used for the Hive ACID table. However, the C++ writer
does not implement this yet.


>
> Thank you,
> Hinko
>


Best,
Gang

Reply via email to