Thanks for explaining Gang! > That works and is used for the Hive ACID table. However, the C++ writer does > not implement this yet.
Seems that would solve the issues I'm worrying about. I guess you are referring to the second to last paragraph in https://orc.apache.org/docs/acid.html. Is it safe to assume that the missing support in C++ is due to lack of manpower / need, and not a technical issue? Is there any plan to have this in C++ codebase in the future? I'm interested to see how that works in java writer (I assume that Hive is using that implementation). Looking at OrcAcidUtils.java and then FileDump.java I can see how the "_flush_length" side file is utilized. I can't find any any code that actually writes the offsets to the "_flush_length" file, though. I would also be interested in seeing when and how the preliminary footers are written, too. Thanks, Hinko ________________________________________ From: Gang Wu <ust...@gmail.com> Sent: Tuesday, March 28, 2023 9:29:31 AM To: user@orc.apache.org Subject: Re: Writing in c++ and data persistence Hi Hinko, Please see my inline answers below: On Tue, Mar 28, 2023 at 3:16 PM Hinko Kocevar <hinko.koce...@ess.eu<mailto:hinko.koce...@ess.eu>> wrote: Hi, I have a couple of questions about the persistence and consistency of the data when written to the file. In my use case I generally expect that the data rate is high enough such that I can write sized (1GB or more) ORC files in short (less than 60 seconds) amount of time. There could be occasions where the data stream would be significantly reduced. With that I'm afraid of is having substantial amount of data already in the opened ORC file that is still being written to, albeit now slowly, and risking of losing already "written" that data in case the writer process dies (file not cleanly closed). I would like to still have large files even if the data rate slow, and willing to wait for the data to accumulate up to the desired file size. I'm specifically interested in the C++ writer. How does the ORC writer handle the situation where some data has been written to the file, and then the writer process dies? In what state is such file? Can its contents be recovered? If close() is not called on the writer, the writer is in an abnormal state and data cannot be recovered. How is data persisted to the file; is the data buffered in the ORC library or in the OS or directly written to the file? The ORC file consists of one or more stripes. The data will be buffered in the memory and flush to the disk or remote file system if the buffered size reaches a certain threshold (e.g. estimated size of 128MB after compression) and a new stripe will be created and buffered for incoming writes. That said, if close() is not called then the ORC file does not create a footer which is vital to read all stripes. The data will be lost anyway. Can a file be "temporary closed" as a precaution, on demand (ie. to achieve consistency on read), but then still be written to further until desired file size is achieved and closed for good? I'm imagining a intermediate "footers" that would superseded by the final footer. That works and is used for the Hive ACID table. However, the C++ writer does not implement this yet. Thank you, Hinko Best, Gang