Re: Writing in c++ and data persistence

Gang Wu Wed, 29 Mar 2023 18:56:31 -0700

Thanks Owen for the explanation!

If this is for streaming purposes, why not use the AVRO file format? The
overhead of appending to AVRO is significantly lower compared to ORC.


Best,
Gang

On Thu, Mar 30, 2023 at 12:27 AM Owen O'Malley <owen.omal...@gmail.com>
wrote:

> On Wed, Mar 29, 2023 at 8:18 AM Hinko Kocevar <hinko.koce...@ess.eu>
> wrote:
>
>> >  I guess it resides in the Apache Hive code base but haven't got a
>> chance to take a look at it yet.
>>
>> It does. I managed to find some references there. Thinking about it
>> afterwards it strikes me as that whole Hive ACID business is not what I
>> need to deal with..
>>
>> > I don't regard it as a missing feature in the C++ for following reasons:
>> >    This is an internal implementation of the Hive ACID table.
>>
>
> I disagree with this assessment. The Java intermediate footer was intended
> for exactly the streaming use case and it was useful for ACID. I guess I
> should clarify the website.
>
>>
>
>> >    It is not a common practice to append to an existing ORC file since
>> some file systems do not support append semantics and files are immutable
>> after close.
>>
>
> Yes, it doesn't work on any of the cloud storage systems that I've looked
> at (eg. s3, azure, google). HDFS and local file systems work. *smile*
>
>
>> >    If the final close() is not called and the writer still has some
>> buffered data, the final ORC file is still malformed.
>>
>
> If you are writing partial ORC files, you need to create a side file that
> has the valid tail positions (as 8 byte unsigned values). This allows
> readers to get up to the last footer without reading new rows as the footer.
>
> The Java reader takes the file end position as an optional parameter.
>
> That sounds sensible. I'm not after the ability of editing/appending to
>> the existing closed file, though. My use case involves coding an
>> application that would stream the data from remote sources directly into
>> ORC files (with required amount of pre-buffering for column creation). What
>> I understand so far is that creating ORC files this way might not be
>> common. I suspect that data would usually be "logged" by Kafka and then ORC
>> files would be written using Kafka as a source. I can see how doing that
>> relieves the ORC code of possible writer app failures.
>>
>
> You probably need to handle it at a higher level, otherwise you won't be
> guaranteed to have exactly once semantics from Kafka. The intermediate
> footers allow readers to read partial files and thus the writer doesn't
> need to start a new file, but you could have similar semantics that way.
>
> I'd suggest something like:
>
> pick a spot to check point at:
>   * capture the kafka offsets
>   * writer intermediate footers in all of the orc files
>   * append the current offset to the side file
>   * if any of this fails, repeat until it works
>
> Of course, if the topic is distributed, you need multiple kafka & ORC
> offsets.
>
>
>> > BTW, I actually had experimented appending to ORC file years ago and
>> found that the metadata overhead is non-negligible if many file footers
>> have been appended.
>>
>> I'm questioning my attempt in trying to flush the data and create
>> intermediate footers. One of the unknowns for me is knowing when to flush ;
>> flushing "at the right time" in order not to create runt stripes for
>> example. I haven't looked into the footer/metadata overhead yet. I'm still
>> intrigued in how far I can get with trying to hack this "flushing" up in
>> C++ codebase.
>>
>> FWIW, I did locate Java code that does the writing of the intermediate
>> footer: WriterImpl.java:804 writeIntermediateFooter().
>>
>> At a glance it looks very similar to the WriterImpl.java:767 close() with
>> exception that the flush (fsync() equivalent ?) call is made instead of
>> close(). I hacked the C++ code to create a similar method (based off the
>> C++ WriterImpl::close()).
>>
>>
>>   long WriterImpl::writeIntermediateFooter() {
>>     if (stripeRows > 0) {
>>       writeStripe();
>>     }
>>     if (stripesAtLastFlush != numStripes) {
>>       writeMetadata();
>>       writeFileFooter();
>>       stripesAtLastFlush = numStripes;
>>       writePostscript();
>>       lastFlushOffset = outStream->getLength();
>>       outStream->flush();
>>     }
>>     return stripesAtLastFlush;
>>   }
>>
>> numStripes is incremented at the end of WriterImpl::writeStripe().
>>
>> In the tests I tried to create a comparable file with Java code and C++
>> code: "struct<col1:int>" and write 10 batches * 10 elements. I can see that
>> more stripes are indeed created and C++ orc-metadata shows reasonable
>> output. With Java tools meta command fails to gather file statistics,
>> though. For some reason my C++ created ORC file now reports
>> fileStats.size() = 22 instead of fileStats.size() = 2 (in ReaderImpl.java
>> ColumnStatistics[] deserializeStats()). Seems like C++ code accumulates the
>> file stats for each stripe created with flush (10 stripes) and also adds
>> one for the close (total 2 * 11 = 22).
>>
>
> Make sure to pass the offset of the footer to the java meta tool.
> Otherwise, it defaults to the end of the file.
>
>
>>
>> I would need to dig deeper in to how the writeMetadata() and
>> writeFileFooter() are different between Java / C++ in order to understand
>> what is going on. This Java WriterImpl.java code caught my eye that might
>> be relevant but have no idea what to make of it:
>>
>
> That is mostly a red-herring related to the history of the development.
>
>>
>>   private void writeMetadata() throws IOException {
>>     // The physical writer now has the stripe statistics, so we pass a
>>     // new builder in here.
>>     physicalWriter.writeFileMetadata(OrcProto.Metadata.newBuilder());
>>   }
>>
>> If you have any pointers on where the culprit might lay it would be
>> highly appreciated.
>>
>> Thank you for your time!
>>
>> Cheers,
>> Hinko
>> ________________________________________
>> From: Gang Wu <ust...@gmail.com>
>> Sent: Wednesday, March 29, 2023 3:40:11 AM
>> To: user@orc.apache.org
>> Subject: Re: Writing in c++ and data persistence
>>
>> Yes, I am referring to https://orc.apache.org/docs/acid.html for
>> appending additional footers to an existing ORC file. I guess it resides in
>> the Apache Hive code base but haven't got a chance to take a look at it yet.
>>
>> I don't regard it as a missing feature in the C++ for following reasons:
>>
>>   *   This is an internal implementation of the Hive ACID table.
>>   *   It is not a common practice to append to an existing ORC file since
>> some file systems do not support append semantics and files are immutable
>> after close.
>>   *   If the final close() is not called and the writer still has some
>> buffered data, the final ORC file is still malformed.
>>
>> BTW, I actually had experimented appending to ORC file years ago and
>> found that the metadata overhead is non-negligible if many file footers
>> have been appended.
>>
>> That said, if you think this feature is a good use case for you, you are
>> welcome to submit a PR to add it.
>>
>> Best,
>> Gang
>>
>> On Tue, Mar 28, 2023 at 8:10 PM Hinko Kocevar <hinko.koce...@ess.eu
>> <mailto:hinko.koce...@ess.eu>> wrote:
>> Thanks for explaining Gang!
>>
>> > That works and is used for the Hive ACID table. However, the C++ writer
>> does not implement this yet.
>>
>> Seems that would solve the issues I'm worrying about. I guess you are
>> referring to the second to last paragraph in
>> https://orc.apache.org/docs/acid.html. Is it safe to assume that the
>> missing support in C++ is due to lack of manpower / need, and not a
>> technical issue? Is there any plan to have this in C++ codebase in the
>> future?
>>
>> I'm interested to see how that works in java writer (I assume that Hive
>> is using that implementation). Looking at OrcAcidUtils.java and then
>> FileDump.java I can see how the "_flush_length" side file is utilized. I
>> can't find any any code that actually writes the offsets to the
>> "_flush_length" file, though. I would also be interested in seeing when and
>> how the preliminary footers are written, too.
>>
>> Thanks,
>> Hinko
>> ________________________________________
>> From: Gang Wu <ust...@gmail.com<mailto:ust...@gmail.com>>
>> Sent: Tuesday, March 28, 2023 9:29:31 AM
>> To: user@orc.apache.org<mailto:user@orc.apache.org>
>> Subject: Re: Writing in c++ and data persistence
>>
>> Hi Hinko,
>>
>> Please see my inline answers below:
>>
>> On Tue, Mar 28, 2023 at 3:16 PM Hinko Kocevar <hinko.koce...@ess.eu
>> <mailto:hinko.koce...@ess.eu><mailto:hinko.koce...@ess.eu<mailto:
>> hinko.koce...@ess.eu>>> wrote:
>> Hi,
>>
>> I have a couple of questions about the persistence and consistency of the
>> data when written to the file. In my use case I generally expect that the
>> data rate is high enough such that I can write sized (1GB or more) ORC
>> files in short (less than 60 seconds) amount of time. There could be
>> occasions where the data stream would be significantly reduced. With that
>> I'm afraid of is having substantial amount of data already in the opened
>> ORC file that is still being written to, albeit now slowly, and risking of
>> losing already "written" that data in case the writer process dies (file
>> not cleanly closed). I would like to still have large files even if the
>> data rate slow, and willing to wait for the data to accumulate up to the
>> desired file size.
>>
>> I'm specifically interested in the C++ writer.
>>
>> How does the ORC writer handle the situation where some data has been
>> written to the file, and then the writer process dies?
>>
>> In what state is such file? Can its contents be recovered?
>>
>> If close() is not called on the writer, the writer is in an abnormal
>> state and data cannot be recovered.
>>
>>
>> How is data persisted to the file; is the data buffered in the ORC
>> library or in the OS or directly written to the file?
>>
>> The ORC file consists of one or more stripes. The data will be buffered
>> in the memory and flush to the disk or remote file system if the buffered
>> size reaches a certain threshold (e.g. estimated size of 128MB after
>> compression) and a new stripe will be created and buffered for incoming
>> writes.
>>
>> That said, if close() is not called then the ORC file does not create a
>> footer which is vital to read all stripes. The data will be lost anyway.
>>
>>
>> Can a file be "temporary closed" as a precaution, on demand (ie. to
>> achieve consistency on read), but then still be written to further until
>> desired file size is achieved and closed for good? I'm imagining a
>> intermediate "footers" that would superseded by the final footer.
>>
>> That works and is used for the Hive ACID table. However, the C++ writer
>> does not implement this yet.
>>
>>
>> Thank you,
>> Hinko
>>
>>
>> Best,
>> Gang
>>
>

Re: Writing in c++ and data persistence

Reply via email to