Re: Writing in c++ and data persistence

Hinko Kocevar Thu, 30 Mar 2023 00:38:23 -0700

Thank you both for the valuable insights!

> Yes, it doesn't work on any of the cloud storage systems that I've looked at 
> (eg. s3, azure, google). HDFS and local file systems work. *smile*

I'm on premise using non-HDFS like FS, but I'm not after appending and/or 
editing of existing files.

> If you are writing partial ORC files, you need to create a side file that has 
> the valid tail positions (as 8 byte unsigned values). This allows readers to 
> get up to the last footer without reading new rows as the footer.

Ack!

> If this is for streaming purposes, why not use the AVRO file format? The 
> overhead of appending to AVRO is significantly lower compared to ORC.

TBH, I was just looking at how AVRO would work for me compared to ORC. 
According to some tests others have made [1] AVRO is indeed good for ingestion, 
but has poor random read access (no predicate push down built-in), while I 
guess for sequential access I would be OK. That got me thinking: use AVRO with 
use a custom file data container that has statistics embedded after each block 
that would be then used for predicate push down. For my use case I would need 
to do this for two columns only (ns timestamp aka uint64, and strings of max 60 
chars), and I do not have a nested data structure set to worry about etc. Doing 
that would mean I would need to come up with custom data reader on the analysis 
side.

FWIW, I opened https://github.com/apache/orc/pull/1455 with very early code 
that allows me to write intermediate footers. I'm able to open/scan/read data & 
metadata afterwards with C++ and Java tools. Not all aspects of the use were 
tested yet.

[1] 
https://db-blog.web.cern.ch/blog/zbigniew-baranowski/2017-01-performance-comparison-different-file-formats-and-storage-engines

Thanks,
Hinko
________________________________________
From: Gang Wu <ust...@gmail.com>
Sent: Thursday, March 30, 2023 3:56:12 AM
To: user@orc.apache.org
Subject: Re: Writing in c++ and data persistence

Thanks Owen for the explanation!

If this is for streaming purposes, why not use the AVRO file format? The 
overhead of appending to AVRO is significantly lower compared to ORC.

Best,
Gang

On Thu, Mar 30, 2023 at 12:27 AM Owen O'Malley 
<owen.omal...@gmail.com<mailto:owen.omal...@gmail.com>> wrote:
On Wed, Mar 29, 2023 at 8:18 AM Hinko Kocevar 
<hinko.koce...@ess.eu<mailto:hinko.koce...@ess.eu>> wrote:
>  I guess it resides in the Apache Hive code base but haven't got a chance to 
> take a look at it yet.

It does. I managed to find some references there. Thinking about it afterwards 
it strikes me as that whole Hive ACID business is not what I need to deal with..

> I don't regard it as a missing feature in the C++ for following reasons:
>    This is an internal implementation of the Hive ACID table.

I disagree with this assessment. The Java intermediate footer was intended for 
exactly the streaming use case and it was useful for ACID. I guess I should 
clarify the website.

>    It is not a common practice to append to an existing ORC file since some 
> file systems do not support append semantics and files are immutable after 
> close.

Yes, it doesn't work on any of the cloud storage systems that I've looked at 
(eg. s3, azure, google). HDFS and local file systems work. *smile*

>    If the final close() is not called and the writer still has some buffered 
> data, the final ORC file is still malformed.

If you are writing partial ORC files, you need to create a side file that has 
the valid tail positions (as 8 byte unsigned values). This allows readers to 
get up to the last footer without reading new rows as the footer.

The Java reader takes the file end position as an optional parameter.

That sounds sensible. I'm not after the ability of editing/appending to the 
existing closed file, though. My use case involves coding an application that 
would stream the data from remote sources directly into ORC files (with 
required amount of pre-buffering for column creation). What I understand so far 
is that creating ORC files this way might not be common. I suspect that data 
would usually be "logged" by Kafka and then ORC files would be written using 
Kafka as a source. I can see how doing that relieves the ORC code of possible 
writer app failures.

You probably need to handle it at a higher level, otherwise you won't be 
guaranteed to have exactly once semantics from Kafka. The intermediate footers 
allow readers to read partial files and thus the writer doesn't need to start a 
new file, but you could have similar semantics that way.

I'd suggest something like:

pick a spot to check point at:
  * capture the kafka offsets
  * writer intermediate footers in all of the orc files
  * append the current offset to the side file
  * if any of this fails, repeat until it works

Of course, if the topic is distributed, you need multiple kafka & ORC offsets.

> BTW, I actually had experimented appending to ORC file years ago and found 
> that the metadata overhead is non-negligible if many file footers have been 
> appended.

I'm questioning my attempt in trying to flush the data and create intermediate 
footers. One of the unknowns for me is knowing when to flush ; flushing "at the 
right time" in order not to create runt stripes for example. I haven't looked 
into the footer/metadata overhead yet. I'm still intrigued in how far I can get 
with trying to hack this "flushing" up in C++ codebase.

FWIW, I did locate Java code that does the writing of the intermediate footer: 
WriterImpl.java:804 writeIntermediateFooter().

At a glance it looks very similar to the WriterImpl.java:767 close() with 
exception that the flush (fsync() equivalent ?) call is made instead of 
close(). I hacked the C++ code to create a similar method (based off the C++ 
WriterImpl::close()).

  long WriterImpl::writeIntermediateFooter() {
    if (stripeRows > 0) {
      writeStripe();
    }
    if (stripesAtLastFlush != numStripes) {
      writeMetadata();
      writeFileFooter();
      stripesAtLastFlush = numStripes;
      writePostscript();
      lastFlushOffset = outStream->getLength();
      outStream->flush();
    }
    return stripesAtLastFlush;
  }

numStripes is incremented at the end of WriterImpl::writeStripe().

In the tests I tried to create a comparable file with Java code and C++ code: 
"struct<col1:int>" and write 10 batches * 10 elements. I can see that more 
stripes are indeed created and C++ orc-metadata shows reasonable output. With 
Java tools meta command fails to gather file statistics, though. For some 
reason my C++ created ORC file now reports fileStats.size() = 22 instead of 
fileStats.size() = 2 (in ReaderImpl.java ColumnStatistics[] 
deserializeStats()). Seems like C++ code accumulates the file stats for each 
stripe created with flush (10 stripes) and also adds one for the close (total 2 
* 11 = 22).

Make sure to pass the offset of the footer to the java meta tool. Otherwise, it 
defaults to the end of the file.

I would need to dig deeper in to how the writeMetadata() and writeFileFooter() 
are different between Java / C++ in order to understand what is going on. This 
Java WriterImpl.java code caught my eye that might be relevant but have no idea 
what to make of it:

That is mostly a red-herring related to the history of the development.

  private void writeMetadata() throws IOException {
    // The physical writer now has the stripe statistics, so we pass a
    // new builder in here.
    physicalWriter.writeFileMetadata(OrcProto.Metadata.newBuilder());
  }

If you have any pointers on where the culprit might lay it would be highly 
appreciated.

Thank you for your time!

Cheers,
Hinko
________________________________________
From: Gang Wu <ust...@gmail.com<mailto:ust...@gmail.com>>
Sent: Wednesday, March 29, 2023 3:40:11 AM
To: user@orc.apache.org<mailto:user@orc.apache.org>
Subject: Re: Writing in c++ and data persistence

Yes, I am referring to https://orc.apache.org/docs/acid.html for appending 
additional footers to an existing ORC file. I guess it resides in the Apache 
Hive code base but haven't got a chance to take a look at it yet.

I don't regard it as a missing feature in the C++ for following reasons:

  *   This is an internal implementation of the Hive ACID table.
  *   It is not a common practice to append to an existing ORC file since some 
file systems do not support append semantics and files are immutable after 
close.
  *   If the final close() is not called and the writer still has some buffered 
data, the final ORC file is still malformed.

BTW, I actually had experimented appending to ORC file years ago and found that 
the metadata overhead is non-negligible if many file footers have been appended.

That said, if you think this feature is a good use case for you, you are 
welcome to submit a PR to add it.

Best,
Gang

On Tue, Mar 28, 2023 at 8:10 PM Hinko Kocevar 
<hinko.koce...@ess.eu<mailto:hinko.koce...@ess.eu><mailto:hinko.koce...@ess.eu<mailto:hinko.koce...@ess.eu>>>
 wrote:
Thanks for explaining Gang!

> That works and is used for the Hive ACID table. However, the C++ writer does 
> not implement this yet.

Seems that would solve the issues I'm worrying about. I guess you are referring 
to the second to last paragraph in https://orc.apache.org/docs/acid.html. Is it 
safe to assume that the missing support in C++ is due to lack of manpower / 
need, and not a technical issue? Is there any plan to have this in C++ codebase 
in the future?

I'm interested to see how that works in java writer (I assume that Hive is 
using that implementation). Looking at OrcAcidUtils.java and then FileDump.java 
I can see how the "_flush_length" side file is utilized. I can't find any any 
code that actually writes the offsets to the "_flush_length" file, though. I 
would also be interested in seeing when and how the preliminary footers are 
written, too.

Thanks,
Hinko
________________________________________
From: Gang Wu 
<ust...@gmail.com<mailto:ust...@gmail.com><mailto:ust...@gmail.com<mailto:ust...@gmail.com>>>
Sent: Tuesday, March 28, 2023 9:29:31 AM
To: 
user@orc.apache.org<mailto:user@orc.apache.org><mailto:user@orc.apache.org<mailto:user@orc.apache.org>>
Subject: Re: Writing in c++ and data persistence

Hi Hinko,

Please see my inline answers below:

On Tue, Mar 28, 2023 at 3:16 PM Hinko Kocevar 
<hinko.koce...@ess.eu<mailto:hinko.koce...@ess.eu><mailto:hinko.koce...@ess.eu<mailto:hinko.koce...@ess.eu>><mailto:hinko.koce...@ess.eu<mailto:hinko.koce...@ess.eu><mailto:hinko.koce...@ess.eu<mailto:hinko.koce...@ess.eu>>>>
 wrote:
Hi,

I have a couple of questions about the persistence and consistency of the data 
when written to the file. In my use case I generally expect that the data rate 
is high enough such that I can write sized (1GB or more) ORC files in short 
(less than 60 seconds) amount of time. There could be occasions where the data 
stream would be significantly reduced. With that I'm afraid of is having 
substantial amount of data already in the opened ORC file that is still being 
written to, albeit now slowly, and risking of losing already "written" that 
data in case the writer process dies (file not cleanly closed). I would like to 
still have large files even if the data rate slow, and willing to wait for the 
data to accumulate up to the desired file size.

I'm specifically interested in the C++ writer.

How does the ORC writer handle the situation where some data has been written 
to the file, and then the writer process dies?

In what state is such file? Can its contents be recovered?

If close() is not called on the writer, the writer is in an abnormal state and 
data cannot be recovered.

How is data persisted to the file; is the data buffered in the ORC library or 
in the OS or directly written to the file?

The ORC file consists of one or more stripes. The data will be buffered in the 
memory and flush to the disk or remote file system if the buffered size reaches 
a certain threshold (e.g. estimated size of 128MB after compression) and a new 
stripe will be created and buffered for incoming writes.

That said, if close() is not called then the ORC file does not create a footer 
which is vital to read all stripes. The data will be lost anyway.

Can a file be "temporary closed" as a precaution, on demand (ie. to achieve 
consistency on read), but then still be written to further until desired file 
size is achieved and closed for good? I'm imagining a intermediate "footers" 
that would superseded by the final footer.

That works and is used for the Hive ACID table. However, the C++ writer does 
not implement this yet.

Thank you,
Hinko

Best,
Gang

Re: Writing in c++ and data persistence

Reply via email to