Re: Write reliability in Iceberg

2020-01-28 Thread Ryan Blue
Hi Gautam, Hadoop tables are not intended to be used when the file system doesn't support atomic rename because of the problems you describe. Atomic rename is a requirement for correctness in Hadoop tables. That is why we also have metastore tables, where some other atomic swap is used. I

Re: Write reliability in Iceberg

2020-01-28 Thread suds
We have referred https://iceberg.incubator.apache.org/custom-catalog/ and implemented atomic operation using dynamo optimistic locking. Iceberg codebase has has excellent test case to validate custom implementation.

Re: Write reliability in Iceberg

2020-01-28 Thread Ryan Blue
Thanks for pointing out those references, suds! And thanks to Mouli (for writing the doc) and Anton (for writing the test)! On Tue, Jan 28, 2020 at 2:05 PM suds wrote: > We have referred https://iceberg.incubator.apache.org/custom-catalog/ and > implemented atomic operation using dynamo

Re: Iceberg tombstone?

2020-01-28 Thread Ryan Blue
> Sorting seems to condition deletes at least, right? Sorting is an optimization. Equality deletes, like id = 10, can be implemented by keeping a hash set of deleted values or by merging two sorted lists. The latter option doesn't require a lot of memory for a large set of deletes, which is why

help/best place to store arbitrary snapshot metadata

2020-01-28 Thread Dabeluchi Ndubisi
Hi, We would like to store snapshot metadata that is necessary for producing/consuming incremental data. An example of this is the maximum value of an event timeline that we have processed so far, so that we know where to read from next. Some of the possible options that we have discovered

Re: help/best place to store arbitrary snapshot metadata

2020-01-28 Thread Ryan Blue
Hi Dabby, I think your assessment is right. - Table metadata isn’t versioned with snapshots and is a good idea for table-level configuration. It sounds like what you need is additional information about a snapshot, so table properties don’t make sense. - Using upper and lower bounds

Write reliability in Iceberg

2020-01-28 Thread Gautam
Hello Devs, We are currently working on building out a high write throughput pipeline with Iceberg where hundreds or thousands of writers (and thousands of readers) could be accessing a table at any given moment. We are facing the issue called out by [1]. According to

Re: Write reliability in Iceberg

2020-01-28 Thread Gautam
Thanks Ryan and Suds for the suggestions, we are looking into these options. We currently don't have any external catalog or locking service and depend purely on commit retries. Additionally, we don't have any of our meta data in Hive Metastore, and, we want to leverage the underlying filesystem