Hi Gautam,
Hadoop tables are not intended to be used when the file system doesn't
support atomic rename because of the problems you describe. Atomic rename
is a requirement for correctness in Hadoop tables.
That is why we also have metastore tables, where some other atomic swap is
used. I
We have referred https://iceberg.incubator.apache.org/custom-catalog/ and
implemented atomic operation using dynamo optimistic locking. Iceberg
codebase has has excellent test case to validate custom implementation.
Thanks for pointing out those references, suds!
And thanks to Mouli (for writing the doc) and Anton (for writing the test)!
On Tue, Jan 28, 2020 at 2:05 PM suds wrote:
> We have referred https://iceberg.incubator.apache.org/custom-catalog/ and
> implemented atomic operation using dynamo
> Sorting seems to condition deletes at least, right?
Sorting is an optimization. Equality deletes, like id = 10, can be
implemented by keeping a hash set of deleted values or by merging two
sorted lists. The latter option doesn't require a lot of memory for a large
set of deletes, which is why
Hi,
We would like to store snapshot metadata that is necessary for
producing/consuming incremental data. An example of this is the maximum value
of an event timeline that we have processed so far, so that we know where to
read from next.
Some of the possible options that we have discovered
Hi Dabby,
I think your assessment is right.
- Table metadata isn’t versioned with snapshots and is a good idea for
table-level configuration. It sounds like what you need is additional
information about a snapshot, so table properties don’t make sense.
- Using upper and lower bounds
Hello Devs,
We are currently working on building out a high write
throughput pipeline with Iceberg where hundreds or thousands of writers
(and thousands of readers) could be accessing a table at any given moment.
We are facing the issue called out by [1]. According to
Thanks Ryan and Suds for the suggestions, we are looking into these
options.
We currently don't have any external catalog or locking service and depend
purely on commit retries. Additionally, we don't have any of our meta data
in Hive Metastore, and, we want to leverage the underlying filesystem