[ https://issues.apache.org/jira/browse/HUDI-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
sivabalan narayanan resolved HUDI-2285. --------------------------------------- Fix Version/s: 0.10.0 Resolution: Fixed > Metadata Table Synchronous Design > --------------------------------- > > Key: HUDI-2285 > URL: https://issues.apache.org/jira/browse/HUDI-2285 > Project: Apache Hudi > Issue Type: Sub-task > Reporter: Prashant Wason > Assignee: Prashant Wason > Priority: Major > Labels: pull-request-available > Fix For: 0.10.0 > > > h2. *Motivation* > HUDI Metadata Table version 1 (0.7 release) supports file-listing > optimization. We intend to add support for additional information - > record-level index(UUID), column indexes (column range index) to the metadata > table. This requires re-architecting the table design for large scale > (50billion+ records), synchronous operations and to reduce the reader-side > overhead. > # Limit the amount of sync requirement on the reader side > # Syncing on reader side may negate the benefits of the secondary index > # Not syncing on the reader-side simplifies design and reduces testing > # Allow moving to a multi-writer design with operations running in separate > pipelines > # E.g. Clustering / Clean / Backfills in separate pipelines > # Ease of debugging > # Scale - Should be able to handle very large inserts - millions of keys, > thousands of datafiles written > > h3. *Writer Side* > The lifecycle of a HUDI operation will be as listed below. The example below > shows COMMIT operation but the steps apply for all types of operations. > # SparkHoodieWriteClient.commit(...) is called by ingestion process at time > T1 > # Create requested instant t1.commit.requested > # Create inflight instant t1.commit.inflight > # Perform the write of RDD into the dataset and create the > HoodieCommitMetadata > # HoodieMetadataTableWriter.update(CommitMetadata, t1, WriteStatus) > # This will perform a delta-commit into the HUDI Metadata Table updating the > file listing, record-level index (future) and column indexes (future) > together from the data collected in the WriteStatus. > # This commit will complete before the commit started on the dataset will > complete. > # This will create the t1.deltacommit on the Metadata Table. > # Since Metadata Table has inline clean and inline compaction, those > additional operations may also take place at this time > # Complete the commit by creating t1.commit > Inline-compaction will only compact those log blocks which can be deemed > readable as per the algorithm described in the reader-side in the next > section. > h3. *Reader Side* > # List the dataset to find all completed instants - e.g. t1.commit, > t2.commit … t10.commit > # Since these instants are completed, their related metadata has already > been written to the metadata table as part of respective deltacommits - > t1.deltacommit, t2.deltacommit … t10.deltacommit > # Find the last completed instant on the dataset - t10.commit > # Open the FileSlice on the metadata partition with the following > constraints: > # Any base file with time > t10 cannot be used > # Any log blocks whose timestamp is not in the list of completed instants > (#1 above) cannot be used > # Only in ingestion failure cases the latest base file (created due to > compaction) or some log blocks may have to be neglected. In success cases, > this process should not add extra overhead except for listing the dataset. > > h3. *Multi Write Support* > Since each operation on metadata table writes to the same files (file-listing > partition has a single FileSlice), we can only allow single-writer access to > the metadata table. For this, the Transaction Manager is used to lock the > table before any updates. > In essence, each multi-writer operation will contend for the same lock to > write updates to the metadata table before the operation completes. This may > not even be an issue in reality as the operations will complete at different > times and the metadata table operations should be fast. > > *Upgrade/Downgrade* > The two versions (current and this new one) differ in schema and its > complicated to check whether the table is in sync. So its simpler to > re-bootstrap as its only the file listing which needs to be re-bootstrapped. > h3. *Support for shards in metadata table partitions.* > 1. There will be fixed number of shards for each Metadata Table partition. > 2. Shards are implemented using filenames of format fileId00ABCD where ABCD > is the shard number. This allows easy identification of the files and their > order while still keeping the names unique. > 3. Shards are pre-allocation during the time of bootstrap. > 4. Currently only files partition has 1 shard. But this code is required for > record-level-index so implemented here. -- This message was sent by Atlassian Jira (v8.3.4#803005)