[ 
https://issues.apache.org/jira/browse/HUDI-2285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan resolved HUDI-2285.
---------------------------------------
    Fix Version/s: 0.10.0
       Resolution: Fixed

> Metadata Table Synchronous Design
> ---------------------------------
>
>                 Key: HUDI-2285
>                 URL: https://issues.apache.org/jira/browse/HUDI-2285
>             Project: Apache Hudi
>          Issue Type: Sub-task
>            Reporter: Prashant Wason
>            Assignee: Prashant Wason
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.10.0
>
>
> h2. *Motivation*
> HUDI Metadata Table version 1 (0.7 release) supports file-listing 
> optimization. We intend to add support for additional information - 
> record-level index(UUID), column indexes (column range index) to the metadata 
> table. This requires re-architecting the table design for large scale 
> (50billion+ records), synchronous operations and to reduce the reader-side 
> overhead.
>  # Limit the amount of sync requirement on the reader side
>  # Syncing on reader side may negate the benefits of the secondary index 
>  # Not syncing on the reader-side simplifies design and reduces testing
>  # Allow moving to a multi-writer design with operations running in separate 
> pipelines
>  # E.g. Clustering / Clean / Backfills in separate pipelines
>  # Ease of debugging 
>  # Scale - Should be able to handle very large inserts - millions of keys, 
> thousands of datafiles written
>  
> h3. *Writer Side*
> The lifecycle of a HUDI operation will be as listed below. The example below 
> shows COMMIT operation but the steps apply for all types of operations.
>  # SparkHoodieWriteClient.commit(...) is called by ingestion process at time 
> T1
>  # Create requested instant t1.commit.requested
>  # Create inflight instant t1.commit.inflight
>  # Perform the write of RDD into the dataset and create the 
> HoodieCommitMetadata
>  # HoodieMetadataTableWriter.update(CommitMetadata, t1, WriteStatus)
>  # This will perform a delta-commit into the HUDI Metadata Table updating the 
> file listing, record-level index (future) and column indexes (future) 
> together from the data collected in the WriteStatus.
>  # This commit will complete before the commit started on the dataset will 
> complete.
>  # This will create the t1.deltacommit on the Metadata Table.
>  # Since Metadata Table has inline clean and inline compaction, those 
> additional operations may also take place at this time
>  # Complete the commit by creating t1.commit
> Inline-compaction will only compact those log blocks which can be deemed 
> readable as per the algorithm described in the reader-side in the next 
> section.
> h3. *Reader Side*
>  # List the dataset to find all completed instants - e.g. t1.commit, 
> t2.commit … t10.commit
>  # Since these instants are completed, their related metadata has already 
> been written to the metadata table as part of respective deltacommits - 
> t1.deltacommit, t2.deltacommit … t10.deltacommit
>  # Find the last completed instant on the dataset - t10.commit
>  # Open the FileSlice on the metadata partition with the following 
> constraints:
>  # Any base file with time > t10 cannot be used
>  # Any log blocks whose timestamp is not in the list of completed instants 
> (#1 above) cannot be used
>  # Only in ingestion failure cases the latest base file (created due to 
> compaction) or some log blocks may have to be neglected. In success cases, 
> this process should not add extra overhead except for listing the dataset.
>  
> h3. *Multi Write Support*
> Since each operation on metadata table writes to the same files (file-listing 
> partition has a single FileSlice), we can only allow single-writer access to 
> the metadata table. For this, the Transaction Manager is used to lock the 
> table before any updates.
> In essence, each multi-writer operation will contend for the same lock to 
> write updates to the metadata table before the operation completes. This may 
> not even be an issue in reality as the operations will complete at different 
> times and the metadata table operations should be fast.
>  
> *Upgrade/Downgrade* 
> The two versions (current and this new one) differ in schema and its 
> complicated to check whether the table is in sync. So its simpler to 
> re-bootstrap as its only the file listing which needs to be re-bootstrapped.
> h3. *Support for shards in metadata table partitions.*
> 1. There will be fixed number of shards for each Metadata Table partition.
>  2. Shards are implemented using filenames of format fileId00ABCD where ABCD 
> is the shard number. This allows easy identification of the files and their 
> order while still keeping the names unique.
>  3. Shards are pre-allocation during the time of bootstrap.
>  4. Currently only files partition has 1 shard. But this code is required for 
> record-level-index so implemented here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to