This is an automated email from the ASF dual-hosted git repository. bhavanisudha pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push: new 83b0744d6cf [DOCS] Update write operations page (#9619) 83b0744d6cf is described below commit 83b0744d6cfdea9671e7ed2b1cf4fa285904f8af Author: Bhavani Sudha Saktheeswaran <2179254+bhasu...@users.noreply.github.com> AuthorDate: Tue Sep 12 09:57:28 2023 -0700 [DOCS] Update write operations page (#9619) --- website/docs/write_operations.md | 49 ++++++++++++++++++++++++++++++++++++++ website/src/theme/DocPage/index.js | 2 +- 2 files changed, 50 insertions(+), 1 deletion(-) diff --git a/website/docs/write_operations.md b/website/docs/write_operations.md index fc0791bcf20..2942dabf1ef 100644 --- a/website/docs/write_operations.md +++ b/website/docs/write_operations.md @@ -34,6 +34,55 @@ Hudi supports implementing two types of deletes on data stored in Hudi tables, b - Using DataSource, set `PAYLOAD_CLASS_OPT_KEY` to `"org.apache.hudi.EmptyHoodieRecordPayload"`. This will remove all the records in the DataSet being submitted. - Using DataSource or Hudi Streamer, add a column named `_hoodie_is_deleted` to DataSet. The value of this column must be set to `true` for all the records to be deleted and either `false` or left null for any records which are to be upserted. +### BOOTSTRAP +Hudi supports migrating your existing large tables into a Hudi table using the `bootstrap` operation. There are a couple of ways to approach this. Please refer to +[bootstrapping page](https://hudi.apache.org/docs/migration_guide) for more details. + +### INSERT_OVERWRITE +This operation is used to rerwrite the all the partitions that are present in the input. This operation can be faster +than `upsert` for batch ETL jobs, that are recomputing entire target partitions at once (as opposed to incrementally +updating the target tables). This is because, we are able to bypass indexing, precombining and other repartitioning +steps in the upsert write path completely. This comes in handy if you are doing any backfill or any such type of use-cases. + +### INSERT_OVERWRITE_TABLE +This operation can be used to overwrite the entire table for whatever reason. The Hudi cleaner will eventually clean up +the previous table snapshot's file groups asynchronously based on the configured cleaning policy. This operation is much +faster than issuing explicit deletes. + +### DELETE_PARTITION +In addition to deleting individual records, Hudi supports deleting entire partitions in bulk using this operation. +Deletion of specific partitions can be done using the config +[`hoodie.datasource.write.partitions.to.delete`](https://hudi.apache.org/docs/configurations#hoodiedatasourcewritepartitionstodelete). + + +## Configs +Here are thh basic configs relevant to the write operations types mentioned above. Please refer to [Write Options](https://hudi.apache.org/docs/configurations#Write-Options) for more Spark based configs and [Flink options](https://hudi.apache.org/docs/next/configurations#Flink-Options) for Flink based configs. + +**Spark based configs:** + +| Config Name | Default | Description [...] +|------------------------------------------------|----------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- [...] +| hoodie.datasource.write.operation | upsert (Optional) | Whether to do upsert, insert or bulk_insert for the write operation. Use bulk_insert to load new data into a table, and there on use upsert/insert. bulk insert uses a disk based write path to scale to load large inputs without need to cache it.<br /><br />`Config Param: OPERATION` [...] +| hoodie.datasource.write.precombine.field | ts (Optional) | Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)<br /><br />`Config Param: PRECOMBINE_FIELD` [...] +| hoodie.combine.before.insert | false (Optional) | When inserted records share same key, controls whether they should be first combined (i.e de-duplicated) before writing to storage.<br /><br />`Config Param: COMBINE_BEFORE_INSERT` [...] +| hoodie.datasource.write.insert.drop.duplicates | false (Optional) | If set to true, records from the incoming dataframe will not overwrite existing records with the same key during the write operation. This config is deprecated as of 0.14.0. Please use hoodie.datasource.insert.dup.policy instead.<br /><br />`Config Param: INSERT_DROP_DUPS` [...] +| hoodie.bulkinsert.sort.mode | NONE (Optional) | org.apache.hudi.execution.bulkinsert.BulkInsertSortMode: Modes for sorting records during bulk insert. <ul><li>`NONE(default)`: No sorting. Fastest and matches `spark.write.parquet()` in number of files and overhead.</li><li>`GLOBAL_SORT`: This ensures best file sizes, with lowest memory overhead at cost of sorting.</li><li>`PARTITION_SORT`: Strikes a balance by only sorting within a Spark RDD partition, still keep [...] +| hoodie.bootstrap.base.path | N/A **(Required)** | **Applicable only when** operation type is `bootstrap`. Base path of the dataset that needs to be bootstrapped as a Hudi table<br /><br />`Config Param: BASE_PATH`<br />`Since Version: 0.6.0` [...] +| hoodie.bootstrap.mode.selector | org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector (Optional) | Selects the mode in which each file/partition in the bootstrapped dataset gets bootstrapped<br />Possible values:<ul><li>`org.apache.hudi.client.bootstrap.selector.MetadataOnlyBootstrapModeSelector`: In this mode, the full record data is not copied into Hudi therefore it avoids full cost of rewriting the dataset. Instead, 'skeleton' files co [...] +| hoodie.datasource.write.partitions.to.delete | N/A **(Required)** | **Applicable only when** operation type is `delete_partition`. Comma separated list of partitions to delete. Allows use of wildcard *<br /><br />`Config Param: PARTITIONS_TO_DELETE` [...] + + +**Flink based configs:** + +| Config Name | Default | Description | +|------------------------------------------------|----------------------|-------------------------------------------------------------------------------------| +| write.operation | upsert (Optional) | The write operation, that this write should do<br /><br /> `Config Param: OPERATION`| +| precombine.field | ts (Optional) | Field used in preCombining before actual write. When two records have the same key value, we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)<br /><br /> `Config Param: PRECOMBINE_FIELD`| +| write.precombine | false (Optional) | Flag to indicate whether to drop duplicates before insert/upsert. By default these cases will accept duplicates, to gain extra performance: 1) insert operation; 2) upsert for MOR table, the MOR table deduplicate on reading<br /><br /> `Config Param: PRE_COMBINE`| +| write.bulk_insert.sort_input | true (Optional) | Whether to sort the inputs by specific fields for bulk insert tasks, default true<br /><br /> `Config Param: WRITE_BULK_INSERT_SORT_INPUT` | +| write.bulk_insert.sort_input.by_record_key | false (Optional) | Whether to sort the inputs by record keys for bulk insert tasks, default false<br /><br /> `Config Param: WRITE_BULK_INSERT_SORT_INPUT_BY_RECORD_KEY` | + + ## Writing path The following is an inside look on the Hudi write path and the sequence of events that occur during a write. diff --git a/website/src/theme/DocPage/index.js b/website/src/theme/DocPage/index.js index 817f8474215..ef73ef448bc 100644 --- a/website/src/theme/DocPage/index.js +++ b/website/src/theme/DocPage/index.js @@ -128,7 +128,7 @@ function DocPageContent({ ); } -const arrayOfPages = (matchPath) => [`${matchPath}/configurations`, `${matchPath}/basic_configurations`, `${matchPath}/timeline`, `${matchPath}/table_types`, `${matchPath}/migration_guide`, `${matchPath}/compaction`, `${matchPath}/clustering`, `${matchPath}/indexing`, `${matchPath}/metadata`, `${matchPath}/metadata_indexing`, `${matchPath}/record_payload`, `${matchPath}/file_sizing`, `${matchPath}/hoodie_cleaner`, `${matchPath}/concurrency_control`]; +const arrayOfPages = (matchPath) => [`${matchPath}/configurations`, `${matchPath}/basic_configurations`, `${matchPath}/timeline`, `${matchPath}/table_types`, `${matchPath}/migration_guide`, `${matchPath}/compaction`, `${matchPath}/clustering`, `${matchPath}/indexing`, `${matchPath}/metadata`, `${matchPath}/metadata_indexing`, `${matchPath}/record_payload`, `${matchPath}/file_sizing`, `${matchPath}/hoodie_cleaner`, `${matchPath}/concurrency_control`, `${matchPath}/write_operations`]; const showCustomStylesForDocs = (matchPath, pathname) => arrayOfPages(matchPath).includes(pathname); function DocPage(props) { const {