[jira] [Commented] (HUDI-2658) When disable auto clean, do not check if MIN_COMMITS_TO_KEEP was larger CLEANER_COMMITS_RETAINED

sivabalan narayanan (Jira) Wed, 29 Dec 2021 20:13:04 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-2658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466674#comment-17466674
 ]


sivabalan narayanan commented on HUDI-2658:
-------------------------------------------

Closing as invalid. 

comment from the PR

I think putting this conditional validity could compromise the integrity of 
min-instant as user can toggle auto clean any time. What if on the same table 
there is a writer and a compactor with different auto clean settings: the 
writer could disable auto clean and trigger archival and have less number of 
commits, then compactor runs and see actual instants less than min-instants? I 
found having consistency over the logic here is important.

 

> When disable auto clean, do not check if MIN_COMMITS_TO_KEEP was larger 
> CLEANER_COMMITS_RETAINED
> ------------------------------------------------------------------------------------------------
>
>                 Key: HUDI-2658
>                 URL: https://issues.apache.org/jira/browse/HUDI-2658
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Yue Zhang
>            Priority: Major
>              Labels: pull-request-available, sev:normal
>
> Exception mentioned blow will throw even though disable auto clean.
> {code:java}
> 21/10/18 05:54:20,149 ERROR Misc: Streaming batch fail, shutting down whole 
> application immediately.21/10/18 05:54:20,149 ERROR Misc: Streaming batch 
> fail, shutting down whole application 
> immediately.java.lang.IllegalArgumentException: Increase 
> hoodie.keep.min.commits=3 to be greater than 
> hoodie.cleaner.commits.retained=10. Otherwise, there is risk of incremental 
> pull missing data from few instants. at 
> org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40)
>  at 
> org.apache.hudi.config.HoodieCompactionConfig$Builder.build(HoodieCompactionConfig.java:355)
>  at 
> org.apache.hudi.config.HoodieWriteConfig$Builder.setDefaults(HoodieWriteConfig.java:1396)
>  at 
> org.apache.hudi.config.HoodieWriteConfig$Builder.build(HoodieWriteConfig.java:1436)
>  at 
> org.apache.hudi.DataSourceUtils.createHoodieConfig(DataSourceUtils.java:188) 
> at 
> org.apache.hudi.DataSourceUtils.createHoodieClient(DataSourceUtils.java:193) 
> at 
> org.apache.hudi.HoodieSparkSqlWriter$$anonfun$3.apply(HoodieSparkSqlWriter.scala:166)
>  at 
> org.apache.hudi.HoodieSparkSqlWriter$$anonfun$3.apply(HoodieSparkSqlWriter.scala:166)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:166) 
> at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:145) at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) at 
> tv.freewheel.reporting.ssql.sinkers.HudiSinker.sink(HudiSinker.scala:20) at 
> tv.freewheel.reporting.realtime.core.schedulers.RuleScheduler$$anonfun$execSink$1$$anonfun$apply$1.apply$mcV$sp(RuleScheduler.scala:73)
>  at tv.freewheel.reporting.realtime.utils.Misc$.failFast(Misc.scala:72) at 
> tv.freewheel.reporting.realtime.core.schedulers.RuleScheduler$$anonfun$execSink$1.apply(RuleScheduler.scala:73)
>  at 
> tv.freewheel.reporting.realtime.core.schedulers.RuleScheduler$$anonfun$execSink$1.apply(RuleScheduler.scala:71)
>  at scala.Option.foreach(Option.scala:257) at 
> tv.freewheel.reporting.realtime.core.schedulers.RuleScheduler.execSink(RuleScheduler.scala:71)
>  at 
> tv.freewheel.reporting.realtime.core.schedulers.RuleScheduler$$anonfun$submitRecursively$3$$anonfun$1.apply$mcV$sp(RuleScheduler.scala:35)
>  at tv.freewheel.reporting.realtime.utils.Misc$$anon$2.run(Misc.scala:31) at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
> java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-2658) When disable auto clean, do not check if MIN_COMMITS_TO_KEEP was larger CLEANER_COMMITS_RETAINED

Reply via email to