[ 
https://issues.apache.org/jira/browse/KYLIN-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nichunen resolved KYLIN-4307.
-----------------------------
    Resolution: Fixed

> ExponentialBackoffRetry implemented a incorrect retry policy
> ------------------------------------------------------------
>
>                 Key: KYLIN-4307
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4307
>             Project: Kylin
>          Issue Type: Bug
>          Components: Storage - HBase
>    Affects Versions: v2.6.4
>         Environment: HBase 1.1.5
>            Reporter: Liang Geng
>            Assignee: Yaqian Zhang
>            Priority: Major
>             Fix For: v3.1.0
>
>         Attachments: Screen Shot 2019-12-19 at 12.16.05.png, 
> image-2020-02-13-15-58-23-114.png
>
>
> Today we had encountered a build error. The step "Build N-Dimension Cuboid : 
> level 1" failed caused by WriteConflictException.
>  
>  
> {code:java}
> 2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696795196, to newTs: 1576696799048, operation result: 
> true2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696795196, to newTs: 1576696799048, operation result: 
> true2019-12-19 03:20:02,775 INFO  [BadQueryDetector] 
> service.BadQueryDetector:147 : Detect bad query.2019-12-19 03:20:03,254 TRACE 
> [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] 
> hbase.HBaseResourceStore:335 : Update row 
> /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 
> 1576696793142, to newTs: 1576696803253, operation result: true2019-12-19 
> 03:20:03,256 TRACE [Scheduler 1528161485 Job 
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from 
> oldTs: 1576696803253, to newTs: 1576696803255, operation result: 
> true2019-12-19 03:20:04,187 INFO  [FetcherRunner 2062551616-143] 
> threadpool.DefaultFetcherRunner:85 : Job Fetcher: 2 should running, 2 actual 
> running, 0 stopped, 0 ready, 1754 already succeed, 7 error, 10 discarded, 0 
> others2019-12-19 03:20:09,124 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696799048, to newTs: 1576696809122, operation result: 
> true2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696809122, to newTs: 1576696809124, operation result: 
> true2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job 
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from 
> oldTs: 1576696803255, to newTs: 1576696813262, operation result: 
> true2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] 
> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 
> seconds.2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696809124, to newTs: 1576696819131, operation result: 
> false2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job 
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from 
> oldTs: 1576696813262, to newTs: 1576696825245, operation result: 
> true2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : 
> error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, 
> name=Build N-Dimension Cuboid : level 1, 
> state=RUNNING}org.apache.kylin.common.persistence.WriteConflictException: 
> Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, 
> expect old TS 1576696809124, but it is 1576696819131 at 
> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
>  at 
> org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426)
>  at 
> org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
>  at 
> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {code}
>  
> At 03:20:24,133, a thread "Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879" is trying to write content to the 
> path "/execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08" but the 
> exception was raised for some reasons. We don't know what kind of exception 
> happened because the exception was eaten by doWithRetry.
> ExponentialBackoffRetry will retry if the following exceptions happened: 
> SocketTimeoutException, or ConnectException, or RetriesExhaustedException or 
> any other exceptions configured in 
> "kylin.resourcestore.connection-exceptions".
> By default the retry policy is disabled, but we have enabled it by setting 
> "kylin.resourcestore.reconnect-enabled".
>  
> {code:java}
> // A successful written, 1576696809122 -> 1576696809124// A successful 
> written, 1576696809122 -> 15766968091242019-12-19 03:20:09,125 TRACE 
> [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] 
> hbase.HBaseResourceStore:335 : Update row 
> /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 
> 1576696809122, to newTs: 1576696809124, operation result: true
> // A written with exception but actually persistence to HBase, 1576696809124 
> -> 1576696819131// ExponentialBackoffRetry.doWithRetry catched the exception, 
> and then back-off2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] 
> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 
> seconds.
> // Retry as back-off time reached. But this time, the invoking of checkAndPut 
> failed because the content is already written.2019-12-19 03:20:25,228 TRACE 
> [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] 
> hbase.HBaseResourceStore:335 : Update row 
> /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 
> 1576696809124, to newTs: 1576696819131, operation result: false
> // checkAndPutResourceImpl throws WriteConflictException2019-12-19 
> 03:20:25,299 ERROR [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : 
> error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, 
> name=Build N-Dimension Cuboid : level 1, 
> state=RUNNING}org.apache.kylin.common.persistence.WriteConflictException: 
> Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, 
> expect old TS 1576696809124, but it is 1576696819131 at 
> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
>  at 
> org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426)
>  at 
> org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
>  at 
> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {code}
>  
> As shown in the log, the thread failed to call checkAndPutResourceImpl and 
> sleep 1 second to back-off at "2019-12-19 03:20:25,228".
> At 03:20:25,299, the WriteConflictException was raised. The message is 
> "expect old TS 1576696809124, but it is 1576696819131". Obviously, the thread 
> is trying to update with the same timestamp. That proves the first written is 
> successful but throws exceptions. And then, that exception was caught by 
> doWithRetry.
> I'm not a HBase expert, the picture above shows my rough guess:  !Screen Shot 
> 2019-12-19 at 12.16.05.png!
> Thus, we think ExponentialBackoffRetry implemented an incorrect retry policy. 
> Determining whether to retry only by the type of exception is not reliable. 
> At least, we should pull the last modified time to verify the written is 
> successful or not. Right?
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to