[ https://issues.apache.org/jira/browse/KYLIN-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
nichunen resolved KYLIN-4307. ----------------------------- Resolution: Fixed > ExponentialBackoffRetry implemented a incorrect retry policy > ------------------------------------------------------------ > > Key: KYLIN-4307 > URL: https://issues.apache.org/jira/browse/KYLIN-4307 > Project: Kylin > Issue Type: Bug > Components: Storage - HBase > Affects Versions: v2.6.4 > Environment: HBase 1.1.5 > Reporter: Liang Geng > Assignee: Yaqian Zhang > Priority: Major > Fix For: v3.1.0 > > Attachments: Screen Shot 2019-12-19 at 12.16.05.png, > image-2020-02-13-15-58-23-114.png > > > Today we had encountered a build error. The step "Build N-Dimension Cuboid : > level 1" failed caused by WriteConflictException. > > > {code:java} > 2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job > 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : > Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from > oldTs: 1576696795196, to newTs: 1576696799048, operation result: > true2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job > 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : > Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from > oldTs: 1576696795196, to newTs: 1576696799048, operation result: > true2019-12-19 03:20:02,775 INFO [BadQueryDetector] > service.BadQueryDetector:147 : Detect bad query.2019-12-19 03:20:03,254 TRACE > [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] > hbase.HBaseResourceStore:335 : Update row > /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: > 1576696793142, to newTs: 1576696803253, operation result: true2019-12-19 > 03:20:03,256 TRACE [Scheduler 1528161485 Job > d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : > Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from > oldTs: 1576696803253, to newTs: 1576696803255, operation result: > true2019-12-19 03:20:04,187 INFO [FetcherRunner 2062551616-143] > threadpool.DefaultFetcherRunner:85 : Job Fetcher: 2 should running, 2 actual > running, 0 stopped, 0 ready, 1754 already succeed, 7 error, 10 discarded, 0 > others2019-12-19 03:20:09,124 TRACE [Scheduler 1528161485 Job > 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : > Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from > oldTs: 1576696799048, to newTs: 1576696809122, operation result: > true2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job > 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : > Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from > oldTs: 1576696809122, to newTs: 1576696809124, operation result: > true2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job > d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : > Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from > oldTs: 1576696803255, to newTs: 1576696813262, operation result: > true2019-12-19 03:20:24,133 INFO [Scheduler 1528161485 Job > 4759f195-9443-2604-5358-d5804d9aaefc-879] > persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 > seconds.2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job > 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : > Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from > oldTs: 1576696809124, to newTs: 1576696819131, operation result: > false2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job > d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : > Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from > oldTs: 1576696813262, to newTs: 1576696825245, operation result: > true2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job > 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : > error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, > name=Build N-Dimension Cuboid : level 1, > state=RUNNING}org.apache.kylin.common.persistence.WriteConflictException: > Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, > expect old TS 1576696809124, but it is 1576696819131 at > org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338) > at > org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423) > at > org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52) > at > org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423) > at > org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411) > at > org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405) > at > org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388) > at > org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368) > at > org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252) > at > org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426) > at > org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566) > at > org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163) > at > org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167) > at > org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71) > at > org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167) > at > org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > > At 03:20:24,133, a thread "Scheduler 1528161485 Job > 4759f195-9443-2604-5358-d5804d9aaefc-879" is trying to write content to the > path "/execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08" but the > exception was raised for some reasons. We don't know what kind of exception > happened because the exception was eaten by doWithRetry. > ExponentialBackoffRetry will retry if the following exceptions happened: > SocketTimeoutException, or ConnectException, or RetriesExhaustedException or > any other exceptions configured in > "kylin.resourcestore.connection-exceptions". > By default the retry policy is disabled, but we have enabled it by setting > "kylin.resourcestore.reconnect-enabled". > > {code:java} > // A successful written, 1576696809122 -> 1576696809124// A successful > written, 1576696809122 -> 15766968091242019-12-19 03:20:09,125 TRACE > [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] > hbase.HBaseResourceStore:335 : Update row > /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: > 1576696809122, to newTs: 1576696809124, operation result: true > // A written with exception but actually persistence to HBase, 1576696809124 > -> 1576696819131// ExponentialBackoffRetry.doWithRetry catched the exception, > and then back-off2019-12-19 03:20:24,133 INFO [Scheduler 1528161485 Job > 4759f195-9443-2604-5358-d5804d9aaefc-879] > persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 > seconds. > // Retry as back-off time reached. But this time, the invoking of checkAndPut > failed because the content is already written.2019-12-19 03:20:25,228 TRACE > [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] > hbase.HBaseResourceStore:335 : Update row > /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: > 1576696809124, to newTs: 1576696819131, operation result: false > // checkAndPutResourceImpl throws WriteConflictException2019-12-19 > 03:20:25,299 ERROR [Scheduler 1528161485 Job > 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : > error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, > name=Build N-Dimension Cuboid : level 1, > state=RUNNING}org.apache.kylin.common.persistence.WriteConflictException: > Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, > expect old TS 1576696809124, but it is 1576696819131 at > org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338) > at > org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423) > at > org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52) > at > org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423) > at > org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411) > at > org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405) > at > org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388) > at > org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368) > at > org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252) > at > org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426) > at > org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566) > at > org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163) > at > org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167) > at > org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71) > at > org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167) > at > org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > > As shown in the log, the thread failed to call checkAndPutResourceImpl and > sleep 1 second to back-off at "2019-12-19 03:20:25,228". > At 03:20:25,299, the WriteConflictException was raised. The message is > "expect old TS 1576696809124, but it is 1576696819131". Obviously, the thread > is trying to update with the same timestamp. That proves the first written is > successful but throws exceptions. And then, that exception was caught by > doWithRetry. > I'm not a HBase expert, the picture above shows my rough guess: !Screen Shot > 2019-12-19 at 12.16.05.png! > Thus, we think ExponentialBackoffRetry implemented an incorrect retry policy. > Determining whether to retry only by the type of exception is not reliable. > At least, we should pull the last modified time to verify the written is > successful or not. Right? > -- This message was sent by Atlassian Jira (v8.3.4#803005)