Hi Chunen, I've created a Jira issue: https://issues.apache.org/jira/browse/KYLIN-4307. We are trying to fix the issue. And the PR will be created later.
Regards, Liang On Thu, Dec 19, 2019 at 2:52 PM nichunen <[email protected]> wrote: > Hi Liang, > > I have closely read your analysis for this case, and I agree with you. > > Would you please open a Jira issue? What’s more, pull request for this is > welcome. > > Best regards, > > > > Ni Chunen / George > > > On 12/19/2019 12:24,liang<[email protected]> <[email protected]> wrote: > > Hi there, > > Today we had encountered a build error. The step "Build N-Dimension Cuboid > : level 1" failed caused by WriteConflictException. > > >> 2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job >> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : >> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from >> oldTs: 1576696795196, to newTs: 1576696799048, operation result: true >> 2019-12-19 03:20:02,775 INFO [BadQueryDetector] >> service.BadQueryDetector:147 : Detect bad query. >> 2019-12-19 03:20:03,254 TRACE [Scheduler 1528161485 Job >> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : >> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from >> oldTs: 1576696793142, to newTs: 1576696803253, operation result: true >> 2019-12-19 03:20:03,256 TRACE [Scheduler 1528161485 Job >> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : >> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from >> oldTs: 1576696803253, to newTs: 1576696803255, operation result: true >> 2019-12-19 03:20:04,187 INFO [FetcherRunner 2062551616-143] >> threadpool.DefaultFetcherRunner:85 : Job Fetcher: 2 should running, 2 >> actual running, 0 stopped, 0 ready, 1754 already succeed, 7 error, 10 >> discarded, 0 others >> 2019-12-19 03:20:09,124 TRACE [Scheduler 1528161485 Job >> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : >> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from >> oldTs: 1576696799048, to newTs: 1576696809122, operation result: true >> 2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job >> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : >> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from >> oldTs: 1576696809122, to newTs: 1576696809124, operation result: true >> 2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job >> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : >> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from >> oldTs: 1576696803255, to newTs: 1576696813262, operation result: true >> 2019-12-19 03:20:24,133 INFO [Scheduler 1528161485 Job >> 4759f195-9443-2604-5358-d5804d9aaefc-879] >> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 >> seconds. >> 2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job >> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : >> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from >> oldTs: 1576696809124, to newTs: 1576696819131, operation result: false >> 2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job >> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : >> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from >> oldTs: 1576696813262, to newTs: 1576696825245, operation result: true >> 2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job >> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : >> error execute >> MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, name=Build >> N-Dimension Cuboid : level 1, state=RUNNING} >> org.apache.kylin.common.persistence.WriteConflictException: Overwriting >> conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, expect >> old TS 1576696809124, but it is 1576696819131 >> at >> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338) >> at >> org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423) >> at >> org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52) >> at >> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423) >> at >> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411) >> at >> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405) >> at >> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388) >> at >> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368) >> at >> org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252) >> at >> org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426) >> at >> org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566) >> at >> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163) >> at >> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167) >> at >> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71) >> at >> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167) >> at >> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> > > > At 03:20:24,133, the thread "Scheduler 1528161485 Job > 4759f195-9443-2604-5358-d5804d9aaefc-879" is trying to write content to the > path "/execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08" but the > exception was raised for some reasons. > > ExponentialBackoffRetry will retry if the following exceptions happened: > SocketTimeoutException, or ConnectException, or RetriesExhaustedException > or any other exceptions configured in > "kylin.resourcestore.connection-exceptions" > > By default the retry policy is disabled, but we have enabled it by setting > "kylin.resourcestore.reconnect-enabled". Because we have a heavy loading > HBase cluster. > > Because failed to call checkAndPutResourceImpl, the thread "Scheduler > 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879" sleeps 1 second to > back-off. > > >> 2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job >> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : >> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from >> oldTs: 1576696809122, to newTs: 1576696809124, operation result: true >> 2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job >> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : >> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from >> oldTs: 1576696803255, to newTs: 1576696813262, operation result: true >> 2019-12-19 03:20:24,133 INFO [Scheduler 1528161485 Job >> 4759f195-9443-2604-5358-d5804d9aaefc-879] >> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 >> seconds. >> 2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job >> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : >> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from >> oldTs: 1576696809124, to newTs: 1576696819131, operation result: false >> 2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job >> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : >> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from >> oldTs: 1576696813262, to newTs: 1576696825245, operation result: true >> 2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job >> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : >> error execute >> MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, name=Build >> N-Dimension Cuboid : level 1, state=RUNNING} >> org.apache.kylin.common.persistence.WriteConflictException: Overwriting >> conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, expect >> old TS 1576696809124, but it is 1576696819131 >> at >> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338) >> > > > Here is the guess of what's happended at that time: > > [image: Screen Shot 2019-12-19 at 12.16.05.png] > Thus, we think ExponentialBackoffRetry implemented an incorrect retry > policy. Determining whether to retry only by the type of exception is not > reliable. At least, we should pull the last modified time to verify the > written is successful or not. Right? > >
