Hi Chunen,

I've created a Jira issue: https://issues.apache.org/jira/browse/KYLIN-4307.
We are trying to fix the issue. And the PR will be created later.

Regards,
Liang

On Thu, Dec 19, 2019 at 2:52 PM nichunen <[email protected]> wrote:

> Hi Liang,
>
> I have closely read your analysis for this case, and I agree with you.
>
> Would you please open a Jira issue? What’s more, pull request for this is
> welcome.
>
> Best regards,
>
>
>
> Ni Chunen / George
>
>
> On 12/19/2019 12:24,liang<[email protected]> <[email protected]> wrote:
>
> Hi there,
>
> Today we had encountered a build error. The step "Build N-Dimension Cuboid
> : level 1" failed caused by WriteConflictException.
>
>
>> 2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
>> oldTs: 1576696795196, to newTs: 1576696799048, operation result: true
>> 2019-12-19 03:20:02,775 INFO  [BadQueryDetector]
>> service.BadQueryDetector:147 : Detect bad query.
>> 2019-12-19 03:20:03,254 TRACE [Scheduler 1528161485 Job
>> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
>> oldTs: 1576696793142, to newTs: 1576696803253, operation result: true
>> 2019-12-19 03:20:03,256 TRACE [Scheduler 1528161485 Job
>> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
>> oldTs: 1576696803253, to newTs: 1576696803255, operation result: true
>> 2019-12-19 03:20:04,187 INFO  [FetcherRunner 2062551616-143]
>> threadpool.DefaultFetcherRunner:85 : Job Fetcher: 2 should running, 2
>> actual running, 0 stopped, 0 ready, 1754 already succeed, 7 error, 10
>> discarded, 0 others
>> 2019-12-19 03:20:09,124 TRACE [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
>> oldTs: 1576696799048, to newTs: 1576696809122, operation result: true
>> 2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
>> oldTs: 1576696809122, to newTs: 1576696809124, operation result: true
>> 2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job
>> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
>> oldTs: 1576696803255, to newTs: 1576696813262, operation result: true
>> 2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879]
>> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1
>> seconds.
>> 2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
>> oldTs: 1576696809124, to newTs: 1576696819131, operation result: false
>> 2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job
>> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
>> oldTs: 1576696813262, to newTs: 1576696825245, operation result: true
>> 2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 :
>> error execute
>> MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, name=Build
>> N-Dimension Cuboid : level 1, state=RUNNING}
>> org.apache.kylin.common.persistence.WriteConflictException: Overwriting
>> conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, expect
>> old TS 1576696809124, but it is 1576696819131
>> at
>> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
>> at
>> org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
>> at
>> org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
>> at
>> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
>> at
>> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
>> at
>> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
>> at
>> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
>> at
>> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
>> at
>> org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
>> at
>> org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426)
>> at
>> org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
>> at
>> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
>> at
>> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>> at
>> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
>> at
>> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>> at
>> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>
>
> At 03:20:24,133, the thread "Scheduler 1528161485 Job
> 4759f195-9443-2604-5358-d5804d9aaefc-879" is trying to write content to the
> path "/execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08" but the
> exception was raised for some reasons.
>
> ExponentialBackoffRetry will retry if the following exceptions happened:
> SocketTimeoutException, or ConnectException, or RetriesExhaustedException
> or any other exceptions configured in
> "kylin.resourcestore.connection-exceptions"
>
> By default the retry policy is disabled, but we have enabled it by setting
> "kylin.resourcestore.reconnect-enabled". Because we have a heavy loading
> HBase cluster.
>
> Because failed to call checkAndPutResourceImpl, the thread "Scheduler
> 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879" sleeps 1 second to
> back-off.
>
>
>> 2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
>> oldTs: 1576696809122, to newTs: 1576696809124, operation result: true
>> 2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job
>> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
>> oldTs: 1576696803255, to newTs: 1576696813262, operation result: true
>> 2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879]
>> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1
>> seconds.
>> 2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from
>> oldTs: 1576696809124, to newTs: 1576696819131, operation result: false
>> 2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job
>> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 :
>> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from
>> oldTs: 1576696813262, to newTs: 1576696825245, operation result: true
>> 2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job
>> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 :
>> error execute
>> MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, name=Build
>> N-Dimension Cuboid : level 1, state=RUNNING}
>> org.apache.kylin.common.persistence.WriteConflictException: Overwriting
>> conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, expect
>> old TS 1576696809124, but it is 1576696819131
>> at
>> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
>>
>
>
> Here is the guess of what's happended at that time:
>
> [image: Screen Shot 2019-12-19 at 12.16.05.png]
> Thus, we think ExponentialBackoffRetry implemented an incorrect retry
> policy. Determining whether to retry only by the type of exception is not
> reliable. At least, we should pull the last modified time to verify the
> written is successful or not. Right?
>
>

Reply via email to