[jira] [Closed] (JAMES-3900) Running task updates stalled on the Distributed task manager

Benoit Tellier (Jira) Tue, 25 Apr 2023 04:22:30 -0700


     [ 
https://issues.apache.org/jira/browse/JAMES-3900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Benoit Tellier closed JAMES-3900.
---------------------------------
    Resolution: Fixed

> Running task updates stalled on the Distributed task manager
> ------------------------------------------------------------
>
>                 Key: JAMES-3900
>                 URL: https://issues.apache.org/jira/browse/JAMES-3900
>             Project: James Server
>          Issue Type: Improvement
>          Components: task
>            Reporter: Benoit Tellier
>            Priority: Major
>             Fix For: 3.8.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Upon performing a long reindexing upon a long reindexing, we were prompted 
> for the following error:
> {code:java}
> reactor.core.Exceptions$ErrorCallbackNotImplemented: 
> com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out 
> after PT5S
> Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Query 
> timed out after PT5S
>       at 
> com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:207)
>       at 
> io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715)
>       at 
> io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
>       at 
> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703)
>       at 
> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790)
>       at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503)
>       at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>       at java.base/java.lang.Thread.run(Unknown Source)
> {code}
> After which scheduled updates for the task no longer happens.
> After investigation the error upon polling updates within SerialTaskManager 
> are not handled thus cancelling the whole subscription is the default reactor 
> behaviour.
> We likely should manage this error and prevent it from aborting the overall 
> process. I will propose a PR to be doing just this.
> Also, using event sourcing for the updates for managing tasks updates is a 
> somewhat debatable choice... At one update every 30s a task generating 10KB 
> of JSON (not uncommon, eg if a task generate a large error report...) running 
> for a week could easily generate 200MB of data being read at consistency 
> level SERIAL from Cassandra, which is likely too much of an expectation to be 
> honest... (not mentionning the *massive* deserialization effort...)
> As such, I propose to move polling updates management out of the aggregate, 
> have dedicate 
> a dedicated storage API for it. I will likely do it in a follow up of this 
> ticket...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Closed] (JAMES-3900) Running task updates stalled on the Distributed task manager

Reply via email to