Benoit Tellier created JAMES-3900:
-------------------------------------

             Summary: Running task updates stalled on the Distributed task 
manager
                 Key: JAMES-3900
                 URL: https://issues.apache.org/jira/browse/JAMES-3900
             Project: James Server
          Issue Type: Improvement
          Components: task
            Reporter: Benoit Tellier
             Fix For: 3.8.0


Upon performing a long reindexing upon a long reindexing, we were prompted for 
the following error:


{code:java}
reactor.core.Exceptions$ErrorCallbackNotImplemented: 
com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after 
PT5S
Caused by: com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed 
out after PT5S
        at 
com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:207)
        at 
io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715)
        at 
io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)
        at 
io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703)
        at 
io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790)
        at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Unknown Source)
{code}


After which scheduled updates for the task no longer happens.

After investigation the error upon polling updates within SerialTaskManager are 
not handled thus cancelling the whole subscription is the default reactor 
behaviour.

We likely should manage this error and prevent it from aborting the overall 
process. I will propose a PR to be doing just this.

Also, using event sourcing for the updates for managing tasks updates is a 
somewhat debatable choice... At one update every 30s a task generating 10KB of 
JSON (not uncommon, eg if a task generate a large error report...) running for 
a week could easily generate 200MB of data being read at consistency level 
SERIAL from Cassandra, which is likely too much of an expectation to be 
honest... (not mentionning the *massive* deserialization effort...)

As such, I propose to move polling updates management out of the aggregate, 
have dedicate 
a dedicated storage API for it. I will likely do it in a follow up of this 
ticket...




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

Reply via email to