Noel J. Bergman wrote:
The following is a summary of the problem.

  1) It occurs ONLY when using JDBCSpoolRepository for RemoteDelivery
  2) If there are more items in the spool than fit in the cache, it is
     possible to delay delivery for messages that ought to be delivered.
  3) If iterating through the cache takes more than one second, it is
     possible to spinloop.

I'm investigating further on the problem. I had this again today, even if I raised the maxcache to 10000 and I had less than 10000 messages. So something weird is happening and I have to check it better. Furthermore I'm under the impression that I have a similar issue also on the main spool manager... Maybe there are multiple problems so I have to fix some of them to check for the others.

There are a variety of approaches.  One is to fix it.  So far neither
Stefano nor I (not that I've had much time to look, but he spent all day on
it), have come up with a trivial fix.  The types of fixes for this code
would push back release for weeks.  At that point I might as well implement
the right long-term change, planned for the next release, rather than a
one-off bandaid to resolve v2.3.

The long term change needs a db change and we decided to keep db structure unchanged until 3.0 so imho we need a fix for 2.3 and 2.4 that doesn't include changing the db to replace the last_updated with next_processing_time or something similar.

Alternatively, we could add a configuration parameter for the hardcoded
timeout value (there is already one for the cache size), document the
potential problem, and release JAMES v2.3.

Imho the problem is not the timeout: the timeout is there to avoid that all the threads run the same query on the repository when there are no messages. Without timeout you would need 50 queries to decide that you have nothing to do, with the timeout this is fixed. Increasing the timeout is an hack and would work only because we already have an hack that our threads wake up every 60 seconds (we don't do this for file repositories that works better regarding to this issue).

I do not want to just remove the cache, which is one of Stefano's
suggestions.  The cache prevents JAMES from crashing when the message
arrival rate is higher than it can process.  Throwing OOMs and possibly
discarding messages in the process is not acceptable.

I think that now we have a behaviour that is buggy and difficult to understand and to solve. I want to have it fixed on my system before deciding what to do with 2.3.0.

And my preferred solution, now, is not the removal of the cache but a complete rewrite of the chaching algorythm and the accept mechanism without changing the db.

Recognize that part of the problem is the conflating of the RemoteDelivery
spool and the main pipeline spool, which have different requirements, since
the former applies scheduling on top of the spool.  Again, that's on the
roadmap to change, but wasn't planned for v2.3.

        --- Noel

Well, we have a bug and we may need to change the original plan.
I still think there is something more about this issue to be discovered so I will talk about possible solutions later, when I'll have investigated a little more on this hard issue.

Stefano


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to