Noel J. Bergman wrote:
The following is a summary of the problem.
1) It occurs ONLY when using JDBCSpoolRepository for RemoteDelivery
2) If there are more items in the spool than fit in the cache, it is
possible to delay delivery for messages that ought to be delivered.
3) If iterating through the cache takes more than one second, it is
possible to spinloop.
I'm investigating further on the problem. I had this again today, even
if I raised the maxcache to 10000 and I had less than 10000 messages. So
something weird is happening and I have to check it better.
Furthermore I'm under the impression that I have a similar issue also on
the main spool manager... Maybe there are multiple problems so I have to
fix some of them to check for the others.
There are a variety of approaches. One is to fix it. So far neither
Stefano nor I (not that I've had much time to look, but he spent all day on
it), have come up with a trivial fix. The types of fixes for this code
would push back release for weeks. At that point I might as well implement
the right long-term change, planned for the next release, rather than a
one-off bandaid to resolve v2.3.
The long term change needs a db change and we decided to keep db
structure unchanged until 3.0 so imho we need a fix for 2.3 and 2.4 that
doesn't include changing the db to replace the last_updated with
next_processing_time or something similar.
Alternatively, we could add a configuration parameter for the hardcoded
timeout value (there is already one for the cache size), document the
potential problem, and release JAMES v2.3.
Imho the problem is not the timeout: the timeout is there to avoid that
all the threads run the same query on the repository when there are no
messages. Without timeout you would need 50 queries to decide that you
have nothing to do, with the timeout this is fixed. Increasing the
timeout is an hack and would work only because we already have an hack
that our threads wake up every 60 seconds (we don't do this for file
repositories that works better regarding to this issue).
I do not want to just remove the cache, which is one of Stefano's
suggestions. The cache prevents JAMES from crashing when the message
arrival rate is higher than it can process. Throwing OOMs and possibly
discarding messages in the process is not acceptable.
I think that now we have a behaviour that is buggy and difficult to
understand and to solve. I want to have it fixed on my system before
deciding what to do with 2.3.0.
And my preferred solution, now, is not the removal of the cache but a
complete rewrite of the chaching algorythm and the accept mechanism
without changing the db.
Recognize that part of the problem is the conflating of the RemoteDelivery
spool and the main pipeline spool, which have different requirements, since
the former applies scheduling on top of the spool. Again, that's on the
roadmap to change, but wasn't planned for v2.3.
--- Noel
Well, we have a bug and we may need to change the original plan.
I still think there is something more about this issue to be discovered
so I will talk about possible solutions later, when I'll have
investigated a little more on this hard issue.
Stefano
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]