There were actually 2 issues. First, the recovery mechanism was a relatively slow batch process and made recovery longer than necessary. We are fixing this by making it incremental instead of batch (think traditional file systems with fsck versus log-oriented file systems with concurrent repair).
Second, there was an issue that created duplicate jobs<http://issues.apache.org/jira/browse/ODE-424>when a job would fail. On some systems were failure are frequent and repetitive, this could lead to a significant number of outstanding jobs and therefore exacerbate the first problem. alex On Tue, Nov 25, 2008 at 9:47 AM, Chris Taylor <[EMAIL PROTECTED]> wrote: > Thanks, Alex. The problem description is a little confusing in this Jira, > though. What is it that happens, exactly? > > > > > ________________________________ > From: Alex Boisvert <[EMAIL PROTECTED]> > To: [email protected] > Sent: Tuesday, November 25, 2008 11:22:49 AM > Subject: Re: Client calling retired process? > > On Tue, Nov 25, 2008 at 9:06 AM, Chris Taylor <[EMAIL PROTECTED]> wrote: > > > In the meantime, this is causing a secondary issue in that when we hit > the > > original OOM, we build up a lot of rescheduled jobs (sometimes well over > a > > hundred) apparently for requests that cannot be satisfied. When the > server > > starts up again, it immediately pegs at full capacity trying to satisfy > > these. Other than deleting the rescheduled jobs from ODE_JOB, is there > some > > way to change the configuration of ODE to limit how many of these it > > reschedules so as not to back it up? > > > This was recently fixed on the Ode 1.x branch > http://issues.apache.org/jira/browse/ODE-425 > > but still needs to be ported to the trunk > http://issues.apache.org/jira/browse/ODE-430 > > I'm hoping it will happen in the next week or so. > > alex > > > > >
