Hey BFost, Totally agreed here, and with Mike on it. This is an issue that we need to fix. Thanks to Mike and others for taking the time to document this, and I am +1 with Brian that along with the documentation, we should probably think of a strategy to fix this and implement it in 0.5. Mike, I think you offered to file a JIRA issue -- that offer still stand? :)
Thanks! Cheers, Chris On Apr 10, 2012, at 10:58 AM, Brian Foster wrote: > hey chris, > > i believe mike is talking about the following case: > > 1) queue is full > 2) scheduler pops job from queue and beginnings trying to find a node for job > 3) queue now has 1 open slot > 4) another job is given to the resource manager and is placed in the queue > 5) queue is now full again > 6) scheduler fails to schedule popped job > 7) scheduler pushs job back into the queue > 8) queue is full so exception is thrown and job is lost > > -brian > > On Apr 10, 2012, at 07:08 AM, "Mattmann, Chris A (388J)" > <[email protected]> wrote: > >> Hi Mike, >> >> On Apr 9, 2012, at 9:12 AM, Cayanan, Michael D (388J) wrote: >> >> > Hey Chris, >> > >> > Comments are below. >> >> >> >> "At the time of this writing, jobs that cannot be added to the queue >> >> disappear...." >> >> >> >> I think we should be more clear than "disappear". They don't disappear. >> >> The >> >> Scheduler will try and send a Job to the BatchMgr, and if there is an >> >> exception, >> >> it tries to re-queue the Job back onto the JobStack. If it's unable to do >> >> that, then >> >> there is an issue, but it at the very least tries to re-queue the job if >> >> there was an >> >> issue. >> > >> > The reason this blurb was put into the wiki was because when Gabe and I >> > were looking through the Resource Manager code, this is what looks to be >> > happening. Check out the piece of code that tries to add a job: >> >> Reaching Max queue size is different than saying that jobs that cannot be >> added to the queue disappear. I think we should explicitly state: >> >> "At the time of this writing, when then queue has reached the max queue >> size, a message is logged by the Scheduler saying there is a Job Queue >> Exception adding a job to the queue, and then the Job is dropped." >> >> I think that's more accurate based on your code walk. I was thinking based on >> your above message that you were talking about Jobs that couldn't be >> Scheduled for whatever reason (e.g., the Batch Mgr being down, or a >> Batch Stub being down) in which case they are re-queued. >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Senior Computer Scientist >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 171-266B, Mailstop: 171-246 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Assistant Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
