Hi Florian, Please run the job manually, when outside the scheduling window or with the scheduling off. What is the reason for the job abort?
Karl On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding < [email protected]> wrote: > Hi Karl, > > yes, I've coincidentally seen "Aborted" in the end time column when I > refreshed the job status just after the number of active documents was > zero. At the next refresh the job was starting up. After looking in the > history I found out that it even started a third time. You can see the > history of a single day below (job continue, end, start, stop, unwait, > wait). The start method is "Start at beginning of schedule window". Job > invocation is "complete". Hop count mode is "Delete unreachable > documents". > > 02.03.2014 18:41 job end > 02.03.2014 18:28 job start > 02.03.2014 18:14 job start > 02.03.2014 18:00 job start > 02.03.2014 17:49 job end > 02.03.2014 17:27 job end > 02.03.2014 17:13 job start > 02.03.2014 17:00 job start > 02.03.2014 16:13 job end > 02.03.2014 16:00 job start > 02.03.2014 15:41 job end > 02.03.2014 15:27 job start > 02.03.2014 15:14 job start > 02.03.2014 15:00 job start > 02.03.2014 14:13 job end > 02.03.2014 14:00 job start > 02.03.2014 13:13 job end > 02.03.2014 13:00 job start > 02.03.2014 12:27 job end > 02.03.2014 12:14 job start > 02.03.2014 12:00 job start > 02.03.2014 11:13 job end > 02.03.2014 11:00 job start > 02.03.2014 10:13 job end > 02.03.2014 10:00 job start > 02.03.2014 09:29 job end > 02.03.2014 09:14 job start > 02.03.2014 09:00 job start > > Best, > Florian > > > > Hi Florian, > > > > Jobs don't just abort randomly. Are you sure that the job aborted? Or > > did > > it just restart? > > > > As for "is this normal", it depends on how you have created your job. If > > you selected the "Start within schedule window" selection, MCF will > > restart > > the job whenever it finishes and run it until the end of the scheduling > > window. > > > > Karl > > > > > > > > On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding < > > [email protected]> wrote: > > > >> Hi Karl, > >> > >> I've just observed that the job was started according to its schedule > >> and > >> crawled all documents correctly (I've chosen to re-ingest all documents > >> before the run). However, after finishing the last document (zero active > >> documents) it was somehow aborted and restarted immediately. Is this an > >> expected behavior? > >> > >> Best, > >> Florian > >> > >> > >> > Hi Florian, > >> > > >> > Based on this schedule, your crawls will be able to start whenever the > >> > hour > >> > turns. So they can start every hour on the hour. If the last crawl > >> > crossed an hour boundary, the next crawl will start immediately, I > >> > believe. > >> > > >> > Karl > >> > > >> > > >> > > >> > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding < > >> > [email protected]> wrote: > >> > > >> >> Hi Karl, > >> >> > >> >> these are the values: > >> >> Priority: 5 Start method: Start at beginning of > >> schedule > >> >> window > >> >> Schedule type: Scan every document once Minimum recrawl > >> >> interval: > >> >> Not > >> >> applicable > >> >> Expiration interval: Not applicable Reseed interval: Not > >> >> applicable > >> >> Scheduled time: Any day of week at 12 am 1 am 2 am 3 am 4 am > >> 5 > >> >> am > >> >> 6 am 7 > >> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6 pm 7 pm 8 > >> pm 9 > >> >> pm 10 pm 11 pm > >> >> Maximum run time: No limit Job invocation: > >> Complete > >> >> > >> >> Maybe it is because I've changed the job from continuous crawling to > >> >> this > >> >> schedule. I started it a few times manually, too. I couldn't notice > >> >> anything strange in the job setup or in the respective entries in the > >> >> database. > >> >> > >> >> Regards, > >> >> Florian > >> >> > >> >> > Hi Florian, > >> >> > > >> >> > I was unable to reproduce the behavior you described. > >> >> > > >> >> > Could you view your job, and post a screen shot of that page? I > >> want > >> >> to > >> >> > see what your schedule record(s) look like. > >> >> > > >> >> > Thanks, > >> >> > Karl > >> >> > > >> >> > > >> >> > > >> >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright <[email protected]> > >> >> wrote: > >> >> > > >> >> >> Hi Florian, > >> >> >> > >> >> >> I've never noted this behavior before. I'll see if I can > >> reproduce > >> >> it > >> >> >> here. > >> >> >> > >> >> >> Karl > >> >> >> > >> >> >> > >> >> >> > >> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding < > >> >> >> [email protected]> wrote: > >> >> >> > >> >> >>> Hi Karl, > >> >> >>> > >> >> >>> the scheduled job seems to work as expecetd. However, it runs two > >> >> >>> times: > >> >> >>> It starts at the beginning of the scheduled time, finishes, and > >> >> >>> immediately starts again. After finishing the second run it waits > >> >> for > >> >> >>> the > >> >> >>> next scheduled time. Why does it run two times? The start method > >> is > >> >> >>> "Start > >> >> >>> at beginning of schedule window". > >> >> >>> > >> >> >>> Yes, you're right about the checking guarantee. Currently, our > >> >> interval > >> >> >>> is > >> >> >>> long enough for a complete crawler run. > >> >> >>> > >> >> >>> Best, > >> >> >>> Florian > >> >> >>> > >> >> >>> > >> >> >>> > Hi Florian, > >> >> >>> > > >> >> >>> > It is impossible to *guarantee* that a document will be > >> checked, > >> >> >>> because > >> >> >>> > if > >> >> >>> > load on the crawler is high enough, it will fall behind. But I > >> >> will > >> >> >>> look > >> >> >>> > into adding the feature you request. > >> >> >>> > > >> >> >>> > Karl > >> >> >>> > > >> >> >>> > > >> >> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding < > >> >> >>> > [email protected]> wrote: > >> >> >>> > > >> >> >>> >> Hi Karl, > >> >> >>> >> > >> >> >>> >> yes, in our case it is necessary to make sure that new > >> documents > >> >> are > >> >> >>> >> discovered and indexed within a certain interval. I have > >> created > >> >> a > >> >> >>> >> feature > >> >> >>> >> request on that. In the meantime we will try to use a > >> scheduled > >> >> job > >> >> >>> >> instead. > >> >> >>> >> > >> >> >>> >> Thanks for your help, > >> >> >>> >> Florian > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > Hi Florian, > >> >> >>> >> > > >> >> >>> >> > What you are seeing is "dynamic crawling" behavior. The > >> time > >> >> >>> between > >> >> >>> >> > refetches of a document is based on the history of fetches > >> of > >> >> that > >> >> >>> >> > document. The recrawl interval is the initial time between > >> >> >>> document > >> >> >>> >> > fetches, but if a document does not change, the interval for > >> >> the > >> >> >>> >> document > >> >> >>> >> > increases according to a formula. > >> >> >>> >> > > >> >> >>> >> > I would need to look at the code to be able to give you the > >> >> >>> precise > >> >> >>> >> > formula, but if you need a limit on the amount of time > >> between > >> >> >>> >> document > >> >> >>> >> > fetch attempts, I suggest you create a ticket and I will > >> look > >> >> into > >> >> >>> >> adding > >> >> >>> >> > that as a feature. > >> >> >>> >> > > >> >> >>> >> > Thanks, > >> >> >>> >> > Karl > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > > >> >> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding < > >> >> >>> >> > [email protected]> wrote: > >> >> >>> >> > > >> >> >>> >> >> Hello, > >> >> >>> >> >> > >> >> >>> >> >> the parameters reseed interval and recrawl interval of a > >> >> >>> continuous > >> >> >>> >> >> crawling job are not quite clear to me. The documentation > >> >> tells > >> >> >>> that > >> >> >>> >> the > >> >> >>> >> >> reseed interval is the time after which the seeds are > >> checked > >> >> >>> again, > >> >> >>> >> and > >> >> >>> >> >> the recrawl interval is the time after which a document is > >> >> >>> checked > >> >> >>> >> for > >> >> >>> >> >> changes. > >> >> >>> >> >> > >> >> >>> >> >> However, we observed that the recrawl interval for a > >> document > >> >> >>> >> increases > >> >> >>> >> >> after each check. On the other hand, the reseed interval > >> seems > >> >> to > >> >> >>> be > >> >> >>> >> set > >> >> >>> >> >> up correctly in the database metadata about the seed > >> >> documents. > >> >> >>> Yet > >> >> >>> >> the > >> >> >>> >> >> web server does not receive requests at each time the > >> interval > >> >> >>> >> elapses > >> >> >>> >> >> but > >> >> >>> >> >> only after several intervals have elapsed. > >> >> >>> >> >> > >> >> >>> >> >> We are using a web connector. The web server does not tell > >> the > >> >> >>> client > >> >> >>> >> to > >> >> >>> >> >> cache the documents. Any help would be appreciated. > >> >> >>> >> >> > >> >> >>> >> >> Best regards, > >> >> >>> >> >> Florian > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> >> > >> >> >>> >> > > >> >> >>> >> > >> >> >>> >> > >> >> >>> >> > >> >> >>> > > >> >> >>> > >> >> >>> > >> >> >>> > >> >> >> > >> >> > > >> >> > >> >> > >> >> > >> > > >> > >> > >> > > > > >
