Hi Karl, thanks for the fix. However, it is a bit difficult to try it because I do not have a test system with the same setup. Before doing it I'm going to log all output from Manifold to check if there is some error visible when a job completes and restarts unexpectedly.
Best, Florian > Any luck with this? > Karl > > > On Tue, Feb 4, 2014 at 4:15 PM, Karl Wright <[email protected]> wrote: > >> I've created a branch at: >> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-880 . >> This contains my proposed fix; please try it out. If you would like, I >> can >> also attach a patch, although I'm not certain it would apply properly >> onto >> MCF 1.4.1 sources. >> >> Karl >> >> >> >> On Tue, Feb 4, 2014 at 2:37 PM, Karl Wright <[email protected]> wrote: >> >>> Hi Florian, >>> >>> I'm pretty sure now that what is happening is that your output >>> connector >>> is throwing some kind of exception when it is asked to remove documents >>> during the cleanup phase of the crawl. The state transitions in the >>> framework seem to be incorrect under these conditions, and the error is >>> likely not logged into the job's error field. The ticket I've created >>> to >>> address this is CONNECTORS-880. >>> >>> Karl >>> >>> >>> >>> On Tue, Feb 4, 2014 at 2:14 PM, Karl Wright <[email protected]> wrote: >>> >>>> The code path for an abort sequence looks pretty iron-clad. The >>>> bad-case output: >>>> >>>> >>>> >>>>>> >>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job >>>> 1385573203052 >>>> for shutdown >>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job >>>> 1385573203052 in need of notification >>>> <<<<<< >>>> >>>> is not including: >>>> >>>> >>>> >>>>>> >>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052 >>>> now >>>> completed >>>> <<<<<< >>>> >>>> is very significant, because it is in that method that the last-check >>>> time would be updated typically, in the method JobManager.finishJob(). >>>> If >>>> an abort took place, it would have started BEFORE all this; once the >>>> job >>>> state gets set to STATUS_SHUTTINGDOWN, there is no way that the job >>>> can be >>>> aborted either manually or by repository-connector related activity. >>>> At >>>> that time the job is cleaning up documents that are no longer >>>> reachable. I >>>> will check to see what happens if the output connector throws an >>>> exception >>>> during this phase; it's the only thing I can think of that might >>>> potentially derail the job from finishing. >>>> >>>> Karl >>>> >>>> >>>> >>>> On Tue, Feb 4, 2014 at 1:29 PM, Karl Wright <[email protected]> >>>> wrote: >>>> >>>>> Hi Florian, >>>>> >>>>> The only way this can happen is if the proper job termination state >>>>> sequence does not take place. When MCF checks to see if a job should >>>>> be >>>>> started, if it determines that the answer is "no" it updates the job >>>>> record >>>>> immediately with a new "last checked" value. But if it starts the >>>>> job, it >>>>> waits for the job completion to take place before updating the job's >>>>> "last >>>>> checked" time. When a job aborts, at first glance it looks like it >>>>> also >>>>> does the right thing, but clearly that's not true, and there must be >>>>> a bug >>>>> somewhere in how this condition is handled. >>>>> >>>>> I'll create a ticket to research this. In the interim, I suggest you >>>>> figure out why your job is aborting in the first place. >>>>> >>>>> Thanks, >>>>> Karl >>>>> >>>>> >>>>> On Tue, Feb 4, 2014 at 11:49 AM, Karl Wright >>>>> <[email protected]>wrote: >>>>> >>>>>> Hi Florian, >>>>>> >>>>>> I do not expect errors to appear in the tomcat log. >>>>>> >>>>>> But this is interesting: >>>>>> >>>>>> Good: >>>>>> >>>>>> >>>>>> >>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job >>>>>> 1385573203052 needs to be started; it was last checked at >>>>>> 1391439592120, >>>>>> and now it is 1391439602151 >>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Time match FOUND >>>>>> within interval 1391439592120 to 1391439602151 >>>>>> ... >>>>>> >>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job >>>>>> 1385573203052 needs to be started; it was last checked at >>>>>> 1391440412615, >>>>>> and now it is 1391440427102 >>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - No time match >>>>>> found >>>>>> within interval 1391440412615 to 1391440427102 >>>>>> <<<<<< >>>>>> "last checked" time for job is updated. >>>>>> >>>>>> Bad: >>>>>> >>>>>> >>>>>> >>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job >>>>>> 1385573203052 needs to be started; it was last checked at >>>>>> 1391446794075, >>>>>> and now it is 1391446804106 >>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Time match FOUND >>>>>> within interval 1391446794075 to 1391446804106 >>>>>> ... >>>>>> >>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job >>>>>> 1385573203052 needs to be started; it was last checked at >>>>>> 1391446794075, >>>>>> and now it is 1391447647733 >>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Time match FOUND >>>>>> within interval 1391446794075 to 1391447647733 >>>>>> <<<<<< >>>>>> Note that the "last checked" time is NOT updated. >>>>>> >>>>>> I don't understand why, in one case, the "last checked" time is >>>>>> being >>>>>> updated for the job, and is not in another case. I will look to see >>>>>> if >>>>>> there is any way in the code that this can happen. >>>>>> >>>>>> Karl >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Feb 4, 2014 at 10:45 AM, Florian Schmedding < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Karl, >>>>>>> >>>>>>> there are no errors in the Tomcat logs. Currently, the Manifold log >>>>>>> contains only the job log messages (<property >>>>>>> name="org.apache.manifoldcf.jobs" value="ALL"/>). I include two log >>>>>>> snippets, one from a normal run, and one where the job got repeated >>>>>>> two >>>>>>> times. I noticed the thread sequence "Finisher - Job reset - Job >>>>>>> notification" when the job finally terminates, and the thread >>>>>>> sequence >>>>>>> "Finisher - Job notification" when the job gets restarted again >>>>>>> instead of >>>>>>> terminating. >>>>>>> >>>>>>> >>>>>>> DEBUG 2014-02-03 15:59:52,130 (Job start thread) - Checking if job >>>>>>> 1385573203052 needs to be started; it was last checked at >>>>>>> 1391439582108, >>>>>>> and now it is 1391439592119 >>>>>>> DEBUG 2014-02-03 15:59:52,131 (Job start thread) - No time match >>>>>>> found >>>>>>> within interval 1391439582108 to 1391439592119 >>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job >>>>>>> 1385573203052 needs to be started; it was last checked at >>>>>>> 1391439592120, >>>>>>> and now it is 1391439602151 >>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Time match >>>>>>> FOUND >>>>>>> within interval 1391439592120 to 1391439602151 >>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Job >>>>>>> '1385573203052' is >>>>>>> within run window at 1391439602151 ms. (which starts at >>>>>>> 1391439600000 >>>>>>> ms.) >>>>>>> DEBUG 2014-02-03 16:00:02,288 (Job start thread) - Signalled for >>>>>>> job >>>>>>> start >>>>>>> for job 1385573203052 >>>>>>> DEBUG 2014-02-03 16:00:11,319 (Startup thread) - Marked job >>>>>>> 1385573203052 >>>>>>> for startup >>>>>>> DEBUG 2014-02-03 16:00:12,719 (Startup thread) - Job 1385573203052 >>>>>>> is >>>>>>> now >>>>>>> started >>>>>>> DEBUG 2014-02-03 16:13:30,234 (Finisher thread) - Marked job >>>>>>> 1385573203052 >>>>>>> for shutdown >>>>>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job >>>>>>> 1385573203052 >>>>>>> now >>>>>>> completed >>>>>>> DEBUG 2014-02-03 16:13:37,541 (Job notification thread) - Found job >>>>>>> 1385573203052 in need of notification >>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job >>>>>>> 1385573203052 needs to be started; it was last checked at >>>>>>> 1391440412615, >>>>>>> and now it is 1391440427102 >>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - No time match >>>>>>> found >>>>>>> within interval 1391440412615 to 1391440427102 >>>>>>> >>>>>>> >>>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) - Checking if job >>>>>>> 1385573203052 needs to be started; it was last checked at >>>>>>> 1391446784053, >>>>>>> and now it is 1391446794074 >>>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) - No time match >>>>>>> found >>>>>>> within interval 1391446784053 to 1391446794074 >>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job >>>>>>> 1385573203052 needs to be started; it was last checked at >>>>>>> 1391446794075, >>>>>>> and now it is 1391446804106 >>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Time match >>>>>>> FOUND >>>>>>> within interval 1391446794075 to 1391446804106 >>>>>>> DEBUG 2014-02-03 18:00:04,110 (Job start thread) - Job >>>>>>> '1385573203052' is >>>>>>> within run window at 1391446804106 ms. (which starts at >>>>>>> 1391446800000 >>>>>>> ms.) >>>>>>> DEBUG 2014-02-03 18:00:04,178 (Job start thread) - Signalled for >>>>>>> job >>>>>>> start >>>>>>> for job 1385573203052 >>>>>>> DEBUG 2014-02-03 18:00:11,710 (Startup thread) - Marked job >>>>>>> 1385573203052 >>>>>>> for startup >>>>>>> DEBUG 2014-02-03 18:00:13,408 (Startup thread) - Job 1385573203052 >>>>>>> is >>>>>>> now >>>>>>> started >>>>>>> DEBUG 2014-02-03 18:14:04,286 (Finisher thread) - Marked job >>>>>>> 1385573203052 >>>>>>> for shutdown >>>>>>> DEBUG 2014-02-03 18:14:06,777 (Job notification thread) - Found job >>>>>>> 1385573203052 in need of notification >>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job >>>>>>> 1385573203052 needs to be started; it was last checked at >>>>>>> 1391446794075, >>>>>>> and now it is 1391447647733 >>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Time match >>>>>>> FOUND >>>>>>> within interval 1391446794075 to 1391447647733 >>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Job >>>>>>> '1385573203052' is >>>>>>> within run window at 1391447647733 ms. (which starts at >>>>>>> 1391446800000 >>>>>>> ms.) >>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Checking if job >>>>>>> 1385573203052 needs to be started; it was last checked at >>>>>>> 1391446794075, >>>>>>> and now it is 1391447657740 >>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Time match >>>>>>> FOUND >>>>>>> within interval 1391446794075 to 1391447657740 >>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Job >>>>>>> '1385573203052' is >>>>>>> within run window at 1391447657740 ms. (which starts at >>>>>>> 1391446800000 >>>>>>> ms.) >>>>>>> DEBUG 2014-02-03 18:14:17,899 (Job start thread) - Signalled for >>>>>>> job >>>>>>> start >>>>>>> for job 1385573203052 >>>>>>> DEBUG 2014-02-03 18:14:26,787 (Startup thread) - Marked job >>>>>>> 1385573203052 >>>>>>> for startup >>>>>>> DEBUG 2014-02-03 18:14:28,636 (Startup thread) - Job 1385573203052 >>>>>>> is >>>>>>> now >>>>>>> started >>>>>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job >>>>>>> 1385573203052 >>>>>>> for shutdown >>>>>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job >>>>>>> 1385573203052 in need of notification >>>>>>> DEBUG 2014-02-03 18:27:59,356 (Job start thread) - Checking if job >>>>>>> 1385573203052 needs to be started; it was last checked at >>>>>>> 1391446794075, >>>>>>> and now it is 1391448479353 >>>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Time match >>>>>>> FOUND >>>>>>> within interval 1391446794075 to 1391448479353 >>>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Job >>>>>>> '1385573203052' is >>>>>>> within run window at 1391448479353 ms. (which starts at >>>>>>> 1391446800000 >>>>>>> ms.) >>>>>>> DEBUG 2014-02-03 18:27:59,430 (Job start thread) - Signalled for >>>>>>> job >>>>>>> start >>>>>>> for job 1385573203052 >>>>>>> DEBUG 2014-02-03 18:28:09,309 (Startup thread) - Marked job >>>>>>> 1385573203052 >>>>>>> for startup >>>>>>> DEBUG 2014-02-03 18:28:10,727 (Startup thread) - Job 1385573203052 >>>>>>> is >>>>>>> now >>>>>>> started >>>>>>> DEBUG 2014-02-03 18:41:18,202 (Finisher thread) - Marked job >>>>>>> 1385573203052 >>>>>>> for shutdown >>>>>>> DEBUG 2014-02-03 18:41:23,636 (Job reset thread) - Job >>>>>>> 1385573203052 >>>>>>> now >>>>>>> completed >>>>>>> DEBUG 2014-02-03 18:41:25,368 (Job notification thread) - Found job >>>>>>> 1385573203052 in need of notification >>>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) - Checking if job >>>>>>> 1385573203052 needs to be started; it was last checked at >>>>>>> 1391449283114, >>>>>>> and now it is 1391449292400 >>>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) - No time match >>>>>>> found >>>>>>> within interval 1391449283114 to 1391449292400 >>>>>>> >>>>>>> >>>>>>> Do you need another log output? >>>>>>> >>>>>>> Best, >>>>>>> Florian >>>>>>> >>>>>>> > Also, what does the log have to say? If there is an error >>>>>>> aborting >>>>>>> the >>>>>>> > job, there should be some record of it in the manifoldcf.log. >>>>>>> > >>>>>>> > Thanks, >>>>>>> > Karl >>>>>>> > >>>>>>> > >>>>>>> > On Tue, Feb 4, 2014 at 6:16 AM, Karl Wright <[email protected]> >>>>>>> wrote: >>>>>>> > >>>>>>> >> Hi Florian, >>>>>>> >> >>>>>>> >> Please run the job manually, when outside the scheduling window >>>>>>> or >>>>>>> with >>>>>>> >> the scheduling off. What is the reason for the job abort? >>>>>>> >> >>>>>>> >> Karl >>>>>>> >> >>>>>>> >> >>>>>>> >> >>>>>>> >> On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding < >>>>>>> >> [email protected]> wrote: >>>>>>> >> >>>>>>> >>> Hi Karl, >>>>>>> >>> >>>>>>> >>> yes, I've coincidentally seen "Aborted" in the end time column >>>>>>> when I >>>>>>> >>> refreshed the job status just after the number of active >>>>>>> documents was >>>>>>> >>> zero. At the next refresh the job was starting up. After >>>>>>> looking >>>>>>> in the >>>>>>> >>> history I found out that it even started a third time. You can >>>>>>> see the >>>>>>> >>> history of a single day below (job continue, end, start, stop, >>>>>>> unwait, >>>>>>> >>> wait). The start method is "Start at beginning of schedule >>>>>>> window". Job >>>>>>> >>> invocation is "complete". Hop count mode is "Delete unreachable >>>>>>> >>> documents". >>>>>>> >>> >>>>>>> >>> 02.03.2014 18:41 job end >>>>>>> >>> 02.03.2014 18:28 job start >>>>>>> >>> 02.03.2014 18:14 job start >>>>>>> >>> 02.03.2014 18:00 job start >>>>>>> >>> 02.03.2014 17:49 job end >>>>>>> >>> 02.03.2014 17:27 job end >>>>>>> >>> 02.03.2014 17:13 job start >>>>>>> >>> 02.03.2014 17:00 job start >>>>>>> >>> 02.03.2014 16:13 job end >>>>>>> >>> 02.03.2014 16:00 job start >>>>>>> >>> 02.03.2014 15:41 job end >>>>>>> >>> 02.03.2014 15:27 job start >>>>>>> >>> 02.03.2014 15:14 job start >>>>>>> >>> 02.03.2014 15:00 job start >>>>>>> >>> 02.03.2014 14:13 job end >>>>>>> >>> 02.03.2014 14:00 job start >>>>>>> >>> 02.03.2014 13:13 job end >>>>>>> >>> 02.03.2014 13:00 job start >>>>>>> >>> 02.03.2014 12:27 job end >>>>>>> >>> 02.03.2014 12:14 job start >>>>>>> >>> 02.03.2014 12:00 job start >>>>>>> >>> 02.03.2014 11:13 job end >>>>>>> >>> 02.03.2014 11:00 job start >>>>>>> >>> 02.03.2014 10:13 job end >>>>>>> >>> 02.03.2014 10:00 job start >>>>>>> >>> 02.03.2014 09:29 job end >>>>>>> >>> 02.03.2014 09:14 job start >>>>>>> >>> 02.03.2014 09:00 job start >>>>>>> >>> >>>>>>> >>> Best, >>>>>>> >>> Florian >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> > Hi Florian, >>>>>>> >>> > >>>>>>> >>> > Jobs don't just abort randomly. Are you sure that the job >>>>>>> aborted? >>>>>>> >>> Or >>>>>>> >>> > did >>>>>>> >>> > it just restart? >>>>>>> >>> > >>>>>>> >>> > As for "is this normal", it depends on how you have created >>>>>>> your job. >>>>>>> >>> If >>>>>>> >>> > you selected the "Start within schedule window" selection, >>>>>>> MCF >>>>>>> will >>>>>>> >>> > restart >>>>>>> >>> > the job whenever it finishes and run it until the end of the >>>>>>> >>> scheduling >>>>>>> >>> > window. >>>>>>> >>> > >>>>>>> >>> > Karl >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > >>>>>>> >>> > On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding < >>>>>>> >>> > [email protected]> wrote: >>>>>>> >>> > >>>>>>> >>> >> Hi Karl, >>>>>>> >>> >> >>>>>>> >>> >> I've just observed that the job was started according to its >>>>>>> >>> schedule >>>>>>> >>> >> and >>>>>>> >>> >> crawled all documents correctly (I've chosen to re-ingest >>>>>>> all >>>>>>> >>> documents >>>>>>> >>> >> before the run). However, after finishing the last document >>>>>>> (zero >>>>>>> >>> active >>>>>>> >>> >> documents) it was somehow aborted and restarted immediately. >>>>>>> Is this >>>>>>> >>> an >>>>>>> >>> >> expected behavior? >>>>>>> >>> >> >>>>>>> >>> >> Best, >>>>>>> >>> >> Florian >>>>>>> >>> >> >>>>>>> >>> >> >>>>>>> >>> >> > Hi Florian, >>>>>>> >>> >> > >>>>>>> >>> >> > Based on this schedule, your crawls will be able to start >>>>>>> whenever >>>>>>> >>> the >>>>>>> >>> >> > hour >>>>>>> >>> >> > turns. So they can start every hour on the hour. If the >>>>>>> last >>>>>>> >>> crawl >>>>>>> >>> >> > crossed an hour boundary, the next crawl will start >>>>>>> immediately, I >>>>>>> >>> >> > believe. >>>>>>> >>> >> > >>>>>>> >>> >> > Karl >>>>>>> >>> >> > >>>>>>> >>> >> > >>>>>>> >>> >> > >>>>>>> >>> >> > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding < >>>>>>> >>> >> > [email protected]> wrote: >>>>>>> >>> >> > >>>>>>> >>> >> >> Hi Karl, >>>>>>> >>> >> >> >>>>>>> >>> >> >> these are the values: >>>>>>> >>> >> >> Priority: 5 Start method: Start at >>>>>>> beginning >>>>>>> of >>>>>>> >>> >> schedule >>>>>>> >>> >> >> window >>>>>>> >>> >> >> Schedule type: Scan every document once Minimum >>>>>>> recrawl >>>>>>> >>> >> >> interval: >>>>>>> >>> >> >> Not >>>>>>> >>> >> >> applicable >>>>>>> >>> >> >> Expiration interval: Not applicable Reseed interval: >>>>>>> >>> Not >>>>>>> >>> >> >> applicable >>>>>>> >>> >> >> Scheduled time: Any day of week at 12 am 1 am 2 >>>>>>> am >>>>>>> 3 am 4 >>>>>>> >>> am >>>>>>> >>> >> 5 >>>>>>> >>> >> >> am >>>>>>> >>> >> >> 6 am 7 >>>>>>> >>> >> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6 >>>>>>> pm 7 pm >>>>>>> >>> 8 >>>>>>> >>> >> pm 9 >>>>>>> >>> >> >> pm 10 pm 11 pm >>>>>>> >>> >> >> Maximum run time: No limit Job invocation: >>>>>>> >>> >> Complete >>>>>>> >>> >> >> >>>>>>> >>> >> >> Maybe it is because I've changed the job from continuous >>>>>>> crawling >>>>>>> >>> to >>>>>>> >>> >> >> this >>>>>>> >>> >> >> schedule. I started it a few times manually, too. I >>>>>>> couldn't >>>>>>> >>> notice >>>>>>> >>> >> >> anything strange in the job setup or in the respective >>>>>>> entries in >>>>>>> >>> the >>>>>>> >>> >> >> database. >>>>>>> >>> >> >> >>>>>>> >>> >> >> Regards, >>>>>>> >>> >> >> Florian >>>>>>> >>> >> >> >>>>>>> >>> >> >> > Hi Florian, >>>>>>> >>> >> >> > >>>>>>> >>> >> >> > I was unable to reproduce the behavior you described. >>>>>>> >>> >> >> > >>>>>>> >>> >> >> > Could you view your job, and post a screen shot of that >>>>>>> page? >>>>>>> >>> I >>>>>>> >>> >> want >>>>>>> >>> >> >> to >>>>>>> >>> >> >> > see what your schedule record(s) look like. >>>>>>> >>> >> >> > >>>>>>> >>> >> >> > Thanks, >>>>>>> >>> >> >> > Karl >>>>>>> >>> >> >> > >>>>>>> >>> >> >> > >>>>>>> >>> >> >> > >>>>>>> >>> >> >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright >>>>>>> >>> <[email protected]> >>>>>>> >>> >> >> wrote: >>>>>>> >>> >> >> > >>>>>>> >>> >> >> >> Hi Florian, >>>>>>> >>> >> >> >> >>>>>>> >>> >> >> >> I've never noted this behavior before. I'll see if I >>>>>>> can >>>>>>> >>> >> reproduce >>>>>>> >>> >> >> it >>>>>>> >>> >> >> >> here. >>>>>>> >>> >> >> >> >>>>>>> >>> >> >> >> Karl >>>>>>> >>> >> >> >> >>>>>>> >>> >> >> >> >>>>>>> >>> >> >> >> >>>>>>> >>> >> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding < >>>>>>> >>> >> >> >> [email protected]> wrote: >>>>>>> >>> >> >> >> >>>>>>> >>> >> >> >>> Hi Karl, >>>>>>> >>> >> >> >>> >>>>>>> >>> >> >> >>> the scheduled job seems to work as expecetd. However, >>>>>>> it runs >>>>>>> >>> two >>>>>>> >>> >> >> >>> times: >>>>>>> >>> >> >> >>> It starts at the beginning of the scheduled time, >>>>>>> finishes, >>>>>>> >>> and >>>>>>> >>> >> >> >>> immediately starts again. After finishing the second >>>>>>> run it >>>>>>> >>> waits >>>>>>> >>> >> >> for >>>>>>> >>> >> >> >>> the >>>>>>> >>> >> >> >>> next scheduled time. Why does it run two times? The >>>>>>> start >>>>>>> >>> method >>>>>>> >>> >> is >>>>>>> >>> >> >> >>> "Start >>>>>>> >>> >> >> >>> at beginning of schedule window". >>>>>>> >>> >> >> >>> >>>>>>> >>> >> >> >>> Yes, you're right about the checking guarantee. >>>>>>> Currently, >>>>>>> >>> our >>>>>>> >>> >> >> interval >>>>>>> >>> >> >> >>> is >>>>>>> >>> >> >> >>> long enough for a complete crawler run. >>>>>>> >>> >> >> >>> >>>>>>> >>> >> >> >>> Best, >>>>>>> >>> >> >> >>> Florian >>>>>>> >>> >> >> >>> >>>>>>> >>> >> >> >>> >>>>>>> >>> >> >> >>> > Hi Florian, >>>>>>> >>> >> >> >>> > >>>>>>> >>> >> >> >>> > It is impossible to *guarantee* that a document >>>>>>> will >>>>>>> be >>>>>>> >>> >> checked, >>>>>>> >>> >> >> >>> because >>>>>>> >>> >> >> >>> > if >>>>>>> >>> >> >> >>> > load on the crawler is high enough, it will fall >>>>>>> behind. >>>>>>> >>> But >>>>>>> >>> I >>>>>>> >>> >> >> will >>>>>>> >>> >> >> >>> look >>>>>>> >>> >> >> >>> > into adding the feature you request. >>>>>>> >>> >> >> >>> > >>>>>>> >>> >> >> >>> > Karl >>>>>>> >>> >> >> >>> > >>>>>>> >>> >> >> >>> > >>>>>>> >>> >> >> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding >>>>>>> < >>>>>>> >>> >> >> >>> > [email protected]> wrote: >>>>>>> >>> >> >> >>> > >>>>>>> >>> >> >> >>> >> Hi Karl, >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >>> >> yes, in our case it is necessary to make sure that >>>>>>> new >>>>>>> >>> >> documents >>>>>>> >>> >> >> are >>>>>>> >>> >> >> >>> >> discovered and indexed within a certain interval. >>>>>>> I >>>>>>> have >>>>>>> >>> >> created >>>>>>> >>> >> >> a >>>>>>> >>> >> >> >>> >> feature >>>>>>> >>> >> >> >>> >> request on that. In the meantime we will try to >>>>>>> use a >>>>>>> >>> >> scheduled >>>>>>> >>> >> >> job >>>>>>> >>> >> >> >>> >> instead. >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >>> >> Thanks for your help, >>>>>>> >>> >> >> >>> >> Florian >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >>> >> > Hi Florian, >>>>>>> >>> >> >> >>> >> > >>>>>>> >>> >> >> >>> >> > What you are seeing is "dynamic crawling" >>>>>>> behavior. The >>>>>>> >>> >> time >>>>>>> >>> >> >> >>> between >>>>>>> >>> >> >> >>> >> > refetches of a document is based on the history >>>>>>> of >>>>>>> >>> fetches >>>>>>> >>> >> of >>>>>>> >>> >> >> that >>>>>>> >>> >> >> >>> >> > document. The recrawl interval is the initial >>>>>>> time >>>>>>> >>> between >>>>>>> >>> >> >> >>> document >>>>>>> >>> >> >> >>> >> > fetches, but if a document does not change, the >>>>>>> interval >>>>>>> >>> for >>>>>>> >>> >> >> the >>>>>>> >>> >> >> >>> >> document >>>>>>> >>> >> >> >>> >> > increases according to a formula. >>>>>>> >>> >> >> >>> >> > >>>>>>> >>> >> >> >>> >> > I would need to look at the code to be able to >>>>>>> give you >>>>>>> >>> the >>>>>>> >>> >> >> >>> precise >>>>>>> >>> >> >> >>> >> > formula, but if you need a limit on the amount >>>>>>> of >>>>>>> time >>>>>>> >>> >> between >>>>>>> >>> >> >> >>> >> document >>>>>>> >>> >> >> >>> >> > fetch attempts, I suggest you create a ticket >>>>>>> and >>>>>>> I will >>>>>>> >>> >> look >>>>>>> >>> >> >> into >>>>>>> >>> >> >> >>> >> adding >>>>>>> >>> >> >> >>> >> > that as a feature. >>>>>>> >>> >> >> >>> >> > >>>>>>> >>> >> >> >>> >> > Thanks, >>>>>>> >>> >> >> >>> >> > Karl >>>>>>> >>> >> >> >>> >> > >>>>>>> >>> >> >> >>> >> > >>>>>>> >>> >> >> >>> >> > >>>>>>> >>> >> >> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian >>>>>>> Schmedding >>>>>>> < >>>>>>> >>> >> >> >>> >> > [email protected]> wrote: >>>>>>> >>> >> >> >>> >> > >>>>>>> >>> >> >> >>> >> >> Hello, >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> >> >> the parameters reseed interval and recrawl >>>>>>> interval of >>>>>>> >>> a >>>>>>> >>> >> >> >>> continuous >>>>>>> >>> >> >> >>> >> >> crawling job are not quite clear to me. The >>>>>>> >>> documentation >>>>>>> >>> >> >> tells >>>>>>> >>> >> >> >>> that >>>>>>> >>> >> >> >>> >> the >>>>>>> >>> >> >> >>> >> >> reseed interval is the time after which the >>>>>>> seeds >>>>>>> are >>>>>>> >>> >> checked >>>>>>> >>> >> >> >>> again, >>>>>>> >>> >> >> >>> >> and >>>>>>> >>> >> >> >>> >> >> the recrawl interval is the time after which a >>>>>>> document >>>>>>> >>> is >>>>>>> >>> >> >> >>> checked >>>>>>> >>> >> >> >>> >> for >>>>>>> >>> >> >> >>> >> >> changes. >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> >> >> However, we observed that the recrawl interval >>>>>>> for a >>>>>>> >>> >> document >>>>>>> >>> >> >> >>> >> increases >>>>>>> >>> >> >> >>> >> >> after each check. On the other hand, the reseed >>>>>>> >>> interval >>>>>>> >>> >> seems >>>>>>> >>> >> >> to >>>>>>> >>> >> >> >>> be >>>>>>> >>> >> >> >>> >> set >>>>>>> >>> >> >> >>> >> >> up correctly in the database metadata about the >>>>>>> seed >>>>>>> >>> >> >> documents. >>>>>>> >>> >> >> >>> Yet >>>>>>> >>> >> >> >>> >> the >>>>>>> >>> >> >> >>> >> >> web server does not receive requests at each >>>>>>> time >>>>>>> the >>>>>>> >>> >> interval >>>>>>> >>> >> >> >>> >> elapses >>>>>>> >>> >> >> >>> >> >> but >>>>>>> >>> >> >> >>> >> >> only after several intervals have elapsed. >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> >> >> We are using a web connector. The web server >>>>>>> does >>>>>>> not >>>>>>> >>> tell >>>>>>> >>> >> the >>>>>>> >>> >> >> >>> client >>>>>>> >>> >> >> >>> >> to >>>>>>> >>> >> >> >>> >> >> cache the documents. Any help would be >>>>>>> appreciated. >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> >> >> Best regards, >>>>>>> >>> >> >> >>> >> >> Florian >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> >> >> >>>>>>> >>> >> >> >>> >> > >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >>> >> >>>>>>> >>> >> >> >>> > >>>>>>> >>> >> >> >>> >>>>>>> >>> >> >> >>> >>>>>>> >>> >> >> >>> >>>>>>> >>> >> >> >> >>>>>>> >>> >> >> > >>>>>>> >>> >> >> >>>>>>> >>> >> >> >>>>>>> >>> >> >> >>>>>>> >>> >> > >>>>>>> >>> >> >>>>>>> >>> >> >>>>>>> >>> >> >>>>>>> >>> > >>>>>>> >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> >> >>>>>>> > >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
