Re: Continuous crawling

Florian Schmedding Wed, 05 Feb 2014 07:12:46 -0800

Hi Karl,

thanks for the fix. However, it is a bit difficult to try it because I do
not have a test system with the same setup. Before doing it I'm going to
log all output from Manifold to check if there is some error visible when
a job completes and restarts unexpectedly.


Best,
Florian


> Any luck with this?
> Karl
>
>
> On Tue, Feb 4, 2014 at 4:15 PM, Karl Wright <[email protected]> wrote:
>
>> I've created a branch at:
>> https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-880 .
>> This contains my proposed fix; please try it out.  If you would like, I
>> can
>> also attach a patch, although I'm not certain it would apply properly
>> onto
>> MCF 1.4.1 sources.
>>
>> Karl
>>
>>
>>
>> On Tue, Feb 4, 2014 at 2:37 PM, Karl Wright <[email protected]> wrote:
>>
>>> Hi Florian,
>>>
>>> I'm pretty sure now that what is happening is that your output
>>> connector
>>> is throwing some kind of exception when it is asked to remove documents
>>> during the cleanup phase of the crawl.  The state transitions in the
>>> framework seem to be incorrect under these conditions, and the error is
>>> likely not logged into the job's error field.  The ticket I've created
>>> to
>>> address this is CONNECTORS-880.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Tue, Feb 4, 2014 at 2:14 PM, Karl Wright <[email protected]> wrote:
>>>
>>>> The code path for an abort sequence looks pretty iron-clad.  The
>>>> bad-case output:
>>>>
>>>>
>>>> >>>>>>
>>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job
>>>> 1385573203052
>>>> for shutdown
>>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job
>>>> 1385573203052 in need of notification
>>>> <<<<<<
>>>>
>>>> is not including:
>>>>
>>>>
>>>> >>>>>>
>>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job 1385573203052
>>>> now
>>>> completed
>>>> <<<<<<
>>>>
>>>> is very significant, because it is in that method that the last-check
>>>> time would be updated typically, in the method JobManager.finishJob().
>>>>  If
>>>> an abort took place, it would have started BEFORE all this; once the
>>>> job
>>>> state gets set to STATUS_SHUTTINGDOWN, there is no way that the job
>>>> can be
>>>> aborted either manually or by repository-connector related activity.
>>>> At
>>>> that time the job is cleaning up documents that are no longer
>>>> reachable.  I
>>>> will check to see what happens if the output connector throws an
>>>> exception
>>>> during this phase; it's the only thing I can think of that might
>>>> potentially derail the job from finishing.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Tue, Feb 4, 2014 at 1:29 PM, Karl Wright <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Florian,
>>>>>
>>>>> The only way this can happen is if the proper job termination state
>>>>> sequence does not take place.  When MCF checks to see if a job should
>>>>> be
>>>>> started, if it determines that the answer is "no" it updates the job
>>>>> record
>>>>> immediately with a new "last checked" value.  But if it starts the
>>>>> job, it
>>>>> waits for the job completion to take place before updating the job's
>>>>> "last
>>>>> checked" time.  When a job aborts, at first glance it looks like it
>>>>> also
>>>>> does the right thing, but clearly that's not true, and there must be
>>>>> a bug
>>>>> somewhere in how this condition is handled.
>>>>>
>>>>> I'll create a ticket to research this. In the interim, I suggest you
>>>>> figure out why your job is aborting in the first place.
>>>>>
>>>>> Thanks,
>>>>> Karl
>>>>>
>>>>>
>>>>> On Tue, Feb 4, 2014 at 11:49 AM, Karl Wright
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> Hi Florian,
>>>>>>
>>>>>> I do not expect errors to appear in the tomcat log.
>>>>>>
>>>>>> But this is interesting:
>>>>>>
>>>>>> Good:
>>>>>>
>>>>>> >>>>>>
>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391439592120,
>>>>>> and now it is 1391439602151
>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) -  Time match FOUND
>>>>>> within interval 1391439592120 to 1391439602151
>>>>>>  ...
>>>>>>
>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391440412615,
>>>>>> and now it is 1391440427102
>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) -  No time match
>>>>>> found
>>>>>> within interval 1391440412615 to 1391440427102
>>>>>> <<<<<<
>>>>>> "last checked" time for job is updated.
>>>>>>
>>>>>> Bad:
>>>>>>
>>>>>> >>>>>>
>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391446794075,
>>>>>> and now it is 1391446804106
>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) -  Time match FOUND
>>>>>> within interval 1391446794075 to 1391446804106
>>>>>>  ...
>>>>>>
>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job
>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>> 1391446794075,
>>>>>> and now it is 1391447647733
>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) -  Time match FOUND
>>>>>> within interval 1391446794075 to 1391447647733
>>>>>> <<<<<<
>>>>>> Note that the "last checked" time is NOT updated.
>>>>>>
>>>>>> I don't understand why, in one case, the "last checked" time is
>>>>>> being
>>>>>> updated for the job, and is not in another case.  I will look to see
>>>>>> if
>>>>>> there is any way in the code that this can happen.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Feb 4, 2014 at 10:45 AM, Florian Schmedding <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Karl,
>>>>>>>
>>>>>>> there are no errors in the Tomcat logs. Currently, the Manifold log
>>>>>>> contains only the job log messages (<property
>>>>>>> name="org.apache.manifoldcf.jobs" value="ALL"/>). I include two log
>>>>>>> snippets, one from a normal run, and one where the job got repeated
>>>>>>> two
>>>>>>> times. I noticed the thread sequence "Finisher - Job reset - Job
>>>>>>> notification" when the job finally terminates, and the thread
>>>>>>> sequence
>>>>>>> "Finisher - Job notification" when the job gets restarted again
>>>>>>> instead of
>>>>>>> terminating.
>>>>>>>
>>>>>>>
>>>>>>> DEBUG 2014-02-03 15:59:52,130 (Job start thread) - Checking if job
>>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>>> 1391439582108,
>>>>>>> and now it is 1391439592119
>>>>>>> DEBUG 2014-02-03 15:59:52,131 (Job start thread) -  No time match
>>>>>>> found
>>>>>>> within interval 1391439582108 to 1391439592119
>>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Checking if job
>>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>>> 1391439592120,
>>>>>>> and now it is 1391439602151
>>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) -  Time match
>>>>>>> FOUND
>>>>>>> within interval 1391439592120 to 1391439602151
>>>>>>> DEBUG 2014-02-03 16:00:02,153 (Job start thread) - Job
>>>>>>> '1385573203052' is
>>>>>>> within run window at 1391439602151 ms. (which starts at
>>>>>>> 1391439600000
>>>>>>> ms.)
>>>>>>> DEBUG 2014-02-03 16:00:02,288 (Job start thread) - Signalled for
>>>>>>> job
>>>>>>> start
>>>>>>> for job 1385573203052
>>>>>>> DEBUG 2014-02-03 16:00:11,319 (Startup thread) - Marked job
>>>>>>> 1385573203052
>>>>>>> for startup
>>>>>>> DEBUG 2014-02-03 16:00:12,719 (Startup thread) - Job 1385573203052
>>>>>>> is
>>>>>>> now
>>>>>>> started
>>>>>>> DEBUG 2014-02-03 16:13:30,234 (Finisher thread) - Marked job
>>>>>>> 1385573203052
>>>>>>> for shutdown
>>>>>>> DEBUG 2014-02-03 16:13:32,995 (Job reset thread) - Job
>>>>>>> 1385573203052
>>>>>>> now
>>>>>>> completed
>>>>>>> DEBUG 2014-02-03 16:13:37,541 (Job notification thread) - Found job
>>>>>>> 1385573203052 in need of notification
>>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) - Checking if job
>>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>>> 1391440412615,
>>>>>>> and now it is 1391440427102
>>>>>>> DEBUG 2014-02-03 16:13:47,105 (Job start thread) -  No time match
>>>>>>> found
>>>>>>> within interval 1391440412615 to 1391440427102
>>>>>>>
>>>>>>>
>>>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) - Checking if job
>>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>>> 1391446784053,
>>>>>>> and now it is 1391446794074
>>>>>>> DEBUG 2014-02-03 17:59:54,078 (Job start thread) -  No time match
>>>>>>> found
>>>>>>> within interval 1391446784053 to 1391446794074
>>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) - Checking if job
>>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>>> 1391446794075,
>>>>>>> and now it is 1391446804106
>>>>>>> DEBUG 2014-02-03 18:00:04,109 (Job start thread) -  Time match
>>>>>>> FOUND
>>>>>>> within interval 1391446794075 to 1391446804106
>>>>>>> DEBUG 2014-02-03 18:00:04,110 (Job start thread) - Job
>>>>>>> '1385573203052' is
>>>>>>> within run window at 1391446804106 ms. (which starts at
>>>>>>> 1391446800000
>>>>>>> ms.)
>>>>>>> DEBUG 2014-02-03 18:00:04,178 (Job start thread) - Signalled for
>>>>>>> job
>>>>>>> start
>>>>>>> for job 1385573203052
>>>>>>> DEBUG 2014-02-03 18:00:11,710 (Startup thread) - Marked job
>>>>>>> 1385573203052
>>>>>>> for startup
>>>>>>> DEBUG 2014-02-03 18:00:13,408 (Startup thread) - Job 1385573203052
>>>>>>> is
>>>>>>> now
>>>>>>> started
>>>>>>> DEBUG 2014-02-03 18:14:04,286 (Finisher thread) - Marked job
>>>>>>> 1385573203052
>>>>>>> for shutdown
>>>>>>> DEBUG 2014-02-03 18:14:06,777 (Job notification thread) - Found job
>>>>>>> 1385573203052 in need of notification
>>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Checking if job
>>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>>> 1391446794075,
>>>>>>> and now it is 1391447647733
>>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) -  Time match
>>>>>>> FOUND
>>>>>>> within interval 1391446794075 to 1391447647733
>>>>>>> DEBUG 2014-02-03 18:14:07,736 (Job start thread) - Job
>>>>>>> '1385573203052' is
>>>>>>> within run window at 1391447647733 ms. (which starts at
>>>>>>> 1391446800000
>>>>>>> ms.)
>>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Checking if job
>>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>>> 1391446794075,
>>>>>>> and now it is 1391447657740
>>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) -  Time match
>>>>>>> FOUND
>>>>>>> within interval 1391446794075 to 1391447657740
>>>>>>> DEBUG 2014-02-03 18:14:17,744 (Job start thread) - Job
>>>>>>> '1385573203052' is
>>>>>>> within run window at 1391447657740 ms. (which starts at
>>>>>>> 1391446800000
>>>>>>> ms.)
>>>>>>> DEBUG 2014-02-03 18:14:17,899 (Job start thread) - Signalled for
>>>>>>> job
>>>>>>> start
>>>>>>> for job 1385573203052
>>>>>>> DEBUG 2014-02-03 18:14:26,787 (Startup thread) - Marked job
>>>>>>> 1385573203052
>>>>>>> for startup
>>>>>>> DEBUG 2014-02-03 18:14:28,636 (Startup thread) - Job 1385573203052
>>>>>>> is
>>>>>>> now
>>>>>>> started
>>>>>>> DEBUG 2014-02-03 18:27:45,387 (Finisher thread) - Marked job
>>>>>>> 1385573203052
>>>>>>> for shutdown
>>>>>>> DEBUG 2014-02-03 18:27:52,737 (Job notification thread) - Found job
>>>>>>> 1385573203052 in need of notification
>>>>>>> DEBUG 2014-02-03 18:27:59,356 (Job start thread) - Checking if job
>>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>>> 1391446794075,
>>>>>>> and now it is 1391448479353
>>>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) -  Time match
>>>>>>> FOUND
>>>>>>> within interval 1391446794075 to 1391448479353
>>>>>>> DEBUG 2014-02-03 18:27:59,358 (Job start thread) - Job
>>>>>>> '1385573203052' is
>>>>>>> within run window at 1391448479353 ms. (which starts at
>>>>>>> 1391446800000
>>>>>>> ms.)
>>>>>>> DEBUG 2014-02-03 18:27:59,430 (Job start thread) - Signalled for
>>>>>>> job
>>>>>>> start
>>>>>>> for job 1385573203052
>>>>>>> DEBUG 2014-02-03 18:28:09,309 (Startup thread) - Marked job
>>>>>>> 1385573203052
>>>>>>> for startup
>>>>>>> DEBUG 2014-02-03 18:28:10,727 (Startup thread) - Job 1385573203052
>>>>>>> is
>>>>>>> now
>>>>>>> started
>>>>>>> DEBUG 2014-02-03 18:41:18,202 (Finisher thread) - Marked job
>>>>>>> 1385573203052
>>>>>>> for shutdown
>>>>>>> DEBUG 2014-02-03 18:41:23,636 (Job reset thread) - Job
>>>>>>> 1385573203052
>>>>>>> now
>>>>>>> completed
>>>>>>> DEBUG 2014-02-03 18:41:25,368 (Job notification thread) - Found job
>>>>>>> 1385573203052 in need of notification
>>>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) - Checking if job
>>>>>>> 1385573203052 needs to be started; it was last checked at
>>>>>>> 1391449283114,
>>>>>>> and now it is 1391449292400
>>>>>>> DEBUG 2014-02-03 18:41:32,403 (Job start thread) -  No time match
>>>>>>> found
>>>>>>> within interval 1391449283114 to 1391449292400
>>>>>>>
>>>>>>>
>>>>>>> Do you need another log output?
>>>>>>>
>>>>>>> Best,
>>>>>>> Florian
>>>>>>>
>>>>>>> > Also, what does the log have to say?  If there is an error
>>>>>>> aborting
>>>>>>> the
>>>>>>> > job, there should be some record of it in the manifoldcf.log.
>>>>>>> >
>>>>>>> > Thanks,
>>>>>>> > Karl
>>>>>>> >
>>>>>>> >
>>>>>>> > On Tue, Feb 4, 2014 at 6:16 AM, Karl Wright <[email protected]>
>>>>>>> wrote:
>>>>>>> >
>>>>>>> >> Hi Florian,
>>>>>>> >>
>>>>>>> >> Please run the job manually, when outside the scheduling window
>>>>>>> or
>>>>>>> with
>>>>>>> >> the scheduling off.  What is the reason for the job abort?
>>>>>>> >>
>>>>>>> >> Karl
>>>>>>> >>
>>>>>>> >>
>>>>>>> >>
>>>>>>> >> On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding <
>>>>>>> >> [email protected]> wrote:
>>>>>>> >>
>>>>>>> >>> Hi Karl,
>>>>>>> >>>
>>>>>>> >>> yes, I've coincidentally seen "Aborted" in the end time column
>>>>>>> when I
>>>>>>> >>> refreshed the job status just after the number of active
>>>>>>> documents was
>>>>>>> >>> zero. At the next refresh the job was starting up. After
>>>>>>> looking
>>>>>>> in the
>>>>>>> >>> history I found out that it even started a third time. You can
>>>>>>> see the
>>>>>>> >>> history of a single day below (job continue, end, start, stop,
>>>>>>> unwait,
>>>>>>> >>> wait). The start method is "Start at beginning of schedule
>>>>>>> window". Job
>>>>>>> >>> invocation is "complete". Hop count mode is "Delete unreachable
>>>>>>> >>> documents".
>>>>>>> >>>
>>>>>>> >>> 02.03.2014 18:41        job end
>>>>>>> >>> 02.03.2014 18:28        job start
>>>>>>> >>> 02.03.2014 18:14        job start
>>>>>>> >>> 02.03.2014 18:00        job start
>>>>>>> >>> 02.03.2014 17:49        job end
>>>>>>> >>> 02.03.2014 17:27        job end
>>>>>>> >>> 02.03.2014 17:13        job start
>>>>>>> >>> 02.03.2014 17:00        job start
>>>>>>> >>> 02.03.2014 16:13        job end
>>>>>>> >>> 02.03.2014 16:00        job start
>>>>>>> >>> 02.03.2014 15:41        job end
>>>>>>> >>> 02.03.2014 15:27        job start
>>>>>>> >>> 02.03.2014 15:14        job start
>>>>>>> >>> 02.03.2014 15:00        job start
>>>>>>> >>> 02.03.2014 14:13        job end
>>>>>>> >>> 02.03.2014 14:00        job start
>>>>>>> >>> 02.03.2014 13:13        job end
>>>>>>> >>> 02.03.2014 13:00        job start
>>>>>>> >>> 02.03.2014 12:27        job end
>>>>>>> >>> 02.03.2014 12:14        job start
>>>>>>> >>> 02.03.2014 12:00        job start
>>>>>>> >>> 02.03.2014 11:13        job end
>>>>>>> >>> 02.03.2014 11:00        job start
>>>>>>> >>> 02.03.2014 10:13        job end
>>>>>>> >>> 02.03.2014 10:00        job start
>>>>>>> >>> 02.03.2014 09:29        job end
>>>>>>> >>> 02.03.2014 09:14        job start
>>>>>>> >>> 02.03.2014 09:00        job start
>>>>>>> >>>
>>>>>>> >>> Best,
>>>>>>> >>> Florian
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>> > Hi Florian,
>>>>>>> >>> >
>>>>>>> >>> > Jobs don't just abort randomly.  Are you sure that the job
>>>>>>> aborted?
>>>>>>> >>> Or
>>>>>>> >>> > did
>>>>>>> >>> > it just restart?
>>>>>>> >>> >
>>>>>>> >>> > As for "is this normal", it depends on how you have created
>>>>>>> your job.
>>>>>>> >>>  If
>>>>>>> >>> > you selected the "Start within schedule window" selection,
>>>>>>> MCF
>>>>>>> will
>>>>>>> >>> > restart
>>>>>>> >>> > the job whenever it finishes and run it until the end of the
>>>>>>> >>> scheduling
>>>>>>> >>> > window.
>>>>>>> >>> >
>>>>>>> >>> > Karl
>>>>>>> >>> >
>>>>>>> >>> >
>>>>>>> >>> >
>>>>>>> >>> > On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding <
>>>>>>> >>> > [email protected]> wrote:
>>>>>>> >>> >
>>>>>>> >>> >> Hi Karl,
>>>>>>> >>> >>
>>>>>>> >>> >> I've just observed that the job was started according to its
>>>>>>> >>> schedule
>>>>>>> >>> >> and
>>>>>>> >>> >> crawled all documents correctly (I've chosen to re-ingest
>>>>>>> all
>>>>>>> >>> documents
>>>>>>> >>> >> before the run). However, after finishing the last document
>>>>>>> (zero
>>>>>>> >>> active
>>>>>>> >>> >> documents) it was somehow aborted and restarted immediately.
>>>>>>> Is this
>>>>>>> >>> an
>>>>>>> >>> >> expected behavior?
>>>>>>> >>> >>
>>>>>>> >>> >> Best,
>>>>>>> >>> >> Florian
>>>>>>> >>> >>
>>>>>>> >>> >>
>>>>>>> >>> >> > Hi Florian,
>>>>>>> >>> >> >
>>>>>>> >>> >> > Based on this schedule, your crawls will be able to start
>>>>>>> whenever
>>>>>>> >>> the
>>>>>>> >>> >> > hour
>>>>>>> >>> >> > turns.  So they can start every hour on the hour.  If the
>>>>>>> last
>>>>>>> >>> crawl
>>>>>>> >>> >> > crossed an hour boundary, the next crawl will start
>>>>>>> immediately, I
>>>>>>> >>> >> > believe.
>>>>>>> >>> >> >
>>>>>>> >>> >> > Karl
>>>>>>> >>> >> >
>>>>>>> >>> >> >
>>>>>>> >>> >> >
>>>>>>> >>> >> > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding <
>>>>>>> >>> >> > [email protected]> wrote:
>>>>>>> >>> >> >
>>>>>>> >>> >> >> Hi Karl,
>>>>>>> >>> >> >>
>>>>>>> >>> >> >> these are the values:
>>>>>>> >>> >> >> Priority:       5       Start method:   Start at
>>>>>>> beginning
>>>>>>> of
>>>>>>> >>> >> schedule
>>>>>>> >>> >> >> window
>>>>>>> >>> >> >> Schedule type:  Scan every document once        Minimum
>>>>>>> recrawl
>>>>>>> >>> >> >> interval:
>>>>>>> >>> >> >>       Not
>>>>>>> >>> >> >> applicable
>>>>>>> >>> >> >> Expiration interval:    Not applicable  Reseed interval:
>>>>>>> >>> Not
>>>>>>> >>> >> >> applicable
>>>>>>> >>> >> >> Scheduled time:         Any day of week at 12 am 1 am 2
>>>>>>> am
>>>>>>> 3 am 4
>>>>>>> >>> am
>>>>>>> >>> >> 5
>>>>>>> >>> >> >> am
>>>>>>> >>> >> >> 6 am 7
>>>>>>> >>> >> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6
>>>>>>> pm 7 pm
>>>>>>> >>> 8
>>>>>>> >>> >> pm 9
>>>>>>> >>> >> >> pm 10 pm 11 pm
>>>>>>> >>> >> >> Maximum run time:       No limit        Job invocation:
>>>>>>> >>> >> Complete
>>>>>>> >>> >> >>
>>>>>>> >>> >> >> Maybe it is because I've changed the job from continuous
>>>>>>> crawling
>>>>>>> >>> to
>>>>>>> >>> >> >> this
>>>>>>> >>> >> >> schedule. I started it a few times manually, too. I
>>>>>>> couldn't
>>>>>>> >>> notice
>>>>>>> >>> >> >> anything strange in the job setup or in the respective
>>>>>>> entries in
>>>>>>> >>> the
>>>>>>> >>> >> >> database.
>>>>>>> >>> >> >>
>>>>>>> >>> >> >> Regards,
>>>>>>> >>> >> >> Florian
>>>>>>> >>> >> >>
>>>>>>> >>> >> >> > Hi Florian,
>>>>>>> >>> >> >> >
>>>>>>> >>> >> >> > I was unable to reproduce the behavior you described.
>>>>>>> >>> >> >> >
>>>>>>> >>> >> >> > Could you view your job, and post a screen shot of that
>>>>>>> page?
>>>>>>> >>> I
>>>>>>> >>> >> want
>>>>>>> >>> >> >> to
>>>>>>> >>> >> >> > see what your schedule record(s) look like.
>>>>>>> >>> >> >> >
>>>>>>> >>> >> >> > Thanks,
>>>>>>> >>> >> >> > Karl
>>>>>>> >>> >> >> >
>>>>>>> >>> >> >> >
>>>>>>> >>> >> >> >
>>>>>>> >>> >> >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright
>>>>>>> >>> <[email protected]>
>>>>>>> >>> >> >> wrote:
>>>>>>> >>> >> >> >
>>>>>>> >>> >> >> >> Hi Florian,
>>>>>>> >>> >> >> >>
>>>>>>> >>> >> >> >> I've never noted this behavior before.  I'll see if I
>>>>>>> can
>>>>>>> >>> >> reproduce
>>>>>>> >>> >> >> it
>>>>>>> >>> >> >> >> here.
>>>>>>> >>> >> >> >>
>>>>>>> >>> >> >> >> Karl
>>>>>>> >>> >> >> >>
>>>>>>> >>> >> >> >>
>>>>>>> >>> >> >> >>
>>>>>>> >>> >> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding <
>>>>>>> >>> >> >> >> [email protected]> wrote:
>>>>>>> >>> >> >> >>
>>>>>>> >>> >> >> >>> Hi Karl,
>>>>>>> >>> >> >> >>>
>>>>>>> >>> >> >> >>> the scheduled job seems to work as expecetd. However,
>>>>>>> it runs
>>>>>>> >>> two
>>>>>>> >>> >> >> >>> times:
>>>>>>> >>> >> >> >>> It starts at the beginning of the scheduled time,
>>>>>>> finishes,
>>>>>>> >>> and
>>>>>>> >>> >> >> >>> immediately starts again. After finishing the second
>>>>>>> run it
>>>>>>> >>> waits
>>>>>>> >>> >> >> for
>>>>>>> >>> >> >> >>> the
>>>>>>> >>> >> >> >>> next scheduled time. Why does it run two times? The
>>>>>>> start
>>>>>>> >>> method
>>>>>>> >>> >> is
>>>>>>> >>> >> >> >>> "Start
>>>>>>> >>> >> >> >>> at beginning of schedule window".
>>>>>>> >>> >> >> >>>
>>>>>>> >>> >> >> >>> Yes, you're right about the checking guarantee.
>>>>>>> Currently,
>>>>>>> >>> our
>>>>>>> >>> >> >> interval
>>>>>>> >>> >> >> >>> is
>>>>>>> >>> >> >> >>> long enough for a complete crawler run.
>>>>>>> >>> >> >> >>>
>>>>>>> >>> >> >> >>> Best,
>>>>>>> >>> >> >> >>> Florian
>>>>>>> >>> >> >> >>>
>>>>>>> >>> >> >> >>>
>>>>>>> >>> >> >> >>> > Hi Florian,
>>>>>>> >>> >> >> >>> >
>>>>>>> >>> >> >> >>> > It is impossible to *guarantee* that a document
>>>>>>> will
>>>>>>> be
>>>>>>> >>> >> checked,
>>>>>>> >>> >> >> >>> because
>>>>>>> >>> >> >> >>> > if
>>>>>>> >>> >> >> >>> > load on the crawler is high enough, it will fall
>>>>>>> behind.
>>>>>>> >>> But
>>>>>>> >>> I
>>>>>>> >>> >> >> will
>>>>>>> >>> >> >> >>> look
>>>>>>> >>> >> >> >>> > into adding the feature you request.
>>>>>>> >>> >> >> >>> >
>>>>>>> >>> >> >> >>> > Karl
>>>>>>> >>> >> >> >>> >
>>>>>>> >>> >> >> >>> >
>>>>>>> >>> >> >> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding
>>>>>>> <
>>>>>>> >>> >> >> >>> > [email protected]> wrote:
>>>>>>> >>> >> >> >>> >
>>>>>>> >>> >> >> >>> >> Hi Karl,
>>>>>>> >>> >> >> >>> >>
>>>>>>> >>> >> >> >>> >> yes, in our case it is necessary to make sure that
>>>>>>> new
>>>>>>> >>> >> documents
>>>>>>> >>> >> >> are
>>>>>>> >>> >> >> >>> >> discovered and indexed within a certain interval.
>>>>>>> I
>>>>>>> have
>>>>>>> >>> >> created
>>>>>>> >>> >> >> a
>>>>>>> >>> >> >> >>> >> feature
>>>>>>> >>> >> >> >>> >> request on that. In the meantime we will try to
>>>>>>> use a
>>>>>>> >>> >> scheduled
>>>>>>> >>> >> >> job
>>>>>>> >>> >> >> >>> >> instead.
>>>>>>> >>> >> >> >>> >>
>>>>>>> >>> >> >> >>> >> Thanks for your help,
>>>>>>> >>> >> >> >>> >> Florian
>>>>>>> >>> >> >> >>> >>
>>>>>>> >>> >> >> >>> >>
>>>>>>> >>> >> >> >>> >> > Hi Florian,
>>>>>>> >>> >> >> >>> >> >
>>>>>>> >>> >> >> >>> >> > What you are seeing is "dynamic crawling"
>>>>>>> behavior.  The
>>>>>>> >>> >> time
>>>>>>> >>> >> >> >>> between
>>>>>>> >>> >> >> >>> >> > refetches of a document is based on the history
>>>>>>> of
>>>>>>> >>> fetches
>>>>>>> >>> >> of
>>>>>>> >>> >> >> that
>>>>>>> >>> >> >> >>> >> > document.  The recrawl interval is the initial
>>>>>>> time
>>>>>>> >>> between
>>>>>>> >>> >> >> >>> document
>>>>>>> >>> >> >> >>> >> > fetches, but if a document does not change, the
>>>>>>> interval
>>>>>>> >>> for
>>>>>>> >>> >> >> the
>>>>>>> >>> >> >> >>> >> document
>>>>>>> >>> >> >> >>> >> > increases according to a formula.
>>>>>>> >>> >> >> >>> >> >
>>>>>>> >>> >> >> >>> >> > I would need to look at the code to be able to
>>>>>>> give you
>>>>>>> >>> the
>>>>>>> >>> >> >> >>> precise
>>>>>>> >>> >> >> >>> >> > formula, but if you need a limit on the amount
>>>>>>> of
>>>>>>> time
>>>>>>> >>> >> between
>>>>>>> >>> >> >> >>> >> document
>>>>>>> >>> >> >> >>> >> > fetch attempts, I suggest you create a ticket
>>>>>>> and
>>>>>>> I will
>>>>>>> >>> >> look
>>>>>>> >>> >> >> into
>>>>>>> >>> >> >> >>> >> adding
>>>>>>> >>> >> >> >>> >> > that as a feature.
>>>>>>> >>> >> >> >>> >> >
>>>>>>> >>> >> >> >>> >> > Thanks,
>>>>>>> >>> >> >> >>> >> > Karl
>>>>>>> >>> >> >> >>> >> >
>>>>>>> >>> >> >> >>> >> >
>>>>>>> >>> >> >> >>> >> >
>>>>>>> >>> >> >> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian
>>>>>>> Schmedding
>>>>>>> <
>>>>>>> >>> >> >> >>> >> > [email protected]> wrote:
>>>>>>> >>> >> >> >>> >> >
>>>>>>> >>> >> >> >>> >> >> Hello,
>>>>>>> >>> >> >> >>> >> >>
>>>>>>> >>> >> >> >>> >> >> the parameters reseed interval and recrawl
>>>>>>> interval of
>>>>>>> >>> a
>>>>>>> >>> >> >> >>> continuous
>>>>>>> >>> >> >> >>> >> >> crawling job are not quite clear to me. The
>>>>>>> >>> documentation
>>>>>>> >>> >> >> tells
>>>>>>> >>> >> >> >>> that
>>>>>>> >>> >> >> >>> >> the
>>>>>>> >>> >> >> >>> >> >> reseed interval is the time after which the
>>>>>>> seeds
>>>>>>> are
>>>>>>> >>> >> checked
>>>>>>> >>> >> >> >>> again,
>>>>>>> >>> >> >> >>> >> and
>>>>>>> >>> >> >> >>> >> >> the recrawl interval is the time after which a
>>>>>>> document
>>>>>>> >>> is
>>>>>>> >>> >> >> >>> checked
>>>>>>> >>> >> >> >>> >> for
>>>>>>> >>> >> >> >>> >> >> changes.
>>>>>>> >>> >> >> >>> >> >>
>>>>>>> >>> >> >> >>> >> >> However, we observed that the recrawl interval
>>>>>>> for a
>>>>>>> >>> >> document
>>>>>>> >>> >> >> >>> >> increases
>>>>>>> >>> >> >> >>> >> >> after each check. On the other hand, the reseed
>>>>>>> >>> interval
>>>>>>> >>> >> seems
>>>>>>> >>> >> >> to
>>>>>>> >>> >> >> >>> be
>>>>>>> >>> >> >> >>> >> set
>>>>>>> >>> >> >> >>> >> >> up correctly in the database metadata about the
>>>>>>> seed
>>>>>>> >>> >> >> documents.
>>>>>>> >>> >> >> >>> Yet
>>>>>>> >>> >> >> >>> >> the
>>>>>>> >>> >> >> >>> >> >> web server does not receive requests at each
>>>>>>> time
>>>>>>> the
>>>>>>> >>> >> interval
>>>>>>> >>> >> >> >>> >> elapses
>>>>>>> >>> >> >> >>> >> >> but
>>>>>>> >>> >> >> >>> >> >> only after several intervals have elapsed.
>>>>>>> >>> >> >> >>> >> >>
>>>>>>> >>> >> >> >>> >> >> We are using a web connector. The web server
>>>>>>> does
>>>>>>> not
>>>>>>> >>> tell
>>>>>>> >>> >> the
>>>>>>> >>> >> >> >>> client
>>>>>>> >>> >> >> >>> >> to
>>>>>>> >>> >> >> >>> >> >> cache the documents. Any help would be
>>>>>>> appreciated.
>>>>>>> >>> >> >> >>> >> >>
>>>>>>> >>> >> >> >>> >> >> Best regards,
>>>>>>> >>> >> >> >>> >> >> Florian
>>>>>>> >>> >> >> >>> >> >>
>>>>>>> >>> >> >> >>> >> >>
>>>>>>> >>> >> >> >>> >> >>
>>>>>>> >>> >> >> >>> >> >>
>>>>>>> >>> >> >> >>> >> >
>>>>>>> >>> >> >> >>> >>
>>>>>>> >>> >> >> >>> >>
>>>>>>> >>> >> >> >>> >>
>>>>>>> >>> >> >> >>> >
>>>>>>> >>> >> >> >>>
>>>>>>> >>> >> >> >>>
>>>>>>> >>> >> >> >>>
>>>>>>> >>> >> >> >>
>>>>>>> >>> >> >> >
>>>>>>> >>> >> >>
>>>>>>> >>> >> >>
>>>>>>> >>> >> >>
>>>>>>> >>> >> >
>>>>>>> >>> >>
>>>>>>> >>> >>
>>>>>>> >>> >>
>>>>>>> >>> >
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>>
>>>>>>> >>
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Continuous crawling

Reply via email to