Re: Continuous crawling

Karl Wright Tue, 04 Feb 2014 03:33:03 -0800

Also, what does the log have to say?  If there is an error aborting the
job, there should be some record of it in the manifoldcf.log.


Thanks,
Karl


On Tue, Feb 4, 2014 at 6:16 AM, Karl Wright <[email protected]> wrote:

> Hi Florian,
>
> Please run the job manually, when outside the scheduling window or with
> the scheduling off.  What is the reason for the job abort?
>
> Karl
>
>
>
> On Tue, Feb 4, 2014 at 3:30 AM, Florian Schmedding <
> [email protected]> wrote:
>
>> Hi Karl,
>>
>> yes, I've coincidentally seen "Aborted" in the end time column when I
>> refreshed the job status just after the number of active documents was
>> zero. At the next refresh the job was starting up. After looking in the
>> history I found out that it even started a third time. You can see the
>> history of a single day below (job continue, end, start, stop, unwait,
>> wait). The start method is "Start at beginning of schedule window". Job
>> invocation is "complete". Hop count mode is "Delete unreachable
>> documents".
>>
>> 02.03.2014 18:41        job end
>> 02.03.2014 18:28        job start
>> 02.03.2014 18:14        job start
>> 02.03.2014 18:00        job start
>> 02.03.2014 17:49        job end
>> 02.03.2014 17:27        job end
>> 02.03.2014 17:13        job start
>> 02.03.2014 17:00        job start
>> 02.03.2014 16:13        job end
>> 02.03.2014 16:00        job start
>> 02.03.2014 15:41        job end
>> 02.03.2014 15:27        job start
>> 02.03.2014 15:14        job start
>> 02.03.2014 15:00        job start
>> 02.03.2014 14:13        job end
>> 02.03.2014 14:00        job start
>> 02.03.2014 13:13        job end
>> 02.03.2014 13:00        job start
>> 02.03.2014 12:27        job end
>> 02.03.2014 12:14        job start
>> 02.03.2014 12:00        job start
>> 02.03.2014 11:13        job end
>> 02.03.2014 11:00        job start
>> 02.03.2014 10:13        job end
>> 02.03.2014 10:00        job start
>> 02.03.2014 09:29        job end
>> 02.03.2014 09:14        job start
>> 02.03.2014 09:00        job start
>>
>> Best,
>> Florian
>>
>>
>> > Hi Florian,
>> >
>> > Jobs don't just abort randomly.  Are you sure that the job aborted?  Or
>> > did
>> > it just restart?
>> >
>> > As for "is this normal", it depends on how you have created your job.
>>  If
>> > you selected the "Start within schedule window" selection, MCF will
>> > restart
>> > the job whenever it finishes and run it until the end of the scheduling
>> > window.
>> >
>> > Karl
>> >
>> >
>> >
>> > On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding <
>> > [email protected]> wrote:
>> >
>> >> Hi Karl,
>> >>
>> >> I've just observed that the job was started according to its schedule
>> >> and
>> >> crawled all documents correctly (I've chosen to re-ingest all documents
>> >> before the run). However, after finishing the last document (zero
>> active
>> >> documents) it was somehow aborted and restarted immediately. Is this an
>> >> expected behavior?
>> >>
>> >> Best,
>> >> Florian
>> >>
>> >>
>> >> > Hi Florian,
>> >> >
>> >> > Based on this schedule, your crawls will be able to start whenever
>> the
>> >> > hour
>> >> > turns.  So they can start every hour on the hour.  If the last crawl
>> >> > crossed an hour boundary, the next crawl will start immediately, I
>> >> > believe.
>> >> >
>> >> > Karl
>> >> >
>> >> >
>> >> >
>> >> > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding <
>> >> > [email protected]> wrote:
>> >> >
>> >> >> Hi Karl,
>> >> >>
>> >> >> these are the values:
>> >> >> Priority:       5       Start method:   Start at beginning of
>> >> schedule
>> >> >> window
>> >> >> Schedule type:  Scan every document once        Minimum recrawl
>> >> >> interval:
>> >> >>       Not
>> >> >> applicable
>> >> >> Expiration interval:    Not applicable  Reseed interval:        Not
>> >> >> applicable
>> >> >> Scheduled time:         Any day of week at 12 am 1 am 2 am 3 am 4 am
>> >> 5
>> >> >> am
>> >> >> 6 am 7
>> >> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6 pm 7 pm 8
>> >> pm 9
>> >> >> pm 10 pm 11 pm
>> >> >> Maximum run time:       No limit        Job invocation:
>> >> Complete
>> >> >>
>> >> >> Maybe it is because I've changed the job from continuous crawling to
>> >> >> this
>> >> >> schedule. I started it a few times manually, too. I couldn't notice
>> >> >> anything strange in the job setup or in the respective entries in
>> the
>> >> >> database.
>> >> >>
>> >> >> Regards,
>> >> >> Florian
>> >> >>
>> >> >> > Hi Florian,
>> >> >> >
>> >> >> > I was unable to reproduce the behavior you described.
>> >> >> >
>> >> >> > Could you view your job, and post a screen shot of that page?  I
>> >> want
>> >> >> to
>> >> >> > see what your schedule record(s) look like.
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Karl
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright <[email protected]>
>> >> >> wrote:
>> >> >> >
>> >> >> >> Hi Florian,
>> >> >> >>
>> >> >> >> I've never noted this behavior before.  I'll see if I can
>> >> reproduce
>> >> >> it
>> >> >> >> here.
>> >> >> >>
>> >> >> >> Karl
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding <
>> >> >> >> [email protected]> wrote:
>> >> >> >>
>> >> >> >>> Hi Karl,
>> >> >> >>>
>> >> >> >>> the scheduled job seems to work as expecetd. However, it runs
>> two
>> >> >> >>> times:
>> >> >> >>> It starts at the beginning of the scheduled time, finishes, and
>> >> >> >>> immediately starts again. After finishing the second run it
>> waits
>> >> >> for
>> >> >> >>> the
>> >> >> >>> next scheduled time. Why does it run two times? The start method
>> >> is
>> >> >> >>> "Start
>> >> >> >>> at beginning of schedule window".
>> >> >> >>>
>> >> >> >>> Yes, you're right about the checking guarantee. Currently, our
>> >> >> interval
>> >> >> >>> is
>> >> >> >>> long enough for a complete crawler run.
>> >> >> >>>
>> >> >> >>> Best,
>> >> >> >>> Florian
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> > Hi Florian,
>> >> >> >>> >
>> >> >> >>> > It is impossible to *guarantee* that a document will be
>> >> checked,
>> >> >> >>> because
>> >> >> >>> > if
>> >> >> >>> > load on the crawler is high enough, it will fall behind.  But
>> I
>> >> >> will
>> >> >> >>> look
>> >> >> >>> > into adding the feature you request.
>> >> >> >>> >
>> >> >> >>> > Karl
>> >> >> >>> >
>> >> >> >>> >
>> >> >> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding <
>> >> >> >>> > [email protected]> wrote:
>> >> >> >>> >
>> >> >> >>> >> Hi Karl,
>> >> >> >>> >>
>> >> >> >>> >> yes, in our case it is necessary to make sure that new
>> >> documents
>> >> >> are
>> >> >> >>> >> discovered and indexed within a certain interval. I have
>> >> created
>> >> >> a
>> >> >> >>> >> feature
>> >> >> >>> >> request on that. In the meantime we will try to use a
>> >> scheduled
>> >> >> job
>> >> >> >>> >> instead.
>> >> >> >>> >>
>> >> >> >>> >> Thanks for your help,
>> >> >> >>> >> Florian
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>> >> > Hi Florian,
>> >> >> >>> >> >
>> >> >> >>> >> > What you are seeing is "dynamic crawling" behavior.  The
>> >> time
>> >> >> >>> between
>> >> >> >>> >> > refetches of a document is based on the history of fetches
>> >> of
>> >> >> that
>> >> >> >>> >> > document.  The recrawl interval is the initial time between
>> >> >> >>> document
>> >> >> >>> >> > fetches, but if a document does not change, the interval
>> for
>> >> >> the
>> >> >> >>> >> document
>> >> >> >>> >> > increases according to a formula.
>> >> >> >>> >> >
>> >> >> >>> >> > I would need to look at the code to be able to give you the
>> >> >> >>> precise
>> >> >> >>> >> > formula, but if you need a limit on the amount of time
>> >> between
>> >> >> >>> >> document
>> >> >> >>> >> > fetch attempts, I suggest you create a ticket and I will
>> >> look
>> >> >> into
>> >> >> >>> >> adding
>> >> >> >>> >> > that as a feature.
>> >> >> >>> >> >
>> >> >> >>> >> > Thanks,
>> >> >> >>> >> > Karl
>> >> >> >>> >> >
>> >> >> >>> >> >
>> >> >> >>> >> >
>> >> >> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding <
>> >> >> >>> >> > [email protected]> wrote:
>> >> >> >>> >> >
>> >> >> >>> >> >> Hello,
>> >> >> >>> >> >>
>> >> >> >>> >> >> the parameters reseed interval and recrawl interval of a
>> >> >> >>> continuous
>> >> >> >>> >> >> crawling job are not quite clear to me. The documentation
>> >> >> tells
>> >> >> >>> that
>> >> >> >>> >> the
>> >> >> >>> >> >> reseed interval is the time after which the seeds are
>> >> checked
>> >> >> >>> again,
>> >> >> >>> >> and
>> >> >> >>> >> >> the recrawl interval is the time after which a document is
>> >> >> >>> checked
>> >> >> >>> >> for
>> >> >> >>> >> >> changes.
>> >> >> >>> >> >>
>> >> >> >>> >> >> However, we observed that the recrawl interval for a
>> >> document
>> >> >> >>> >> increases
>> >> >> >>> >> >> after each check. On the other hand, the reseed interval
>> >> seems
>> >> >> to
>> >> >> >>> be
>> >> >> >>> >> set
>> >> >> >>> >> >> up correctly in the database metadata about the seed
>> >> >> documents.
>> >> >> >>> Yet
>> >> >> >>> >> the
>> >> >> >>> >> >> web server does not receive requests at each time the
>> >> interval
>> >> >> >>> >> elapses
>> >> >> >>> >> >> but
>> >> >> >>> >> >> only after several intervals have elapsed.
>> >> >> >>> >> >>
>> >> >> >>> >> >> We are using a web connector. The web server does not tell
>> >> the
>> >> >> >>> client
>> >> >> >>> >> to
>> >> >> >>> >> >> cache the documents. Any help would be appreciated.
>> >> >> >>> >> >>
>> >> >> >>> >> >> Best regards,
>> >> >> >>> >> >> Florian
>> >> >> >>> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >> >>
>> >> >> >>> >> >
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>> >>
>> >> >> >>> >
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >
>>
>>
>>
>

Re: Continuous crawling

Reply via email to