Re: Continuous crawling

Karl Wright Mon, 03 Feb 2014 09:29:24 -0800

Hi Florian,

Jobs don't just abort randomly.  Are you sure that the job aborted?  Or did
it just restart?


As for "is this normal", it depends on how you have created your job.  If
you selected the "Start within schedule window" selection, MCF will restart
the job whenever it finishes and run it until the end of the scheduling
window.

Karl



On Mon, Feb 3, 2014 at 12:24 PM, Florian Schmedding <
[email protected]> wrote:

> Hi Karl,
>
> I've just observed that the job was started according to its schedule and
> crawled all documents correctly (I've chosen to re-ingest all documents
> before the run). However, after finishing the last document (zero active
> documents) it was somehow aborted and restarted immediately. Is this an
> expected behavior?
>
> Best,
> Florian
>
>
> > Hi Florian,
> >
> > Based on this schedule, your crawls will be able to start whenever the
> > hour
> > turns.  So they can start every hour on the hour.  If the last crawl
> > crossed an hour boundary, the next crawl will start immediately, I
> > believe.
> >
> > Karl
> >
> >
> >
> > On Wed, Jan 15, 2014 at 1:04 PM, Florian Schmedding <
> > [email protected]> wrote:
> >
> >> Hi Karl,
> >>
> >> these are the values:
> >> Priority:       5       Start method:   Start at beginning of schedule
> >> window
> >> Schedule type:  Scan every document once        Minimum recrawl
> >> interval:
> >>       Not
> >> applicable
> >> Expiration interval:    Not applicable  Reseed interval:        Not
> >> applicable
> >> Scheduled time:         Any day of week at 12 am 1 am 2 am 3 am 4 am 5
> >> am
> >> 6 am 7
> >> am 8 am 9 am 10 am 11 am 12 pm 1 pm 2 pm 3 pm 4 pm 5 pm 6 pm 7 pm 8 pm 9
> >> pm 10 pm 11 pm
> >> Maximum run time:       No limit        Job invocation:         Complete
> >>
> >> Maybe it is because I've changed the job from continuous crawling to
> >> this
> >> schedule. I started it a few times manually, too. I couldn't notice
> >> anything strange in the job setup or in the respective entries in the
> >> database.
> >>
> >> Regards,
> >> Florian
> >>
> >> > Hi Florian,
> >> >
> >> > I was unable to reproduce the behavior you described.
> >> >
> >> > Could you view your job, and post a screen shot of that page?  I want
> >> to
> >> > see what your schedule record(s) look like.
> >> >
> >> > Thanks,
> >> > Karl
> >> >
> >> >
> >> >
> >> > On Tue, Jan 14, 2014 at 6:09 AM, Karl Wright <[email protected]>
> >> wrote:
> >> >
> >> >> Hi Florian,
> >> >>
> >> >> I've never noted this behavior before.  I'll see if I can reproduce
> >> it
> >> >> here.
> >> >>
> >> >> Karl
> >> >>
> >> >>
> >> >>
> >> >> On Tue, Jan 14, 2014 at 5:36 AM, Florian Schmedding <
> >> >> [email protected]> wrote:
> >> >>
> >> >>> Hi Karl,
> >> >>>
> >> >>> the scheduled job seems to work as expecetd. However, it runs two
> >> >>> times:
> >> >>> It starts at the beginning of the scheduled time, finishes, and
> >> >>> immediately starts again. After finishing the second run it waits
> >> for
> >> >>> the
> >> >>> next scheduled time. Why does it run two times? The start method is
> >> >>> "Start
> >> >>> at beginning of schedule window".
> >> >>>
> >> >>> Yes, you're right about the checking guarantee. Currently, our
> >> interval
> >> >>> is
> >> >>> long enough for a complete crawler run.
> >> >>>
> >> >>> Best,
> >> >>> Florian
> >> >>>
> >> >>>
> >> >>> > Hi Florian,
> >> >>> >
> >> >>> > It is impossible to *guarantee* that a document will be checked,
> >> >>> because
> >> >>> > if
> >> >>> > load on the crawler is high enough, it will fall behind.  But I
> >> will
> >> >>> look
> >> >>> > into adding the feature you request.
> >> >>> >
> >> >>> > Karl
> >> >>> >
> >> >>> >
> >> >>> > On Sun, Jan 5, 2014 at 9:08 AM, Florian Schmedding <
> >> >>> > [email protected]> wrote:
> >> >>> >
> >> >>> >> Hi Karl,
> >> >>> >>
> >> >>> >> yes, in our case it is necessary to make sure that new documents
> >> are
> >> >>> >> discovered and indexed within a certain interval. I have created
> >> a
> >> >>> >> feature
> >> >>> >> request on that. In the meantime we will try to use a scheduled
> >> job
> >> >>> >> instead.
> >> >>> >>
> >> >>> >> Thanks for your help,
> >> >>> >> Florian
> >> >>> >>
> >> >>> >>
> >> >>> >> > Hi Florian,
> >> >>> >> >
> >> >>> >> > What you are seeing is "dynamic crawling" behavior.  The time
> >> >>> between
> >> >>> >> > refetches of a document is based on the history of fetches of
> >> that
> >> >>> >> > document.  The recrawl interval is the initial time between
> >> >>> document
> >> >>> >> > fetches, but if a document does not change, the interval for
> >> the
> >> >>> >> document
> >> >>> >> > increases according to a formula.
> >> >>> >> >
> >> >>> >> > I would need to look at the code to be able to give you the
> >> >>> precise
> >> >>> >> > formula, but if you need a limit on the amount of time between
> >> >>> >> document
> >> >>> >> > fetch attempts, I suggest you create a ticket and I will look
> >> into
> >> >>> >> adding
> >> >>> >> > that as a feature.
> >> >>> >> >
> >> >>> >> > Thanks,
> >> >>> >> > Karl
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >
> >> >>> >> > On Sat, Jan 4, 2014 at 7:56 AM, Florian Schmedding <
> >> >>> >> > [email protected]> wrote:
> >> >>> >> >
> >> >>> >> >> Hello,
> >> >>> >> >>
> >> >>> >> >> the parameters reseed interval and recrawl interval of a
> >> >>> continuous
> >> >>> >> >> crawling job are not quite clear to me. The documentation
> >> tells
> >> >>> that
> >> >>> >> the
> >> >>> >> >> reseed interval is the time after which the seeds are checked
> >> >>> again,
> >> >>> >> and
> >> >>> >> >> the recrawl interval is the time after which a document is
> >> >>> checked
> >> >>> >> for
> >> >>> >> >> changes.
> >> >>> >> >>
> >> >>> >> >> However, we observed that the recrawl interval for a document
> >> >>> >> increases
> >> >>> >> >> after each check. On the other hand, the reseed interval seems
> >> to
> >> >>> be
> >> >>> >> set
> >> >>> >> >> up correctly in the database metadata about the seed
> >> documents.
> >> >>> Yet
> >> >>> >> the
> >> >>> >> >> web server does not receive requests at each time the interval
> >> >>> >> elapses
> >> >>> >> >> but
> >> >>> >> >> only after several intervals have elapsed.
> >> >>> >> >>
> >> >>> >> >> We are using a web connector. The web server does not tell the
> >> >>> client
> >> >>> >> to
> >> >>> >> >> cache the documents. Any help would be appreciated.
> >> >>> >> >>
> >> >>> >> >> Best regards,
> >> >>> >> >> Florian
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >
> >> >>> >>
> >> >>> >>
> >> >>> >>
> >> >>> >
> >> >>>
> >> >>>
> >> >>>
> >> >>
> >> >
> >>
> >>
> >>
> >
>
>
>

Re: Continuous crawling

Reply via email to