Depending on your settings, Nutch does this as well. It is even possible to set
up different inc/decremental values per mime-type.
The algorithms are pluggable and overridable at any point of interest. You can
go all the way.
-----Original message-----
> From:Walter Underwood <wun...@wunderwood.org>
> Sent: Wednesday 3rd August 2016 20:03
> To: solr-user@lucene.apache.org
> Subject: Re: SOLR + Nutch set up (UNCLASSIFIED)
>
> That’s good news.
>
> It should reset the interval estimate on page change instead of slowly
> shortening it.
>
> I’m pretty sure that Ultraseek used a bounded exponential backoff when the
> page had not changed.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/ (my blog)
>
>
> > On Aug 3, 2016, at 10:51 AM, Marco Scalone <marcoscal...@gmail.com> wrote:
> >
> > Nutch also has adaptive strategy:
> >
> > This class implements an adaptive re-fetch algorithm. This works as
> >> follows:
> >>
> >> - for pages that has changed since the last fetchTime, decrease their
> >> fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
> >> - for pages that haven't changed since the last fetchTime, increase
> >> their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
> >> If SYNC_DELTA property is true, then:
> >> - calculate a delta = fetchTime - modifiedTime
> >> - try to synchronize with the time of change, by shifting the next
> >> fetchTime by a fraction of the difference between the last
> >> modification
> >> time and the last fetch time. I.e. the next fetch time will be set to
> >> fetchTime
> >> + fetchInterval - delta * SYNC_DELTA_RATE
> >> - if the adjusted fetch interval is bigger than the delta, then
> >> fetchInterval
> >> = delta.
> >> - the minimum value of fetchInterval may not be smaller than
> >> MIN_INTERVAL (default is 1 minute).
> >> - the maximum value of fetchInterval may not be bigger than
> >> MAX_INTERVAL (default is 365 days).
> >>
> >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
> >> the algorithm, so that the fetch interval either increases or decreases
> >> infinitely, with little relevance to the page changes. Please use
> >> main(String[])
> >> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
> >> method to test the values before applying them in a production system.
> >>
> >
> > From:
> > https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html
> >
> >
> > 2016-08-03 14:45 GMT-03:00 Walter Underwood <wun...@wunderwood.org>:
> >
> >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
> >> in Ultraseek.
> >>
> >> I think we were the only people who built an adaptive crawler for
> >> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
> >> to Mike Lynch. He looked at me like I had three heads and didn’t even
> >> answer me.
> >>
> >> Ultraseek also has great support for sites that need login. If you use
> >> that, you’ll need to find a way to do that with another crawler.
> >>
> >> wunder
> >> Walter Underwood
> >> Former Ultraseek Principal Engineer
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/ (my blog)
> >>
> >>
> >>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
> >> <kris.t.musshorn....@mail.mil> wrote:
> >>>
> >>> CLASSIFICATION: UNCLASSIFIED
> >>>
> >>> We are currently using ultraseek and looking to deprecate it in favor of
> >> solr/nutch.
> >>> Ultraseek runs all the time and auto detects when pages have changed and
> >> automatically reindexes them.
> >>> Is this possible with SOLR/nutch?
> >>>
> >>> Thanks,
> >>> Kris
> >>>
> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>> Kris T. Musshorn
> >>> FileMaker Developer - Contractor - Catapult Technology Inc.
> >>> US Army Research Lab
> >>> Aberdeen Proving Ground
> >>> Application Management & Development Branch
> >>> 410-278-7251
> >>> kris.t.musshorn....@mail.mil
> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>
> >>>
> >>>
> >>> CLASSIFICATION: UNCLASSIFIED
> >>
> >>
>
>