That’s good news. It should reset the interval estimate on page change instead of slowly shortening it.
I’m pretty sure that Ultraseek used a bounded exponential backoff when the page had not changed. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Aug 3, 2016, at 10:51 AM, Marco Scalone <marcoscal...@gmail.com> wrote: > > Nutch also has adaptive strategy: > > This class implements an adaptive re-fetch algorithm. This works as >> follows: >> >> - for pages that has changed since the last fetchTime, decrease their >> fetchInterval by a factor of DEC_FACTOR (default value is 0.2f). >> - for pages that haven't changed since the last fetchTime, increase >> their fetchInterval by a factor of INC_FACTOR (default value is 0.2f). >> If SYNC_DELTA property is true, then: >> - calculate a delta = fetchTime - modifiedTime >> - try to synchronize with the time of change, by shifting the next >> fetchTime by a fraction of the difference between the last modification >> time and the last fetch time. I.e. the next fetch time will be set to >> fetchTime >> + fetchInterval - delta * SYNC_DELTA_RATE >> - if the adjusted fetch interval is bigger than the delta, then >> fetchInterval >> = delta. >> - the minimum value of fetchInterval may not be smaller than >> MIN_INTERVAL (default is 1 minute). >> - the maximum value of fetchInterval may not be bigger than >> MAX_INTERVAL (default is 365 days). >> >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize >> the algorithm, so that the fetch interval either increases or decreases >> infinitely, with little relevance to the page changes. Please use >> main(String[]) >> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29> >> method to test the values before applying them in a production system. >> > > From: > https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html > > > 2016-08-03 14:45 GMT-03:00 Walter Underwood <wun...@wunderwood.org>: > >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler >> in Ultraseek. >> >> I think we were the only people who built an adaptive crawler for >> enterprise use. I tried to get Ultraseek open-sourced. I made the argument >> to Mike Lynch. He looked at me like I had three heads and didn’t even >> answer me. >> >> Ultraseek also has great support for sites that need login. If you use >> that, you’ll need to find a way to do that with another crawler. >> >> wunder >> Walter Underwood >> Former Ultraseek Principal Engineer >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) >> <kris.t.musshorn....@mail.mil> wrote: >>> >>> CLASSIFICATION: UNCLASSIFIED >>> >>> We are currently using ultraseek and looking to deprecate it in favor of >> solr/nutch. >>> Ultraseek runs all the time and auto detects when pages have changed and >> automatically reindexes them. >>> Is this possible with SOLR/nutch? >>> >>> Thanks, >>> Kris >>> >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> Kris T. Musshorn >>> FileMaker Developer - Contractor - Catapult Technology Inc. >>> US Army Research Lab >>> Aberdeen Proving Ground >>> Application Management & Development Branch >>> 410-278-7251 >>> kris.t.musshorn....@mail.mil >>> ~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> >>> >>> >>> CLASSIFICATION: UNCLASSIFIED >> >>