Nutch also has adaptive strategy: This class implements an adaptive re-fetch algorithm. This works as > follows: > > - for pages that has changed since the last fetchTime, decrease their > fetchInterval by a factor of DEC_FACTOR (default value is 0.2f). > - for pages that haven't changed since the last fetchTime, increase > their fetchInterval by a factor of INC_FACTOR (default value is 0.2f). > If SYNC_DELTA property is true, then: > - calculate a delta = fetchTime - modifiedTime > - try to synchronize with the time of change, by shifting the next > fetchTime by a fraction of the difference between the last modification > time and the last fetch time. I.e. the next fetch time will be set to > fetchTime > + fetchInterval - delta * SYNC_DELTA_RATE > - if the adjusted fetch interval is bigger than the delta, then > fetchInterval > = delta. > - the minimum value of fetchInterval may not be smaller than > MIN_INTERVAL (default is 1 minute). > - the maximum value of fetchInterval may not be bigger than > MAX_INTERVAL (default is 365 days). > > NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize > the algorithm, so that the fetch interval either increases or decreases > infinitely, with little relevance to the page changes. Please use > main(String[]) > <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29> > method to test the values before applying them in a production system. >
From: https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html 2016-08-03 14:45 GMT-03:00 Walter Underwood <wun...@wunderwood.org>: > I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler > in Ultraseek. > > I think we were the only people who built an adaptive crawler for > enterprise use. I tried to get Ultraseek open-sourced. I made the argument > to Mike Lynch. He looked at me like I had three heads and didn’t even > answer me. > > Ultraseek also has great support for sites that need login. If you use > that, you’ll need to find a way to do that with another crawler. > > wunder > Walter Underwood > Former Ultraseek Principal Engineer > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > > On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US) > <kris.t.musshorn....@mail.mil> wrote: > > > > CLASSIFICATION: UNCLASSIFIED > > > > We are currently using ultraseek and looking to deprecate it in favor of > solr/nutch. > > Ultraseek runs all the time and auto detects when pages have changed and > automatically reindexes them. > > Is this possible with SOLR/nutch? > > > > Thanks, > > Kris > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Kris T. Musshorn > > FileMaker Developer - Contractor - Catapult Technology Inc. > > US Army Research Lab > > Aberdeen Proving Ground > > Application Management & Development Branch > > 410-278-7251 > > kris.t.musshorn....@mail.mil > > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > > > CLASSIFICATION: UNCLASSIFIED > >