Re: SOLR + Nutch set up (UNCLASSIFIED)

Marco Scalone Wed, 03 Aug 2016 10:52:41 -0700

Nutch also has adaptive strategy:

This class implements an adaptive re-fetch algorithm. This works as
> follows:
>
>    - for pages that has changed since the last fetchTime, decrease their
>    fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>    - for pages that haven't changed since the last fetchTime, increase
>    their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>    If SYNC_DELTA property is true, then:
>       - calculate a delta = fetchTime - modifiedTime
>       - try to synchronize with the time of change, by shifting the next
>       fetchTime by a fraction of the difference between the last modification
>       time and the last fetch time. I.e. the next fetch time will be set to 
> fetchTime
>       + fetchInterval - delta * SYNC_DELTA_RATE
>       - if the adjusted fetch interval is bigger than the delta, then 
> fetchInterval
>       = delta.
>    - the minimum value of fetchInterval may not be smaller than
>    MIN_INTERVAL (default is 1 minute).
>    - the maximum value of fetchInterval may not be bigger than
>    MAX_INTERVAL (default is 365 days).
>
> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize
> the algorithm, so that the fetch interval either increases or decreases
> infinitely, with little relevance to the page changes. Please use
> main(String[])
> <https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
> method to test the values before applying them in a production system.
>


From:
https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/crawl/AdaptiveFetchSchedule.html


2016-08-03 14:45 GMT-03:00 Walter Underwood <wun...@wunderwood.org>:

> I’m pretty sure Nutch uses a batch crawler instead of the adaptive crawler
> in Ultraseek.
>
> I think we were the only people who built an adaptive crawler for
> enterprise use. I tried to get Ultraseek open-sourced. I made the argument
> to Mike Lynch. He looked at me like I had three heads and didn’t even
> answer me.
>
> Ultraseek also has great support for sites that need login. If you use
> that, you’ll need to find a way to do that with another crawler.
>
> wunder
> Walter Underwood
> Former Ultraseek Principal Engineer
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL (US)
> <kris.t.musshorn....@mail.mil> wrote:
> >
> > CLASSIFICATION: UNCLASSIFIED
> >
> > We are currently using ultraseek and looking to deprecate it in favor of
> solr/nutch.
> > Ultraseek runs all the time and auto detects when pages have changed and
> automatically reindexes them.
> > Is this possible with SOLR/nutch?
> >
> > Thanks,
> > Kris
> >
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~
> > Kris T. Musshorn
> > FileMaker Developer - Contractor - Catapult Technology Inc.
> > US Army Research Lab
> > Aberdeen Proving Ground
> > Application Management & Development Branch
> > 410-278-7251
> > kris.t.musshorn....@mail.mil
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >
> >
> > CLASSIFICATION: UNCLASSIFIED
>
>

Re: SOLR + Nutch set up (UNCLASSIFIED)

Reply via email to