RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Markus Jelsma
No, just run it continously, always! By default everything is refetched (if 
possible) every 30 days. Just read the descriptions for adaptive schedule and 
its javadoc. It is simple to use, but sometimes hard to predict its outcome, 
just because you never know what changes, at whatever time.

You will be fine with defaults if you have a small site. Just set the interval 
to a few days, or more if your site is slightly larger.

M.

 
 
-Original message-
> From:Musshorn, Kris T CTR USARMY RDECOM ARL (US) 
> <kris.t.musshorn@mail.mil>
> Sent: Wednesday 3rd August 2016 20:08
> To: solr-user@lucene.apache.org
> Subject: RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)
> 
> CLASSIFICATION: UNCLASSIFIED
> 
> Shall I assume that, even though nutch has adaptive capability, I would still 
> have to figure out how to trigger it to go look for content that needs update?
> 
> Thanks,
> Kris
> 
> ~~
> Kris T. Musshorn
> FileMaker Developer - Contractor – Catapult Technology Inc.  
> US Army Research Lab 
> Aberdeen Proving Ground 
> Application Management & Development Branch 
> 410-278-7251
> kris.t.musshorn@mail.mil
> ~~
> 
> 
> -Original Message-
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Wednesday, August 03, 2016 2:03 PM
> To: solr-user@lucene.apache.org
> Subject: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)
> 
> All active links contained in this email were disabled.  Please verify the 
> identity of the sender, and confirm the authenticity of all links contained 
> within the message prior to copying and pasting the address to a Web browser. 
>  
> 
> 
> 
> 
> 
> 
> That’s good news.
> 
> It should reset the interval estimate on page change instead of slowly 
> shortening it.
> 
> I’m pretty sure that Ultraseek used a bounded exponential backoff when the 
> page had not changed.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> Caution-http://observer.wunderwood.org/  (my blog)
> 
> 
> > On Aug 3, 2016, at 10:51 AM, Marco Scalone <marcoscal...@gmail.com> wrote:
> > 
> > Nutch also has adaptive strategy:
> > 
> > This class implements an adaptive re-fetch algorithm. This works as
> >> follows:
> >> 
> >>   - for pages that has changed since the last fetchTime, decrease their
> >>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
> >>   - for pages that haven't changed since the last fetchTime, increase
> >>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
> >>   If SYNC_DELTA property is true, then:
> >>  - calculate a delta = fetchTime - modifiedTime
> >>  - try to synchronize with the time of change, by shifting the next
> >>  fetchTime by a fraction of the difference between the last 
> >> modification
> >>  time and the last fetch time. I.e. the next fetch time will be set to 
> >> fetchTime
> >>  + fetchInterval - delta * SYNC_DELTA_RATE
> >>  - if the adjusted fetch interval is bigger than the delta, then 
> >> fetchInterval
> >>  = delta.
> >>   - the minimum value of fetchInterval may not be smaller than
> >>   MIN_INTERVAL (default is 1 minute).
> >>   - the maximum value of fetchInterval may not be bigger than
> >>   MAX_INTERVAL (default is 365 days).
> >> 
> >> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may 
> >> destabilize the algorithm, so that the fetch interval either 
> >> increases or decreases infinitely, with little relevance to the page 
> >> changes. Please use
> >> main(String[])
> >> <Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutc
> >> h/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
> >> method to test the values before applying them in a production system.
> >> 
> > 
> > From:
> > Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/
> > crawl/AdaptiveFetchSchedule.html
> > 
> > 
> > 2016-08-03 14:45 GMT-03:00 Walter Underwood <wun...@wunderwood.org>:
> > 
> >> I’m pretty sure Nutch uses a batch crawler instead of the adaptive 
> >> crawler in Ultraseek.
> >> 
> >> I think we were the only people who built an adaptive crawler for 
> >> enterprise use. I tried to get Ultraseek open-sourced. I made the 
> >> argument to Mike Lynch. He looked at me like I had three heads and 
> >> didn’t even answer me.
> >>

RE: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)

2016-08-03 Thread Musshorn, Kris T CTR USARMY RDECOM ARL (US)
CLASSIFICATION: UNCLASSIFIED

Shall I assume that, even though nutch has adaptive capability, I would still 
have to figure out how to trigger it to go look for content that needs update?

Thanks,
Kris

~~
Kris T. Musshorn
FileMaker Developer - Contractor – Catapult Technology Inc.  
US Army Research Lab 
Aberdeen Proving Ground 
Application Management & Development Branch 
410-278-7251
kris.t.musshorn@mail.mil
~~


-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Wednesday, August 03, 2016 2:03 PM
To: solr-user@lucene.apache.org
Subject: [Non-DoD Source] Re: SOLR + Nutch set up (UNCLASSIFIED)

All active links contained in this email were disabled.  Please verify the 
identity of the sender, and confirm the authenticity of all links contained 
within the message prior to copying and pasting the address to a Web browser.  






That’s good news.

It should reset the interval estimate on page change instead of slowly 
shortening it.

I’m pretty sure that Ultraseek used a bounded exponential backoff when the page 
had not changed.

wunder
Walter Underwood
wun...@wunderwood.org
Caution-http://observer.wunderwood.org/  (my blog)


> On Aug 3, 2016, at 10:51 AM, Marco Scalone  wrote:
> 
> Nutch also has adaptive strategy:
> 
> This class implements an adaptive re-fetch algorithm. This works as
>> follows:
>> 
>>   - for pages that has changed since the last fetchTime, decrease their
>>   fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
>>   - for pages that haven't changed since the last fetchTime, increase
>>   their fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
>>   If SYNC_DELTA property is true, then:
>>  - calculate a delta = fetchTime - modifiedTime
>>  - try to synchronize with the time of change, by shifting the next
>>  fetchTime by a fraction of the difference between the last modification
>>  time and the last fetch time. I.e. the next fetch time will be set to 
>> fetchTime
>>  + fetchInterval - delta * SYNC_DELTA_RATE
>>  - if the adjusted fetch interval is bigger than the delta, then 
>> fetchInterval
>>  = delta.
>>   - the minimum value of fetchInterval may not be smaller than
>>   MIN_INTERVAL (default is 1 minute).
>>   - the maximum value of fetchInterval may not be bigger than
>>   MAX_INTERVAL (default is 365 days).
>> 
>> NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may 
>> destabilize the algorithm, so that the fetch interval either 
>> increases or decreases infinitely, with little relevance to the page 
>> changes. Please use
>> main(String[])
>> > h/crawl/AdaptiveFetchSchedule.html#main%28java.lang.String[]%29>
>> method to test the values before applying them in a production system.
>> 
> 
> From:
> Caution-https://nutch.apache.org/apidocs/apidocs-1.2/org/apache/nutch/
> crawl/AdaptiveFetchSchedule.html
> 
> 
> 2016-08-03 14:45 GMT-03:00 Walter Underwood :
> 
>> I’m pretty sure Nutch uses a batch crawler instead of the adaptive 
>> crawler in Ultraseek.
>> 
>> I think we were the only people who built an adaptive crawler for 
>> enterprise use. I tried to get Ultraseek open-sourced. I made the 
>> argument to Mike Lynch. He looked at me like I had three heads and 
>> didn’t even answer me.
>> 
>> Ultraseek also has great support for sites that need login. If you 
>> use that, you’ll need to find a way to do that with another crawler.
>> 
>> wunder
>> Walter Underwood
>> Former Ultraseek Principal Engineer
>> wun...@wunderwood.org
>> Caution-http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Aug 3, 2016, at 10:12 AM, Musshorn, Kris T CTR USARMY RDECOM ARL 
>>> (US)
>>  wrote:
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>>> 
>>> We are currently using ultraseek and looking to deprecate it in 
>>> favor of
>> solr/nutch.
>>> Ultraseek runs all the time and auto detects when pages have changed 
>>> and
>> automatically reindexes them.
>>> Is this possible with SOLR/nutch?
>>> 
>>> Thanks,
>>> Kris
>>> 
>>> ~~
>>> Kris T. Musshorn
>>> FileMaker Developer - Contractor - Catapult Technology Inc.
>>> US Army Research Lab
>>> Aberdeen Proving Ground
>>> Application Management & Development Branch
>>> 410-278-7251
>>> kris.t.musshorn@mail.mil
>>> ~~
>>> 
>>> 
>>> 
>>> CLASSIFICATION: UNCLASSIFIED
>> 
>> 


CLASSIFICATION: UNCLASSIFIED