Re: Whether Nutch AdaptiveFetchSchedule can do recrawling automatically?

Lewis John Mcgibbney Thu, 18 Apr 2013 16:55:26 -0700

Hi Raja,
The FetchSchedule [0] defines the contract for implementations that
manipulate fetch times and re-fetch intervals. FetchScheduleFactory [1]
caches the instance in the ObjectCache.
The Interface and classes (respectively) do not automate or semi-automate
actual scheduling e.g. execute the scheduling directly. Instead the
parameters and behaviour defined by your FetchSchedule implementation is
consulted when a fetching job is executed.


You asked if you can control this through scripts, the answer is yes. I
have continuous crawls running as nightly jobs, all of this is scripted and
managed via cron.

Simply put, if the page is ready to be crawled AND the job is executed,
then the page will be fetched within the next segment or batch.

hth
Lewis

[0]
http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/crawl/FetchSchedule.java
[1]
http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/crawl/FetchScheduleFactory.java


On Thu, Apr 18, 2013 at 5:53 AM, vivekvl <[email protected]> wrote:

> Curious to know whether Nutch AdaptiveFetchSchedule can do recrawling
> automatically?
>
> I observed Hadoop automatically reinitiates the interrupted Jobs. Otherwise
> Hadoop is always up and running with Nutch jobs configured to it. In this
> scenario if a page is ready to be crawled based on adaptive schedule,
> whether Nutch will recrawl the page?
>
> Also I like to know the best approach for continuous crawling for live
> environment.
>
> Thanks,
> Raja
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Whether-Nutch-AdaptiveFetchSchedule-can-do-recrawling-automatically-tp4056979.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*Lewis*

Re: Whether Nutch AdaptiveFetchSchedule can do recrawling automatically?

Reply via email to