recrawl sites with a scheduled crawling

tahere ganjiyar Wed, 02 Nov 2011 19:06:17 -0700
> I want to re_crawl my sites every hour. i write a script for this. i edit
> some properties in nutch_site. xml but my re_crawler fetches urls only for
> 3 times and after that it stop fetching. it's mean that my nutch don't
> update after 3 hours. this is my changes in nutch-site.xml:
>
> <property>
>   <name>db.fetch.interval.
default</name>
>   <value>30</value>
>   <description>The default number of seconds between re-fetches of a page
> (30 days).</description> </property>
> <property>
>   <name>db.fetch.schedule.class</name>
>   <value>org.apache.nutch.crawl.AdaptiveFetchSchedule</value>
>   <description>The implementation of fetch schedule. DefaultFetchSchedule
> simply adds the original fetchInterval to the last fetch time, regardless
> of page changes.</description> </property>
>  <property>
>   <name>solr.commit.size</name>
>   <value>10</value>
>   <description>Defines the number of documents to send to Solr in a single
> update batch. Decrease when handling very large documents to prevent Nutch
> from running out of memory.</description> </property>
>  <property>
>   <name>db.fetch.interval.max</name>
>   <value>36000</value>
>   <description>The maximum number of seconds between re-fetches of a page
> (90 days). After this period every page in the db will be re-tried, no
> matter what is its status.</description> </property>
recrawl sites with a scheduled crawling

Reply via email to