Re-crawl every 24 hours

Ali rahmani Wed, 21 May 2014 03:23:24 -0700

Dear Sir, 
I am customizing Nutch 2.2 to crawl my seed lists which contains about 30 URL. 
I need to crawl mentioned URL every 24 minutes and JUST fetch new added links. 
I added the following configurations to nutch-site.xml file and use the 
following command:


<property>
  <name>db.fetch.interval.default</name>
  <value>1800</value>
  <description>The default number of seconds between re-fetches of a page (30 
days).
  </description>
</property>

<property>
  <name>db.update.purge.404</name>
  <value>true</value>
  <description>If true, updatedb will add purge records with status DB_GONE
  from the CrawlDB.
  </description>
</property>


./crawl urls/ testdb http://localhost:8983/solr 2


but whenever I run mention command, nutch goes deep and deeper.
would you please tell where is the problem ?
Regards,

Re-crawl every 24 hours

Reply via email to