How to investigate recrawl issue

Matteo Diarena Wed, 29 Apr 2015 13:46:49 -0700

Dear all,

I'm completely new to Apache Nutch, I started only few days ago to use it
for the first time and I was impressed from its capabilities.


I'm experiencing a little issue I hope someone can help me to fix:

I configured a test  instance of Apache Nutch (1.9) to crawl a news website
using the following parameters:

 

<configuration>

<property>

  <name>http.agent.name</name>

  <value>NewsWatcher Agent</value>

</property>

<property>

  <name>fetcher.threads.per.queue</name>

  <value>50</value>

  <description></description>

</property>

<property>

  <name>plugin.includes</name>

 
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|
indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|scoring-depth</va
lue>

  <description></description>

</property>

<property>

  <name>db.fetch.interval.default</name>

  <value>300</value>

  <description></description>

</property>

</configuration>

 

and running a cron over ./bin/crawl command every five minutes with a
_maxdepth_=2 because I want to frequently update my index with only new
articles published in homepage without crawling the whole site.

 

At the first run everything is fine, but after it seems the homepage is not
updated anymore. 

Looking at the log file it seems that the whole process is ok but I cannot
see new articles, published in homepage, in my index.

 

Looking in the crawldb with readdb command I always obtain the same
signature even if the page is changed.

 

Can anyone help me to understand how to investigate this issue? 

Is there something else I can check after the log file?

Is there any debug option I can enable?

 

Thanks a lot everybody in advance,

Matteo

How to investigate recrawl issue

Reply via email to