There must be some config variable that allows to set  timeModified to current 
date when injected. You need to inject home page url on each run.


hth
Alex.



-----Original Message-----
From: Matteo Diarena <[email protected]>
To: user <[email protected]>
Sent: Wed, Apr 29, 2015 1:46 pm
Subject: How to investigate recrawl issue


Dear all,

I'm completely new to Apache Nutch, I started only few days ago to
use it
for the first time and I was impressed from its capabilities.

I'm
experiencing a little issue I hope someone can help me to fix:

I configured a
test  instance of Apache Nutch (1.9) to crawl a news website
using the
following parameters:

 

<configuration>

<property>

 
<name>http.agent.name</name>

  <value>NewsWatcher
Agent</value>

</property>

<property>

 
<name>fetcher.threads.per.queue</name>

  <value>50</value>

 
<description></description>

</property>

<property>

 
<name>plugin.includes</name>



<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|
indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|scoring-depth</va
lue>


<description></description>

</property>

<property>

 
<name>db.fetch.interval.default</name>

  <value>300</value>

 
<description></description>

</property>

</configuration>

 

and
running a cron over ./bin/crawl command every five minutes with a
_maxdepth_=2
because I want to frequently update my index with only new
articles published
in homepage without crawling the whole site.

 

At the first run everything
is fine, but after it seems the homepage is not
updated anymore. 

Looking at
the log file it seems that the whole process is ok but I cannot
see new
articles, published in homepage, in my index.

 

Looking in the crawldb
with readdb command I always obtain the same
signature even if the page is
changed.

 

Can anyone help me to understand how to investigate this issue?


Is there something else I can check after the log file?

Is there any
debug option I can enable?

 

Thanks a lot everybody in advance,

Matteo


 


 

Reply via email to