There must be some config variable that allows to set timeModified to current date when injected. You need to inject home page url on each run.
hth Alex. -----Original Message----- From: Matteo Diarena <[email protected]> To: user <[email protected]> Sent: Wed, Apr 29, 2015 1:46 pm Subject: How to investigate recrawl issue Dear all, I'm completely new to Apache Nutch, I started only few days ago to use it for the first time and I was impressed from its capabilities. I'm experiencing a little issue I hope someone can help me to fix: I configured a test instance of Apache Nutch (1.9) to crawl a news website using the following parameters: <configuration> <property> <name>http.agent.name</name> <value>NewsWatcher Agent</value> </property> <property> <name>fetcher.threads.per.queue</name> <value>50</value> <description></description> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)| indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|scoring-depth</va lue> <description></description> </property> <property> <name>db.fetch.interval.default</name> <value>300</value> <description></description> </property> </configuration> and running a cron over ./bin/crawl command every five minutes with a _maxdepth_=2 because I want to frequently update my index with only new articles published in homepage without crawling the whole site. At the first run everything is fine, but after it seems the homepage is not updated anymore. Looking at the log file it seems that the whole process is ok but I cannot see new articles, published in homepage, in my index. Looking in the crawldb with readdb command I always obtain the same signature even if the page is changed. Can anyone help me to understand how to investigate this issue? Is there something else I can check after the log file? Is there any debug option I can enable? Thanks a lot everybody in advance, Matteo

