Nutch has a default time value assigned to every page for reindexing. This is defaulted to 30 days. There are also adaptive parameters that will increase/decrease this timeframe. If you want to index a page that fast, you need to either re-inject the page and set the parameter to over write to true and/or use a plugin like urlmeta to force in reindex timeframe value.
Spend some time in the nutch-default.xml file. This has all the levers that can be adjusted for nutch. jeff On Wed, Apr 29, 2015 at 4:19 PM, <[email protected]> wrote: > There must be some config variable that allows to set timeModified to > current date when injected. You need to inject home page url on each run. > > > hth > Alex. > > > > -----Original Message----- > From: Matteo Diarena <[email protected]> > To: user <[email protected]> > Sent: Wed, Apr 29, 2015 1:46 pm > Subject: How to investigate recrawl issue > > > Dear all, > > I'm completely new to Apache Nutch, I started only few days ago to > use it > for the first time and I was impressed from its capabilities. > > I'm > experiencing a little issue I hope someone can help me to fix: > > I configured a > test instance of Apache Nutch (1.9) to crawl a news website > using the > following parameters: > > > > <configuration> > > <property> > > > <name>http.agent.name</name> > > <value>NewsWatcher > Agent</value> > > </property> > > <property> > > > <name>fetcher.threads.per.queue</name> > > <value>50</value> > > > <description></description> > > </property> > > <property> > > > <name>plugin.includes</name> > > > > > <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)| > > indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|scoring-depth</va > lue> > > > <description></description> > > </property> > > <property> > > > <name>db.fetch.interval.default</name> > > <value>300</value> > > > <description></description> > > </property> > > </configuration> > > > > and > running a cron over ./bin/crawl command every five minutes with a > _maxdepth_=2 > because I want to frequently update my index with only new > articles published > in homepage without crawling the whole site. > > > > At the first run everything > is fine, but after it seems the homepage is not > updated anymore. > > Looking at > the log file it seems that the whole process is ok but I cannot > see new > articles, published in homepage, in my index. > > > > Looking in the crawldb > with readdb command I always obtain the same > signature even if the page is > changed. > > > > Can anyone help me to understand how to investigate this issue? > > > Is there something else I can check after the log file? > > Is there any > debug option I can enable? > > > > Thanks a lot everybody in advance, > > Matteo > > > > > > >

