Re: How to investigate recrawl issue

Jeff Cocking Wed, 29 Apr 2015 16:04:38 -0700

Nutch has a default time value assigned to every page for reindexing.  This
is defaulted to 30 days.  There are also adaptive parameters that will
increase/decrease this timeframe.  If you want to index a page that fast,
you need to either re-inject the page and set the parameter to over write
to true and/or use a plugin like urlmeta to force in reindex timeframe
value.


Spend some time in the nutch-default.xml file. This has all the levers that
can be adjusted for nutch.

jeff

On Wed, Apr 29, 2015 at 4:19 PM, <[email protected]> wrote:

> There must be some config variable that allows to set  timeModified to
> current date when injected. You need to inject home page url on each run.
>
>
> hth
> Alex.
>
>
>
> -----Original Message-----
> From: Matteo Diarena <[email protected]>
> To: user <[email protected]>
> Sent: Wed, Apr 29, 2015 1:46 pm
> Subject: How to investigate recrawl issue
>
>
> Dear all,
>
> I'm completely new to Apache Nutch, I started only few days ago to
> use it
> for the first time and I was impressed from its capabilities.
>
> I'm
> experiencing a little issue I hope someone can help me to fix:
>
> I configured a
> test  instance of Apache Nutch (1.9) to crawl a news website
> using the
> following parameters:
>
>
>
> <configuration>
>
> <property>
>
>
> <name>http.agent.name</name>
>
>   <value>NewsWatcher
> Agent</value>
>
> </property>
>
> <property>
>
>
> <name>fetcher.threads.per.queue</name>
>
>   <value>50</value>
>
>
> <description></description>
>
> </property>
>
> <property>
>
>
> <name>plugin.includes</name>
>
>
>
>
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|
>
> indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|scoring-depth</va
> lue>
>
>
> <description></description>
>
> </property>
>
> <property>
>
>
> <name>db.fetch.interval.default</name>
>
>   <value>300</value>
>
>
> <description></description>
>
> </property>
>
> </configuration>
>
>
>
> and
> running a cron over ./bin/crawl command every five minutes with a
> _maxdepth_=2
> because I want to frequently update my index with only new
> articles published
> in homepage without crawling the whole site.
>
>
>
> At the first run everything
> is fine, but after it seems the homepage is not
> updated anymore.
>
> Looking at
> the log file it seems that the whole process is ok but I cannot
> see new
> articles, published in homepage, in my index.
>
>
>
> Looking in the crawldb
> with readdb command I always obtain the same
> signature even if the page is
> changed.
>
>
>
> Can anyone help me to understand how to investigate this issue?
>
>
> Is there something else I can check after the log file?
>
> Is there any
> debug option I can enable?
>
>
>
> Thanks a lot everybody in advance,
>
> Matteo
>
>
>
>
>
>
>

Re: How to investigate recrawl issue

Reply via email to