Changing Files on indexed server - how to re index?

Paul Rogers Mon, 13 Dec 2010 04:08:19 -0800

Hi All

This is a pretty basic question - apologies in advance (but haven't been
able to find an answer).


If  I have a web site/server that contains content (both html and
pdf/word/excell documents etc) that is contantly changing, ie new files are
being added, existing files deleted and updated etc.

How does Nutch deal with this?  I have setup Nutch with Solr and see a
digest field for each file in the Solr index.  This seesm to be some form of
hash.  However when I run the Nutch trawl it only seems to add new files.
Does it have some mechanism for detecting deleted and updated files.

How does Nutch deal with sites that are constantly changing?  How do people
trigger their crawls on such sites?

Apologies if this all a bit vague, but I'm struggling to decide the best way
to explain what I'm trying to achieve without a better understanding of the
underying processes.

regards

Paul

Changing Files on indexed server - how to re index?

Reply via email to