Thanks Tom,
I don't see that property in my nutch-defaults. I think it's probably
from an older version.
I'm just gonna write a util method to clean them up that queries the
gora store and deletes the matches from the index.
On 05/16/2017 04:12 AM, Tom Chiverton wrote:
Do you need to set
db.update.purge.404=true
?
Tom
On 15/05/17 20:35, Ben Vachon wrote:
Hi all,
I'm working with Nutch 2.3.1 and I have a problem that I'm hoping the
community can help me with.
A page is fetched successfully and subsequently indexed during the
initial run of a crawler, but later, the page no longer exists on the
server (404 not found). When I run the crawler again to update the
index, I would like my IndexWriter to delete the document for this page.
I have the necessary code for this in my IndexWriter, but pages that
are not successfully fetched are not successfully parsed and
therefore never even reach my IndexFilters let alone the IndexWriter.
The page is ignored instead of deleted.
Any tips for handling this?
Thanks,
Ben V.
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud
service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________