Hi all,

I'm working with Nutch 2.3.1 and I have a problem that I'm hoping the community can help me with. A page is fetched successfully and subsequently indexed during the initial run of a crawler, but later, the page no longer exists on the server (404 not found). When I run the crawler again to update the index, I would like my IndexWriter to delete the document for this page. I have the necessary code for this in my IndexWriter, but pages that are not successfully fetched are not successfully parsed and therefore never even reach my IndexFilters let alone the IndexWriter.
The page is ignored instead of deleted.
Any tips for handling this?

Thanks,

Ben V.

Reply via email to