Hi Louis What do you get in the crawldb for that URL? Which version of Nutch are you using?
The indexer takes a -deleteGone parameter, are you using it? Julien On 12 May 2014 19:36, Louis Keeble <[email protected]> wrote: > > > > > Hi all, > > > I am using the elasticsearch nutch indexing plugin to index a web site and > add search info to an elasticsearch index. It has been working well so far. > As a test, I removed a single document from my a previously indexed web > site and re-ran the nutch crawler on this web site. The web site correctly > gave a HTTP 404 (deleted) status for the deleted document when it was > fetched by nutch. The crawl seemed to finish successfully BUT the deleted > document is still showing up in the elasticsearch index. I expected/hoped > it would be deleted from the index. > > Does anyone have any idea why the deleted (from web site) document is not > being deleted from the Elasticsearch index? > > > Here's what I see for this document when I dump nutch's most recent > segment data: > > Recno:: 136 > URL:: > http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form > > CrawlDatum:: > Version: 7 > Status: 2 (db_fetched) > Fetch time: Fri May 09 11:49:53 CDT 2014 > Modified time: Wed Dec 31 18:00:00 CST 1969 > Retries since fetch: 0 > Retry interval: 3600 seconds (0 days) > Score: 0.006134969 > Signature: ba168f4ecf34ccbb1adea384a5f5a78d > Metadata: > _ngt_=1399668503370 > Content-Type=text/html > _pst_=success(1), lastModified=0 > _rs_=154 > > CrawlDatum:: > Version: 7 > Status: 37 (fetch_gone) > Fetch time: Fri May 09 15:49:33 CDT 2014 > Modified time: Wed Dec 31 18:00:00 CST 1969 > Retries since fetch: 0 > Retry interval: 3600 seconds (0 days) > Score: 0.006134969 > Signature: ba168f4ecf34ccbb1adea384a5f5a78d > Metadata: > _ngt_=1399668503370 > Content-Type=text/html > _pst_=notfound(14), lastModified=0: > http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form > _rs_=6748 > > > I see that there are two "CrawlDatum"' records, one has a status of 2 > (db_fetched) and the other has a status of 37 (fetch_gone). The indexing > data looked like it was sent to elasticsearch successfully based on the > hadoop.log, but there isn't a lot of information provided in hadoop.log for > elasticsearch. > Thanks, > > > > -Lou -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

