Hi Julien,
I am using Nutch 1.9 with ElasticSearch 0.9. (via the plugin) When the site is initially indexed I see that the crawldb contains the soon-to-be-deleted URL as follows: http://dahl/pages/Jim_Bloggs_School/to_be_deleted_aardvark Version: 7 Status: 2 (db_fetched) Fetch time: Tue May 20 11:49:53 CDT 2014 Modified time: Wed Dec 31 18:00:00 CST 1969 Retries since fetch: 0 Retry interval: 900 seconds (0 days) Score: 0.0060975607 Signature: ecad1fb13879445c07ca3c9b302077a4 Metadata: Content-Type=text/html _pst_=success(1), lastModified=0 _rs_=478 When I delete the document and rerun the crawl I no longer see any record of the deleted document in the new crawldb dump. However the document remains in the ElasticSearch index. Note: the method used is to generate a web page with only the documents that I want to be indexed and use that page as the initial seed page. So, the second time the crawler runs, it sees a seed page *without* my deleted document. However, I assume that the retry interval of 900 seconds (see crawldb record above) means that nutch will try to refetch the deleted document, at which point it will get a 404 (deleted) from the web server. I have the following settings in my nutch-site.xml file (among other settings): "link.delete.gone" : "true", "db.update.purge.404" : "true" (Don't worry about the non-XML formatting, this is JSON but it gets translated to XML during a pre-processing step). ** I am not currently using the -deleteGone parameter anywhere. ** I am using the bin/crawl all-in-one script, something like this: bin/crawl <seed_url_folder> <path_to_crawldb_files> -depth 2 -topN 10000 Where would I put the -deleteGone parameter? Just add another parameter to the end ? I saw online that -deleteGone is a valid parameter of the bin/nutch command but am not sure about bin/crawl. Maybe I need to run bin/nutch for this? Thanks for your help! -Lou ________________________________ From: Julien Nioche <[email protected]> To: "[email protected]" <[email protected]>; Louis Keeble <[email protected]> Sent: Monday, May 19, 2014 8:03 AM Subject: Re: Nutch with elasticsearch plugin not removing a deleted doc from the elasticsearch index Hi Louis What do you get in the crawldb for that URL? Which version of Nutch are you using? The indexer takes a -deleteGone parameter, are you using it? Julien On 12 May 2014 19:36, Louis Keeble <[email protected]> wrote: > > > > > Hi all, > > > I am using the elasticsearch nutch indexing plugin to index a web site and > add search info to an elasticsearch index. It has been working well so far. > As a test, I removed a single document from my a previously indexed web > site and re-ran the nutch crawler on this web site. The web site correctly > gave a HTTP 404 (deleted) status for the deleted document when it was > fetched by nutch. The crawl seemed to finish successfully BUT the deleted > document is still showing up in the elasticsearch index. I expected/hoped > it would be deleted from the index. > > Does anyone have any idea why the deleted (from web site) document is not > being deleted from the Elasticsearch index? > > > Here's what I see for this document when I dump nutch's most recent > segment data: > > Recno:: 136 > URL:: > http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form > > CrawlDatum:: > Version: 7 > Status: 2 (db_fetched) > Fetch time: Fri May 09 11:49:53 CDT 2014 > Modified time: Wed Dec 31 18:00:00 CST 1969 > Retries since fetch: 0 > Retry interval: 3600 seconds (0 days) > Score: 0.006134969 > Signature: ba168f4ecf34ccbb1adea384a5f5a78d > Metadata: > _ngt_=1399668503370 > Content-Type=text/html > _pst_=success(1), lastModified=0 > _rs_=154 > > CrawlDatum:: > Version: 7 > Status: 37 (fetch_gone) > Fetch time: Fri May 09 15:49:33 CDT 2014 > Modified time: Wed Dec 31 18:00:00 CST 1969 > Retries since fetch: 0 > Retry interval: 3600 seconds (0 days) > Score: 0.006134969 > Signature: ba168f4ecf34ccbb1adea384a5f5a78d > Metadata: > _ngt_=1399668503370 > Content-Type=text/html > _pst_=notfound(14), lastModified=0: > http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form > _rs_=6748 > > > I see that there are two "CrawlDatum"' records, one has a status of 2 > (db_fetched) and the other has a status of 37 (fetch_gone). The indexing > data looked like it was sent to elasticsearch successfully based on the > hadoop.log, but there isn't a lot of information provided in hadoop.log for > elasticsearch. > Thanks, > > > > -Lou -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

