Hi Lou Where would I put the -deleteGone parameter? Just add another parameter to > the end ? I saw online that -deleteGone is a valid parameter of the > bin/nutch command but am not sure about bin/crawl. Maybe I need to run > bin/nutch for this?
Just modify the crawl script and add the -deleteGone parameter to the index step : $bin/nutch index -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT *-deleteGone* HTH Julien On 20 May 2014 23:00, Louis Keeble <[email protected]> wrote: > Hi Julien, > > > > I am using Nutch 1.9 with ElasticSearch 0.9. (via the plugin) > > > When the site is initially indexed I see that the crawldb contains the > soon-to-be-deleted URL as follows: > > http://dahl/pages/Jim_Bloggs_School/to_be_deleted_aardvark Version: 7 > Status: 2 (db_fetched) > Fetch time: Tue May 20 11:49:53 CDT 2014 > Modified time: Wed Dec 31 18:00:00 CST 1969 > Retries since fetch: 0 > Retry interval: 900 seconds (0 days) > Score: 0.0060975607 > Signature: ecad1fb13879445c07ca3c9b302077a4 > Metadata: > Content-Type=text/html > _pst_=success(1), lastModified=0 > _rs_=478 > > > When I delete the document and rerun the crawl I no longer see any record > of the deleted document in the new crawldb dump. > However the document remains in the ElasticSearch index. > > Note: the method used is to generate a web page with only the documents > that I want to be indexed and use that page as the initial seed page. > So, the second time the crawler runs, it sees a seed page *without* my > deleted document. > However, I assume that the retry interval of 900 seconds (see crawldb > record above) means that nutch will try to refetch the deleted document, at > which point it will get a 404 (deleted) from the web server. > > > I have the following settings in my nutch-site.xml file (among other > settings): > > > "link.delete.gone" : "true", > "db.update.purge.404" : "true" > (Don't worry about the non-XML formatting, this is JSON but it gets > translated to XML during a pre-processing step). > > > ** I am not currently using the -deleteGone parameter anywhere. ** > > I am using the bin/crawl all-in-one script, something like this: > > bin/crawl <seed_url_folder> <path_to_crawldb_files> -depth 2 -topN 10000 > > > Where would I put the -deleteGone parameter? Just add another parameter to > the end ? I saw online that -deleteGone is a valid parameter of the > bin/nutch command but am not sure about bin/crawl. Maybe I need to run > bin/nutch for this? > > > Thanks for your help! > > > > -Lou > > > ________________________________ > From: Julien Nioche <[email protected]> > To: "[email protected]" <[email protected]>; Louis Keeble < > [email protected]> > Sent: Monday, May 19, 2014 8:03 AM > Subject: Re: Nutch with elasticsearch plugin not removing a deleted doc > from the elasticsearch index > > > Hi Louis > > What do you get in the crawldb for that URL? Which version of Nutch are you > using? > > The indexer takes a -deleteGone parameter, are you using it? > > Julien > > > > > > On 12 May 2014 19:36, Louis Keeble <[email protected]> wrote: > > > > > > > > > > > Hi all, > > > > > > I am using the elasticsearch nutch indexing plugin to index a web site > and > > add search info to an elasticsearch index. It has been working well so > far. > > As a test, I removed a single document from my a previously indexed web > > site and re-ran the nutch crawler on this web site. The web site > correctly > > gave a HTTP 404 (deleted) status for the deleted document when it was > > fetched by nutch. The crawl seemed to finish successfully BUT the deleted > > document is still showing up in the elasticsearch index. I expected/hoped > > it would be deleted from the index. > > > > Does anyone have any idea why the deleted (from web site) document is not > > being deleted from the Elasticsearch index? > > > > > > Here's what I see for this document when I dump nutch's most recent > > segment data: > > > > Recno:: 136 > > URL:: > > http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form > > > > CrawlDatum:: > > Version: 7 > > Status: 2 (db_fetched) > > Fetch time: Fri May 09 11:49:53 CDT 2014 > > Modified time: Wed Dec 31 18:00:00 CST 1969 > > Retries since fetch: 0 > > Retry interval: 3600 seconds (0 days) > > Score: 0.006134969 > > Signature: ba168f4ecf34ccbb1adea384a5f5a78d > > Metadata: > > _ngt_=1399668503370 > > Content-Type=text/html > > _pst_=success(1), lastModified=0 > > _rs_=154 > > > > CrawlDatum:: > > Version: 7 > > Status: 37 (fetch_gone) > > Fetch time: Fri May 09 15:49:33 CDT 2014 > > Modified time: Wed Dec 31 18:00:00 CST 1969 > > Retries since fetch: 0 > > Retry interval: 3600 seconds (0 days) > > Score: 0.006134969 > > Signature: ba168f4ecf34ccbb1adea384a5f5a78d > > Metadata: > > _ngt_=1399668503370 > > Content-Type=text/html > > _pst_=notfound(14), lastModified=0: > > http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form > > _rs_=6748 > > > > > > I see that there are two "CrawlDatum"' records, one has a status of 2 > > (db_fetched) and the other has a status of 37 (fetch_gone). The indexing > > data looked like it was sent to elasticsearch successfully based on the > > hadoop.log, but there isn't a lot of information provided in hadoop.log > for > > elasticsearch. > > Thanks, > > > > > > > > -Lou > > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

