Hi Julien, 


I am using Nutch 1.9 with ElasticSearch 0.9. (via the plugin)


When the site is initially indexed I see that the crawldb contains the 
soon-to-be-deleted URL as follows:

http://dahl/pages/Jim_Bloggs_School/to_be_deleted_aardvark      Version: 7
Status: 2 (db_fetched)
Fetch time: Tue May 20 11:49:53 CDT 2014
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 900 seconds (0 days)
Score: 0.0060975607
Signature: ecad1fb13879445c07ca3c9b302077a4
Metadata:
        Content-Type=text/html
        _pst_=success(1), lastModified=0
        _rs_=478


When I delete the document and rerun the crawl I no longer see any record of 
the deleted document in the new crawldb dump.
However the document remains in the ElasticSearch index.

Note: the method used is to generate a web page with only the documents that I 
want to be indexed and use that page as the initial seed page.
So, the second time the crawler runs, it sees a seed page *without* my deleted 
document.
However, I assume that the retry interval of 900 seconds (see crawldb record 
above) means that nutch will try to refetch the deleted document, at which 
point it will get a 404 (deleted) from the web server. 


I have the following settings in my nutch-site.xml file (among other settings):


    "link.delete.gone"    :  "true",
     "db.update.purge.404"          : "true"
(Don't worry about the non-XML formatting, this is JSON but it gets translated 
to XML during a pre-processing step).


** I am not currently using the -deleteGone parameter anywhere.  **

I am using the bin/crawl all-in-one script, something like this:

bin/crawl <seed_url_folder> <path_to_crawldb_files> -depth 2 -topN 10000


Where would I put the -deleteGone parameter? Just add another parameter to the 
end ?  I saw online that -deleteGone is a valid parameter of the bin/nutch 
command but am not sure about bin/crawl. Maybe I need to run bin/nutch for this?


Thanks for your help!


 
-Lou


________________________________
 From: Julien Nioche <[email protected]>
To: "[email protected]" <[email protected]>; Louis Keeble 
<[email protected]> 
Sent: Monday, May 19, 2014 8:03 AM
Subject: Re: Nutch with elasticsearch plugin not removing a deleted doc from 
the elasticsearch index
 

Hi Louis

What do you get in the crawldb for that URL? Which version of Nutch are you
using?

The indexer takes a -deleteGone parameter, are you using it?

Julien





On 12 May 2014 19:36, Louis Keeble <[email protected]> wrote:

>
>
>
>
> Hi all,
>
>
> I am using the elasticsearch nutch indexing plugin to index a web site and
> add search info to an elasticsearch index. It has been working well so far.
> As a test, I removed a single document from my a previously indexed web
> site and re-ran the nutch crawler on this web site. The web site correctly
> gave a HTTP 404 (deleted) status for the deleted document when it was
> fetched by nutch. The crawl seemed to finish successfully BUT the deleted
> document is still showing up in the elasticsearch index. I expected/hoped
> it would be deleted from the index.
>
> Does anyone have any idea why the deleted (from web site) document is not
> being deleted from the Elasticsearch index?
>
>
> Here's what I see for this document when I dump nutch's most recent
> segment data:
>
> Recno:: 136
> URL::
>  http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
>
> CrawlDatum::
> Version: 7
> Status: 2 (db_fetched)
> Fetch time: Fri May 09 11:49:53 CDT 2014
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 3600 seconds (0 days)
> Score: 0.006134969
> Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> Metadata:
>         _ngt_=1399668503370
>         Content-Type=text/html
>         _pst_=success(1), lastModified=0
>         _rs_=154
>
> CrawlDatum::
> Version: 7
> Status: 37 (fetch_gone)
> Fetch time: Fri May 09 15:49:33 CDT 2014
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 3600 seconds (0 days)
> Score: 0.006134969
> Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> Metadata:
>         _ngt_=1399668503370
>         Content-Type=text/html
>         _pst_=notfound(14), lastModified=0:
> http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
>         _rs_=6748
>
>
> I see that there are two "CrawlDatum"' records, one has a status of 2
> (db_fetched) and the other has a status of 37 (fetch_gone). The indexing
> data looked like it was sent to elasticsearch successfully based on the
> hadoop.log, but there isn't a lot of information provided in hadoop.log for
> elasticsearch.
> Thanks,
>
>
>
> -Lou




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to