Hi all, 


I am using the elasticsearch nutch indexing plugin to index a web site and add 
search info to an elasticsearch index. It has been working well so far. As a 
test, I removed a single document from my a previously indexed web site and 
re-ran the nutch crawler on this web site. The web site correctly gave a HTTP 
404 (deleted) status for the deleted document when it was fetched by nutch. The 
crawl seemed to finish successfully BUT the deleted document is still showing 
up in the elasticsearch index. I expected/hoped it would be deleted from the 
index.

Does anyone have any idea why the deleted (from web site) document is not being 
deleted from the Elasticsearch index? 


Here's what I see for this document when I dump nutch's most recent segment 
data:

Recno:: 136
URL::
 http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form

CrawlDatum::
Version: 7
Status: 2 (db_fetched)
Fetch time: Fri May 09 11:49:53 CDT 2014
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 3600 seconds (0 days)
Score: 0.006134969
Signature: ba168f4ecf34ccbb1adea384a5f5a78d
Metadata:
        _ngt_=1399668503370
        Content-Type=text/html
        _pst_=success(1), lastModified=0
        _rs_=154

CrawlDatum::
Version: 7
Status: 37 (fetch_gone)
Fetch time: Fri May 09 15:49:33 CDT 2014
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 3600 seconds (0 days)
Score: 0.006134969
Signature: ba168f4ecf34ccbb1adea384a5f5a78d
Metadata:
        _ngt_=1399668503370
        Content-Type=text/html
        _pst_=notfound(14), lastModified=0: 
http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
        _rs_=6748


I see that there are two "CrawlDatum"' records, one has a status of 2 
(db_fetched) and the other has a status of 37 (fetch_gone). The indexing data 
looked like it was sent to elasticsearch successfully based on the hadoop.log, 
but there isn't a lot of information provided in hadoop.log for elasticsearch.
Thanks,


 
-Lou

Reply via email to