Hi all,
I am using the elasticsearch nutch indexing plugin to index a web site and add
search info to an elasticsearch index. It has been working well so far. As a
test, I removed a single document from my a previously indexed web site and
re-ran the nutch crawler on this web site. The web site correctly gave a HTTP
404 (deleted) status for the deleted document when it was fetched by nutch. The
crawl seemed to finish successfully BUT the deleted document is still showing
up in the elasticsearch index. I expected/hoped it would be deleted from the
index.
Does anyone have any idea why the deleted (from web site) document is not being
deleted from the Elasticsearch index?
Here's what I see for this document when I dump nutch's most recent segment
data:
Recno:: 136
URL::
http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
CrawlDatum::
Version: 7
Status: 2 (db_fetched)
Fetch time: Fri May 09 11:49:53 CDT 2014
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 3600 seconds (0 days)
Score: 0.006134969
Signature: ba168f4ecf34ccbb1adea384a5f5a78d
Metadata:
_ngt_=1399668503370
Content-Type=text/html
_pst_=success(1), lastModified=0
_rs_=154
CrawlDatum::
Version: 7
Status: 37 (fetch_gone)
Fetch time: Fri May 09 15:49:33 CDT 2014
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 3600 seconds (0 days)
Score: 0.006134969
Signature: ba168f4ecf34ccbb1adea384a5f5a78d
Metadata:
_ngt_=1399668503370
Content-Type=text/html
_pst_=notfound(14), lastModified=0:
http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
_rs_=6748
I see that there are two "CrawlDatum"' records, one has a status of 2
(db_fetched) and the other has a status of 37 (fetch_gone). The indexing data
looked like it was sent to elasticsearch successfully based on the hadoop.log,
but there isn't a lot of information provided in hadoop.log for elasticsearch.
Thanks,
-Lou