Hi Louis

What do you get in the crawldb for that URL? Which version of Nutch are you
using?

The indexer takes a -deleteGone parameter, are you using it?

Julien




On 12 May 2014 19:36, Louis Keeble <[email protected]> wrote:

>
>
>
>
> Hi all,
>
>
> I am using the elasticsearch nutch indexing plugin to index a web site and
> add search info to an elasticsearch index. It has been working well so far.
> As a test, I removed a single document from my a previously indexed web
> site and re-ran the nutch crawler on this web site. The web site correctly
> gave a HTTP 404 (deleted) status for the deleted document when it was
> fetched by nutch. The crawl seemed to finish successfully BUT the deleted
> document is still showing up in the elasticsearch index. I expected/hoped
> it would be deleted from the index.
>
> Does anyone have any idea why the deleted (from web site) document is not
> being deleted from the Elasticsearch index?
>
>
> Here's what I see for this document when I dump nutch's most recent
> segment data:
>
> Recno:: 136
> URL::
>  http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
>
> CrawlDatum::
> Version: 7
> Status: 2 (db_fetched)
> Fetch time: Fri May 09 11:49:53 CDT 2014
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 3600 seconds (0 days)
> Score: 0.006134969
> Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> Metadata:
>         _ngt_=1399668503370
>         Content-Type=text/html
>         _pst_=success(1), lastModified=0
>         _rs_=154
>
> CrawlDatum::
> Version: 7
> Status: 37 (fetch_gone)
> Fetch time: Fri May 09 15:49:33 CDT 2014
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 3600 seconds (0 days)
> Score: 0.006134969
> Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> Metadata:
>         _ngt_=1399668503370
>         Content-Type=text/html
>         _pst_=notfound(14), lastModified=0:
> http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
>         _rs_=6748
>
>
> I see that there are two "CrawlDatum"' records, one has a status of 2
> (db_fetched) and the other has a status of 37 (fetch_gone). The indexing
> data looked like it was sent to elasticsearch successfully based on the
> hadoop.log, but there isn't a lot of information provided in hadoop.log for
> elasticsearch.
> Thanks,
>
>
>
> -Lou




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to