Re: Nutch with elasticsearch plugin not removing a deleted doc from the elasticsearch index

Julien Nioche Wed, 21 May 2014 06:16:44 -0700

Hi Lou

Where would I put the -deleteGone parameter? Just add another parameter to
> the end ?  I saw online that -deleteGone is a valid parameter of the
> bin/nutch command but am not sure about bin/crawl. Maybe I need to run
> bin/nutch for this?



Just modify the crawl script and add the -deleteGone parameter to the index
step :

  $bin/nutch index -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb -linkdb
$CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT *-deleteGone*

HTH

Julien





On 20 May 2014 23:00, Louis Keeble <[email protected]> wrote:

> Hi Julien,
>
>
>
> I am using Nutch 1.9 with ElasticSearch 0.9. (via the plugin)
>
>
> When the site is initially indexed I see that the crawldb contains the
> soon-to-be-deleted URL as follows:
>
> http://dahl/pages/Jim_Bloggs_School/to_be_deleted_aardvark      Version: 7
> Status: 2 (db_fetched)
> Fetch time: Tue May 20 11:49:53 CDT 2014
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 900 seconds (0 days)
> Score: 0.0060975607
> Signature: ecad1fb13879445c07ca3c9b302077a4
> Metadata:
>         Content-Type=text/html
>         _pst_=success(1), lastModified=0
>         _rs_=478
>
>
> When I delete the document and rerun the crawl I no longer see any record
> of the deleted document in the new crawldb dump.
> However the document remains in the ElasticSearch index.
>
> Note: the method used is to generate a web page with only the documents
> that I want to be indexed and use that page as the initial seed page.
> So, the second time the crawler runs, it sees a seed page *without* my
> deleted document.
> However, I assume that the retry interval of 900 seconds (see crawldb
> record above) means that nutch will try to refetch the deleted document, at
> which point it will get a 404 (deleted) from the web server.
>
>
> I have the following settings in my nutch-site.xml file (among other
> settings):
>
>
>     "link.delete.gone"    :  "true",
>      "db.update.purge.404"          : "true"
> (Don't worry about the non-XML formatting, this is JSON but it gets
> translated to XML during a pre-processing step).
>
>
> ** I am not currently using the -deleteGone parameter anywhere.  **
>
> I am using the bin/crawl all-in-one script, something like this:
>
> bin/crawl <seed_url_folder> <path_to_crawldb_files> -depth 2 -topN 10000
>
>
> Where would I put the -deleteGone parameter? Just add another parameter to
> the end ?  I saw online that -deleteGone is a valid parameter of the
> bin/nutch command but am not sure about bin/crawl. Maybe I need to run
> bin/nutch for this?
>
>
> Thanks for your help!
>
>
>
> -Lou
>
>
> ________________________________
>  From: Julien Nioche <[email protected]>
> To: "[email protected]" <[email protected]>; Louis Keeble <
> [email protected]>
> Sent: Monday, May 19, 2014 8:03 AM
> Subject: Re: Nutch with elasticsearch plugin not removing a deleted doc
> from the elasticsearch index
>
>
> Hi Louis
>
> What do you get in the crawldb for that URL? Which version of Nutch are you
> using?
>
> The indexer takes a -deleteGone parameter, are you using it?
>
> Julien
>
>
>
>
>
> On 12 May 2014 19:36, Louis Keeble <[email protected]> wrote:
>
> >
> >
> >
> >
> > Hi all,
> >
> >
> > I am using the elasticsearch nutch indexing plugin to index a web site
> and
> > add search info to an elasticsearch index. It has been working well so
> far.
> > As a test, I removed a single document from my a previously indexed web
> > site and re-ran the nutch crawler on this web site. The web site
> correctly
> > gave a HTTP 404 (deleted) status for the deleted document when it was
> > fetched by nutch. The crawl seemed to finish successfully BUT the deleted
> > document is still showing up in the elasticsearch index. I expected/hoped
> > it would be deleted from the index.
> >
> > Does anyone have any idea why the deleted (from web site) document is not
> > being deleted from the Elasticsearch index?
> >
> >
> > Here's what I see for this document when I dump nutch's most recent
> > segment data:
> >
> > Recno:: 136
> > URL::
> >  http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
> >
> > CrawlDatum::
> > Version: 7
> > Status: 2 (db_fetched)
> > Fetch time: Fri May 09 11:49:53 CDT 2014
> > Modified time: Wed Dec 31 18:00:00 CST 1969
> > Retries since fetch: 0
> > Retry interval: 3600 seconds (0 days)
> > Score: 0.006134969
> > Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> > Metadata:
> >         _ngt_=1399668503370
> >         Content-Type=text/html
> >         _pst_=success(1), lastModified=0
> >         _rs_=154
> >
> > CrawlDatum::
> > Version: 7
> > Status: 37 (fetch_gone)
> > Fetch time: Fri May 09 15:49:33 CDT 2014
> > Modified time: Wed Dec 31 18:00:00 CST 1969
> > Retries since fetch: 0
> > Retry interval: 3600 seconds (0 days)
> > Score: 0.006134969
> > Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> > Metadata:
> >         _ngt_=1399668503370
> >         Content-Type=text/html
> >         _pst_=notfound(14), lastModified=0:
> > http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
> >         _rs_=6748
> >
> >
> > I see that there are two "CrawlDatum"' records, one has a status of 2
> > (db_fetched) and the other has a status of 37 (fetch_gone). The indexing
> > data looked like it was sent to elasticsearch successfully based on the
> > hadoop.log, but there isn't a lot of information provided in hadoop.log
> for
> > elasticsearch.
> > Thanks,
> >
> >
> >
> > -Lou
>
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch with elasticsearch plugin not removing a deleted doc from the elasticsearch index

Reply via email to