On Monday 18 July 2011 15:13:41 Marek Bachmann wrote:
> Hi List,
> 
> I have a small test set for working with nutch and solr. I wanted to
> see, if it is possible to delete pages from the solr index after nutch
> had fetched them with 404.
> 
> As far as I know, there is a command "solrclean" witch should handle
> this task. It should go through the crawldb and delete all urls that are
> marked as gone.
> 
> But for some reasons it doesn't work right in my case:
> 
> I had made a crawl over a set of pages. All of them were fetchable
> (200). The total count of urls were 1999. I indexed them successfully
> to solr.
> 
> After that, I wanted to know, what will happen after a recrawl if pages
> disappear. So I logged into the CMS and deleted a category of pages.
> 
> After a recrawl, updatedb, invertlinks etc, my crawldb looked like this:
> 
> root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin#
> ./nutch readdb crawl/crawldb/ -stats
> CrawlDb statistics start: crawl/crawldb/
> Statistics for CrawlDb: crawl/crawldb/
> TOTAL urls:   1999
> retry 0:      1999
> min score:    0.0
> avg score:    0.04724062
> max score:    7.296
> status 3 (db_gone):   169
> status 6 (db_notmodified):    1830
> CrawlDb statistics: done
> 
> That's what I had expected. 169 pages are gone. Fine.
> 
> Next I'd run solrclean and solrdedup.
> 
> root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin#
> ./nutch solrclean crawl/crawldb/ http://hrz-vm180:8983/solr
> SolrClean: starting at 2011-07-18 15:04:51
> SolrClean: deleting 169 documents
> SolrClean: deleted a total of 169 documents
> SolrClean: finished at 2011-07-18 15:04:54, elapsed: 00:00:02
> 
> So SolrClean says, that it deletes all of the 169 documents that are gone.
> 
> But when I query for the word "Datendienste" solr is still responding
> with pages that are actually gone. I'll show you an example:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <response>
> 
> <lst name="responseHeader">
>    <int name="status">0</int>
>    <int name="QTime">2</int>
>    <lst name="params">
>      <str name="indent">on</str>
>      <str name="start">0</str>
> 
>      <str
> name="q">id:"http://www.uni-kassel.de/its-baustelle/datendienste.html";</str
> > <str name="version">2.2</str>
>      <str name="rows">10</str>
>    </lst>
> </lst>
> <result name="response" numFound="1" start="0">
>    <doc>
>      <float name="boost">1.7714221</float>
> 
>      <str name="content">IT Servicezentrum: Datendienste (...)</str>
>      <long name="contentLength">3947</long>
>      <date name="date">2011-07-11T14:26:33.933Z</date>
>      <str name="digest">06c077b62a8012772e6365333c74312d</str>
>      <str
> name="id">http://www.uni-kassel.de/its-baustelle/datendienste.html</str>
>      <str name="segment">20110708154848</str>
> 
>      <str name="title">IT Servicezentrum: Datendienste</str>
>      <date name="tstamp">2011-07-11T14:26:33.933Z</date>
>      <arr
> name="type"><str>text/html</str><str>text</str><str>html</str></arr>
>      <str
> name="url">http://www.uni-kassel.de/its-baustelle/datendienste.html</str>
>    </doc>
> </result>
> 
> </response>
> 
> After that I checked the url
> "http://www.uni-kassel.de/its-baustelle/datendienste.html"; in the
> crawldb and got:
> 
> ./nutch readdb crawl/crawldb/ -url
> http://www.uni-kassel.de/its-baustelle/datendienste.html
> URL: http://www.uni-kassel.de/its-baustelle/datendienste.html
> not found

This means (IIRC) the URL is not in the CrawlDB and whatever is not in the 
CrawlDB cannot be removed. Check the Solr log and see if it actually receives 
the delete commands. Did you issue a commit as well?

> 
> Now I am wondering, why the page is still in the solr index.
> 
> Thank you

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to