Hi List,

I have a small test set for working with nutch and solr. I wanted to see, if it is possible to delete pages from the solr index after nutch had fetched them with 404.

As far as I know, there is a command "solrclean" witch should handle this task. It should go through the crawldb and delete all urls that are marked as gone.

But for some reasons it doesn't work right in my case:

I had made a crawl over a set of pages. All of them were fetchable (200). The total count of urls were 1999. I indexed them successfully to solr.

After that, I wanted to know, what will happen after a recrawl if pages disappear. So I logged into the CMS and deleted a category of pages.

After a recrawl, updatedb, invertlinks etc, my crawldb looked like this:

root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin# ./nutch readdb crawl/crawldb/ -stats
CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:     1999
retry 0:        1999
min score:      0.0
avg score:      0.04724062
max score:      7.296
status 3 (db_gone):     169
status 6 (db_notmodified):      1830
CrawlDb statistics: done

That's what I had expected. 169 pages are gone. Fine.

Next I'd run solrclean and solrdedup.

root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin# ./nutch solrclean crawl/crawldb/ http://hrz-vm180:8983/solr
SolrClean: starting at 2011-07-18 15:04:51
SolrClean: deleting 169 documents
SolrClean: deleted a total of 169 documents
SolrClean: finished at 2011-07-18 15:04:54, elapsed: 00:00:02

So SolrClean says, that it deletes all of the 169 documents that are gone.

But when I query for the word "Datendienste" solr is still responding with pages that are actually gone. I'll show you an example:

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">2</int>
  <lst name="params">
    <str name="indent">on</str>
    <str name="start">0</str>

<str name="q">id:"http://www.uni-kassel.de/its-baustelle/datendienste.html";</str>
    <str name="version">2.2</str>
    <str name="rows">10</str>
  </lst>
</lst>
<result name="response" numFound="1" start="0">
  <doc>
    <float name="boost">1.7714221</float>

    <str name="content">IT Servicezentrum: Datendienste (...)</str>
    <long name="contentLength">3947</long>
    <date name="date">2011-07-11T14:26:33.933Z</date>
    <str name="digest">06c077b62a8012772e6365333c74312d</str>
<str name="id">http://www.uni-kassel.de/its-baustelle/datendienste.html</str>
    <str name="segment">20110708154848</str>

    <str name="title">IT Servicezentrum: Datendienste</str>
    <date name="tstamp">2011-07-11T14:26:33.933Z</date>
<arr name="type"><str>text/html</str><str>text</str><str>html</str></arr> <str name="url">http://www.uni-kassel.de/its-baustelle/datendienste.html</str>
  </doc>
</result>

</response>

After that I checked the url "http://www.uni-kassel.de/its-baustelle/datendienste.html"; in the crawldb and got:

./nutch readdb crawl/crawldb/ -url http://www.uni-kassel.de/its-baustelle/datendienste.html
URL: http://www.uni-kassel.de/its-baustelle/datendienste.html
not found

Now I am wondering, why the page is still in the solr index.

Thank you


Reply via email to