Question about solrclean

Marek Bachmann Mon, 18 Jul 2011 06:11:11 -0700

Hi List,

I have a small test set for working with nutch and solr. I wanted tosee, if it is possible to delete pages from the solr index after nutchhad fetched them with 404.

As far as I know, there is a command "solrclean" witch should handlethis task. It should go through the crawldb and delete all urls that aremarked as gone.


But for some reasons it doesn't work right in my case:

I had made a crawl over a set of pages. All of them were fetchable(200). The total count of urls were 1999. I indexed them successfullyto solr.

After that, I wanted to know, what will happen after a recrawl if pagesdisappear. So I logged into the CMS and deleted a category of pages.


After a recrawl, updatedb, invertlinks etc, my crawldb looked like this:

root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin#./nutch readdb crawl/crawldb/ -stats

CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:     1999
retry 0:        1999
min score:      0.0
avg score:      0.04724062
max score:      7.296
status 3 (db_gone):     169
status 6 (db_notmodified):      1830
CrawlDb statistics: done

That's what I had expected. 169 pages are gone. Fine.

Next I'd run solrclean and solrdedup.

root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin#./nutch solrclean crawl/crawldb/ http://hrz-vm180:8983/solr

SolrClean: starting at 2011-07-18 15:04:51
SolrClean: deleting 169 documents
SolrClean: deleted a total of 169 documents
SolrClean: finished at 2011-07-18 15:04:54, elapsed: 00:00:02

So SolrClean says, that it deletes all of the 169 documents that are gone.

But when I query for the word "Datendienste" solr is still respondingwith pages that are actually gone. I'll show you an example:


<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">2</int>
  <lst name="params">
    <str name="indent">on</str>
    <str name="start">0</str>

<strname="q">id:"http://www.uni-kassel.de/its-baustelle/datendienste.html";</str>

    <str name="version">2.2</str>
    <str name="rows">10</str>
  </lst>
</lst>
<result name="response" numFound="1" start="0">
  <doc>
    <float name="boost">1.7714221</float>

    <str name="content">IT Servicezentrum: Datendienste (...)</str>
    <long name="contentLength">3947</long>
    <date name="date">2011-07-11T14:26:33.933Z</date>
    <str name="digest">06c077b62a8012772e6365333c74312d</str>

<strname="id">http://www.uni-kassel.de/its-baustelle/datendienste.html</str>

    <str name="segment">20110708154848</str>

    <str name="title">IT Servicezentrum: Datendienste</str>
    <date name="tstamp">2011-07-11T14:26:33.933Z</date>

<arrname="type"><str>text/html</str><str>text</str><str>html</str></arr><strname="url">http://www.uni-kassel.de/its-baustelle/datendienste.html</str>

  </doc>
</result>

</response>

After that I checked the url"http://www.uni-kassel.de/its-baustelle/datendienste.html"; in thecrawldb and got:

./nutch readdb crawl/crawldb/ -urlhttp://www.uni-kassel.de/its-baustelle/datendienste.html

URL: http://www.uni-kassel.de/its-baustelle/datendienste.html
not found

Now I am wondering, why the page is still in the solr index.

Thank you

Question about solrclean

Reply via email to