Hi List,
I have a small test set for working with nutch and solr. I wanted to
see, if it is possible to delete pages from the solr index after nutch
had fetched them with 404.
As far as I know, there is a command "solrclean" witch should handle
this task. It should go through the crawldb and delete all urls that are
marked as gone.
But for some reasons it doesn't work right in my case:
I had made a crawl over a set of pages. All of them were fetchable
(200). The total count of urls were 1999. I indexed them successfully
to solr.
After that, I wanted to know, what will happen after a recrawl if pages
disappear. So I logged into the CMS and deleted a category of pages.
After a recrawl, updatedb, invertlinks etc, my crawldb looked like this:
root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin#
./nutch readdb crawl/crawldb/ -stats
CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls: 1999
retry 0: 1999
min score: 0.0
avg score: 0.04724062
max score: 7.296
status 3 (db_gone): 169
status 6 (db_notmodified): 1830
CrawlDb statistics: done
That's what I had expected. 169 pages are gone. Fine.
Next I'd run solrclean and solrdedup.
root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin#
./nutch solrclean crawl/crawldb/ http://hrz-vm180:8983/solr
SolrClean: starting at 2011-07-18 15:04:51
SolrClean: deleting 169 documents
SolrClean: deleted a total of 169 documents
SolrClean: finished at 2011-07-18 15:04:54, elapsed: 00:00:02
So SolrClean says, that it deletes all of the 169 documents that are gone.
But when I query for the word "Datendienste" solr is still responding
with pages that are actually gone. I'll show you an example:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">2</int>
<lst name="params">
<str name="indent">on</str>
<str name="start">0</str>
<str
name="q">id:"http://www.uni-kassel.de/its-baustelle/datendienste.html"</str>
<str name="version">2.2</str>
<str name="rows">10</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<float name="boost">1.7714221</float>
<str name="content">IT Servicezentrum: Datendienste (...)</str>
<long name="contentLength">3947</long>
<date name="date">2011-07-11T14:26:33.933Z</date>
<str name="digest">06c077b62a8012772e6365333c74312d</str>
<str
name="id">http://www.uni-kassel.de/its-baustelle/datendienste.html</str>
<str name="segment">20110708154848</str>
<str name="title">IT Servicezentrum: Datendienste</str>
<date name="tstamp">2011-07-11T14:26:33.933Z</date>
<arr
name="type"><str>text/html</str><str>text</str><str>html</str></arr>
<str
name="url">http://www.uni-kassel.de/its-baustelle/datendienste.html</str>
</doc>
</result>
</response>
After that I checked the url
"http://www.uni-kassel.de/its-baustelle/datendienste.html" in the
crawldb and got:
./nutch readdb crawl/crawldb/ -url
http://www.uni-kassel.de/its-baustelle/datendienste.html
URL: http://www.uni-kassel.de/its-baustelle/datendienste.html
not found
Now I am wondering, why the page is still in the solr index.
Thank you