Re: Question about solrclean

Marek Bachmann Mon, 18 Jul 2011 07:29:25 -0700

On 18.07.2011 16:04, Markus Jelsma wrote:

A 404-URL should not disappear from the CrawlDB, no matter what unless
filtered via URL filters. Can you check? Perhaps something else is going on.


I'll start the process over again :)


On Monday 18 July 2011 15:59:22 Marek Bachmann wrote:

On 18.07.2011 15:43, Markus Jelsma wrote:

On Monday 18 July 2011 15:13:41 Marek Bachmann wrote:

Hi List,

I have a small test set for working with nutch and solr. I wanted to
see, if it is possible to delete pages from the solr index after nutch
had fetched them with 404.

As far as I know, there is a command "solrclean" witch should handle
this task. It should go through the crawldb and delete all urls that are
marked as gone.

But for some reasons it doesn't work right in my case:

I had made a crawl over a set of pages. All of them were fetchable
(200). The total count of urls were 1999. I indexed them successfully
to solr.

After that, I wanted to know, what will happen after a recrawl if pages
disappear. So I logged into the CMS and deleted a category of pages.

After a recrawl, updatedb, invertlinks etc, my crawldb looked like this:

root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin#
./nutch readdb crawl/crawldb/ -stats
CrawlDb statistics start: crawl/crawldb/
Statistics for CrawlDb: crawl/crawldb/
TOTAL urls:     1999
retry 0:        1999
min score:      0.0
avg score:      0.04724062
max score:      7.296
status 3 (db_gone):     169
status 6 (db_notmodified):      1830
CrawlDb statistics: done

That's what I had expected. 169 pages are gone. Fine.

Next I'd run solrclean and solrdedup.

root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin#
./nutch solrclean crawl/crawldb/ http://hrz-vm180:8983/solr
SolrClean: starting at 2011-07-18 15:04:51
SolrClean: deleting 169 documents
SolrClean: deleted a total of 169 documents
SolrClean: finished at 2011-07-18 15:04:54, elapsed: 00:00:02

So SolrClean says, that it deletes all of the 169 documents that are
gone.

But when I query for the word "Datendienste" solr is still responding
with pages that are actually gone. I'll show you an example:

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">

     <int name="status">0</int>
     <int name="QTime">2</int>
     <lst name="params">

       <str name="indent">on</str>
       <str name="start">0</str>

       <str

name="q">id:"http://www.uni-kassel.de/its-baustelle/datendienste.html";</
str

<str name="version">2.2</str>

       <str name="rows">10</str>

     </lst>

</lst>
<result name="response" numFound="1" start="0">

     <doc>

       <float name="boost">1.7714221</float>

       <str name="content">IT Servicezentrum: Datendienste (...)</str>
       <long name="contentLength">3947</long>
       <date name="date">2011-07-11T14:26:33.933Z</date>
       <str name="digest">06c077b62a8012772e6365333c74312d</str>
       <str

name="id">http://www.uni-kassel.de/its-baustelle/datendienste.html</str>

       <str name="segment">20110708154848</str>

       <str name="title">IT Servicezentrum: Datendienste</str>
       <date name="tstamp">2011-07-11T14:26:33.933Z</date>
       <arr

name="type"><str>text/html</str><str>text</str><str>html</str></arr>

       <str

name="url">http://www.uni-kassel.de/its-baustelle/datendienste.html</str


     </doc>

</result>

</response>

After that I checked the url
"http://www.uni-kassel.de/its-baustelle/datendienste.html"; in the
crawldb and got:

./nutch readdb crawl/crawldb/ -url
http://www.uni-kassel.de/its-baustelle/datendienste.html
URL: http://www.uni-kassel.de/its-baustelle/datendienste.html
not found


This means (IIRC) the URL is not in the CrawlDB and whatever is not in
the CrawlDB cannot be removed.


Argh, ok, I'll test that. I interpreted the "not found" as the HTTP
Status instead of not found in the db... :-/

... You are right, I dumped the db and the url isn't really there
anymore... Guess for now I have to increase the number of retries for
404 pages.

That means, if I configure nutch in that way, that it deletes urls from
the db after a number of retries it would not be possible to delete this
pages automatically in solr anymore?

Check the Solr log and see if it actually receives
the delete commands. Did you issue a commit as well?


The command with the list of 169 elements is send to solr, and after
that it commits as well.

Thank you very much :-)

Now I am wondering, why the page is still in the solr index.

Thank you

Re: Question about solrclean

Reply via email to