A 404-URL should not disappear from the CrawlDB, no matter what unless 
filtered via URL filters. Can you check? Perhaps something else is going on.

On Monday 18 July 2011 15:59:22 Marek Bachmann wrote:
> On 18.07.2011 15:43, Markus Jelsma wrote:
> > On Monday 18 July 2011 15:13:41 Marek Bachmann wrote:
> >> Hi List,
> >> 
> >> I have a small test set for working with nutch and solr. I wanted to
> >> see, if it is possible to delete pages from the solr index after nutch
> >> had fetched them with 404.
> >> 
> >> As far as I know, there is a command "solrclean" witch should handle
> >> this task. It should go through the crawldb and delete all urls that are
> >> marked as gone.
> >> 
> >> But for some reasons it doesn't work right in my case:
> >> 
> >> I had made a crawl over a set of pages. All of them were fetchable
> >> (200). The total count of urls were 1999. I indexed them successfully
> >> to solr.
> >> 
> >> After that, I wanted to know, what will happen after a recrawl if pages
> >> disappear. So I logged into the CMS and deleted a category of pages.
> >> 
> >> After a recrawl, updatedb, invertlinks etc, my crawldb looked like this:
> >> 
> >> root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin#
> >> ./nutch readdb crawl/crawldb/ -stats
> >> CrawlDb statistics start: crawl/crawldb/
> >> Statistics for CrawlDb: crawl/crawldb/
> >> TOTAL urls:        1999
> >> retry 0:   1999
> >> min score: 0.0
> >> avg score: 0.04724062
> >> max score: 7.296
> >> status 3 (db_gone):        169
> >> status 6 (db_notmodified): 1830
> >> CrawlDb statistics: done
> >> 
> >> That's what I had expected. 169 pages are gone. Fine.
> >> 
> >> Next I'd run solrclean and solrdedup.
> >> 
> >> root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin#
> >> ./nutch solrclean crawl/crawldb/ http://hrz-vm180:8983/solr
> >> SolrClean: starting at 2011-07-18 15:04:51
> >> SolrClean: deleting 169 documents
> >> SolrClean: deleted a total of 169 documents
> >> SolrClean: finished at 2011-07-18 15:04:54, elapsed: 00:00:02
> >> 
> >> So SolrClean says, that it deletes all of the 169 documents that are
> >> gone.
> >> 
> >> But when I query for the word "Datendienste" solr is still responding
> >> with pages that are actually gone. I'll show you an example:
> >> 
> >> <?xml version="1.0" encoding="UTF-8"?>
> >> <response>
> >> 
> >> <lst name="responseHeader">
> >> 
> >>     <int name="status">0</int>
> >>     <int name="QTime">2</int>
> >>     <lst name="params">
> >>     
> >>       <str name="indent">on</str>
> >>       <str name="start">0</str>
> >>       
> >>       <str
> >> 
> >> name="q">id:"http://www.uni-kassel.de/its-baustelle/datendienste.html";</
> >> str
> >> 
> >>> <str name="version">2.2</str>
> >>> 
> >>       <str name="rows">10</str>
> >>     
> >>     </lst>
> >> 
> >> </lst>
> >> <result name="response" numFound="1" start="0">
> >> 
> >>     <doc>
> >>     
> >>       <float name="boost">1.7714221</float>
> >>       
> >>       <str name="content">IT Servicezentrum: Datendienste (...)</str>
> >>       <long name="contentLength">3947</long>
> >>       <date name="date">2011-07-11T14:26:33.933Z</date>
> >>       <str name="digest">06c077b62a8012772e6365333c74312d</str>
> >>       <str
> >> 
> >> name="id">http://www.uni-kassel.de/its-baustelle/datendienste.html</str>
> >> 
> >>       <str name="segment">20110708154848</str>
> >>       
> >>       <str name="title">IT Servicezentrum: Datendienste</str>
> >>       <date name="tstamp">2011-07-11T14:26:33.933Z</date>
> >>       <arr
> >> 
> >> name="type"><str>text/html</str><str>text</str><str>html</str></arr>
> >> 
> >>       <str
> >> 
> >> name="url">http://www.uni-kassel.de/its-baustelle/datendienste.html</str
> >> >
> >> 
> >>     </doc>
> >> 
> >> </result>
> >> 
> >> </response>
> >> 
> >> After that I checked the url
> >> "http://www.uni-kassel.de/its-baustelle/datendienste.html"; in the
> >> crawldb and got:
> >> 
> >> ./nutch readdb crawl/crawldb/ -url
> >> http://www.uni-kassel.de/its-baustelle/datendienste.html
> >> URL: http://www.uni-kassel.de/its-baustelle/datendienste.html
> >> not found
> > 
> > This means (IIRC) the URL is not in the CrawlDB and whatever is not in
> > the CrawlDB cannot be removed.
> 
> Argh, ok, I'll test that. I interpreted the "not found" as the HTTP
> Status instead of not found in the db... :-/
> 
> ... You are right, I dumped the db and the url isn't really there
> anymore... Guess for now I have to increase the number of retries for
> 404 pages.
> 
> That means, if I configure nutch in that way, that it deletes urls from
> the db after a number of retries it would not be possible to delete this
> pages automatically in solr anymore?
> 
> > Check the Solr log and see if it actually receives
> > the delete commands. Did you issue a commit as well?
> 
> The command with the list of 169 elements is send to solr, and after
> that it commits as well.
> 
> Thank you very much :-)
> 
> >> Now I am wondering, why the page is still in the solr index.
> >> 
> >> Thank you

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to