A 404-URL should not disappear from the CrawlDB, no matter what unless filtered via URL filters. Can you check? Perhaps something else is going on.
On Monday 18 July 2011 15:59:22 Marek Bachmann wrote: > On 18.07.2011 15:43, Markus Jelsma wrote: > > On Monday 18 July 2011 15:13:41 Marek Bachmann wrote: > >> Hi List, > >> > >> I have a small test set for working with nutch and solr. I wanted to > >> see, if it is possible to delete pages from the solr index after nutch > >> had fetched them with 404. > >> > >> As far as I know, there is a command "solrclean" witch should handle > >> this task. It should go through the crawldb and delete all urls that are > >> marked as gone. > >> > >> But for some reasons it doesn't work right in my case: > >> > >> I had made a crawl over a set of pages. All of them were fetchable > >> (200). The total count of urls were 1999. I indexed them successfully > >> to solr. > >> > >> After that, I wanted to know, what will happen after a recrawl if pages > >> disappear. So I logged into the CMS and deleted a category of pages. > >> > >> After a recrawl, updatedb, invertlinks etc, my crawldb looked like this: > >> > >> root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin# > >> ./nutch readdb crawl/crawldb/ -stats > >> CrawlDb statistics start: crawl/crawldb/ > >> Statistics for CrawlDb: crawl/crawldb/ > >> TOTAL urls: 1999 > >> retry 0: 1999 > >> min score: 0.0 > >> avg score: 0.04724062 > >> max score: 7.296 > >> status 3 (db_gone): 169 > >> status 6 (db_notmodified): 1830 > >> CrawlDb statistics: done > >> > >> That's what I had expected. 169 pages are gone. Fine. > >> > >> Next I'd run solrclean and solrdedup. > >> > >> root@hrz-vm180:/home/nutchServer/intranet_nutch/runtime/local/bin# > >> ./nutch solrclean crawl/crawldb/ http://hrz-vm180:8983/solr > >> SolrClean: starting at 2011-07-18 15:04:51 > >> SolrClean: deleting 169 documents > >> SolrClean: deleted a total of 169 documents > >> SolrClean: finished at 2011-07-18 15:04:54, elapsed: 00:00:02 > >> > >> So SolrClean says, that it deletes all of the 169 documents that are > >> gone. > >> > >> But when I query for the word "Datendienste" solr is still responding > >> with pages that are actually gone. I'll show you an example: > >> > >> <?xml version="1.0" encoding="UTF-8"?> > >> <response> > >> > >> <lst name="responseHeader"> > >> > >> <int name="status">0</int> > >> <int name="QTime">2</int> > >> <lst name="params"> > >> > >> <str name="indent">on</str> > >> <str name="start">0</str> > >> > >> <str > >> > >> name="q">id:"http://www.uni-kassel.de/its-baustelle/datendienste.html"</ > >> str > >> > >>> <str name="version">2.2</str> > >>> > >> <str name="rows">10</str> > >> > >> </lst> > >> > >> </lst> > >> <result name="response" numFound="1" start="0"> > >> > >> <doc> > >> > >> <float name="boost">1.7714221</float> > >> > >> <str name="content">IT Servicezentrum: Datendienste (...)</str> > >> <long name="contentLength">3947</long> > >> <date name="date">2011-07-11T14:26:33.933Z</date> > >> <str name="digest">06c077b62a8012772e6365333c74312d</str> > >> <str > >> > >> name="id">http://www.uni-kassel.de/its-baustelle/datendienste.html</str> > >> > >> <str name="segment">20110708154848</str> > >> > >> <str name="title">IT Servicezentrum: Datendienste</str> > >> <date name="tstamp">2011-07-11T14:26:33.933Z</date> > >> <arr > >> > >> name="type"><str>text/html</str><str>text</str><str>html</str></arr> > >> > >> <str > >> > >> name="url">http://www.uni-kassel.de/its-baustelle/datendienste.html</str > >> > > >> > >> </doc> > >> > >> </result> > >> > >> </response> > >> > >> After that I checked the url > >> "http://www.uni-kassel.de/its-baustelle/datendienste.html" in the > >> crawldb and got: > >> > >> ./nutch readdb crawl/crawldb/ -url > >> http://www.uni-kassel.de/its-baustelle/datendienste.html > >> URL: http://www.uni-kassel.de/its-baustelle/datendienste.html > >> not found > > > > This means (IIRC) the URL is not in the CrawlDB and whatever is not in > > the CrawlDB cannot be removed. > > Argh, ok, I'll test that. I interpreted the "not found" as the HTTP > Status instead of not found in the db... :-/ > > ... You are right, I dumped the db and the url isn't really there > anymore... Guess for now I have to increase the number of retries for > 404 pages. > > That means, if I configure nutch in that way, that it deletes urls from > the db after a number of retries it would not be possible to delete this > pages automatically in solr anymore? > > > Check the Solr log and see if it actually receives > > the delete commands. Did you issue a commit as well? > > The command with the list of 169 elements is send to solr, and after > that it commits as well. > > Thank you very much :-) > > >> Now I am wondering, why the page is still in the solr index. > >> > >> Thank you -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

