Hi Jigal, are you indexing with bin/nutch index ... -deleteGone
Purging 404s from CrawlDb should be done only from time to time to keep the CrawlDb small. Normally, 404s are recorded to avoid that they are refetched frequently. > Another issue is that the title tag contents appears at the beginning of > the "content" field before the actualy page contents. Yes, this is the case. In general, it's not wrong if "content" is a pure search field and not used as display field. It's a known feature request [1], so let's implement it know as a configurable option. If you have time to work on it that's fine. If not I could get it done the next days. Best, Sebastian [1] https://issues.apache.org/jira/browse/NUTCH-1749 On 09/13/2016 09:24 AM, Jigal van Hemert | alterNET internet BV wrote: > Hi, > > The daily indexing seems to be working so far (field "indexed" is updated), > but pages that return a 404 are not removed from the solr index. The > content they return is also no included in the index. They just seem tot be > ingnored. > At first db.update.purge.404 was set to true, but upon reading a bit > further on that setting it seemed to me that this would remove the pages > from the Nutch db, essentially leaving them alone without updating the solr > index. So I changed it to false, hoping that they would now be removed from > the index. Alas, nothing changed. > > Another issue is that the title tag contents appears at the beginning of > the "content" field before the actualy page contents. This looks a bit > silly so I searched for a place where it might be configured. Nothing in > schema.xml, schema-solr4.xml and solrindex-mapping.xml. > Maybe I've overlooked something, but I couldn't find any setting that might > explain this. > Is there a way to remove the title tag contents from the "content" field? >

