Hi Jigal, >> are you indexing with >> bin/nutch index ... -deleteGone >> > > No, I'm using: > > bin/crawl urls/[projectname] crawls/[projectname] > http://solr_server.tld/solr/[projectname] 2
Ok, understood. In bin/crawl deletion of 404s is done by calling first bin/nutch index ... and then bin/nutch clean ... Should have the same effect than indexing with -deleteGone. If you are using Nutch 1.12 also have a look at this bug which could be the reason for your problem: https://issues.apache.org/jira/browse/NUTCH-2269 Do you see similar errors in the logs? >>> Another issue is that the title tag contents appears at the beginning of >>> the "content" field before the actualy page contents. >> > Good to know that I didn't miss a setting :-) > Unfortunately I have zero knowledge about Java coding (I'm a PHP guy who > spends a lot of free time on the FOSS project TYPO3). > > For the time being I can report back that it's hardcoded and that it can't > be configured. Thanks for that information (really; no sarcasm)! > Ok, I'll hope to get it addressed soon. Cheers, Sebastian On 09/14/2016 09:51 AM, Jigal van Hemert | alterNET internet BV wrote: > Hi Sebastian, > > Thanks for the reply. > > On 13 September 2016 at 17:14, Sebastian Nagel <[email protected]> > wrote: > >> are you indexing with >> bin/nutch index ... -deleteGone >> > > No, I'm using: > > bin/crawl urls/[projectname] crawls/[projectname] > http://solr_server.tld/solr/[projectname] 2 > > >> Purging 404s from CrawlDb should be done only from time to time >> to keep the CrawlDb small. Normally, 404s are recorded to avoid >> that they are refetched frequently. >> > > I'm not too concerned about 404s in CrawlDb, but about the fact that they > are not removed from the solr index. > It's only a few hundred URLs that need to be indexed and even if it were > thousands of 404 items it would not be a problem for a looooong time :-) > > >> >>> Another issue is that the title tag contents appears at the beginning of >>> the "content" field before the actualy page contents. >> >> Yes, this is the case. In general, it's not wrong if "content" is a pure >> search field and not used as display field. It's a known feature request >> [1], >> so let's implement it know as a configurable option. If you have time >> to work on it that's fine. If not I could get it done the next days. >> > > Good to know that I didn't miss a setting :-) > Unfortunately I have zero knowledge about Java coding (I'm a PHP guy who > spends a lot of free time on the FOSS project TYPO3). > > For the time being I can report back that it's hardcoded and that it can't > be configured. Thanks for that information (really; no sarcasm)! > >

