Hi Jigal,

are you indexing with
  bin/nutch index ... -deleteGone

Purging 404s from CrawlDb should be done only from time to time
to keep the CrawlDb small. Normally, 404s are recorded to avoid
that they are refetched frequently.

> Another issue is that the title tag contents appears at the beginning of
> the "content" field before the actualy page contents.

Yes, this is the case. In general, it's not wrong if "content" is a pure
search field and not used as display field. It's a known feature request [1],
so let's implement it know as a configurable option. If you have time
to work on it that's fine. If not I could get it done the next days.

Best,
Sebastian

[1] https://issues.apache.org/jira/browse/NUTCH-1749


On 09/13/2016 09:24 AM, Jigal van Hemert | alterNET internet BV wrote:
> Hi,
> 
> The daily indexing seems to be working so far (field "indexed" is updated),
> but pages that return a 404 are not removed from the solr index. The
> content they return is also no included in the index. They just seem tot be
> ingnored.
> At first db.update.purge.404 was set to true, but upon reading a bit
> further on that setting it seemed to me that this would remove the pages
> from the Nutch db, essentially leaving them alone without updating the solr
> index. So I changed it to false, hoping that they would now be removed from
> the index. Alas, nothing changed.
> 
> Another issue is that the title tag contents appears at the beginning of
> the "content" field before the actualy page contents. This looks a bit
> silly so I searched for a place where it might be configured. Nothing in
> schema.xml, schema-solr4.xml and solrindex-mapping.xml.
> Maybe I've overlooked something, but I couldn't find any setting that might
> explain this.
> Is there a way to remove the title tag contents from the "content" field?
> 

Reply via email to