Hi Jigal,

>> are you indexing with
>>   bin/nutch index ... -deleteGone
>>
>
> No, I'm using:
>
> bin/crawl urls/[projectname] crawls/[projectname]
> http://solr_server.tld/solr/[projectname] 2

Ok, understood. In bin/crawl deletion of 404s is done
by calling first
  bin/nutch index ...
and then
  bin/nutch clean ...

Should have the same effect than indexing with -deleteGone.
If you are using Nutch 1.12 also have a look at this bug which
could be the reason for your problem:
  https://issues.apache.org/jira/browse/NUTCH-2269
Do you see similar errors in the logs?

>>> Another issue is that the title tag contents appears at the beginning of
>>> the "content" field before the actualy page contents.
>>
> Good to know that I didn't miss a setting :-)
> Unfortunately I have zero knowledge about Java coding (I'm a PHP guy who
> spends a lot of free time on the FOSS project TYPO3).
>
> For the time being I can report back that it's hardcoded and that it can't
> be configured. Thanks for that information (really; no sarcasm)!
>

Ok, I'll hope to get it addressed soon.

Cheers,
Sebastian

On 09/14/2016 09:51 AM, Jigal van Hemert | alterNET internet BV wrote:
> Hi Sebastian,
> 
> Thanks for the reply.
> 
> On 13 September 2016 at 17:14, Sebastian Nagel <[email protected]>
> wrote:
> 
>> are you indexing with
>>   bin/nutch index ... -deleteGone
>>
> 
> No, I'm using:
> 
> bin/crawl urls/[projectname] crawls/[projectname]
> http://solr_server.tld/solr/[projectname] 2
> 
> 
>> Purging 404s from CrawlDb should be done only from time to time
>> to keep the CrawlDb small. Normally, 404s are recorded to avoid
>> that they are refetched frequently.
>>
> 
> I'm not too concerned about 404s in CrawlDb, but about the fact that they
> are not removed from the solr index.
> It's only a few hundred URLs that need to be indexed and even if it were
> thousands of 404 items it would not be a problem for a looooong time :-)
> 
> 
>>
>>> Another issue is that the title tag contents appears at the beginning of
>>> the "content" field before the actualy page contents.
>>
>> Yes, this is the case. In general, it's not wrong if "content" is a pure
>> search field and not used as display field. It's a known feature request
>> [1],
>> so let's implement it know as a configurable option. If you have time
>> to work on it that's fine. If not I could get it done the next days.
>>
> 
> Good to know that I didn't miss a setting :-)
> Unfortunately I have zero knowledge about Java coding (I'm a PHP guy who
> spends a lot of free time on the FOSS project TYPO3).
> 
> For the time being I can report back that it's hardcoded and that it can't
> be configured. Thanks for that information (really; no sarcasm)!
> 
> 

Reply via email to