I finally found the error. I was 100% sure that I’ve set 
parser.character.encoding.default to utf-8 in my notch-site.xml but it was 
missing.
So setting this fixed my problem! I assume that I deleted that setting somehow 
during testing another feature.

Thx to Gora for his hints!

Cheers
Peter


> Am 11.11.2015 um 14:57 schrieb Gora Mohanty <[email protected]>:
> 
> On 11 November 2015 at 19:06, Peter Kraume <[email protected]> wrote:
>> I crawl a german website with Nutch 1.8 to a Solr 4.8.0 index since a couple 
>> of weeks and everything was fine with Umlauts.
>> 
>> Now I have a lot (but not all) documents in the Solr index with garbled 
>> umlauts. I’m not aware of any changes that have been made to the website 
>> (which uses UTF-8) or the Nutch crawler settings. What puzzles me is that 
>> there are documents where Umlauts are correct and others, where the Umlauts 
>> are broken.
>> 
>> Do you have any hints for me where I can start debugging this strange issue?
> 
> Are there examples of publicly available web pages where the umlaut is
> correctly processed, and where it is not?
> 
> Regards,
> Gora

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to