I finally found the error. I was 100% sure that I’ve set parser.character.encoding.default to utf-8 in my notch-site.xml but it was missing. So setting this fixed my problem! I assume that I deleted that setting somehow during testing another feature.
Thx to Gora for his hints! Cheers Peter > Am 11.11.2015 um 14:57 schrieb Gora Mohanty <[email protected]>: > > On 11 November 2015 at 19:06, Peter Kraume <[email protected]> wrote: >> I crawl a german website with Nutch 1.8 to a Solr 4.8.0 index since a couple >> of weeks and everything was fine with Umlauts. >> >> Now I have a lot (but not all) documents in the Solr index with garbled >> umlauts. I’m not aware of any changes that have been made to the website >> (which uses UTF-8) or the Nutch crawler settings. What puzzles me is that >> there are documents where Umlauts are correct and others, where the Umlauts >> are broken. >> >> Do you have any hints for me where I can start debugging this strange issue? > > Are there examples of publicly available web pages where the umlaut is > correctly processed, and where it is not? > > Regards, > Gora
signature.asc
Description: Message signed with OpenPGP using GPGMail

