On 11 November 2015 at 19:06, Peter Kraume <[email protected]> wrote:
> I crawl a german website with Nutch 1.8 to a Solr 4.8.0 index since a couple 
> of weeks and everything was fine with Umlauts.
>
> Now I have a lot (but not all) documents in the Solr index with garbled 
> umlauts. I’m not aware of any changes that have been made to the website 
> (which uses UTF-8) or the Nutch crawler settings. What puzzles me is that 
> there are documents where Umlauts are correct and others, where the Umlauts 
> are broken.
>
> Do you have any hints for me where I can start debugging this strange issue?

Are there examples of publicly available web pages where the umlaut is
correctly processed, and where it is not?

Regards,
Gora

Reply via email to