Merlin Just to make sure I understand what is going on here, you are getting searches from external crawlers. These are coming in the form of an HTTP request I assume?
Have you checked the encoding specified in these requests (in the content type header). If the encoding is not specified then iso-8859-1 is usually assumed. Also have you checked the default encoding of your container? If you are using tomcat that is set using URIEncoding, for example: <Connector address="localhost" port="8000" protocol="HTTP/1.1" connectionTimeout="20000" URIEncoding="UTF-8" /> François On Aug 28, 2011, at 3:10 PM, Merlin Morgenstern wrote: > I double checked all code on that page and it looks like everything is in > utf-8 and works just perfect. The problematic URLs are called always by bots > like google bot. Looks like they are operating with a different encoding. > The page itself has an utf-8 meta tag. > > So it looks like I have to find a way that checks for the encoding and > encodes apropriatly. this should be a common solr problem if all search > engines treat utf-8 that way, right? > > Any ideas how to fix that? Is there maybe a special solr functionality for > this? > > 2011/8/27 François Schiettecatte <fschietteca...@gmail.com> > >> Merlin >> >> Ü encodes to two characters in utf-8 (C39C), and one in iso-8859-1 (%DC) so >> it looks like there is a charset mismatch somewhere. >> >> >> Cheers >> >> François >> >> >> >> On Aug 27, 2011, at 6:34 AM, Merlin Morgenstern wrote: >> >>> Hello, >>> >>> I am having problems with searches that are issued from spiders that >> contain >>> the ASCII encoded character "ü" >>> >>> For example in : "Übersetzung" >>> >>> The solr log shows following query request: /suche/%DCbersetzung >>> which has been translated into solr query: q=?ersetzung >>> >>> If you enter the search term directly as a user into the search box it >> will >>> result into: >>> /suche/Übersetzung which returns perfect results. >>> >>> I am decoding the URL within PHP: $term = trim(urldecode($q)); >>> >>> Somehow urldecode() translates the Character Ü (%DC) into a ? which is a >>> illigeal first character in Solr. >>> >>> I tried it without urldecode(), with rawurldecode() and with >> utf8_decode() >>> but all of those did not help. >>> >>> Thank you for any help or hint on how to solve that problem. >>> >>> Regards, Merlin >> >>