Crap, you're right. I have a well-tested application that's using
UTF-8 everywhere possible and I just tested with some Russian text.
Solr's coughing up this as an exception:
Jul 18, 2006 6:00:05 PM org.apache.solr.core.SolrException log
SEVERE: java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.solr.search.QueryParsing.parseSort
(QueryParsing.java:141)
at
org.apache.solr.request.StandardRequestHandler.handleRequest
(StandardRequestHandler.java:96)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:592)
at org.apache.solr.servlet.SolrServlet.doGet
(SolrServlet.java:94)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:596)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
at org.mortbay.jetty.servlet.ServletHolder.handle
(ServletHolder.java:428)
at org.mortbay.jetty.servlet.WebApplicationHandler.dispatch
(WebApplicationHandler.java:473)
at org.mortbay.jetty.servlet.ServletHandler.handle
(ServletHandler.java:568)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1530)
at org.mortbay.jetty.servlet.WebApplicationContext.handle
(WebApplicationContext.java:633)
at org.mortbay.http.HttpContext.handle(HttpContext.java:1482)
at org.mortbay.http.HttpServer.service(HttpServer.java:909)
at org.mortbay.http.HttpConnection.service
(HttpConnection.java:820)
at org.mortbay.http.HttpConnection.handleNext
(HttpConnection.java:986)
at org.mortbay.http.HttpConnection.handle
(HttpConnection.java:837)
at org.mortbay.http.SocketListener.handleConnection
(SocketListener.java:245)
at org.mortbay.util.ThreadedServer.handle
(ThreadedServer.java:357)
at org.mortbay.util.ThreadPool$PoolThread.run
(ThreadPool.java:534)
You're going directly against Solr/Jetty, right? Not proxied or
mod_rewrite'd through to Apache?
Solr isn't properly encoding the data being received by the servlet.
I think that I can fix this using some of the tricks that I've
learned in building my site. More later.
How much testing have people done using UTF-8 data on Solr?
phil.
On Jul 18, 2006, at 5:53 PM, Tricia Williams wrote:
Hi all,
I'm trying to adapt our old cocoon/lucene based web search
application to one that is more solrish. Our old web app was
capable of searching for queries with cyrillic characters in them.
I'm finding that using the packaged example admin interface
entering a query with a string of cyrillic characters causes a
java.lang.ArrayIndexOutOfBoundsException. I've also noted that the
url built from the search form is not utf-8 encoded. So obviously
if I try to manipulate the query string by inserting a utf-8
encoded string in the q= parameter the values are interpreted
incorrectly and as such I cannot use this approach as a work-
around. My sample query is: ...... (the english word _canada_
translated into russian) or %D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0
(utf-8) or %26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26%
231076%3B%26%231072%3B (solr url encoding)
I would appreciate any advice or suggestions that would allow me
to search for cyrillics in solr. If anyone knows why solr is
behaving as it does with the strange encoding, a brief explanation
of what causes this behaviour could be helpful and what the
encoding is (unicode?). If anyone else has force solr to accept
utf-8 encoded q= parameters with success I would love to know how
you did it.
Thanks in advance!
Tricia
ps. I am using mozilla firefox as my main browser which leads to
the behaviour I reported above. IE 6.0 works fine for cyrillics
although there is still a strange but different encoding (%CA%E0%ED%
E0%E4%E0 for the same query as before).
--
Whirlycott
Philip Jacob
[EMAIL PROTECTED]
http://www.whirlycott.com/phil/