On Thu, Jul 30, 2009 at 6:34 PM, Bill Au<bill.w...@gmail.com> wrote: > FYI, it took me a while to discover that SolrJ by default uses a GET request > for > query, which uses ISO-8859-1.
That depends on the servlet container. SolrJ GET requests are sent in UTF-8. Some servlet containers such as Tomcat need extra configuration to treat URLs as UTF-8 instead of latin-1, but the standard http://www.ietf.org/rfc/rfc3986.txt clearly specifies UTF-8. To test the servlet container configuration, check out example/exampledocs/test_utf8.sh -Yonik http://www.lucidimagination.com I had to explicitly use a POST to do query in > SolrJ in order to get it to use UTF-8. > > Bill > > On Tue, Jul 28, 2009 at 5:27 PM, Robert Muir <rcm...@gmail.com> wrote: > >> Bill, somewhere in the process I think you might be treating your >> UTF-8 text as ISO-8859-1. >> >> Your character: 00B5 (µ) >> Bits: 10110101 >> >> UTF8-encoded: 11000010 10110101 >> >> If you were to treat these bytes as ISO-8859-1 (i.e. reading from a >> file or wrong url encoding) then it looks like: >> 0xC2 (Å) followed by 0xB5 (µ) >> >> >> On Tue, Jul 28, 2009 at 3:26 PM, Bill Au<bill.w...@gmail.com> wrote: >> > I am using SolrJ to index the word µTorrent. After a commit I was not >> able >> > to query for it. It turns out that the document in my Solr index >> contains >> > the word µTorrent instead of µTorrent. Any one has any idea what's >> going >> > on??? >> > >> > Bill >> > >> >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> >