On Thu, Jul 30, 2009 at 6:34 PM, Bill Au<bill.w...@gmail.com> wrote:
>  FYI, it took me a while to discover that SolrJ by default uses a GET request 
> for
> query, which uses ISO-8859-1.

That depends on the servlet container.  SolrJ GET requests are sent in
UTF-8.  Some servlet containers such as Tomcat need extra
configuration to treat URLs as UTF-8 instead of latin-1, but the
standard http://www.ietf.org/rfc/rfc3986.txt clearly specifies UTF-8.

To test the servlet container configuration, check out
example/exampledocs/test_utf8.sh

-Yonik
http://www.lucidimagination.com

  I had to explicitly use a POST to do query in
> SolrJ in order to get it to use UTF-8.
>
> Bill
>
> On Tue, Jul 28, 2009 at 5:27 PM, Robert Muir <rcm...@gmail.com> wrote:
>
>> Bill, somewhere in the process I think you might be treating your
>> UTF-8 text as ISO-8859-1.
>>
>> Your character: 00B5 (µ)
>> Bits: 10110101
>>
>> UTF8-encoded: 11000010 10110101
>>
>> If you were to treat these bytes as ISO-8859-1 (i.e. reading from a
>> file or wrong url encoding) then it looks like:
>> 0xC2 (Å) followed by 0xB5 (µ)
>>
>>
>> On Tue, Jul 28, 2009 at 3:26 PM, Bill Au<bill.w...@gmail.com> wrote:
>> > I am using SolrJ to index the word µTorrent.  After a commit I was not
>> able
>> > to query for it.  It turns out that the document in my Solr index
>> contains
>> > the word µTorrent instead of µTorrent.  Any one has any idea what's
>> going
>> > on???
>> >
>> > Bill
>> >
>>
>>
>>
>> --
>> Robert Muir
>> rcm...@gmail.com
>>
>

Reply via email to