Re: µTorrent indexed as µTorrent

2009-07-30 Thread Bill Au
Thanks, Robert.  That's exactly what my problem was.  Things work find after
I make sure that all my processing (index and query) are using UTF-8.  FYI,
it took me a while to discover that SolrJ by default uses a GET request for
query, which uses ISO-8859-1.  I had to explicitly use a POST to do query in
SolrJ in order to get it to use UTF-8.

Bill

On Tue, Jul 28, 2009 at 5:27 PM, Robert Muir rcm...@gmail.com wrote:

 Bill, somewhere in the process I think you might be treating your
 UTF-8 text as ISO-8859-1.

 Your character: 00B5 (µ)
 Bits: 10110101

 UTF8-encoded: 1110 10110101

 If you were to treat these bytes as ISO-8859-1 (i.e. reading from a
 file or wrong url encoding) then it looks like:
 0xC2 (Å) followed by 0xB5 (µ)


 On Tue, Jul 28, 2009 at 3:26 PM, Bill Aubill.w...@gmail.com wrote:
  I am using SolrJ to index the word µTorrent.  After a commit I was not
 able
  to query for it.  It turns out that the document in my Solr index
 contains
  the word µTorrent instead of µTorrent.  Any one has any idea what's
 going
  on???
 
  Bill
 



 --
 Robert Muir
 rcm...@gmail.com



Re: µTorrent indexed as µTorrent

2009-07-30 Thread Yonik Seeley
On Thu, Jul 30, 2009 at 6:34 PM, Bill Aubill.w...@gmail.com wrote:
  FYI, it took me a while to discover that SolrJ by default uses a GET request 
 for
 query, which uses ISO-8859-1.

That depends on the servlet container.  SolrJ GET requests are sent in
UTF-8.  Some servlet containers such as Tomcat need extra
configuration to treat URLs as UTF-8 instead of latin-1, but the
standard http://www.ietf.org/rfc/rfc3986.txt clearly specifies UTF-8.

To test the servlet container configuration, check out
example/exampledocs/test_utf8.sh

-Yonik
http://www.lucidimagination.com

  I had to explicitly use a POST to do query in
 SolrJ in order to get it to use UTF-8.

 Bill

 On Tue, Jul 28, 2009 at 5:27 PM, Robert Muir rcm...@gmail.com wrote:

 Bill, somewhere in the process I think you might be treating your
 UTF-8 text as ISO-8859-1.

 Your character: 00B5 (µ)
 Bits: 10110101

 UTF8-encoded: 1110 10110101

 If you were to treat these bytes as ISO-8859-1 (i.e. reading from a
 file or wrong url encoding) then it looks like:
 0xC2 (Å) followed by 0xB5 (µ)


 On Tue, Jul 28, 2009 at 3:26 PM, Bill Aubill.w...@gmail.com wrote:
  I am using SolrJ to index the word µTorrent.  After a commit I was not
 able
  to query for it.  It turns out that the document in my Solr index
 contains
  the word µTorrent instead of µTorrent.  Any one has any idea what's
 going
  on???
 
  Bill
 



 --
 Robert Muir
 rcm...@gmail.com




Re: µTorrent indexed as µTorrent

2009-07-28 Thread Robert Muir
Bill, somewhere in the process I think you might be treating your
UTF-8 text as ISO-8859-1.

Your character: 00B5 (µ)
Bits: 10110101

UTF8-encoded: 1110 10110101

If you were to treat these bytes as ISO-8859-1 (i.e. reading from a
file or wrong url encoding) then it looks like:
0xC2 (Å) followed by 0xB5 (µ)


On Tue, Jul 28, 2009 at 3:26 PM, Bill Aubill.w...@gmail.com wrote:
 I am using SolrJ to index the word µTorrent.  After a commit I was not able
 to query for it.  It turns out that the document in my Solr index contains
 the word µTorrent instead of µTorrent.  Any one has any idea what's going
 on???

 Bill




-- 
Robert Muir
rcm...@gmail.com