Re: µTorrent indexed as µTorrent
Thanks, Robert. That's exactly what my problem was. Things work find after I make sure that all my processing (index and query) are using UTF-8. FYI, it took me a while to discover that SolrJ by default uses a GET request for query, which uses ISO-8859-1. I had to explicitly use a POST to do query in SolrJ in order to get it to use UTF-8. Bill On Tue, Jul 28, 2009 at 5:27 PM, Robert Muir rcm...@gmail.com wrote: Bill, somewhere in the process I think you might be treating your UTF-8 text as ISO-8859-1. Your character: 00B5 (µ) Bits: 10110101 UTF8-encoded: 1110 10110101 If you were to treat these bytes as ISO-8859-1 (i.e. reading from a file or wrong url encoding) then it looks like: 0xC2 (Å) followed by 0xB5 (µ) On Tue, Jul 28, 2009 at 3:26 PM, Bill Aubill.w...@gmail.com wrote: I am using SolrJ to index the word µTorrent. After a commit I was not able to query for it. It turns out that the document in my Solr index contains the word µTorrent instead of µTorrent. Any one has any idea what's going on??? Bill -- Robert Muir rcm...@gmail.com
Re: µTorrent indexed as µTorrent
On Thu, Jul 30, 2009 at 6:34 PM, Bill Aubill.w...@gmail.com wrote: FYI, it took me a while to discover that SolrJ by default uses a GET request for query, which uses ISO-8859-1. That depends on the servlet container. SolrJ GET requests are sent in UTF-8. Some servlet containers such as Tomcat need extra configuration to treat URLs as UTF-8 instead of latin-1, but the standard http://www.ietf.org/rfc/rfc3986.txt clearly specifies UTF-8. To test the servlet container configuration, check out example/exampledocs/test_utf8.sh -Yonik http://www.lucidimagination.com I had to explicitly use a POST to do query in SolrJ in order to get it to use UTF-8. Bill On Tue, Jul 28, 2009 at 5:27 PM, Robert Muir rcm...@gmail.com wrote: Bill, somewhere in the process I think you might be treating your UTF-8 text as ISO-8859-1. Your character: 00B5 (µ) Bits: 10110101 UTF8-encoded: 1110 10110101 If you were to treat these bytes as ISO-8859-1 (i.e. reading from a file or wrong url encoding) then it looks like: 0xC2 (Å) followed by 0xB5 (µ) On Tue, Jul 28, 2009 at 3:26 PM, Bill Aubill.w...@gmail.com wrote: I am using SolrJ to index the word µTorrent. After a commit I was not able to query for it. It turns out that the document in my Solr index contains the word µTorrent instead of µTorrent. Any one has any idea what's going on??? Bill -- Robert Muir rcm...@gmail.com
Re: µTorrent indexed as µTorrent
Bill, somewhere in the process I think you might be treating your UTF-8 text as ISO-8859-1. Your character: 00B5 (µ) Bits: 10110101 UTF8-encoded: 1110 10110101 If you were to treat these bytes as ISO-8859-1 (i.e. reading from a file or wrong url encoding) then it looks like: 0xC2 (Å) followed by 0xB5 (µ) On Tue, Jul 28, 2009 at 3:26 PM, Bill Aubill.w...@gmail.com wrote: I am using SolrJ to index the word µTorrent. After a commit I was not able to query for it. It turns out that the document in my Solr index contains the word µTorrent instead of µTorrent. Any one has any idea what's going on??? Bill -- Robert Muir rcm...@gmail.com