Re: Cyrillic characters

Yonik Seeley Tue, 18 Jul 2006 16:17:18 -0700

OK, lets split up the indexing side from the query side for a moment
and assume that you are indexing correctly (setting the content-type
correctly, etc).


I just added a new value to the multi-valued features field to the
solr.xml example document:
 "Good unicode support: héllo (hello with an accent over the e)"
or in the XML:
 <field name="features">Good unicode support: h&#xE9;llo (hello with
an accent over the e)</field>

I used a numeric entity because post.sh doesn't specify any
content-type (ascii or latin1 may be assumed).  But as I said, let's
assume things are indexed correctly for now.

The URI standard says the following:
'''When a new URI scheme defines a component that represents textual
data consisting of characters from the Universal Character Set [UCS],
the data should first be encoded as octets according to the UTF-8
character encoding [STD63]; then only those octets that do not
correspond to characters in the unreserved set should be
percent-encoded. For example, the character A would be represented as
"A", the character LATIN CAPITAL LETTER A WITH GRAVE would be
represented as "%C3%80", and the character KATAKANA LETTER A would be
represented as "%E3%82%A2".'''

http://www.gbiv.com/protocols/uri/rfc/rfc3986.html

So, the unicode code point for the e with an accute accent is \u00E9.
In UTF8 encoding it's a two byte sequence: 0xc3,0xa9

In both Firefox and IE, the following URI works fine to find the document:
http://localhost:8983/solr/select/?stylesheet=&q=h%C3%A9llo

If I try pasting héllo from notepad directly into the URL, IE works
fine, but Firefox substitutes the accented e with %E9, which is
incorrect.

I haven't tried more complicated examples yet, and I haven't tried
wget, etc, but things look like they are working as expected so far
(with the exception of a firefox bug).

-Yonik

Re: Cyrillic characters

Reply via email to