AW: UTF-8 2-byte vs 4-byte encodings

Burkamp, Christian Wed, 02 May 2007 02:26:41 -0700

Gereon,

The four bytes do not look like a valid utf-8 encoded character. 4-byte 
characters in utf-8 start with the binary sequence "11110...". (For reference 
see the excellent wikipedia article on utf-8 encoding).
Your problem looks like someone interpreted your valid 2-byte utf-8 encoded 
character as two single byte characters in some fancy encoding. This happens if 
you send XML updates to solr via http without setting the encoding properly. It 
is not sufficient to set the encoding in the XML but you need an additional 
HTTP header to set the encoding ("Content-type: text/xml; charset=UTF-8")


--Christian

-----Ursprüngliche Nachricht-----
Von: Gereon Steffens [mailto:[EMAIL PROTECTED] 
Gesendet: Mittwoch, 2. Mai 2007 09:59
An: solr-user@lucene.apache.org
Betreff: UTF-8 2-byte vs 4-byte encodings


Hi,

I have a question regarding UTF-8 encodings, illustrated by the 
utf8-example.xml file. This file contains raw, unescaped UTF8 characters, for 
example the "e acute" character, represented as two bytes 0xC3 0xA9. When this 
file is added to Solar and retrieved later, the XML output contains a four-byte 
representation of that character, namely 0xC2 0x83 0xC2 0xA9.

If, on the other hand, the input data contains this same character as an entity 
&#A9; the output contains the two-byte encoded representation 0xC3 0xA9.

Why is that so, and is there a way to always get characters like these out of 
Solr as their two-byte representations?

The reason I'm asking is that I often have to deal with CDATA sections in my 
input files that contain raw (two-byte) UTF8 characters that can't be encoded 
as entities.

Thanks,
Gereon

AW: UTF-8 2-byte vs 4-byte encodings

Reply via email to