Gereon, The four bytes do not look like a valid utf-8 encoded character. 4-byte characters in utf-8 start with the binary sequence "11110...". (For reference see the excellent wikipedia article on utf-8 encoding). Your problem looks like someone interpreted your valid 2-byte utf-8 encoded character as two single byte characters in some fancy encoding. This happens if you send XML updates to solr via http without setting the encoding properly. It is not sufficient to set the encoding in the XML but you need an additional HTTP header to set the encoding ("Content-type: text/xml; charset=UTF-8")
--Christian -----Ursprüngliche Nachricht----- Von: Gereon Steffens [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 2. Mai 2007 09:59 An: solr-user@lucene.apache.org Betreff: UTF-8 2-byte vs 4-byte encodings Hi, I have a question regarding UTF-8 encodings, illustrated by the utf8-example.xml file. This file contains raw, unescaped UTF8 characters, for example the "e acute" character, represented as two bytes 0xC3 0xA9. When this file is added to Solar and retrieved later, the XML output contains a four-byte representation of that character, namely 0xC2 0x83 0xC2 0xA9. If, on the other hand, the input data contains this same character as an entity &#A9; the output contains the two-byte encoded representation 0xC3 0xA9. Why is that so, and is there a way to always get characters like these out of Solr as their two-byte representations? The reason I'm asking is that I often have to deal with CDATA sections in my input files that contain raw (two-byte) UTF8 characters that can't be encoded as entities. Thanks, Gereon