Re: [users] Help reqd on how OOo 2 stores Unicode in content.xml

Andy Pepperdine Mon, 02 Jan 2006 01:15:54 -0800

On Monday 02 January 2006 02:36, Shriramana Sharma wrote:
> Hello.
>
> Opening OOo (first Writer then Calc) I entered the Unicode sequence:
>
> 0928 092e 0938 094d 0924 0947
>
> (Devanagari script for namaste = "I bow to you" = greeting)
>
> but I find that OOo (both Writer and Calc) stores it as the following
> sequence in content.xml -
>
> e0 a4 a8 e0 a4 ae e0 a4 b8 e0 a5 8d e0 a4 a4 e0 a5 87


You need to distinguish between the code representation of a Unicode 
character, and the way in which it is transported between applications (or 
stored, sent to a printer etc.) The internal representation on Linux is 
typically 32 bits per character (I'm guessing that OOo does so). The most 
recent W3C standards on Unicode have now expanded the character set positions 
to occupy the lower 18 bits. What Microsoft will eventually do about this I 
don't know as they are currently restricted to 16 bits.

To transmit this data the encoding is by default UTF-8, which is not dependent 
on the endian-ness of the implementation and is a stream of bytes where the 
high bits are flag bits, and the remaining ones used to carry the data 
(roughly speaking). For the Devanagari part of the spectrum, UTF-8 needs 
three bytes per character, and you can see this in the sequence you gave by 
dividing it into triplets.

You might find these useful
   ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO.html
and
   http://www.alanwood.net/unicode/

or as you are talking about Linux, try
   man 7 utf-8

Andy.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [users] Help reqd on how OOo 2 stores Unicode in content.xml

Reply via email to