On Jun 24, 2011, at 6:04 AM, Florian Zeitz wrote:

> On 24.06.2011 11:24, Remko Tronçon wrote:
>>> So I'd say that we should refer to characters in a string, and deal with
>>> Unicode code-points in the abstract.
>> 
>> I'm wondering whether 'code points' are any better than UTF-8 based
>> positioning. Isn't it possible that a codepoint position also points
>> inside a character/glyph/...? Peter could probably shed some light on
>> this.
>> 
> FWIW, I think using codepoints solves somewhat different problem.
> 
> If we count codepoints we can delete "half a character", e.g. remove the
> "combining cedilla" from ç, but if we count UTF-(8,16) based we can
> delete "half a codepoint" rendering the result undecodeable which is far
> worse.

The protocol ought to defined in wire terms… but state a few guidelines on 
handling of characters composed of multiple code points.

For instance, if a character is sent as <X> <Y>  (Y being a combining 
character), I have little problem with <Y> being edited away so long as <X> by 
itself is valid… or being replaced with <Z> (another combining character) 
without touching <X>.

It's my view that that the client needs to be aware enough of what's happening 
in the GUI and the wire to ensure both are sane.   If you try to design this 
such that clients don't have to be aware of what really going on the wire or in 
the GUI, it will be quite fragile and prone to interoperability problems.

-- Kurt

Reply via email to