Re: [Standards] RTT, take 2

Mark Rejhon Fri, 24 Jun 2011 07:24:24 -0700

Regarding: http://xmpp.org/extensions/inbox/realtimetext.html
(Replies to Remko Tronçon, David Cridland)

On Fri, Jun 24, 2011 at 9:04 AM, Florian Zeitz <[email protected]> wrote:

> On 24.06.2011 11:24, Remko Tronçon wrote:
> > I'm wondering whether 'code points' are any better than UTF-8 based
> > positioning. Isn't it possible that a codepoint position also points
> > inside a character/glyph/...? Peter could probably shed some light on
> > this.
> >
> FWIW, I think using codepoints solves somewhat different problem.
> If we count codepoints we can delete "half a character", e.g. remove the
> "combining cedilla" from ç, but if we count UTF-(8,16) based we can
> delete "half a codepoint" rendering the result undecodeable which is far
> worse.
>

Florian is correct -- this is one of the many reasons why we don't want to
use "UTF-8 counting methodology" for indexes and lengths for XMPP RTT
real-time editing (text inserts and deletes). Interoperability between
slightly buggy clients in UTF-8 can be much worse.

On Fri, Jun 24, 2011 at 5:38 AM, Dave Cridland <[email protected]> wrote:

> As in, adding a "C" character at the fifth code-point of "Tronçon" might
> give you "TroncÇon", or "TronçCon", depending on whether "ç" is a
> "c-with-cedilla" or a "c" followed by a "combining cedilla"?
>
> Yes, I'm quite sure that's possible.
>

Real-time editing worked fine in both cases, due to section 5.2.1
"Monitoring Message Edits". The pre-edit string is compared to the post-edit
string, in order to determine what code points changed. Although I did not
publish the algorithm, the algorithm to do so is actually simpler than most
think -- 50 lines of code (l.340-390 of RealTimeText.cs of the RealJabber
open source). By left/right scanning for unchanged characters (even if the
length has changed), you find the changed section in the middle of the
string and extract that out.  It works even with pastes, auto-spellcheckers,
auto-accenting, complex multi-keypress keyboard entry (multiple dead
characters) because we aren't worried about the input method, but only
worried about how the message changed. Which is why I added section 5.2.1 to
Implementor Notes. "Monitoring Message Edits"  .... which is recommended
instead of monitoring individual keypresses.

In fact you can use any operating systems' textbox and let the operating
system worry about presentation, which is why we aren't worried about
counting individual glyphs (besides, we have no control over counting glyphs
with most GUI frameworks)

> I don't have a solution, either, except to note that this applies to UTF-8
> octets etc as well, unless you normalize all strings first - but then it's
> really not clear to me how to translate editing actions in a GUI into that
> form.
>

The editing actions need to be executed before normalizing, because there is
not a consistent standard of normalization between different platforms. This
is an additional reason we don't count based on glyphs, too.  One platform
may display a glyph as 2 characters, and another platform as 1 character.
The method we chose, solves that problem.

Mark Rejhon

Re: [Standards] RTT, take 2

Reply via email to