Regarding: http://xmpp.org/extensions/inbox/realtimetext.html (Replies to Remko Tronçon, David Cridland)
On Fri, Jun 24, 2011 at 9:04 AM, Florian Zeitz <[email protected]> wrote: > On 24.06.2011 11:24, Remko Tronçon wrote: > > I'm wondering whether 'code points' are any better than UTF-8 based > > positioning. Isn't it possible that a codepoint position also points > > inside a character/glyph/...? Peter could probably shed some light on > > this. > > > FWIW, I think using codepoints solves somewhat different problem. > If we count codepoints we can delete "half a character", e.g. remove the > "combining cedilla" from ç, but if we count UTF-(8,16) based we can > delete "half a codepoint" rendering the result undecodeable which is far > worse. > Florian is correct -- this is one of the many reasons why we don't want to use "UTF-8 counting methodology" for indexes and lengths for XMPP RTT real-time editing (text inserts and deletes). Interoperability between slightly buggy clients in UTF-8 can be much worse. On Fri, Jun 24, 2011 at 5:38 AM, Dave Cridland <[email protected]> wrote: > As in, adding a "C" character at the fifth code-point of "Tronçon" might > give you "TroncÇon", or "TronçCon", depending on whether "ç" is a > "c-with-cedilla" or a "c" followed by a "combining cedilla"? > > Yes, I'm quite sure that's possible. > Real-time editing worked fine in both cases, due to section 5.2.1 "Monitoring Message Edits". The pre-edit string is compared to the post-edit string, in order to determine what code points changed. Although I did not publish the algorithm, the algorithm to do so is actually simpler than most think -- 50 lines of code (l.340-390 of RealTimeText.cs of the RealJabber open source). By left/right scanning for unchanged characters (even if the length has changed), you find the changed section in the middle of the string and extract that out. It works even with pastes, auto-spellcheckers, auto-accenting, complex multi-keypress keyboard entry (multiple dead characters) because we aren't worried about the input method, but only worried about how the message changed. Which is why I added section 5.2.1 to Implementor Notes. "Monitoring Message Edits" .... which is recommended instead of monitoring individual keypresses. In fact you can use any operating systems' textbox and let the operating system worry about presentation, which is why we aren't worried about counting individual glyphs (besides, we have no control over counting glyphs with most GUI frameworks) > I don't have a solution, either, except to note that this applies to UTF-8 > octets etc as well, unless you normalize all strings first - but then it's > really not clear to me how to translate editing actions in a GUI into that > form. > The editing actions need to be executed before normalizing, because there is not a consistent standard of normalization between different platforms. This is an additional reason we don't count based on glyphs, too. One platform may display a glyph as 2 characters, and another platform as 1 character. The method we chose, solves that problem. Mark Rejhon
