On 2012-07-26 5:34 PM, "Gunnar Hellström" <[email protected]> wrote: > > I think we have not solved this issue yet. > > On 2012-07-25 11:06, Kevin Smith wrote: >>> >>> >4.5.4.3 - "A single UTF-8 encoded character equals one code point" - >>> >this isn't true, is it? >>> > >>> >If we instead say >>> >"A single UTF-8 encoded Unicode Character equals one code point." >>> >Is true, and then we need to define Unicode Character as the Character >>> >concept used in the Unicode standard. >>> >And maybe a note saying that "Note that some visible characters are composed >>> >of more than one Unicode Character." >> >> My concern here is the lack of precision about normalisation is >> worrying me. I'm not yet convinced that nothing's going to change >> composition anywhere important - and one code point (unicode >> character) in one place could be more than one code point (unicode >> character) elsewhere. I'm feeling quite uncomfortable about the effect >> this will potentially have on interoperability - and I think it could >> easily be solved by saying "before calculating the rtt transforms to >> send the sender must apply normalisation to the string and before >> applying the transformations to the rtt buffer the recipient must >> apply normalisation to them, where we pick one of the normalisation >> types and stick with it. The other option suggested to me when I was >> asking people about the effect this would have on interop was to >> require RTT to include what normalisation is used, so the sender would >> send an update with normalisation=NFKC or whatever. > > I think that normalization in the endpoints are manageable. They should just be done outside the path where p and n calculations are done. > But Kevin indicated that network equipment might also do Unicode normalization. Then we must introduce some suitable rule against that. > > E.g. "If network equipment makes Unicode normalization of <rtt/> elements, then they must recalculate n and p after that action."
Generally, in most reasonable situations in XMPP, normalizing an already-normalized Unicode string, results in no changes. Kevin says to specify a normalization format, but how do we know what normalization network equipment uses? So we have to carefully choose the normalization standard that is least likely to be affected by further unexpected passes of normalization. Anyway, as long as you normalize first at the sender end, any further normalization is usually harmless. There are different standards of normalization, so research in choosing specific normalization in advance, has merit, but factoring into: - It only affects mid message editing for the most part; where 99 percent plus of typing is at the end. - If servers and network equipment violates standards and rudely modifies code points, Message inconsistencies are generally erased during the once-every-10-seconds Message Reset (or final message delivery in <body/>) - Do a full, complete normalization so that from thereafter, most/all normalization subsets likely has no damaging effects to real-time text in these rare situations. - Experience has shown I have not run into any situation where it is an issue. - Are there special situations? Does country-wide Great Firewalls modify code points n text based packets, for example? Presently, I feel this is beyond scope of XEP-0301 and the rest of the real-time message is probably a lost cause, until the next line. - Again, rare normalization damage (which I have never seen, not even with realjabber.org, talk.l.google.com, or Openfire) is self repairing anyway via Message Reset. - I did many tests; I copy and pasted tortrue test strings including funny bidirectional text with lots of superimposed characters and strange Unicode emoticons, and they transmit/edit in sync on both ends. I will keep testing.... Personally, I think the Unicode Code Point handling is fine but I agree several minor edits may be needed, such as the need to specify a strict/fuller sender normalization standard (before the rtt encode) so that further normalization is unlikely to affect code points. Thanks Mark Rejhon
