On Thu, Jul 26, 2012 at 6:04 PM, Mark Rejhon <[email protected]> wrote: > > On 2012-07-26 5:34 PM, "Gunnar Hellström" <[email protected]> > wrote: >> >> I think we have not solved this issue yet. >> >> On 2012-07-25 11:06, Kevin Smith wrote: >>>> >>>> >4.5.4.3 - "A single UTF-8 encoded character equals one code point" - >>>> >this isn't true, is it? >>>> > >>>> >If we instead say >>>> >"A single UTF-8 encoded Unicode Character equals one code point." >>>> >Is true, and then we need to define Unicode Character as the Character >>>> >concept used in the Unicode standard. >>>> >And maybe a note saying that "Note that some visible characters are >>>> > composed >>>> >of more than one Unicode Character." >>> >>> My concern here is the lack of precision about normalisation is >>> worrying me. I'm not yet convinced that nothing's going to change >>> composition anywhere important - and one code point (unicode >>> character) in one place could be more than one code point (unicode >>> character) elsewhere. I'm feeling quite uncomfortable about the effect >>> this will potentially have on interoperability - and I think it could >>> easily be solved by saying "before calculating the rtt transforms to >>> send the sender must apply normalisation to the string and before >>> applying the transformations to the rtt buffer the recipient must >>> apply normalisation to them, where we pick one of the normalisation >>> types and stick with it. The other option suggested to me when I was >>> asking people about the effect this would have on interop was to >>> require RTT to include what normalisation is used, so the sender would >>> send an update with normalisation=NFKC or whatever. >> >> I think that normalization in the endpoints are manageable. They should >> just be done outside the path where p and n calculations are done. >> But Kevin indicated that network equipment might also do Unicode >> normalization. Then we must introduce some suitable rule against that. >> >> E.g. "If network equipment makes Unicode normalization of <rtt/> elements, >> then they must recalculate n and p after that action." > > Generally, in most reasonable situations in XMPP, normalizing an > already-normalized Unicode string, results in no changes. Kevin says to > specify a normalization format, but how do we know what normalization > network equipment uses? So we have to carefully choose the normalization > standard that is least likely to be affected by further unexpected passes of > normalization. > > Anyway, as long as you normalize first at the sender end, any further > normalization is usually harmless. There are different standards of > normalization, so research in choosing specific normalization in advance, > has merit, but factoring into: > > - It only affects mid message editing for the most part; where 99 percent > plus of typing is at the end. > > - If servers and network equipment violates standards and rudely modifies > code points, Message inconsistencies are generally erased during the > once-every-10-seconds Message Reset (or final message delivery in <body/>) > > - Do a full, complete normalization so that from thereafter, most/all > normalization subsets likely has no damaging effects to real-time text in > these rare situations. > > - Experience has shown I have not run into any situation where it is an > issue. > > - Are there special situations? Does country-wide Great Firewalls modify > code points n text based packets, for example? Presently, I feel this is > beyond scope of XEP-0301 and the rest of the real-time message is probably a > lost cause, until the next line. > > - Again, rare normalization damage (which I have never seen, not even with > realjabber.org, talk.l.google.com, or Openfire) is self repairing anyway via > Message Reset. > > - I did many tests; I copy and pasted tortrue test strings including funny > bidirectional text with lots of superimposed characters and strange Unicode > emoticons, and they transmit/edit in sync on both ends. I will keep > testing.... > > Personally, I think the Unicode Code Point handling is fine but I agree > several minor edits may be needed, such as the need to specify a > strict/fuller sender normalization standard (before the rtt encode) so that > further normalization is unlikely to affect code points. > > Thanks > Mark Rejhon
Checking on the Unicode standards, I realize I was referring is various NFC algorithms re-normalizing NFC. (common normalization being compacting all the combining characters to its most compact formats) Now that we have the appropriate terminology, NFC, NFD, NFKC, NFKD -- I didn't realize that's what you were referring. http://unicode.org/reports/tr15/ I am now assuming that is what Kevin / Gunnar is referring to, the four different "normal forms". I will now speak in proper Unicode normalization terminology (NFC, NFD, NFKC, NFKD) Assuming the path is standards-compliant (standard XML processors), I've found it doesn't matter if the two ends are using different normal forms (e.g. execute NFC normalization before <rtt/> encode) and the other end is using the other form (e.g. converted to NFD after <rtt/> decode), as long as the real-time message is unaffected. About XML parser / server / network driven normalization: --- XML processors do normalize attribute values (for necessity of comparing attributes), but they do not modify the normal format of the Unicode strings within tags. Real-time text is transmitted in the inner text of a <t/> element, so whatever normal format (NFC, NFC, NFKC, NFKD) is acceptable as long as subseqeuent action elements use the same normal format (e.g. NFC for one <rtt/> element and NFD for the next <rtt/> element would be a big "no-no") ... --- Some XML parsers do provide the ability to turn on normalization of Unicode text (as a flag), so either that has to be disabled, or you simply normalize first (to the same normal format, e.g. NFC), so that parser-driven normalization has no effect. If we specify to normalize in NFC format, that's not going to be a helpful mention in XEP-0301 if if the XML parser is currently configured to normalize to a different normal format (e.g. NFD). So that's a plus and a minus at the same time, if I am given a choice: I prefer not to mention a specific normal form. --- In actual practice, by default, most XML parser libraries do not normalize automatically for you (at least without developer consent) --- I've found XMPP servers don't normalize on the server side. (jabber.org / talk.l.google.com / OpenFire) If there are different severs that execute a conversion to multiple different specific normal form, then that is bad for XEP-0301 interop of mid-message edits anyway. Though if I had to mention anything, then NFC normalization is probably the one I should mention -- though that would still be affected by any servers that decide to convert to NFD / NFKC / NFKD (If such servers exist, then -- bad, server, bad, bad!) --- TCP/IP routing don't modify normalization. That's tantamount to packet content modification, which is largely a big no-no anyway, and beyond scope of XEP-0301. --- It's not "the end of the world" if there's a normalization catastrophe, since two methods cause quick recovery of any normalization-messed-up edits (the Message Reset and the <body/> delivery, and normalization concerns can essentially be bypassed altogether with "Basic Real Time Text"). Although not a good excuse to use such mechanisms (originally designed for backwards compatibility and good user experience during MUC/simultaneous login), it's meritworthy to point out this, and further experience by multiple vendors can tighten up the standard in regards to normal formats, during the Draft stage. Based on the above, I am of the conclusion, it is NOT necessary to specify the normal format to use -- just that normalization should occur outside of the RTT codec chain (encoding on sender, to decoding on recipient) -- I feel the advantage of normalization-agnosticity outweighs the risk of the chain doing its own normalization. Senders can use any Unicode normal format (NFC, NFD, NFKC, NFKD) before encoding the <rtt/> element, and as long as the channel is standards-compliant (standards-compliant XML parser that doesn't modify normal forms inside innertext's), it's not going to be converted to a different normal format. It'll come out intact on the other end (at least on all XEP-0301-functioning XMPP chains I've ever tried). Note: I do realize that some wording tweaks are needed to make the Unicode stuff readable to a wider variety of audience, But I still think it's not necessary to specify a normal form. (Although I *could* mention that NFC is the preferred normal form at the sender-side, since any XML parsers and XMPP servers that decide to do 'rude' normalization, will usually normalize to the most bandwidth-compact normalization format -- which is NFC -- but I've not even seen this happen in the real-world) There are pros and cons about mentioning which normal form. Best move is to not specify a normalization format (NFC, NFD, NFKC, NFKD), but if a format has to be mentioned for senders during the pre-RTT-encode step, I'd say "SHOULD be NFC" -- since experience shows it is not a REQUIRED. Thanks Mark Rejhon
