Hello Peter, Thanks for clarifying the unclear areas -- appreciated! I do have some small further inquiries about Unicode handling:
On Thu, Aug 23, 2012 at 11:20 AM, Peter Saint-Andre <[email protected]> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 8/22/12 1:24 PM, Mark Rejhon wrote: >> Hello, >> >> I've managed to address most of Peter's section 1-5 concerns. >> However, for the remainder -- I need advice from anyone on >> unaddressed parts of Peter's comments about XEP-0301 ( >> http://xmpp.org/extensions/xep-0301.html ) There are only five >> major areas of clarifications I need relating to Peter's recent >> comments. [snip] >> ******* CLARIFICATION #5 ******** Peter complimented that the >> Unicode section was much better. >> http://xmpp.org/extensions/xep-0301.html#accurate_processing_of_action_elements >> > However, suggestions of further clarifications are also welcome: >> >>>> OLD Multiple Unicode code points (e.g. combining marks, >>>> accents) can form a combining character sequence. >>>> >>>> NEW Multiple Unicode code points (e.g. combining marks, >>>> accents) can form a combining character sequence. In addition, >>>> some combining character sequences (represented by multiple >>>> code points) can be transformed into a visually equivalent >>>> composite character (represented by a single code point), or >>>> vice-versa (e.g., under Unicode normalization). >>> >>> [Comment & Change Made] That's true. But as we already both >>> know, > > But implementers might not. Sending implementations might do it, so receiving clients may be receiving them anyway. That's why I wrote: "(However, recipients SHOULD NOT assume this behvior from sending clients. See Guidelines for Recipients)." Last sentence of 3rd paragraph, Section 4.7.2: http://xmpp.org/extensions/xep-0301.html#guidelines_for_senders Receiving clients that are unable to combine a sequence of combining characters, will just display them the same way for normal <body>. One example common handling mechanism by GUI's in common messaging clients for unrecognized Unicode, is displaying sequence of blocks ([] [] []) or question marks (? ? ?) - one placeholder block per Unicode code point .... So it looks exactly the same as if the sending clients transmits a sequence of combining characters in <body/> that is unrecognized by recipient clients ..... I'm speaking of pre-existing clients, of course -- You can't prevent senders sending a sequence of combining characters that the recipient may not recognize, so recipients that do not support them, will just display it as a sequence of unrecognized Unicode characters, typically boxes/blocks/placeholder characters (whatever the operating system supports). I have observed that exactly the same thing happens with combining characters in real time text (provided, no unexpected code point modifications take place to internally-stored real-time messages -- as I already specify in the specification). Actual RealJabber testing, in a client that does differential encoding (section 6.4.1 compliant), shows consistency of Unicode behaviour for unrecognized Unicode for <body/> versus for <rtt/> -- unrecognized characters are simply displayed using placeholder characters (the appearance of the placeholder characters is implementation/platform specific) The Unicode.org NFC algorithm clearly specifies behaviour for un-combinable sequences of code points (cannot be replaced by a composite character), and this means senders can still potentially send them, and recipients needs to do something *minimum* -- e.g. minimum behaviour might be to treat code points like array elements to be inserted into an array of code points -- and pass this string along to the GUI, the same rendering mechanism for <rtt/> that is normally used to render <body/> -- viola -- it results in exactly the same unhandled Unicode character handling (placeholder characters) in recipients that do not support a specific sequence. So, text via <rtt/> will render the same as text via <body/> -- including sequences of combining characters sent by the sender. So in the ideal "I followed the spec properly" situation, then <rtt/> is no different "unhandled Unicode" behaviour versus <body/> >>> not all combining character sequences can be sent as a single >>> composite character (e.g. single code point). So I had hoped >>> that was automatically implied, but I guess I have to teach more >>> Unicode here, eh? :-) > > Nothing is automatic and in specifications I prefer not to trust in > the power of implication. :) We can't stop senders from sending sequences of combining characters, not even for <body/>. Recipients that do not support them (for either <body/> or <rtt/>) will simply fall back to the normal handling mechanisms for unrecognized Unicode characters. Ideally, <rtt/> should be no different from <body/> behavior in this regards, and implementations that generally follow differential encoding (Monitoring Text Changes Instead Of Key Presses) will essentially generally have exactly this behavior, even in recipients that do not support valid sequences of combining characters. Given this perspective, do I have to explain handling of unrecognized sequences of Unicode combining characters? >>> "Multiple Unicode code points (e.g. combining marks, accents) can >>> form a combining character sequence. This can also occur in >>> situations where there isn't a visually equivalent composite >>> character of a single code point (e.g. when doing Unicode >>> normalization)" Is this shorter version acceptable? > > No, because it's not as accurate. See above. >>> The standalone combining mark will never be displayed -- it's >>> only during transmission. > > Only what during transmission? The incomplete sequence would only exist during transmission (inside an Insert Text <t></t> wrapper). >>> See differential encoding according to section 6.4.1 (e.g. >>> turning a valid two-character sequence into a valid >>> three-character sequence, by transmitting only the combining mark >>> detected by differential encoder algorithm in section 6.4.1) >>> >>> Perhaps I need to add an additional sentence to make this little >>> tidbit clearer? If so, what do you suggest? > > Maybe just clarify that you're talking about "modifying a valid > complete combining character sequence, to a new valid combining > character sequence" -- that wasn't clear to me. Will do. Thanks! Mark Rejhon
