Re: [Standards] XEP-0301 0.5 comments -Unicode characters

Mark Rejhon Thu, 26 Jul 2012 15:04:30 -0700

On 2012-07-26 5:34 PM, "Gunnar Hellström" <[email protected]>
wrote:
>
> I think we have not solved this issue yet.
>
> On 2012-07-25 11:06, Kevin Smith wrote:
>>>
>>> >4.5.4.3 - "A single UTF-8 encoded character equals one code point" -
>>> >this isn't true, is it?
>>> >
>>> >If we instead say
>>> >"A single UTF-8 encoded Unicode Character equals one code point."
>>> >Is true, and then we need to define Unicode Character as the Character
>>> >concept used in the Unicode standard.
>>> >And maybe a note saying that "Note that some visible characters are
composed
>>> >of more than one Unicode Character."
>>
>> My concern here is the lack of precision about normalisation is
>> worrying me. I'm not yet convinced that nothing's going to change
>> composition anywhere important - and one code point (unicode
>> character) in one place could be more than one code point (unicode
>> character) elsewhere. I'm feeling quite uncomfortable about the effect
>> this will potentially have on interoperability - and I think it could
>> easily be solved by saying "before calculating the rtt transforms to
>> send the sender must apply normalisation to the string and before
>> applying the transformations to the rtt buffer the recipient must
>> apply normalisation to them, where we pick one of the normalisation
>> types and stick with it. The other option suggested to me when I was
>> asking people about the effect this would have on interop was to
>> require RTT to include what normalisation is used, so the sender would
>> send an update with normalisation=NFKC or whatever.
>
> I think that normalization in the endpoints are manageable. They should
just be done outside the path where p and n calculations are done.
> But Kevin indicated that network equipment might also do Unicode
normalization. Then we must introduce some suitable rule against that.
>
> E.g. "If network equipment makes Unicode normalization of <rtt/>
elements, then they must recalculate n and p after that action."


Generally, in most reasonable situations in XMPP, normalizing an
already-normalized Unicode string, results in no changes.  Kevin says to
specify a normalization format, but how do we know what normalization
network equipment uses?   So we have to carefully choose the normalization
standard that is least likely to be affected by further unexpected passes
of normalization.

Anyway, as long as you normalize first at the sender end, any further
normalization is usually harmless.  There are different standards of
normalization, so research in choosing specific normalization in advance,
has merit, but factoring into:

- It only affects mid message editing for the most part; where 99 percent
plus of typing is at the end.

- If servers and network equipment violates standards and rudely modifies
code points, Message inconsistencies are generally erased during the
once-every-10-seconds Message Reset (or final message delivery in <body/>)

- Do a full, complete normalization so that from thereafter, most/all
normalization subsets likely has no damaging effects to real-time text in
these rare situations.

- Experience has shown I have not run into any situation where it is an
issue.

- Are there special situations?  Does country-wide Great Firewalls modify
code points n text based packets, for example?  Presently, I feel this is
beyond scope of XEP-0301 and the rest of the real-time message is probably
a lost cause, until the next line.

- Again, rare normalization damage (which I have never seen, not even with
realjabber.org, talk.l.google.com, or Openfire) is self repairing anyway
via Message Reset.

- I did many tests; I copy and pasted tortrue test strings including funny
bidirectional text with lots of superimposed characters and strange Unicode
emoticons, and they transmit/edit in sync on both ends.  I will keep
testing....

Personally, I think the Unicode Code Point handling is fine but I agree
several minor edits may be needed, such as the need to specify a
strict/fuller sender normalization standard (before the rtt encode) so that
further normalization is unlikely to affect code points.

Thanks
Mark Rejhon

Re: [Standards] XEP-0301 0.5 comments -Unicode characters

Reply via email to