Correction below:

> The text here says that "calculations of p and n values MUST be based on
>> Unicode code points". Are you sure that you mean code points? Given that
>> XMPP mandates the use of UTF-8, I think it would be safer and easier to
>> say "UTF-8-encoded code points" (the point about "Some Unicode encodings
>> use a variable number of bytes per Unicode character" is true but
>> hopefully irrelevant here).
>
>
> Unfortunately it's critically relevant.  UTF-8 is simply a transmission
> medium and not a storage/processing layer medium for most XMPP libraries.
>  More than 75% of XMPP libraries do provide the original UTF-8 data,
>

OOOPS - correction -- I meant "*more than 75% of XMPP libraries do NOT provide
the original UTF-8 data*"


> incoming text is already converted to the programming language's Unicode
> format.  Library typically converts incoming text to the programming
> language's native string format, which is almost never UTF-8.
> .....However, UTF-8/UTF-16/UTF-16LE/UTF-16BE/UTF-32 all are exactly equal
> when all are processed as Code Points.  Therefore, we get consistency, no
> matter how the XMPP library presents Unicode strings.   Much safer!
> ....Java doesn't have a 'char' type -- it doesn't have an 8-bit character.
> ....One popular example, the Java 'smack' library (and Android 'asmack')
> stores incoming XMPP XML as a 16-bit Unicode string (UTF-16).  Java does
> not have an 8-bit 'char' type, except as byte arrays, which aren't easily
> processable.   Incoming XML in this Java library is converted to UTF-16 for
> internal processing by the library, and outgoing XML is converted back to
> UTF-8.
> ....If I theoretically used UTF-8, I can get corruption if I accidentally
> spliced a long UTF-8 character (i.e. during real-time editing in a poorly
> implemented UTF-8 client).  Splicing via Unicode Code Points is much safer,
> and does not result in a character suddenly becoming a different character,
> and buggy clients will show very clear and easily-fixible splicing errors
> (i.e. off-by-one errors), such as text inserted/deleted that's
> one-character-off.
> ....Code Points, is therefore the safest and most interoperable method,
> that works irregardless of whatever Unicode format the programming langauge
> uses (XMPP library, language limitations, etc) and therefore, allows
> XEP-0301 to be compatible with virtually all XMPP libraries worldwide.
>  Just look at Java -- it doesn't even have an 8-bit 'char' type!
> ....Code Points allow complete interop between clients of dissimilar
> Unicode storage formats.  A client using char8 libpurple interops with
> char16 libpurple, a Java client using UTF16 interops with C++ UTF32, no
> problem -- because all strings convert to UTF-8 for transmission.
> ....See this flowchart:
> http://www.realjabber.org/flowchart_of_xmpp_rtt_path.pdf
>
> Can you suggest other alternatives, that would satisfy this situation?
>  (we tried)
> Or given the info above, do you agree code points is the safest way?
>
> Sincerely,
> Mark Rejhon
>

Reply via email to