Correction below:
> The text here says that "calculations of p and n values MUST be based on >> Unicode code points". Are you sure that you mean code points? Given that >> XMPP mandates the use of UTF-8, I think it would be safer and easier to >> say "UTF-8-encoded code points" (the point about "Some Unicode encodings >> use a variable number of bytes per Unicode character" is true but >> hopefully irrelevant here). > > > Unfortunately it's critically relevant. UTF-8 is simply a transmission > medium and not a storage/processing layer medium for most XMPP libraries. > More than 75% of XMPP libraries do provide the original UTF-8 data, > OOOPS - correction -- I meant "*more than 75% of XMPP libraries do NOT provide the original UTF-8 data*" > incoming text is already converted to the programming language's Unicode > format. Library typically converts incoming text to the programming > language's native string format, which is almost never UTF-8. > .....However, UTF-8/UTF-16/UTF-16LE/UTF-16BE/UTF-32 all are exactly equal > when all are processed as Code Points. Therefore, we get consistency, no > matter how the XMPP library presents Unicode strings. Much safer! > ....Java doesn't have a 'char' type -- it doesn't have an 8-bit character. > ....One popular example, the Java 'smack' library (and Android 'asmack') > stores incoming XMPP XML as a 16-bit Unicode string (UTF-16). Java does > not have an 8-bit 'char' type, except as byte arrays, which aren't easily > processable. Incoming XML in this Java library is converted to UTF-16 for > internal processing by the library, and outgoing XML is converted back to > UTF-8. > ....If I theoretically used UTF-8, I can get corruption if I accidentally > spliced a long UTF-8 character (i.e. during real-time editing in a poorly > implemented UTF-8 client). Splicing via Unicode Code Points is much safer, > and does not result in a character suddenly becoming a different character, > and buggy clients will show very clear and easily-fixible splicing errors > (i.e. off-by-one errors), such as text inserted/deleted that's > one-character-off. > ....Code Points, is therefore the safest and most interoperable method, > that works irregardless of whatever Unicode format the programming langauge > uses (XMPP library, language limitations, etc) and therefore, allows > XEP-0301 to be compatible with virtually all XMPP libraries worldwide. > Just look at Java -- it doesn't even have an 8-bit 'char' type! > ....Code Points allow complete interop between clients of dissimilar > Unicode storage formats. A client using char8 libpurple interops with > char16 libpurple, a Java client using UTF16 interops with C++ UTF32, no > problem -- because all strings convert to UTF-8 for transmission. > ....See this flowchart: > http://www.realjabber.org/flowchart_of_xmpp_rtt_path.pdf > > Can you suggest other alternatives, that would satisfy this situation? > (we tried) > Or given the info above, do you agree code points is the safest way? > > Sincerely, > Mark Rejhon >
