Hi Marvin :) On 12/7/20 4:22 PM, Marvin W wrote:
On 04.12.20 21:23, Florian Schmaus wrote:And I am in favor of code points because it allows us to aim for the extended grapheme cluster algorithm, while also allowing for the "simply count code points" fallback.XEP-0426 already discusses why it's using codepoints instead of grapheme clusters in its rationale: […]
>
Also I forgot to mention that grapheme clusters are locale specific (example: "ch" is considered a single grapheme cluster in slowak).
We do have xml:lang, don't we?
Finally, I don't think that it's generally inappropriate to point inside a grapheme cluster (even if that's hard to implement). An example of where it seems appropriate to reference a part of a grapheme cluster is this: https://larma.de/grapheme.html
Fair point. (I am not sure about the relevance, though).Let us ignore grapheme clusters for a moment and focus on XEP-0426: Have you considered Unicode normalization? Especially when a text that was originally in decomposed form is normalized to composed form. This would corrupt the code point indexes.
XMPP does not require any Unicode normal form. Nor does XML 1.0 (as far as I can tell). Furthermore, XMPP does not require that the Unicode form is maintained.
Hence it would be perfectly possible that the Unicode normal form of text exchanged via XMPP changes between hops. While I am not aware of an implementation that does that, it is not forbidden. And when you think that this will never happen, then please also keep in mind that stanzas may be persisted in a database. For example when put in the MAM archive. And a database engine may perform normalization of the data.
I think that due to this, XEP-0426 should specify that counting happens with the text in NFC form. Or am I missing something?
- Florian
OpenPGP_signature
Description: OpenPGP digital signature
_______________________________________________ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: [email protected] _______________________________________________
