Oops, the following should have been sent to the list.
On 19-12-2019 15:02, Ralph Meijer wrote:
On 19-12-2019 13:59, Andrew Nenakhov wrote:
ср, 18 дек. 2019 г. в 20:12, Ralph Meijer <[email protected]
<mailto:[email protected]>>:
My assumption was that we are looking at character data on the
abstract layer /after/ parsing XML. You shouldn't see entities there
(they'd be resolved to their respective characters), nor should you
see <![CDATA[]] wrappers.
Hm, please, define 'abstract' layer more precisely. Citing example
from the XEP proposal, which is the true abstract layer?
this, image.png, or this:image.png ? Or the layer with 'codepoints'?
Is it really any better than escaped XML text?
This approach is also not very practical. When you do stanza
processing on a server, most often you just take stanza as is,
passing all references data without transferring data to abstract
layer back and forth. Plus, when doing the web client this means an
additional escaping - deescaping routine every time when something is
sent-displayed, cause browsers require their own escaping.
Abstract as in the abstract sequence of characters after parsing,
however represented by your programming language. If I parse an XML
document <blah><!CDATA[less < more]]></blah>, and request the text for
the `blah` node, I get an object that encodes the abstract sequence of
characters: `less < more`. In Python, for example, that'd be
represented by a unicode string object.
See also https://www.unicode.org/versions/Unicode12.1.0/ch03.pdf#G2212
for various definitions around characters, code points, glyphs,
graphemes, and the like. So yes, you'd be counting ZWJs and such for
your example, and I think it tallies up to 7 for just man/man/boy/boy,
without Fitzpatrick modifiers, hair variations, hair color, direction.
With regards to having to re-encode for HTML representation, as
unfortunate that may be, other situations require other
transformations, like encoded in UTF-8, for them to be used in other
systems (UI, storage, etc.).
If you want consistent counting on all platforms and languages,
counting Unicode characters seems to be the best way forward.
--
ralphm
_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: [email protected]
_______________________________________________