Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

Ralph Meijer Fri, 20 Dec 2019 04:04:05 -0800

Oops, the following should have been sent to the list.


On 19-12-2019 15:02, Ralph Meijer wrote:

On 19-12-2019 13:59, Andrew Nenakhov wrote:
ср, 18 дек. 2019 г. в 20:12, Ralph Meijer <[email protected]<mailto:[email protected]>>:
    My assumption was that we are looking at character data on the
    abstract layer /after/ parsing XML. You shouldn't see entities there
    (they'd be resolved to their respective characters), nor should you
    see <![CDATA[]] wrappers.
Hm, please, define 'abstract' layer more precisely. Citing examplefrom the XEP proposal, which is the true abstract layer?this, image.png, or this:image.png ? Or the layer with 'codepoints'?Is it really any better than escaped XML text?
This approach is also not very practical. When you do stanzaprocessing on a server, most often you just take stanza as is,passing all references data without transferring data to abstractlayer back and forth. Plus, when doing the web client this means anadditional escaping - deescaping routine every time when something issent-displayed, cause browsers require their own escaping.
Abstract as in the abstract sequence of characters after parsing,however represented by your programming language. If I parse an XMLdocument <blah><!CDATA[less < more]]></blah>, and request the text forthe `blah` node, I get an object that encodes the abstract sequence ofcharacters: `less < more`. In Python, for example, that'd berepresented by a unicode string object.
See also https://www.unicode.org/versions/Unicode12.1.0/ch03.pdf#G2212for various definitions around characters, code points, glyphs,graphemes, and the like. So yes, you'd be counting ZWJs and such foryour example, and I think it tallies up to 7 for just man/man/boy/boy,without Fitzpatrick modifiers, hair variations, hair color, direction.
With regards to having to re-encode for HTML representation, asunfortunate that may be, other situations require othertransformations, like encoded in UTF-8, for them to be used in othersystems (UI, storage, etc.).
If you want consistent counting on all platforms and languages,counting Unicode characters seems to be the best way forward.

--
ralphm

_______________________________________________
Standards mailing list
Info: https://mail.jabber.org/mailman/listinfo/standards
Unsubscribe: [email protected]
_______________________________________________

Re: [Standards] Proposed XMPP Extension: Character counting in message bodies

Reply via email to