On Montag, 12. März 2018 16:31:57 CET Sam Whited wrote: > On Mon, Mar 12, 2018, at 10:17, Jonas Wielicki wrote: > > This is true, XML restricts to Scalar Values. Thanks, I didn’t know that > > term. > Your entire reply seems to be hinged on the fact that all XML libraries > return scalar values
No, I’m arguing in terms of the XML standard, not implementations. But let’s go down that route. > , but this isn't true to my knowledge. Most XML > libraries are going to do whatever the language they're written in does. If > you're using Python 3, you're going to get scalar values Yep. (There are numerous XML library bindings for Python all of which, to my knowledge, return str (a sequence of codepoints), I’m not quoting each individual one) > (assuming it's > returning a string which in Python 3 is scalar values… more or less), if > you use libxml2, or expat, or the libxml2 uses UTF-8 internally, but aliases the xmlChar type, which is intended to be used with the xmlUTF8 family of functions providing access to the Codepoints. [1] libexpat uses UTF-8 or UTF-16 (compile time switch, XML_UNICODE) and leaves the developer alone with that, as far as I can tell. [2] > Go encoding/xml library, etc. you're probably going to get bytes. This one’s funny. It doesn’t specify which encoding you get when you unmarshal into a []byte. string is clear, and UTF-8 internally, but string uses the concept of runes which are Codepoints [3]. Unmarshaling has the option of using either. I have no knowledge about Java or Rust though. AFAIK with JavaScript one is doomed anyways, since it can only do UTF-16, which will be a PITA no matter which route we go. Acknowledging that XMPP enforces UTF-8 (so it would be a reasonable choice to use UTF-8 for everything in an implementation which chooses to go down that route), I’m not going to die on this hill. Still, using bytes of UTF-8 in a layer which clearly operates on Character Data defined in terms of Scalar Values is breaking abstractions for no gain (on the contrary, we’ll have to add wording on how to handle mid-UTF-8-word ranges; in the case of implementations which already get code points one needs an additional encode- to-utf8-step + validation; implementations which already work on UTF-8 need the utility functions to operate on code points anyways). kind regards, Jonas [1]: http://www.xmlsoft.org/html/libxml-xmlstring.html [2]: https://www.xml.com/pub/1999/09/expat/index.html If this isn’t up-to-date, sorry. I wasn’t able to find anything more recent on the expat website. [3]: https://blog.golang.org/strings
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org _______________________________________________