On Mon, Mar 12, 2018, at 11:16, Jonas Wielicki wrote: > libxml2 uses UTF-8 internally, but aliases the xmlChar type, which is > intended > to be used with the xmlUTF8 family of functions providing access to the > Codepoints. [1]
As far as I can tell you just get bytes externally and that XML string is for validation internally, but maybe not, it's been a while since I've used libxml2 and I haven't used it all that extensively outside of other language wrappers. > libexpat uses UTF-8 or UTF-16 (compile time switch, XML_UNICODE) and leaves > the developer alone with that, as far as I can tell. [2] Right, bytes. > This one’s funny. It doesn’t specify which encoding you get when you > unmarshal > into a []byte. string is clear, and UTF-8 internally, but string uses the > concept of runes which are Codepoints [3]. Unmarshaling has the option of > using either. Go strings aren't runes, they're byte strings. String literals are guaranteed to be valid UTF-8. Runes are a distinct concept. > I have no knowledge about Java or Rust though. Rust will more or less do either happily. You can get bytes or code points. > Acknowledging that XMPP enforces UTF-8 (so it would be a reasonable choice to > use UTF-8 for everything in an implementation which chooses to go down that > route), I’m not going to die on this hill. > Still, using bytes of UTF-8 in a > layer which clearly operates on Character Data defined in terms of Scalar > Values I am disputing this; it does not "clearly" operate on scalar values in any way. > is breaking abstractions for no gain (on the contrary, we’ll have to > add wording on how to handle mid-UTF-8-word ranges; Now we have to add the same wording about handling mid-scalar-value ranges, there's no difference except that I have to decode UTF-8 when I might otherwise not have to do so (depending on the system I'm using, my requirements, etc). > in the case of > implementations which already get code points one needs an additional encode- > to-utf8-step + validation; implementations which already work on UTF-8 need > the utility functions to operate on code points anyways). This will always be a problem; in the case of a system that operates on UTF-8 and hands me bytes I now have to re-decode the UTF-8 even though I know it's already valid. This isn't an argument for one or the other. —Sam _______________________________________________ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: standards-unsubscr...@xmpp.org _______________________________________________