On Fri, Dec 4, 2020, at 20:23, Florian Schmaus wrote: > I begin to feel that a lot of your rationale is based on the idea that > you always (/often?) have access to the raw UTF-8 bytes as they > appeared on the wire.
Yes, most of it is. > While is is probably true for languages where the String type's native > encoding is also UTF-8. It is usually not true for others. For > example, widely used XML parser in Java will return Java's String > type, which is UTF-16 (or ISO-8859-1 [1]) based. Yes, this is fair, I was thinking you could probably always get the raw bytes, but it does look like a lot of these *only* do DOM based parsing and don't keep the original representation. > However, given that there is a wide variety here, I am not sure if it > is worth to take any of that into consideration. Yes, fair enough. > Instead, my rationale is based on the idea that you always have > access to the Unicode code points of the textual content obtained > from the XML. I do not have that access without converting from UTF-8 to code points in the hot-path where it would be inappropriate. It's effectively the same thing: I don't want to convert from bytes to code points, you don't want to convert from codepoints to bytes. Some languages will have to do the conversion either way, so it seems worth using the thing that allows for the most flexibility with the least amount of work in eg. IoT devices using C that are trying to optimize for performance where passing along the bytes as received on the wire (possibly with some validation that the range is accurate) is acceptable. > And I am in favor of code points because it allows us to aim for the > extended grapheme cluster algorithm, while also allowing for the > "simply count code points" fallback. If you do bytes you could also easily convert to codepoints and then to grapheme clusters. It also allows for the simple "count codepoints" or "count bytes" fallback. —Sam _______________________________________________ Standards mailing list Info: https://mail.jabber.org/mailman/listinfo/standards Unsubscribe: [email protected] _______________________________________________
