Thanks for the review! In case you're interested, I'd also welcome feedback on Locale Identifiers <https://goo.gl/kizkrm>
Mark On Thu, Nov 22, 2018 at 11:27 AM Henri Sivonen <hsivo...@hsivonen.fi> wrote: > On Tue, Oct 2, 2018 at 3:04 PM Mark Davis ☕️ <m...@macchiato.com> wrote: > >> >> * The Python 3.3 model mentions the disadvantages of memory usage >>> cliffs but doesn't mention the associated perfomance cliffs. It would >>> be good to also mention that when a string manipulation causes the >>> storage to expand or contract, there's a performance impact that's not >>> apparent from the nature of the operation if the programmer's >>> intuition works on the assumption that the programmer is dealing with >>> UTF-32. >>> >> >> The focus was on immutable string models, but I didn't make that clear. >> Added some text. >> > > Thanks. > > >> * The UTF-16/Latin1 model is missing. It's used by SpiderMonkey, DOM >>> text node storage in Gecko, (I believe but am not 100% sure) V8 and, >>> optionally, HotSpot >>> ( >>> https://docs.oracle.com/javase/9/vm/java-hotspot-virtual-machine-performance-enhancements.htm#JSJVM-GUID-3BB4C26F-6DE7-4299-9329-A3E02620D50A >>> ). >>> That is, text has UTF-16 semantics, but if the high half of every code >>> unit in a string is zero, only the lower half is stored. This has >>> properties analogous to the Python 3.3 model, except non-BMP doesn't >>> expand to UTF-32 but uses UTF-16 surrogate pairs. >>> >> >> Thanks, will add. >> > > V8 source code shows it has a OneByteString storage option: > https://cs.chromium.org/chromium/src/v8/src/objects/string.h?sq=package:chromium&g=0&l=494 > . From hearsay, I'm convinced that it means Latin1, but I've failed to find > a clear quotable statement from a V8 developer to that affect. > > >> 3) Type-system-tagged UTF-8 (the model of Rust): Valid UTF-8 buffers >>> have a different type in the type system than byte buffers. To go from >>> a byte buffer to an UTF-8 buffer, UTF-8 validity is checked. Once data >>> has been tagged as valid UTF-8, the validity is trusted completely so >>> that iteration by code point does not have "else" branches for >>> malformed sequences. If data that the type system indicates to be >>> valid UTF-8 wasn't actually valid, it would be nasal demon time. The >>> language has a default "safe" side and an opt-in "unsafe" side. The >>> unsafe side is for performing low-level operations in a way where the >>> responsibility of upholding invariants is moved from the compiler to >>> the programmer. It's impossible to violate the UTF-8 validity >>> invariant using the safe part of the language. >>> >> >> Added a quote based on this; please check if it is ok. >> > > Looks accurate. Thanks. > > -- > Henri Sivonen > hsivo...@hsivonen.fi > https://hsivonen.fi/ >