> On 11. Jan 2018, at 06:36, Chris Lattner via swift-dev <swift-dev@swift.org> > wrote: > > On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev > <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote: >> (A gist-formatted version of this email can be found at >> https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f >> <https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f>) > > I’m very very excited for this, thank you for the detailed writeup and > consideration of the effects and tradeoffs involved. > >> Given that ordering is not fit for human consumption, but rather machine >> processing, it might as well be fast. The current ordering differs on Darwin >> and Linux platforms, with Linux in particular suffering from poor >> performance due to choice of ordering (UCA with DUCET) and older versions of >> ICU. Instead, [String Comparison >> Prototype](https://github.com/apple/swift/pull/12115 >> <https://github.com/apple/swift/pull/12115>) provides a simpler ordering >> that allows for many common-case fast-paths and optimizations. For all the >> Unicode enthusiasts out there, this is the lexicographical ordering of >> NFC-normalized UTF-16 code units. > > Thank you for fixing this. Your tradeoffs make perfect sense to me. > >> ### Small String Optimization > .. >> For example, assuming a 16-byte String struct and 8 bits used for flags and >> discriminators (including discriminators for which small form a String is >> in), 120 bits are available for a small string payload. 120 bits can hold 7 >> UTF-16 code units, which is sufficient for most graphemes and many common >> words and separators. 120 bits can also fit 15 ASCII/UTF-8 code units >> without any packing, which suffices for many programmer/system strings >> (which have a strong skew towards ASCII). >> >> We may also want a compact 5-bit encoding for formatted numbers, such as >> 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in >> base-10, which would require 18, 19, and 24 characters respectively. 120 >> bits with a 5-bit encoding would fit all of these. This would speed up the >> creation and interpolation of many strings containing numbers. > > I think it is important to consider that having more special cases and > different representations slows down nearly *every* operation on string > because they have to check and detangle all of the possible representations. > Given the ability to hold 15 digits of ascii, I don’t see why it would be > worthwhile to worry about a 5-bit representation for digits. String should > be an Any! > > The tradeoff here is that you’d be biasing the design to favor creation and > interpolation of many strings containing *large* numbers, at the cost of > general string performance anywhere. This doesn’t sound like a good > tradeoff for me, particularly when people writing extremely performance > sensitive code probably won’t find it good enough anyway. > >> Final details are still in exploration. If the construction and >> interpretation of small bit patterns can remain behind a resilience barrier, >> new forms could be added in the future. However, the performance impact of >> this would need to be carefully evaluated. > > I’d love to see performance numbers on this. Have you considered a design > where you have exactly one small string representation: a sequence of 15 UTF8 > bytes? This holds your 15 bytes of ascii, probably more non-ascii characters > on average than 7 UTF16 codepoints, and only needs one determinator branch on > the entry point of hot functions that want to touch the bytes. If you have a > lot of algorithms that are sped up by knowing they have ascii, you could go > with two bits: “is small” and “isascii”, where isascii is set in both the > small string and normal string cases. > > Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are we > really destined to have 2-words? That’s still much better than the 3 words > we have now, but for some workloads it is a significant bloat. > > >> ### Unmanaged Strings >> >> Such unmanaged strings can be used when the lifetime of the storage is known >> to outlive all uses. > > Just like StringRef! +1, this concept can be a huge performance win… but can > also be a huge source of UB if used wrong.. :-( > >> As such, they don’t need to participate in reference counting. In the >> future, perhaps with ownership annotations or sophisticated static analysis, >> Swift could convert managed strings into unmanaged ones as an optimization >> when it knows the contents would not escape. Similarly in the future, a >> String constructed from a Substring that is known to not outlive the >> Substring’s owner can use an unmanaged form to skip copying the contents. >> This would benefit String’s status as the recommended common-currency type >> for API design. > > This could also have implications for StaticString. > >> Some kind of unmanaged string construct is an often-requested feature for >> highly performance-sensitive code. We haven’t thought too deeply about how >> best to surface this as API and it can be sliced and diced many ways. Some >> considerations: > > Other questions/considerations: > - here and now, could we get the vast majority of this benefit by having the > ABI pass string arguments as +0 and guarantee their lifetime across calls? > What is the tradeoff of that? > - does this concept even matter if/when we can write an argument as “borrowed > String”? I suppose this is a bit more general in that it should be usable as > properties and other things that the (initial) ownership design isn’t scoped > to support since it doesn’t have lifetimes. > - array has exactly the same sort of concern, why is this a string-specific > problem? > - how does this work with mutation? Wouldn’t we really have to have > something like MutableArrayRef since its talking about the mutability of the > elements, not something that var/let can support? > > When I think about this, it seems like an “non-owning *slice*” of some sort. > If you went that direction, then you wouldn’t have to build it into the > String currency type, just like it isn’t in LLVM. > >> ### Performance Flags > > Nice. > >> ### Miscellaneous >> >> There are many other tweaks and improvements possible under the new ABI >> prior to stabilization, such as: >> >> * Guaranteed nul-termination for String’s storage classes for faster C >> bridging. > > This is probably best as another perf flag. > >> * Unification of String and Substring’s various Views. >> * Some form of opaque string, such as an existential, which can be used as >> an extension point. >> * More compact/performant indices. > > What is the current theory on eager vs lazy bridging with Objective-C? Can > we get to an eager bridging model? That alone would dramatically simplify > and speed up every operation on a Swift string. > >> ## String Ergonomics > > I need to run now, but I’ll read and comment on this section when I have a > chance. > > That said, just today I had to write the code below and the ‘charToByte’ part > of it is absolutely tragic. Is there any thoughts about how to improve > character literal processing? > > -Chris > > func decodeHex(_ string: String) -> [UInt8] { > func charToByte(_ c: String) -> UInt8 { > return c.utf8.first! // swift makes char processing grotty. :-( > } > > func hexToInt(_ c : UInt8) -> UInt8 { > switch c { > case charToByte("0")...charToByte("9"): return c - charToByte("0") > case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10 > case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10 > default: fatalError("invalid hexadecimal character") > } > } > > var result: [UInt8] = [] > > assert(string.count & 1 == 0, "must get a pair of hexadecimal characters") > var it = string.utf8.makeIterator() > while let byte1 = it.next(), > let byte2 = it.next() { > result.append((hexToInt(byte1) << 4) | hexToInt(byte2)) > } > > return result > } > > print(decodeHex("01FF")) > print(decodeHex("8001")) > print(decodeHex("80a1bcd3"))
In this particular example, you can use UInt8.init(ascii:) to make it a little nicer (no need for the charToByte function): func hexToInt(_ c : UInt8) -> UInt8 { switch c { case UInt8(ascii: "0")...UInt8(ascii: "9"): return c - UInt8(ascii: "0") case UInt8(ascii: "a")...UInt8(ascii: "f"): return c - UInt8(ascii: "a") + 10 case UInt8(ascii: "A")...UInt8(ascii: "F"): return c - UInt8(ascii: "A") + 10 default: fatalError("invalid hexadecimal character") } }
_______________________________________________ swift-dev mailing list swift-dev@swift.org https://lists.swift.org/mailman/listinfo/swift-dev