On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev <swift-dev@swift.org> wrote: > (A gist-formatted version of this email can be found at > https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f > <https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f>)
I’m very very excited for this, thank you for the detailed writeup and consideration of the effects and tradeoffs involved. > Given that ordering is not fit for human consumption, but rather machine > processing, it might as well be fast. The current ordering differs on Darwin > and Linux platforms, with Linux in particular suffering from poor performance > due to choice of ordering (UCA with DUCET) and older versions of ICU. > Instead, [String Comparison > Prototype](https://github.com/apple/swift/pull/12115 > <https://github.com/apple/swift/pull/12115>) provides a simpler ordering > that allows for many common-case fast-paths and optimizations. For all the > Unicode enthusiasts out there, this is the lexicographical ordering of > NFC-normalized UTF-16 code units. Thank you for fixing this. Your tradeoffs make perfect sense to me. > ### Small String Optimization .. > For example, assuming a 16-byte String struct and 8 bits used for flags and > discriminators (including discriminators for which small form a String is > in), 120 bits are available for a small string payload. 120 bits can hold 7 > UTF-16 code units, which is sufficient for most graphemes and many common > words and separators. 120 bits can also fit 15 ASCII/UTF-8 code units without > any packing, which suffices for many programmer/system strings (which have a > strong skew towards ASCII). > > We may also want a compact 5-bit encoding for formatted numbers, such as > 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in > base-10, which would require 18, 19, and 24 characters respectively. 120 bits > with a 5-bit encoding would fit all of these. This would speed up the > creation and interpolation of many strings containing numbers. I think it is important to consider that having more special cases and different representations slows down nearly *every* operation on string because they have to check and detangle all of the possible representations. Given the ability to hold 15 digits of ascii, I don’t see why it would be worthwhile to worry about a 5-bit representation for digits. String should be an Any! The tradeoff here is that you’d be biasing the design to favor creation and interpolation of many strings containing *large* numbers, at the cost of general string performance anywhere. This doesn’t sound like a good tradeoff for me, particularly when people writing extremely performance sensitive code probably won’t find it good enough anyway. > Final details are still in exploration. If the construction and > interpretation of small bit patterns can remain behind a resilience barrier, > new forms could be added in the future. However, the performance impact of > this would need to be carefully evaluated. I’d love to see performance numbers on this. Have you considered a design where you have exactly one small string representation: a sequence of 15 UTF8 bytes? This holds your 15 bytes of ascii, probably more non-ascii characters on average than 7 UTF16 codepoints, and only needs one determinator branch on the entry point of hot functions that want to touch the bytes. If you have a lot of algorithms that are sped up by knowing they have ascii, you could go with two bits: “is small” and “isascii”, where isascii is set in both the small string and normal string cases. Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are we really destined to have 2-words? That’s still much better than the 3 words we have now, but for some workloads it is a significant bloat. > ### Unmanaged Strings > > Such unmanaged strings can be used when the lifetime of the storage is known > to outlive all uses. Just like StringRef! +1, this concept can be a huge performance win… but can also be a huge source of UB if used wrong.. :-( > As such, they don’t need to participate in reference counting. In the future, > perhaps with ownership annotations or sophisticated static analysis, Swift > could convert managed strings into unmanaged ones as an optimization when it > knows the contents would not escape. Similarly in the future, a String > constructed from a Substring that is known to not outlive the Substring’s > owner can use an unmanaged form to skip copying the contents. This would > benefit String’s status as the recommended common-currency type for API > design. This could also have implications for StaticString. > Some kind of unmanaged string construct is an often-requested feature for > highly performance-sensitive code. We haven’t thought too deeply about how > best to surface this as API and it can be sliced and diced many ways. Some > considerations: Other questions/considerations: - here and now, could we get the vast majority of this benefit by having the ABI pass string arguments as +0 and guarantee their lifetime across calls? What is the tradeoff of that? - does this concept even matter if/when we can write an argument as “borrowed String”? I suppose this is a bit more general in that it should be usable as properties and other things that the (initial) ownership design isn’t scoped to support since it doesn’t have lifetimes. - array has exactly the same sort of concern, why is this a string-specific problem? - how does this work with mutation? Wouldn’t we really have to have something like MutableArrayRef since its talking about the mutability of the elements, not something that var/let can support? When I think about this, it seems like an “non-owning *slice*” of some sort. If you went that direction, then you wouldn’t have to build it into the String currency type, just like it isn’t in LLVM. > ### Performance Flags Nice. > ### Miscellaneous > > There are many other tweaks and improvements possible under the new ABI prior > to stabilization, such as: > > * Guaranteed nul-termination for String’s storage classes for faster C > bridging. This is probably best as another perf flag. > * Unification of String and Substring’s various Views. > * Some form of opaque string, such as an existential, which can be used as an > extension point. > * More compact/performant indices. What is the current theory on eager vs lazy bridging with Objective-C? Can we get to an eager bridging model? That alone would dramatically simplify and speed up every operation on a Swift string. > ## String Ergonomics I need to run now, but I’ll read and comment on this section when I have a chance. That said, just today I had to write the code below and the ‘charToByte’ part of it is absolutely tragic. Is there any thoughts about how to improve character literal processing? -Chris func decodeHex(_ string: String) -> [UInt8] { func charToByte(_ c: String) -> UInt8 { return c.utf8.first! // swift makes char processing grotty. :-( } func hexToInt(_ c : UInt8) -> UInt8 { switch c { case charToByte("0")...charToByte("9"): return c - charToByte("0") case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10 case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10 default: fatalError("invalid hexadecimal character") } } var result: [UInt8] = [] assert(string.count & 1 == 0, "must get a pair of hexadecimal characters") var it = string.utf8.makeIterator() while let byte1 = it.next(), let byte2 = it.next() { result.append((hexToInt(byte1) << 4) | hexToInt(byte2)) } return result } print(decodeHex("01FF")) print(decodeHex("8001")) print(decodeHex("80a1bcd3"))
_______________________________________________ swift-dev mailing list swift-dev@swift.org https://lists.swift.org/mailman/listinfo/swift-dev