Re: [swift-dev] State of String: ABI & Performance

Michael Ilseman via swift-dev Thu, 11 Jan 2018 15:20:06 -0800


> On Jan 11, 2018, at 2:17 PM, Michael Ilseman via swift-dev 
> <swift-dev@swift.org> wrote:
> 
> 
> 
>> On Jan 11, 2018, at 2:06 PM, Tony Allevato <tony.allev...@gmail.com 
>> <mailto:tony.allev...@gmail.com>> wrote:
>> 
>> 
>> 
>> On Thu, Jan 11, 2018 at 12:32 PM Michael Ilseman via swift-dev 
>> <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:
>> Hi Chris!
>> 
>> +CC Michael Gottesman, as I veer into talking about ARC.
>> 
>> 
>> 
>>> On Jan 10, 2018, at 9:29 PM, Chris Lattner <clatt...@nondot.org 
>>> <mailto:clatt...@nondot.org>> wrote:
>>> 
>>> On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev 
>>> <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote:
>>>> (A gist-formatted version of this email can be found at 
>>>> https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f 
>>>> <https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f>)
>>> 
>>> I’m very very excited for this, thank you for the detailed writeup and 
>>> consideration of the effects and tradeoffs involved.
>>> 
>> 
>>>> Given that ordering is not fit for human consumption, but rather machine 
>>>> processing, it might as well be fast. The current ordering differs on 
>>>> Darwin and Linux platforms, with Linux in particular suffering from poor 
>>>> performance due to choice of ordering (UCA with DUCET) and older versions 
>>>> of ICU. Instead, [String Comparison 
>>>> Prototype](https://github.com/apple/swift/pull/12115 
>>>> <https://github.com/apple/swift/pull/12115>)  provides a simpler ordering 
>>>> that allows for many common-case fast-paths and optimizations. For all the 
>>>> Unicode enthusiasts out there, this is the lexicographical ordering of 
>>>> NFC-normalized UTF-16 code units.
>>> 
>>> Thank you for fixing this.  Your tradeoffs make perfect sense to me.
>>> 
>>>> ### Small String Optimization
>>> ..
>>>> For example, assuming a 16-byte String struct and 8 bits used for flags 
>>>> and discriminators (including discriminators for which small form a String 
>>>> is in), 120 bits are available for a small string payload. 120 bits can 
>>>> hold 7 UTF-16 code units, which is sufficient for most graphemes and many 
>>>> common words and separators. 120 bits can also fit 15 ASCII/UTF-8 code 
>>>> units without any packing, which suffices for many programmer/system 
>>>> strings (which have a strong skew towards ASCII).
>>>> 
>>>> We may also want a compact 5-bit encoding for formatted numbers, such as 
>>>> 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in 
>>>> base-10, which would require 18, 19, and 24 characters respectively. 120 
>>>> bits with a 5-bit encoding would fit all of these. This would speed up the 
>>>> creation and interpolation of many strings containing numbers.
>>> 
>>> I think it is important to consider that having more special cases and 
>>> different representations slows down nearly *every* operation on string 
>>> because they have to check and detangle all of the possible representations.
>> 
>> nit: Multiple small representations would slow down nearly every operation 
>> on *small* strings. That is, we would initially branch on a isSmall check, 
>> and then small strings would pay the cost of further inspection. This could 
>> also slow down String construction, depending on how aggressively we try to 
>> pack/transcode and whether we also have specialized inits.
>> 
>>>  Given the ability to hold 15 digits of ascii, I don’t see why it would be 
>>> worthwhile to worry about a 5-bit representation for digits.  String should 
>>> be an Any!
>>> 
>>> The tradeoff here is that you’d be biasing the design to favor creation and 
>>> interpolation of many strings containing *large* numbers, at the cost of 
>>> general string performance anywhere.   This doesn’t sound like a good 
>>> tradeoff for me, particularly when people writing extremely performance 
>>> sensitive code probably won’t find it good enough anyway.
>>> 
>>>> Final details are still in exploration. If the construction and 
>>>> interpretation of small bit patterns can remain behind a resilience 
>>>> barrier, new forms could be added in the future. However, the performance 
>>>> impact of this would need to be carefully evaluated.
>>> 
>>> I’d love to see performance numbers on this.  Have you considered a design 
>>> where you have exactly one small string representation: a sequence of 15 
>>> UTF8 bytes?  This holds your 15 bytes of ascii, probably more non-ascii 
>>> characters on average than 7 UTF16 codepoints, and only needs one 
>>> determinator branch on the entry point of hot functions that want to touch 
>>> the bytes.  If you have a lot of algorithms that are sped up by knowing 
>>> they have ascii, you could go with two bits:  “is small” and “isascii”, 
>>> where isascii is set in both the small string and normal string cases.
>>> 
>> 
>> We have not yet done the investigation, so all details could change. This is 
>> my (perhaps flawed!) reasoning as to why, in general, it may be useful to 
>> have multiple small forms. Again, this will change as we learn more.
>> 
>> Strings are created and copied (i.e. a value-copy/retain) more often than 
>> they are read (and strings are read far more often than they are modified). 
>> The “high-order bit” for performance is whether a string is small-or-not, as 
>> avoiding an allocation and, just as importantly, managing the memory with 
>> ARC is skipped. Beyond that, we’re debating smaller performance deltas, 
>> which are still important. This reasoning also influences where we choose to 
>> stick our resilience barriers, which can be relaxed in the future.
>> 
>> One of the nice things about having a small representation for each of our 
>> backing storage kinds (ASCII/UTF-8 vs UTF-16) is that it would gives us a 
>> very fast way of conservatively checking whether we form a small string. 
>> E.g., if we’re forming a Substring over a portion of a UTF-16-backed string, 
>> we can check whether the range holds 7 or fewer code units to avoid the ARC. 
>> I don’t know if it would be worth the attempt to detect whether it would fit 
>> in 15 transcoded UTF-8 code units vs bumping the ref count. Avoiding the 
>> extra packing attempt may help branch mis-predicts, though I’m *way* out of 
>> my depth when it comes to reasoning about mis-predicts as they emerge in the 
>> wild. There is a “temporal locality of Unicode-ness” in string processing, 
>> though.
>> 
>> As far as what strings a 7 code unit UTF-16 small form that couldn’t fit in 
>> 15 code units of UTF-8, the biggest cases are latter-BMP scalars where a 
>> small UTF-8 string could only store 5. This may not be a big deal.
>> 
>> As to a 5-bit encoding, again, all details pending more experimentation. Its 
>> importance may be diminished by more efficient, low-level String 
>> construction interfaces. The difference between 15 and 24 characters could 
>> be big, considering that formatted addresses and Doubles fit in that range. 
>> We’ll see.
>> 
>> 
>>> Finally, what tradeoffs do you see between a 1-word vs 2-word string?  Are 
>>> we really destined to have 2-words?  That’s still much better than the 3 
>>> words we have now, but for some workloads it is a significant bloat.
>> 
>> <repeat disclaimer about final details being down to real data>. Some 
>> arguments in favor of 2-word, presented roughly in order of impact:
>> 
>> 1. This allows the String type to accommodate llvm::StringRef-style usages. 
>> This is pretty broad usage: “mmap a file and treat its contents as a 
>> String”, “store all my contents in an llvm::BumpPtr which outlives uses”, 
>> un-owned slices, etc. One word String would greatly limit this to only 
>> whole-string nul-terminated cases.
>> 
>> 2. Two-word String fits more small strings. Exactly where along the 
>> diminishing-returns curve 7 vs 15 UTF-8 code units lie is dependent on the 
>> data set. One example is NSString, which (according to reasoning at 
>> https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html
>>  
>> <https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html>)
>>  considered it important enough to have 6- and 5- bit reduced ASCII 
>> character sets to squeeze up to 11-length strings in a word. 15 code unit 
>> small strings would be a super-set of tagged NSStrings, meaning we could 
>> bridge them eagerly in-line, while 7 code unit small strings would be a 
>> subset (and also a strong argument against eagerly bridging them). 
>> 
>> If you have access to any interesting data sets and can report back some 
>> statistics, that would be immensely helpful!
>> 
>> 3. More bits available to reserve for future-proofing, etc., though many of 
>> these could be stored in the header.
>> 
>> 4. The second word can cache useful information from large strings. 
>> `endIndex` is a very frequently requested computed property and it could be 
>> stored directly in-line rather than loaded from memory (though perhaps a 
>> load happens anyways in a subsequent read of the string). Alternatively, we 
>> could store the grapheme count or some other piece of information that we’d 
>> otherwise have to recompute. More experimentation needed here.
>> 
>> 5. (vague and hand-wavy) Two-words fits into a nice groove that 3-words 
>> doesn’t: 2 words is a rule-of-thumb size for very small buffers. It’s a 
>> common heap alignment, stack alignment, vector-width, double-word-load 
>> width, etc.. 1-word Strings may be under-utilizing available resources, that 
>> is the second word will often be there for use anyways. The main case where 
>> this is not true and 1-word shines is aggregates of String.
>> 
>> I’m seeking some kind of ruthlessly-pragmatic balance here, and I think 2 
>> word is currently winning. Again, up to further investigation and debate. 
>> FWIW, I believe Rust’s String is 3-words, i.e. it’s the std::vector-style 
>> pointer+length+capacity, but I haven’t looked into their String vs OsString 
>> vs OsStr vs … model.
>> 
>> 
>>> 
>>>> ### Unmanaged Strings
>>>> 
>>>> Such unmanaged strings can be used when the lifetime of the storage is 
>>>> known to outlive all uses. 
>>> 
>>> Just like StringRef!  +1, this concept can be a huge performance win… but 
>>> can also be a huge source of UB if used wrong.. :-(
>>> 
>> 
>> When it comes to naming, I’m more prone to argue for “UnsafeString” rather 
>> than “UnmanagedString”, but that’s a later debate for SE ;-)
>> 
>>>> As such, they don’t need to participate in reference counting. In the 
>>>> future, perhaps with ownership annotations or sophisticated static 
>>>> analysis, Swift could convert managed strings into unmanaged ones as an 
>>>> optimization when it knows the contents would not escape. Similarly in the 
>>>> future, a String constructed from a Substring that is known to not outlive 
>>>> the Substring’s owner can use an unmanaged form to skip copying the 
>>>> contents. This would benefit String’s status as the recommended 
>>>> common-currency type for API design.
>>> 
>>> This could also have implications for StaticString.
>> 
>> Yup!
>> 
>>> 
>>>> Some kind of unmanaged string construct is an often-requested feature for 
>>>> highly performance-sensitive code. We haven’t thought too deeply about how 
>>>> best to surface this as API and it can be sliced and diced many ways. Some 
>>>> considerations:
>>> 
>>> Other questions/considerations:
>>> - here and now, could we get the vast majority of this benefit by having 
>>> the ABI pass string arguments as +0 and guarantee their lifetime across 
>>> calls?  What is the tradeoff of that?
>> 
>> Michael Gottesman (+CC) is doing this sort of investigation right now, 
>> evaluating the tradeoffs of more wide-spread use of the +0 convention. My 
>> current understanding is that even with +0, many retain/release pairs cannot 
>> be eliminated without more knowledge of the callee, but I don’t know the 
>> reasons why. Hopefully he can elaborate.
>> 
>>> - does this concept even matter if/when we can write an argument as 
>>> “borrowed String”?  I suppose this is a bit more general in that it should 
>>> be usable as properties and other things that the (initial) ownership 
>>> design isn’t scoped to support since it doesn’t have lifetimes.
>> 
>> Right, it’s a little more general.
>> 
>>> - array has exactly the same sort of concern, why is this a string-specific 
>>> problem?
>> 
>> I think it’s interesting as slices may emerge more often in practice on 
>> strings than arrays, e.g. for presenting parsing results. Also, applications 
>> that care about this level of performance may custom-allocate all their 
>> string data contiguously (e.g. llvm::BumpPtr). I believe that projects such 
>> as llbuild have done this, forcing themselves into an API schism between 
>> String and UnsafeBufferPointer<UInt8>, often duplicating functionality. In 
>> theory all of this could also apply to array, but it seems to happen more 
>> for strings.
>> 
>>> - how does this work with mutation?  Wouldn’t we really have to have 
>>> something like MutableArrayRef since its talking about the mutability of 
>>> the elements, not something that var/let can support?
>>> 
>> 
>> Good point. If this is more of a “UnsafeString”-like concept, there’s 
>> analogy with Unsafe[Mutable]BufferPointer.
>> 
>>> When I think about this, it seems like an “non-owning *slice*” of some 
>>> sort.  If you went that direction, then you wouldn’t have to build it into 
>>> the String currency type, just like it isn’t in LLVM.
>>> 
>> 
>> Yes, though you would lose the ability to give them to APIs expecting a 
>> String when you know it’s safe to do so. E.g., when the contents are 
>> effectively immortal relative to the API call and effects. Careful use of 
>> “unsafe” or “unmanaged” in type names and argument labels needed here.
>> 
>>>> ### Performance Flags
>>> 
>>> Nice.
>>> 
>>>> ### Miscellaneous
>>>> 
>>>> There are many other tweaks and improvements possible under the new ABI 
>>>> prior to stabilization, such as:
>>>> 
>>>> * Guaranteed nul-termination for String’s storage classes for faster C 
>>>> bridging.
>>> 
>>> This is probably best as another perf flag.
>>> 
>> 
>> I was thinking that if we always zero-fill our storage classes and reserve 
>> one extra character, then the flag would be redundant with the discriminator 
>> bits. However, that’s assuming the string doesn’t store a nul character in 
>> the middle. I agree this might need a separate perf flag.
>> 
>>>> * Unification of String and Substring’s various Views.
>>>> * Some form of opaque string, such as an existential, which can be used as 
>>>> an extension point.
>>>> * More compact/performant indices.
>>> 
>>> What is the current theory on eager vs lazy bridging with Objective-C?  Can 
>>> we get to an eager bridging model?  That alone would dramatically simplify 
>>> and speed up every operation on a Swift string.
>>> 
>> 
>> I don’t think that will be feasible in general, although perhaps it could 
>> happen for most common occurrences.
>> 
>>>> ## String Ergonomics
>>> 
>>> I need to run now, but I’ll read and comment on this section when I have a 
>>> chance.
>>> 
>>> That said, just today I had to write the code below and the ‘charToByte’ 
>>> part of it is absolutely tragic.  Is there any thoughts about how to 
>>> improve character literal processing?
>>> 
>>> -Chris
>>> 
>>> func decodeHex(_ string: String) -> [UInt8] {
>>>   func charToByte(_ c: String) -> UInt8 {
>>>     return c.utf8.first! // swift makes char processing grotty. :-(
>>>   }
>>> 
>>>   func hexToInt(_ c : UInt8) -> UInt8 {
>>>     switch c {
>>>     case charToByte("0")...charToByte("9"): return c - charToByte("0")
>>>     case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10
>>>     case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10
>>>     default: fatalError("invalid hexadecimal character")
>>>     }
>>>   }
>>> 
>> 
>> Yeah, trying to do ASCII value ranges is currently pretty rough. Here’s what 
>> I did recently when I wanted something similar, using unicode scalar literal 
>> ranges:
>> 
>> extension Character {
>>     public var isASCII: Bool {
>>         guard let scalar = unicodeScalars.first, unicodeScalars.count == 1 
>> else { return false }
>>         return scalar.value <= 0x7f
>>     }
>>     public var isASCIIAlphaNumeric: Bool {
>>         guard isASCII else { return false }
>>         switch unicodeScalars.first! {
>>         case "0"..."9": return true
>>         case "A"..."Z": return true
>>         case "a"..."z": return true
>>         default: return false
>>         }
>>     }
>> }
>> 
>> +1 to UnicodeScalars. I've found that in the cases where you can't 
>> guarantee/don't know whether the underlying string storage is ASCII or 
>> UTF-16, if performance is a concern, accessing the scalars gives you the 
>> lowest cost conversions (ASCII -> Scalar is a simple promotion always, and 
>> UTF-16 -> Scalar is a simple promotion in all but the relatively small cases 
>> where you have a surrogate pair). On the other hand, transcoding to UTF-8 
>> can be more costly if all you want to do is compare the Unicode code points 
>> numerically anyway.
>> 
>> Using UnicodeScalars here also has the advantage of looking cleaner in the 
>> source code (like those ranges above) thanks to 
>> ExpressibleByUnicodeScalarLiteral, which UInt8 doesn't have.
>> 
>>  
>> For a more general solution, I think a `var numericValue: Int? { get }` on 
>> Character would make sense. Unicode defines (at least one) semantics for 
>> this and ICU provides this functionality already.
>> 
>> I threw together a small library recently that adds extensions to 
>> UnicodeScalar for most of the properties exposed by ICU and I would *love* 
>> to see these make become first-class: 
>> https://github.com/allevato/icu-swift/tree/master/Sources/ICU 
>> <https://github.com/allevato/icu-swift/tree/master/Sources/ICU>. In other 
>> words I'm happy to turn bits of this into proposals and refactor into 
>> implementation PRs :)
>> 
> 
> That looks awesome!
> 
> I think a few general APIs for querying categories/properties would satisfy 
> most of the expert-level needs, and it would be highly valuable to provide 
> frequently used (and easily explained) properties. For the stdlib, these may 
> not be as exhaustive as your package, balancing usefulness and 
> discoverability against API bloat. There would always be a place for an 
> exhaustive/convenience package for experts and enthusiasts.
>


I forgot to say: Yes, please make a pitch and drive a proposal! :-)

> 
>> 
>> 
>> 
>> 
>>>   var result: [UInt8] = []
>>> 
>>>   assert(string.count & 1 == 0, "must get a pair of hexadecimal characters")
>>>   var it = string.utf8.makeIterator()
>>>   while let byte1 = it.next(),
>>>         let byte2 = it.next() {
>>>     result.append((hexToInt(byte1) << 4) | hexToInt(byte2))
>>>   }
>>> 
>>>   return result
>>> }
>>> 
>>> print(decodeHex("01FF"))
>>> print(decodeHex("8001"))
>>> print(decodeHex("80a1bcd3"))
>>>  
>> 
>>> 
>>> 
>> _______________________________________________
>> swift-dev mailing list
>> swift-dev@swift.org <mailto:swift-dev@swift.org>
>> https://lists.swift.org/mailman/listinfo/swift-dev 
>> <https://lists.swift.org/mailman/listinfo/swift-dev>
> _______________________________________________
> swift-dev mailing list
> swift-dev@swift.org <mailto:swift-dev@swift.org>
> https://lists.swift.org/mailman/listinfo/swift-dev 
> <https://lists.swift.org/mailman/listinfo/swift-dev>

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Re: [swift-dev] State of String: ABI & Performance

Reply via email to