> On Jan 11, 2018, at 2:17 PM, Michael Ilseman via swift-dev
> <[email protected]> wrote:
>
>
>
>> On Jan 11, 2018, at 2:06 PM, Tony Allevato <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>
>>
>> On Thu, Jan 11, 2018 at 12:32 PM Michael Ilseman via swift-dev
>> <[email protected] <mailto:[email protected]>> wrote:
>> Hi Chris!
>>
>> +CC Michael Gottesman, as I veer into talking about ARC.
>>
>>
>>
>>> On Jan 10, 2018, at 9:29 PM, Chris Lattner <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev
>>> <[email protected] <mailto:[email protected]>> wrote:
>>>> (A gist-formatted version of this email can be found at
>>>> https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f
>>>> <https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f>)
>>>
>>> I’m very very excited for this, thank you for the detailed writeup and
>>> consideration of the effects and tradeoffs involved.
>>>
>>
>>>> Given that ordering is not fit for human consumption, but rather machine
>>>> processing, it might as well be fast. The current ordering differs on
>>>> Darwin and Linux platforms, with Linux in particular suffering from poor
>>>> performance due to choice of ordering (UCA with DUCET) and older versions
>>>> of ICU. Instead, [String Comparison
>>>> Prototype](https://github.com/apple/swift/pull/12115
>>>> <https://github.com/apple/swift/pull/12115>) provides a simpler ordering
>>>> that allows for many common-case fast-paths and optimizations. For all the
>>>> Unicode enthusiasts out there, this is the lexicographical ordering of
>>>> NFC-normalized UTF-16 code units.
>>>
>>> Thank you for fixing this. Your tradeoffs make perfect sense to me.
>>>
>>>> ### Small String Optimization
>>> ..
>>>> For example, assuming a 16-byte String struct and 8 bits used for flags
>>>> and discriminators (including discriminators for which small form a String
>>>> is in), 120 bits are available for a small string payload. 120 bits can
>>>> hold 7 UTF-16 code units, which is sufficient for most graphemes and many
>>>> common words and separators. 120 bits can also fit 15 ASCII/UTF-8 code
>>>> units without any packing, which suffices for many programmer/system
>>>> strings (which have a strong skew towards ASCII).
>>>>
>>>> We may also want a compact 5-bit encoding for formatted numbers, such as
>>>> 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in
>>>> base-10, which would require 18, 19, and 24 characters respectively. 120
>>>> bits with a 5-bit encoding would fit all of these. This would speed up the
>>>> creation and interpolation of many strings containing numbers.
>>>
>>> I think it is important to consider that having more special cases and
>>> different representations slows down nearly *every* operation on string
>>> because they have to check and detangle all of the possible representations.
>>
>> nit: Multiple small representations would slow down nearly every operation
>> on *small* strings. That is, we would initially branch on a isSmall check,
>> and then small strings would pay the cost of further inspection. This could
>> also slow down String construction, depending on how aggressively we try to
>> pack/transcode and whether we also have specialized inits.
>>
>>> Given the ability to hold 15 digits of ascii, I don’t see why it would be
>>> worthwhile to worry about a 5-bit representation for digits. String should
>>> be an Any!
>>>
>>> The tradeoff here is that you’d be biasing the design to favor creation and
>>> interpolation of many strings containing *large* numbers, at the cost of
>>> general string performance anywhere. This doesn’t sound like a good
>>> tradeoff for me, particularly when people writing extremely performance
>>> sensitive code probably won’t find it good enough anyway.
>>>
>>>> Final details are still in exploration. If the construction and
>>>> interpretation of small bit patterns can remain behind a resilience
>>>> barrier, new forms could be added in the future. However, the performance
>>>> impact of this would need to be carefully evaluated.
>>>
>>> I’d love to see performance numbers on this. Have you considered a design
>>> where you have exactly one small string representation: a sequence of 15
>>> UTF8 bytes? This holds your 15 bytes of ascii, probably more non-ascii
>>> characters on average than 7 UTF16 codepoints, and only needs one
>>> determinator branch on the entry point of hot functions that want to touch
>>> the bytes. If you have a lot of algorithms that are sped up by knowing
>>> they have ascii, you could go with two bits: “is small” and “isascii”,
>>> where isascii is set in both the small string and normal string cases.
>>>
>>
>> We have not yet done the investigation, so all details could change. This is
>> my (perhaps flawed!) reasoning as to why, in general, it may be useful to
>> have multiple small forms. Again, this will change as we learn more.
>>
>> Strings are created and copied (i.e. a value-copy/retain) more often than
>> they are read (and strings are read far more often than they are modified).
>> The “high-order bit” for performance is whether a string is small-or-not, as
>> avoiding an allocation and, just as importantly, managing the memory with
>> ARC is skipped. Beyond that, we’re debating smaller performance deltas,
>> which are still important. This reasoning also influences where we choose to
>> stick our resilience barriers, which can be relaxed in the future.
>>
>> One of the nice things about having a small representation for each of our
>> backing storage kinds (ASCII/UTF-8 vs UTF-16) is that it would gives us a
>> very fast way of conservatively checking whether we form a small string.
>> E.g., if we’re forming a Substring over a portion of a UTF-16-backed string,
>> we can check whether the range holds 7 or fewer code units to avoid the ARC.
>> I don’t know if it would be worth the attempt to detect whether it would fit
>> in 15 transcoded UTF-8 code units vs bumping the ref count. Avoiding the
>> extra packing attempt may help branch mis-predicts, though I’m *way* out of
>> my depth when it comes to reasoning about mis-predicts as they emerge in the
>> wild. There is a “temporal locality of Unicode-ness” in string processing,
>> though.
>>
>> As far as what strings a 7 code unit UTF-16 small form that couldn’t fit in
>> 15 code units of UTF-8, the biggest cases are latter-BMP scalars where a
>> small UTF-8 string could only store 5. This may not be a big deal.
>>
>> As to a 5-bit encoding, again, all details pending more experimentation. Its
>> importance may be diminished by more efficient, low-level String
>> construction interfaces. The difference between 15 and 24 characters could
>> be big, considering that formatted addresses and Doubles fit in that range.
>> We’ll see.
>>
>>
>>> Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are
>>> we really destined to have 2-words? That’s still much better than the 3
>>> words we have now, but for some workloads it is a significant bloat.
>>
>> <repeat disclaimer about final details being down to real data>. Some
>> arguments in favor of 2-word, presented roughly in order of impact:
>>
>> 1. This allows the String type to accommodate llvm::StringRef-style usages.
>> This is pretty broad usage: “mmap a file and treat its contents as a
>> String”, “store all my contents in an llvm::BumpPtr which outlives uses”,
>> un-owned slices, etc. One word String would greatly limit this to only
>> whole-string nul-terminated cases.
>>
>> 2. Two-word String fits more small strings. Exactly where along the
>> diminishing-returns curve 7 vs 15 UTF-8 code units lie is dependent on the
>> data set. One example is NSString, which (according to reasoning at
>> https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html
>>
>> <https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html>)
>> considered it important enough to have 6- and 5- bit reduced ASCII
>> character sets to squeeze up to 11-length strings in a word. 15 code unit
>> small strings would be a super-set of tagged NSStrings, meaning we could
>> bridge them eagerly in-line, while 7 code unit small strings would be a
>> subset (and also a strong argument against eagerly bridging them).
>>
>> If you have access to any interesting data sets and can report back some
>> statistics, that would be immensely helpful!
>>
>> 3. More bits available to reserve for future-proofing, etc., though many of
>> these could be stored in the header.
>>
>> 4. The second word can cache useful information from large strings.
>> `endIndex` is a very frequently requested computed property and it could be
>> stored directly in-line rather than loaded from memory (though perhaps a
>> load happens anyways in a subsequent read of the string). Alternatively, we
>> could store the grapheme count or some other piece of information that we’d
>> otherwise have to recompute. More experimentation needed here.
>>
>> 5. (vague and hand-wavy) Two-words fits into a nice groove that 3-words
>> doesn’t: 2 words is a rule-of-thumb size for very small buffers. It’s a
>> common heap alignment, stack alignment, vector-width, double-word-load
>> width, etc.. 1-word Strings may be under-utilizing available resources, that
>> is the second word will often be there for use anyways. The main case where
>> this is not true and 1-word shines is aggregates of String.
>>
>> I’m seeking some kind of ruthlessly-pragmatic balance here, and I think 2
>> word is currently winning. Again, up to further investigation and debate.
>> FWIW, I believe Rust’s String is 3-words, i.e. it’s the std::vector-style
>> pointer+length+capacity, but I haven’t looked into their String vs OsString
>> vs OsStr vs … model.
>>
>>
>>>
>>>> ### Unmanaged Strings
>>>>
>>>> Such unmanaged strings can be used when the lifetime of the storage is
>>>> known to outlive all uses.
>>>
>>> Just like StringRef! +1, this concept can be a huge performance win… but
>>> can also be a huge source of UB if used wrong.. :-(
>>>
>>
>> When it comes to naming, I’m more prone to argue for “UnsafeString” rather
>> than “UnmanagedString”, but that’s a later debate for SE ;-)
>>
>>>> As such, they don’t need to participate in reference counting. In the
>>>> future, perhaps with ownership annotations or sophisticated static
>>>> analysis, Swift could convert managed strings into unmanaged ones as an
>>>> optimization when it knows the contents would not escape. Similarly in the
>>>> future, a String constructed from a Substring that is known to not outlive
>>>> the Substring’s owner can use an unmanaged form to skip copying the
>>>> contents. This would benefit String’s status as the recommended
>>>> common-currency type for API design.
>>>
>>> This could also have implications for StaticString.
>>
>> Yup!
>>
>>>
>>>> Some kind of unmanaged string construct is an often-requested feature for
>>>> highly performance-sensitive code. We haven’t thought too deeply about how
>>>> best to surface this as API and it can be sliced and diced many ways. Some
>>>> considerations:
>>>
>>> Other questions/considerations:
>>> - here and now, could we get the vast majority of this benefit by having
>>> the ABI pass string arguments as +0 and guarantee their lifetime across
>>> calls? What is the tradeoff of that?
>>
>> Michael Gottesman (+CC) is doing this sort of investigation right now,
>> evaluating the tradeoffs of more wide-spread use of the +0 convention. My
>> current understanding is that even with +0, many retain/release pairs cannot
>> be eliminated without more knowledge of the callee, but I don’t know the
>> reasons why. Hopefully he can elaborate.
>>
>>> - does this concept even matter if/when we can write an argument as
>>> “borrowed String”? I suppose this is a bit more general in that it should
>>> be usable as properties and other things that the (initial) ownership
>>> design isn’t scoped to support since it doesn’t have lifetimes.
>>
>> Right, it’s a little more general.
>>
>>> - array has exactly the same sort of concern, why is this a string-specific
>>> problem?
>>
>> I think it’s interesting as slices may emerge more often in practice on
>> strings than arrays, e.g. for presenting parsing results. Also, applications
>> that care about this level of performance may custom-allocate all their
>> string data contiguously (e.g. llvm::BumpPtr). I believe that projects such
>> as llbuild have done this, forcing themselves into an API schism between
>> String and UnsafeBufferPointer<UInt8>, often duplicating functionality. In
>> theory all of this could also apply to array, but it seems to happen more
>> for strings.
>>
>>> - how does this work with mutation? Wouldn’t we really have to have
>>> something like MutableArrayRef since its talking about the mutability of
>>> the elements, not something that var/let can support?
>>>
>>
>> Good point. If this is more of a “UnsafeString”-like concept, there’s
>> analogy with Unsafe[Mutable]BufferPointer.
>>
>>> When I think about this, it seems like an “non-owning *slice*” of some
>>> sort. If you went that direction, then you wouldn’t have to build it into
>>> the String currency type, just like it isn’t in LLVM.
>>>
>>
>> Yes, though you would lose the ability to give them to APIs expecting a
>> String when you know it’s safe to do so. E.g., when the contents are
>> effectively immortal relative to the API call and effects. Careful use of
>> “unsafe” or “unmanaged” in type names and argument labels needed here.
>>
>>>> ### Performance Flags
>>>
>>> Nice.
>>>
>>>> ### Miscellaneous
>>>>
>>>> There are many other tweaks and improvements possible under the new ABI
>>>> prior to stabilization, such as:
>>>>
>>>> * Guaranteed nul-termination for String’s storage classes for faster C
>>>> bridging.
>>>
>>> This is probably best as another perf flag.
>>>
>>
>> I was thinking that if we always zero-fill our storage classes and reserve
>> one extra character, then the flag would be redundant with the discriminator
>> bits. However, that’s assuming the string doesn’t store a nul character in
>> the middle. I agree this might need a separate perf flag.
>>
>>>> * Unification of String and Substring’s various Views.
>>>> * Some form of opaque string, such as an existential, which can be used as
>>>> an extension point.
>>>> * More compact/performant indices.
>>>
>>> What is the current theory on eager vs lazy bridging with Objective-C? Can
>>> we get to an eager bridging model? That alone would dramatically simplify
>>> and speed up every operation on a Swift string.
>>>
>>
>> I don’t think that will be feasible in general, although perhaps it could
>> happen for most common occurrences.
>>
>>>> ## String Ergonomics
>>>
>>> I need to run now, but I’ll read and comment on this section when I have a
>>> chance.
>>>
>>> That said, just today I had to write the code below and the ‘charToByte’
>>> part of it is absolutely tragic. Is there any thoughts about how to
>>> improve character literal processing?
>>>
>>> -Chris
>>>
>>> func decodeHex(_ string: String) -> [UInt8] {
>>> func charToByte(_ c: String) -> UInt8 {
>>> return c.utf8.first! // swift makes char processing grotty. :-(
>>> }
>>>
>>> func hexToInt(_ c : UInt8) -> UInt8 {
>>> switch c {
>>> case charToByte("0")...charToByte("9"): return c - charToByte("0")
>>> case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10
>>> case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10
>>> default: fatalError("invalid hexadecimal character")
>>> }
>>> }
>>>
>>
>> Yeah, trying to do ASCII value ranges is currently pretty rough. Here’s what
>> I did recently when I wanted something similar, using unicode scalar literal
>> ranges:
>>
>> extension Character {
>> public var isASCII: Bool {
>> guard let scalar = unicodeScalars.first, unicodeScalars.count == 1
>> else { return false }
>> return scalar.value <= 0x7f
>> }
>> public var isASCIIAlphaNumeric: Bool {
>> guard isASCII else { return false }
>> switch unicodeScalars.first! {
>> case "0"..."9": return true
>> case "A"..."Z": return true
>> case "a"..."z": return true
>> default: return false
>> }
>> }
>> }
>>
>> +1 to UnicodeScalars. I've found that in the cases where you can't
>> guarantee/don't know whether the underlying string storage is ASCII or
>> UTF-16, if performance is a concern, accessing the scalars gives you the
>> lowest cost conversions (ASCII -> Scalar is a simple promotion always, and
>> UTF-16 -> Scalar is a simple promotion in all but the relatively small cases
>> where you have a surrogate pair). On the other hand, transcoding to UTF-8
>> can be more costly if all you want to do is compare the Unicode code points
>> numerically anyway.
>>
>> Using UnicodeScalars here also has the advantage of looking cleaner in the
>> source code (like those ranges above) thanks to
>> ExpressibleByUnicodeScalarLiteral, which UInt8 doesn't have.
>>
>>
>> For a more general solution, I think a `var numericValue: Int? { get }` on
>> Character would make sense. Unicode defines (at least one) semantics for
>> this and ICU provides this functionality already.
>>
>> I threw together a small library recently that adds extensions to
>> UnicodeScalar for most of the properties exposed by ICU and I would *love*
>> to see these make become first-class:
>> https://github.com/allevato/icu-swift/tree/master/Sources/ICU
>> <https://github.com/allevato/icu-swift/tree/master/Sources/ICU>. In other
>> words I'm happy to turn bits of this into proposals and refactor into
>> implementation PRs :)
>>
>
> That looks awesome!
>
> I think a few general APIs for querying categories/properties would satisfy
> most of the expert-level needs, and it would be highly valuable to provide
> frequently used (and easily explained) properties. For the stdlib, these may
> not be as exhaustive as your package, balancing usefulness and
> discoverability against API bloat. There would always be a place for an
> exhaustive/convenience package for experts and enthusiasts.
>
I forgot to say: Yes, please make a pitch and drive a proposal! :-)
>
>>
>>
>>
>>
>>> var result: [UInt8] = []
>>>
>>> assert(string.count & 1 == 0, "must get a pair of hexadecimal characters")
>>> var it = string.utf8.makeIterator()
>>> while let byte1 = it.next(),
>>> let byte2 = it.next() {
>>> result.append((hexToInt(byte1) << 4) | hexToInt(byte2))
>>> }
>>>
>>> return result
>>> }
>>>
>>> print(decodeHex("01FF"))
>>> print(decodeHex("8001"))
>>> print(decodeHex("80a1bcd3"))
>>>
>>
>>>
>>>
>> _______________________________________________
>> swift-dev mailing list
>> [email protected] <mailto:[email protected]>
>> https://lists.swift.org/mailman/listinfo/swift-dev
>> <https://lists.swift.org/mailman/listinfo/swift-dev>
> _______________________________________________
> swift-dev mailing list
> [email protected] <mailto:[email protected]>
> https://lists.swift.org/mailman/listinfo/swift-dev
> <https://lists.swift.org/mailman/listinfo/swift-dev>
_______________________________________________
swift-dev mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-dev