> On Jan 11, 2018, at 2:17 PM, Michael Ilseman via swift-dev > <swift-dev@swift.org> wrote: > > > >> On Jan 11, 2018, at 2:06 PM, Tony Allevato <tony.allev...@gmail.com >> <mailto:tony.allev...@gmail.com>> wrote: >> >> >> >> On Thu, Jan 11, 2018 at 12:32 PM Michael Ilseman via swift-dev >> <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote: >> Hi Chris! >> >> +CC Michael Gottesman, as I veer into talking about ARC. >> >> >> >>> On Jan 10, 2018, at 9:29 PM, Chris Lattner <clatt...@nondot.org >>> <mailto:clatt...@nondot.org>> wrote: >>> >>> On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev >>> <swift-dev@swift.org <mailto:swift-dev@swift.org>> wrote: >>>> (A gist-formatted version of this email can be found at >>>> https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f >>>> <https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f>) >>> >>> I’m very very excited for this, thank you for the detailed writeup and >>> consideration of the effects and tradeoffs involved. >>> >> >>>> Given that ordering is not fit for human consumption, but rather machine >>>> processing, it might as well be fast. The current ordering differs on >>>> Darwin and Linux platforms, with Linux in particular suffering from poor >>>> performance due to choice of ordering (UCA with DUCET) and older versions >>>> of ICU. Instead, [String Comparison >>>> Prototype](https://github.com/apple/swift/pull/12115 >>>> <https://github.com/apple/swift/pull/12115>) provides a simpler ordering >>>> that allows for many common-case fast-paths and optimizations. For all the >>>> Unicode enthusiasts out there, this is the lexicographical ordering of >>>> NFC-normalized UTF-16 code units. >>> >>> Thank you for fixing this. Your tradeoffs make perfect sense to me. >>> >>>> ### Small String Optimization >>> .. >>>> For example, assuming a 16-byte String struct and 8 bits used for flags >>>> and discriminators (including discriminators for which small form a String >>>> is in), 120 bits are available for a small string payload. 120 bits can >>>> hold 7 UTF-16 code units, which is sufficient for most graphemes and many >>>> common words and separators. 120 bits can also fit 15 ASCII/UTF-8 code >>>> units without any packing, which suffices for many programmer/system >>>> strings (which have a strong skew towards ASCII). >>>> >>>> We may also want a compact 5-bit encoding for formatted numbers, such as >>>> 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in >>>> base-10, which would require 18, 19, and 24 characters respectively. 120 >>>> bits with a 5-bit encoding would fit all of these. This would speed up the >>>> creation and interpolation of many strings containing numbers. >>> >>> I think it is important to consider that having more special cases and >>> different representations slows down nearly *every* operation on string >>> because they have to check and detangle all of the possible representations. >> >> nit: Multiple small representations would slow down nearly every operation >> on *small* strings. That is, we would initially branch on a isSmall check, >> and then small strings would pay the cost of further inspection. This could >> also slow down String construction, depending on how aggressively we try to >> pack/transcode and whether we also have specialized inits. >> >>> Given the ability to hold 15 digits of ascii, I don’t see why it would be >>> worthwhile to worry about a 5-bit representation for digits. String should >>> be an Any! >>> >>> The tradeoff here is that you’d be biasing the design to favor creation and >>> interpolation of many strings containing *large* numbers, at the cost of >>> general string performance anywhere. This doesn’t sound like a good >>> tradeoff for me, particularly when people writing extremely performance >>> sensitive code probably won’t find it good enough anyway. >>> >>>> Final details are still in exploration. If the construction and >>>> interpretation of small bit patterns can remain behind a resilience >>>> barrier, new forms could be added in the future. However, the performance >>>> impact of this would need to be carefully evaluated. >>> >>> I’d love to see performance numbers on this. Have you considered a design >>> where you have exactly one small string representation: a sequence of 15 >>> UTF8 bytes? This holds your 15 bytes of ascii, probably more non-ascii >>> characters on average than 7 UTF16 codepoints, and only needs one >>> determinator branch on the entry point of hot functions that want to touch >>> the bytes. If you have a lot of algorithms that are sped up by knowing >>> they have ascii, you could go with two bits: “is small” and “isascii”, >>> where isascii is set in both the small string and normal string cases. >>> >> >> We have not yet done the investigation, so all details could change. This is >> my (perhaps flawed!) reasoning as to why, in general, it may be useful to >> have multiple small forms. Again, this will change as we learn more. >> >> Strings are created and copied (i.e. a value-copy/retain) more often than >> they are read (and strings are read far more often than they are modified). >> The “high-order bit” for performance is whether a string is small-or-not, as >> avoiding an allocation and, just as importantly, managing the memory with >> ARC is skipped. Beyond that, we’re debating smaller performance deltas, >> which are still important. This reasoning also influences where we choose to >> stick our resilience barriers, which can be relaxed in the future. >> >> One of the nice things about having a small representation for each of our >> backing storage kinds (ASCII/UTF-8 vs UTF-16) is that it would gives us a >> very fast way of conservatively checking whether we form a small string. >> E.g., if we’re forming a Substring over a portion of a UTF-16-backed string, >> we can check whether the range holds 7 or fewer code units to avoid the ARC. >> I don’t know if it would be worth the attempt to detect whether it would fit >> in 15 transcoded UTF-8 code units vs bumping the ref count. Avoiding the >> extra packing attempt may help branch mis-predicts, though I’m *way* out of >> my depth when it comes to reasoning about mis-predicts as they emerge in the >> wild. There is a “temporal locality of Unicode-ness” in string processing, >> though. >> >> As far as what strings a 7 code unit UTF-16 small form that couldn’t fit in >> 15 code units of UTF-8, the biggest cases are latter-BMP scalars where a >> small UTF-8 string could only store 5. This may not be a big deal. >> >> As to a 5-bit encoding, again, all details pending more experimentation. Its >> importance may be diminished by more efficient, low-level String >> construction interfaces. The difference between 15 and 24 characters could >> be big, considering that formatted addresses and Doubles fit in that range. >> We’ll see. >> >> >>> Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are >>> we really destined to have 2-words? That’s still much better than the 3 >>> words we have now, but for some workloads it is a significant bloat. >> >> <repeat disclaimer about final details being down to real data>. Some >> arguments in favor of 2-word, presented roughly in order of impact: >> >> 1. This allows the String type to accommodate llvm::StringRef-style usages. >> This is pretty broad usage: “mmap a file and treat its contents as a >> String”, “store all my contents in an llvm::BumpPtr which outlives uses”, >> un-owned slices, etc. One word String would greatly limit this to only >> whole-string nul-terminated cases. >> >> 2. Two-word String fits more small strings. Exactly where along the >> diminishing-returns curve 7 vs 15 UTF-8 code units lie is dependent on the >> data set. One example is NSString, which (according to reasoning at >> https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html >> >> <https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html>) >> considered it important enough to have 6- and 5- bit reduced ASCII >> character sets to squeeze up to 11-length strings in a word. 15 code unit >> small strings would be a super-set of tagged NSStrings, meaning we could >> bridge them eagerly in-line, while 7 code unit small strings would be a >> subset (and also a strong argument against eagerly bridging them). >> >> If you have access to any interesting data sets and can report back some >> statistics, that would be immensely helpful! >> >> 3. More bits available to reserve for future-proofing, etc., though many of >> these could be stored in the header. >> >> 4. The second word can cache useful information from large strings. >> `endIndex` is a very frequently requested computed property and it could be >> stored directly in-line rather than loaded from memory (though perhaps a >> load happens anyways in a subsequent read of the string). Alternatively, we >> could store the grapheme count or some other piece of information that we’d >> otherwise have to recompute. More experimentation needed here. >> >> 5. (vague and hand-wavy) Two-words fits into a nice groove that 3-words >> doesn’t: 2 words is a rule-of-thumb size for very small buffers. It’s a >> common heap alignment, stack alignment, vector-width, double-word-load >> width, etc.. 1-word Strings may be under-utilizing available resources, that >> is the second word will often be there for use anyways. The main case where >> this is not true and 1-word shines is aggregates of String. >> >> I’m seeking some kind of ruthlessly-pragmatic balance here, and I think 2 >> word is currently winning. Again, up to further investigation and debate. >> FWIW, I believe Rust’s String is 3-words, i.e. it’s the std::vector-style >> pointer+length+capacity, but I haven’t looked into their String vs OsString >> vs OsStr vs … model. >> >> >>> >>>> ### Unmanaged Strings >>>> >>>> Such unmanaged strings can be used when the lifetime of the storage is >>>> known to outlive all uses. >>> >>> Just like StringRef! +1, this concept can be a huge performance win… but >>> can also be a huge source of UB if used wrong.. :-( >>> >> >> When it comes to naming, I’m more prone to argue for “UnsafeString” rather >> than “UnmanagedString”, but that’s a later debate for SE ;-) >> >>>> As such, they don’t need to participate in reference counting. In the >>>> future, perhaps with ownership annotations or sophisticated static >>>> analysis, Swift could convert managed strings into unmanaged ones as an >>>> optimization when it knows the contents would not escape. Similarly in the >>>> future, a String constructed from a Substring that is known to not outlive >>>> the Substring’s owner can use an unmanaged form to skip copying the >>>> contents. This would benefit String’s status as the recommended >>>> common-currency type for API design. >>> >>> This could also have implications for StaticString. >> >> Yup! >> >>> >>>> Some kind of unmanaged string construct is an often-requested feature for >>>> highly performance-sensitive code. We haven’t thought too deeply about how >>>> best to surface this as API and it can be sliced and diced many ways. Some >>>> considerations: >>> >>> Other questions/considerations: >>> - here and now, could we get the vast majority of this benefit by having >>> the ABI pass string arguments as +0 and guarantee their lifetime across >>> calls? What is the tradeoff of that? >> >> Michael Gottesman (+CC) is doing this sort of investigation right now, >> evaluating the tradeoffs of more wide-spread use of the +0 convention. My >> current understanding is that even with +0, many retain/release pairs cannot >> be eliminated without more knowledge of the callee, but I don’t know the >> reasons why. Hopefully he can elaborate. >> >>> - does this concept even matter if/when we can write an argument as >>> “borrowed String”? I suppose this is a bit more general in that it should >>> be usable as properties and other things that the (initial) ownership >>> design isn’t scoped to support since it doesn’t have lifetimes. >> >> Right, it’s a little more general. >> >>> - array has exactly the same sort of concern, why is this a string-specific >>> problem? >> >> I think it’s interesting as slices may emerge more often in practice on >> strings than arrays, e.g. for presenting parsing results. Also, applications >> that care about this level of performance may custom-allocate all their >> string data contiguously (e.g. llvm::BumpPtr). I believe that projects such >> as llbuild have done this, forcing themselves into an API schism between >> String and UnsafeBufferPointer<UInt8>, often duplicating functionality. In >> theory all of this could also apply to array, but it seems to happen more >> for strings. >> >>> - how does this work with mutation? Wouldn’t we really have to have >>> something like MutableArrayRef since its talking about the mutability of >>> the elements, not something that var/let can support? >>> >> >> Good point. If this is more of a “UnsafeString”-like concept, there’s >> analogy with Unsafe[Mutable]BufferPointer. >> >>> When I think about this, it seems like an “non-owning *slice*” of some >>> sort. If you went that direction, then you wouldn’t have to build it into >>> the String currency type, just like it isn’t in LLVM. >>> >> >> Yes, though you would lose the ability to give them to APIs expecting a >> String when you know it’s safe to do so. E.g., when the contents are >> effectively immortal relative to the API call and effects. Careful use of >> “unsafe” or “unmanaged” in type names and argument labels needed here. >> >>>> ### Performance Flags >>> >>> Nice. >>> >>>> ### Miscellaneous >>>> >>>> There are many other tweaks and improvements possible under the new ABI >>>> prior to stabilization, such as: >>>> >>>> * Guaranteed nul-termination for String’s storage classes for faster C >>>> bridging. >>> >>> This is probably best as another perf flag. >>> >> >> I was thinking that if we always zero-fill our storage classes and reserve >> one extra character, then the flag would be redundant with the discriminator >> bits. However, that’s assuming the string doesn’t store a nul character in >> the middle. I agree this might need a separate perf flag. >> >>>> * Unification of String and Substring’s various Views. >>>> * Some form of opaque string, such as an existential, which can be used as >>>> an extension point. >>>> * More compact/performant indices. >>> >>> What is the current theory on eager vs lazy bridging with Objective-C? Can >>> we get to an eager bridging model? That alone would dramatically simplify >>> and speed up every operation on a Swift string. >>> >> >> I don’t think that will be feasible in general, although perhaps it could >> happen for most common occurrences. >> >>>> ## String Ergonomics >>> >>> I need to run now, but I’ll read and comment on this section when I have a >>> chance. >>> >>> That said, just today I had to write the code below and the ‘charToByte’ >>> part of it is absolutely tragic. Is there any thoughts about how to >>> improve character literal processing? >>> >>> -Chris >>> >>> func decodeHex(_ string: String) -> [UInt8] { >>> func charToByte(_ c: String) -> UInt8 { >>> return c.utf8.first! // swift makes char processing grotty. :-( >>> } >>> >>> func hexToInt(_ c : UInt8) -> UInt8 { >>> switch c { >>> case charToByte("0")...charToByte("9"): return c - charToByte("0") >>> case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10 >>> case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10 >>> default: fatalError("invalid hexadecimal character") >>> } >>> } >>> >> >> Yeah, trying to do ASCII value ranges is currently pretty rough. Here’s what >> I did recently when I wanted something similar, using unicode scalar literal >> ranges: >> >> extension Character { >> public var isASCII: Bool { >> guard let scalar = unicodeScalars.first, unicodeScalars.count == 1 >> else { return false } >> return scalar.value <= 0x7f >> } >> public var isASCIIAlphaNumeric: Bool { >> guard isASCII else { return false } >> switch unicodeScalars.first! { >> case "0"..."9": return true >> case "A"..."Z": return true >> case "a"..."z": return true >> default: return false >> } >> } >> } >> >> +1 to UnicodeScalars. I've found that in the cases where you can't >> guarantee/don't know whether the underlying string storage is ASCII or >> UTF-16, if performance is a concern, accessing the scalars gives you the >> lowest cost conversions (ASCII -> Scalar is a simple promotion always, and >> UTF-16 -> Scalar is a simple promotion in all but the relatively small cases >> where you have a surrogate pair). On the other hand, transcoding to UTF-8 >> can be more costly if all you want to do is compare the Unicode code points >> numerically anyway. >> >> Using UnicodeScalars here also has the advantage of looking cleaner in the >> source code (like those ranges above) thanks to >> ExpressibleByUnicodeScalarLiteral, which UInt8 doesn't have. >> >> >> For a more general solution, I think a `var numericValue: Int? { get }` on >> Character would make sense. Unicode defines (at least one) semantics for >> this and ICU provides this functionality already. >> >> I threw together a small library recently that adds extensions to >> UnicodeScalar for most of the properties exposed by ICU and I would *love* >> to see these make become first-class: >> https://github.com/allevato/icu-swift/tree/master/Sources/ICU >> <https://github.com/allevato/icu-swift/tree/master/Sources/ICU>. In other >> words I'm happy to turn bits of this into proposals and refactor into >> implementation PRs :) >> > > That looks awesome! > > I think a few general APIs for querying categories/properties would satisfy > most of the expert-level needs, and it would be highly valuable to provide > frequently used (and easily explained) properties. For the stdlib, these may > not be as exhaustive as your package, balancing usefulness and > discoverability against API bloat. There would always be a place for an > exhaustive/convenience package for experts and enthusiasts. >
I forgot to say: Yes, please make a pitch and drive a proposal! :-) > >> >> >> >> >>> var result: [UInt8] = [] >>> >>> assert(string.count & 1 == 0, "must get a pair of hexadecimal characters") >>> var it = string.utf8.makeIterator() >>> while let byte1 = it.next(), >>> let byte2 = it.next() { >>> result.append((hexToInt(byte1) << 4) | hexToInt(byte2)) >>> } >>> >>> return result >>> } >>> >>> print(decodeHex("01FF")) >>> print(decodeHex("8001")) >>> print(decodeHex("80a1bcd3")) >>> >> >>> >>> >> _______________________________________________ >> swift-dev mailing list >> swift-dev@swift.org <mailto:swift-dev@swift.org> >> https://lists.swift.org/mailman/listinfo/swift-dev >> <https://lists.swift.org/mailman/listinfo/swift-dev> > _______________________________________________ > swift-dev mailing list > swift-dev@swift.org <mailto:swift-dev@swift.org> > https://lists.swift.org/mailman/listinfo/swift-dev > <https://lists.swift.org/mailman/listinfo/swift-dev>
_______________________________________________ swift-dev mailing list swift-dev@swift.org https://lists.swift.org/mailman/listinfo/swift-dev