On Jan 10, 2018, at 11:55 AM, Michael Ilseman via swift-dev 
<swift-dev@swift.org> wrote:
> (A gist-formatted version of this email can be found at 
> https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f 
> <https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f>)

I’m very very excited for this, thank you for the detailed writeup and 
consideration of the effects and tradeoffs involved.

> Given that ordering is not fit for human consumption, but rather machine 
> processing, it might as well be fast. The current ordering differs on Darwin 
> and Linux platforms, with Linux in particular suffering from poor performance 
> due to choice of ordering (UCA with DUCET) and older versions of ICU. 
> Instead, [String Comparison 
> Prototype](https://github.com/apple/swift/pull/12115 
> <https://github.com/apple/swift/pull/12115>)  provides a simpler ordering 
> that allows for many common-case fast-paths and optimizations. For all the 
> Unicode enthusiasts out there, this is the lexicographical ordering of 
> NFC-normalized UTF-16 code units.

Thank you for fixing this.  Your tradeoffs make perfect sense to me.

> ### Small String Optimization
..
> For example, assuming a 16-byte String struct and 8 bits used for flags and 
> discriminators (including discriminators for which small form a String is 
> in), 120 bits are available for a small string payload. 120 bits can hold 7 
> UTF-16 code units, which is sufficient for most graphemes and many common 
> words and separators. 120 bits can also fit 15 ASCII/UTF-8 code units without 
> any packing, which suffices for many programmer/system strings (which have a 
> strong skew towards ASCII).
> 
> We may also want a compact 5-bit encoding for formatted numbers, such as 
> 64-bit memory addresses in hex, `Int.max` in base-10, and `Double` in 
> base-10, which would require 18, 19, and 24 characters respectively. 120 bits 
> with a 5-bit encoding would fit all of these. This would speed up the 
> creation and interpolation of many strings containing numbers.

I think it is important to consider that having more special cases and 
different representations slows down nearly *every* operation on string because 
they have to check and detangle all of the possible representations.  Given the 
ability to hold 15 digits of ascii, I don’t see why it would be worthwhile to 
worry about a 5-bit representation for digits.  String should be an Any!

The tradeoff here is that you’d be biasing the design to favor creation and 
interpolation of many strings containing *large* numbers, at the cost of 
general string performance anywhere.   This doesn’t sound like a good tradeoff 
for me, particularly when people writing extremely performance sensitive code 
probably won’t find it good enough anyway.

> Final details are still in exploration. If the construction and 
> interpretation of small bit patterns can remain behind a resilience barrier, 
> new forms could be added in the future. However, the performance impact of 
> this would need to be carefully evaluated.

I’d love to see performance numbers on this.  Have you considered a design 
where you have exactly one small string representation: a sequence of 15 UTF8 
bytes?  This holds your 15 bytes of ascii, probably more non-ascii characters 
on average than 7 UTF16 codepoints, and only needs one determinator branch on 
the entry point of hot functions that want to touch the bytes.  If you have a 
lot of algorithms that are sped up by knowing they have ascii, you could go 
with two bits:  “is small” and “isascii”, where isascii is set in both the 
small string and normal string cases.

Finally, what tradeoffs do you see between a 1-word vs 2-word string?  Are we 
really destined to have 2-words?  That’s still much better than the 3 words we 
have now, but for some workloads it is a significant bloat.


> ### Unmanaged Strings
> 
> Such unmanaged strings can be used when the lifetime of the storage is known 
> to outlive all uses.

Just like StringRef!  +1, this concept can be a huge performance win… but can 
also be a huge source of UB if used wrong.. :-(

> As such, they don’t need to participate in reference counting. In the future, 
> perhaps with ownership annotations or sophisticated static analysis, Swift 
> could convert managed strings into unmanaged ones as an optimization when it 
> knows the contents would not escape. Similarly in the future, a String 
> constructed from a Substring that is known to not outlive the Substring’s 
> owner can use an unmanaged form to skip copying the contents. This would 
> benefit String’s status as the recommended common-currency type for API 
> design.

This could also have implications for StaticString.

> Some kind of unmanaged string construct is an often-requested feature for 
> highly performance-sensitive code. We haven’t thought too deeply about how 
> best to surface this as API and it can be sliced and diced many ways. Some 
> considerations:

Other questions/considerations:
- here and now, could we get the vast majority of this benefit by having the 
ABI pass string arguments as +0 and guarantee their lifetime across calls?  
What is the tradeoff of that?
- does this concept even matter if/when we can write an argument as “borrowed 
String”?  I suppose this is a bit more general in that it should be usable as 
properties and other things that the (initial) ownership design isn’t scoped to 
support since it doesn’t have lifetimes.
- array has exactly the same sort of concern, why is this a string-specific 
problem?
- how does this work with mutation?  Wouldn’t we really have to have something 
like MutableArrayRef since its talking about the mutability of the elements, 
not something that var/let can support?

When I think about this, it seems like an “non-owning *slice*” of some sort.  
If you went that direction, then you wouldn’t have to build it into the String 
currency type, just like it isn’t in LLVM.

> ### Performance Flags

Nice.

> ### Miscellaneous
> 
> There are many other tweaks and improvements possible under the new ABI prior 
> to stabilization, such as:
> 
> * Guaranteed nul-termination for String’s storage classes for faster C 
> bridging.

This is probably best as another perf flag.

> * Unification of String and Substring’s various Views.
> * Some form of opaque string, such as an existential, which can be used as an 
> extension point.
> * More compact/performant indices.

What is the current theory on eager vs lazy bridging with Objective-C?  Can we 
get to an eager bridging model?  That alone would dramatically simplify and 
speed up every operation on a Swift string.

> ## String Ergonomics

I need to run now, but I’ll read and comment on this section when I have a 
chance.

That said, just today I had to write the code below and the ‘charToByte’ part 
of it is absolutely tragic.  Is there any thoughts about how to improve 
character literal processing?

-Chris

func decodeHex(_ string: String) -> [UInt8] {
  func charToByte(_ c: String) -> UInt8 {
    return c.utf8.first! // swift makes char processing grotty. :-(
  }

  func hexToInt(_ c : UInt8) -> UInt8 {
    switch c {
    case charToByte("0")...charToByte("9"): return c - charToByte("0")
    case charToByte("a")...charToByte("f"): return c - charToByte("a") + 10
    case charToByte("A")...charToByte("F"): return c - charToByte("A") + 10
    default: fatalError("invalid hexadecimal character")
    }
  }

  var result: [UInt8] = []

  assert(string.count & 1 == 0, "must get a pair of hexadecimal characters")
  var it = string.utf8.makeIterator()
  while let byte1 = it.next(),
        let byte2 = it.next() {
    result.append((hexToInt(byte1) << 4) | hexToInt(byte2))
  }

  return result
}

print(decodeHex("01FF"))
print(decodeHex("8001"))
print(decodeHex("80a1bcd3"))




_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Reply via email to