Re: [swift-evolution] Strings in Swift 4

Dave Abrahams via swift-evolution Tue, 24 Jan 2017 14:51:16 -0800

on Tue Jan 24 2017, Olivier Tardieu <tardieu-AT-us.ibm.com> wrote:

> Thanks for the great write-up!
>
> The manifesto recognizes the importance of machine processing and 
> performance.
> I am surprised that there is no mention of any kind of "unsafe" strings or 
> string processing.
> In general, Swift does an amazing job at incorporating unsafe mechanism 
> into a safe-by-default programming paradigm.
> But for some reason, Strings seem to be left out of the unsafe
> discussion.


Maybe it wasn't clear from the document, but the intention is that
String would be able to use any model of Unicode as a backing store, and
that you could easily build unsafe models of Unicode... but also that
you could use your unsafe model of Unicode directly, in string-ish ways.

> A lot of machine processing of strings continues to deal with 8-bit 
> quantities (even 7-bit quantities, not UTF-8).
> Swift strings are not very good at that. I see progress in the manifesto 
> but nothing to really close the performance gap with C.
> That's where "unsafe" mechanisms could come into play.

extendedASCII is supposed to address that.  Given a smart enough
optimizer, it should be possible to become competitive with C even
without using unsafe constructs.  However, we recognize the importance
of being able to squeeze out that last bit of performance by dropping
down to unsafe storage.

> To guarantee Unicode correctness, a C string must be validated or 
> transformed to be considered a Swift string.

Not really.  You can do error-correction on the fly.  However, I think
pre-validation is often worthwhile because once you know something is
valid it's much cheaper to decode correctly (especially for UTF-8).

> If I understand the C String interop section correctly, in Swift 4,
> this should not force a copy, but traversing the string is still
> required.  

*What* should not force a copy?

> I hope I am correct about the no-copy thing, and I would also like to
> permit promoting C strings to Swift strings without validation.  This
> is obviously unsafe in general, but I know my strings... and I care
> about performance. ;)

We intend to support that use-case.  That's part of the reason for the
ValidUTF8 and ValidUTF16 encodings you see here:
https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/core/Unicode2.swift#L598
and here:
https://github.com/apple/swift/blob/unicode-rethink/stdlib/public/core/Unicode2.swift#L862

> More importantly, it is not possible to mutate bytes in a Swift string
> at will.  Again it makes sense from the point of view of always
> correct Unicode sequences.  But it does not for machine processing of
> C strings with C-like performance.  Today, I can cheat using a
> "_public" API for this, i.e., myString._core.  _baseAddress!.  This
> should be doable from an official "unsafe" API.

We intend to support that use-case.

> Memory safety is also at play here, as well as ownership.  A proper
> API could guarantee the backing store is writable for instance, that
> it is not shared.  A memory-safe but not unicode-safe API could do
> bounds checks.
>
> While low-level C string processing can be done using unsafe memory
> buffers with performance, the lack of bridging with "real" Swift
> strings kills the deal.  No literals syntax (or costly coercions),
> none of the many useful string APIs.
>
> To illustrate these points here is a simple experiment: code written
> to synthesize an http date string from a bunch of integers.  There are
> four versions of the code going from nice high-level Swift code to
> low-level C-like code.  (Some of this code is also about avoiding ARC
> overheads, and string interpolation overheads, hence the four
> versions.)
>
> On my macbook pro (swiftc -O), the performance is as follows:
>
> interpolation + func:  2.303032365s
> interpolation + array: 1.224858418s
> append:                0.918512377s
> memcpy:                0.182104674s
>
> While the benchmarking could be done more carefully, I think the main
> observation is valid.  The nice code is more than 10x slower than the
> C-like code.  Moreover, the ugly-but-still-valid-Swift code is still
> about 5x slower than the C like code.  For some applications, e.g. web
> servers, this kind of numbers matter...
>
> Some of the proposed improvements would help with this, e.g., small
> strings optimization, and maybe changes to the concatenation
> semantics.  But it seems to me that a big performance gap will remain.
> (Concatenation even with strncat is significantly slower than memcpy
> for fixed-size strings.)
>
> I believe there is a need and an opportunity for a fast "less safe"
> String API.  I hope it will be on the roadmap soon.

I think it's already in the roadmap...the one that's in my head.  If you
want to submit a PR with amendments to the manifesto, that'd be great.
Also thanks very much for the example below; we'll definitely
be referring to it as we proceed forward.

>
>
> Best,
>
> Olivier
>
> import Foundation
>
> // get current date as a series of integers
> // (could be done differently... faster... not the topic)
>
> var theTime = time(nil)
> var timeStruct = tm()
> gmtime_r(&theTime, &timeStruct)
> let wday = Int(timeStruct.tm_wday)
> let mday = Int(timeStruct.tm_mday)
> let mon = Int(timeStruct.tm_mon)
> let year = Int(timeStruct.tm_year) + 1900
> let hour = Int(timeStruct.tm_hour)
> let min = Int(timeStruct.tm_min)
> let sec = Int(timeStruct.tm_sec)
>
> let months = ["Jan", "Feb", "Mar", "Apr", "May", "Jun",
>               "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
>
> let days = ["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"]
>
> func twoDigit(_ num: Int) -> String {
>     return (num < 10 ? "0" : "") + String(num)
> }
>
> let twoDigit = ["00", "01", "02", "03", "04", "05", "06", "07", "08", "09"
> ,
>                 "10", "11", "12", "13", "14", "15", "16", "17", "18", "19"
> ,
>                 "20", "21", "22", "23", "24", "25", "26", "27", "28", "29"
> ,
>                 "30", "31", "32", "33", "34", "35", "36", "37", "38", "39"
> ,
>                 "40", "41", "42", "43", "44", "45", "46", "47", "48", "49"
> ,
>                 "50", "51", "52", "53", "54", "55", "56", "57", "58", "59"
> ,
>                 "60", "61", "62", "63", "64", "65", "66", "67", "68", "69"
> ,
>                 "70", "71", "72", "73", "74", "75", "76", "77", "78", "79"
> ,
>                 "80", "81", "82", "83", "84", "85", "86", "87", "88", "89"
> ,
>                 "90", "91", "92", "93", "94", "95", "96", "97", "98", "99"
> ]
>
> // interpolation + func
>
> func httpDate() -> String {
>     return "\(days[wday]), \(twoDigit(mday)) \(months[mon]) \(year) \(
> twoDigit(hour)):\(twoDigit(min)):\(twoDigit(sec)) GMT"
> }
>
> // interpolation + array
>
> func httpDate1() -> String {
>     return "\(days[wday]), \(twoDigit[mday]) \(months[mon]) \(year) \(
> twoDigit[hour]):\(twoDigit[min]):\(twoDigit[sec]) GMT"
> }
>
> // append + array
>
> func httpDate2() -> String {
>     var s = days[wday]
>     s.append(", ")
>     s.append(twoDigit[mday])
>     s.append(" ")
>     s.append(months[mon])
>     s.append(" ")
>     s.append(twoDigit[year/100])
>     s.append(twoDigit[year%100])
>     s.append(" ")
>     s.append(twoDigit[hour])
>     s.append(":")
>     s.append(twoDigit[min])
>     s.append(":")
>     s.append(twoDigit[sec])
>     s.append(" GMT")
>     return s
> }
>
> // memcpy + array
>
> func httpDate3() -> String {
>     var s = "XXX, XX XXX XXXX XX:XX:XX GMT"
>     s.append("") // force alloc
>     let ptr = s._core._baseAddress!
>     memcpy(ptr, days[wday]._core._baseAddress!, 3)
>     memcpy(ptr.advanced(by: 8), months[mon]._core._baseAddress!, 3)
>     memcpy(ptr.advanced(by: 5), twoDigit[mday]._core._baseAddress!, 2)
>     memcpy(ptr.advanced(by: 12), twoDigit[year/100]._core._baseAddress!, 2
> )
>     memcpy(ptr.advanced(by: 14), twoDigit[year%100]._core._baseAddress!, 2
> )
>     memcpy(ptr.advanced(by: 17), twoDigit[hour]._core._baseAddress!, 2)
>     memcpy(ptr.advanced(by: 20), twoDigit[min]._core._baseAddress!, 2)
>     memcpy(ptr.advanced(by: 23), twoDigit[sec]._core._baseAddress!, 2)
>     return s
> }
>
> var s = ""
>
> var now = mach_absolute_time()
> for _ in 0..<1000000 {
>     s = httpDate()
> }
> print(s)
> print("interpolation + func: \(Double(mach_absolute_time() - now) / 1e9
> )s\n")
>
> now = mach_absolute_time()
> for _ in 0..<1000000 {
>     s = httpDate1()
> }
> print(s)
> print("interpolation + array: \(Double(mach_absolute_time() - now) / 1e9
> )s\n")
>
> now = mach_absolute_time()
> for _ in 0..<1000000 {
>     s = httpDate2()
> }
> print(s)
> print("append: \(Double(mach_absolute_time() - now) / 1e9)s\n")
>
> now = mach_absolute_time()
> for _ in 0..<1000000 {
>     s = httpDate3()
> }
> print(s)
> print("memcpy: \(Double(mach_absolute_time() - now) / 1e9)s\n")
>
> From:   Ben Cohen via swift-evolution <[email protected]>
> To:     swift-evolution <[email protected]>
> Cc:     Dave Abrahams <[email protected]>
> Date:   01/19/2017 09:56 PM
> Subject:        [swift-evolution] Strings in Swift 4
> Sent by:        [email protected]
>
> Hi all,
>
> Below is our take on a design manifesto for Strings in Swift 4 and beyond.
>
> Probably best read in rendered markdown on GitHub:
> https://github.com/apple/swift/blob/master/docs/StringManifesto.md
>
> We’re eager to hear everyone’s thoughts.
>
> Regards,
> Ben and Dave
>
> # String Processing For Swift 4
>
> * Authors: [Dave Abrahams](https://github.com/dabrahams), [Ben Cohen](
> https://github.com/airspeedswift)
>
> The goal of re-evaluating Strings for Swift 4 has been fairly ill-defined 
> thus
> far, with just this short blurb in the
> [list of goals](
> https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html
> ):
>
>> **String re-evaluation**: String is one of the most important 
> fundamental
>> types in the language.  The standard library leads have numerous ideas 
> of how
>> to improve the programming model for it, without jeopardizing the goals 
> of
>> providing a unicode-correct-by-default model.  Our goal is to be better 
> at
>> string processing than Perl!
>
> For Swift 4 and beyond we want to improve three dimensions of text 
> processing:
>
>   1. Ergonomics
>   2. Correctness
>   3. Performance
>
> This document is meant to both provide a sense of the long-term vision 
> (including undecided issues and possible approaches), and to define the 
> scope of
> work that could be done in the Swift 4 timeframe.
>
> ## General Principles
>
> ### Ergonomics
>
> It's worth noting that ergonomics and correctness are 
> mutually-reinforcing.  An
> API that is easy to use—but incorrectly—cannot be considered an ergonomic
> success.  Conversely, an API that's simply hard to use is also hard to use
> correctly.  Acheiving optimal performance without compromising ergonomics 
> or
> correctness is a greater challenge.
>
> Consistency with the Swift language and idioms is also important for
> ergonomics. There are several places both in the standard library and in 
> the
> foundation additions to `String` where patterns and practices found 
> elsewhere
> could be applied to improve usability and familiarity.
>
> ### API Surface Area
>
> Primary data types such as `String` should have APIs that are easily 
> understood
> given a signature and a one-line summary.  Today, `String` fails that 
> test.  As
> you can see, the Standard Library and Foundation both contribute 
> significantly to
> its overall complexity.
>
> **Method Arity** | **Standard Library** | **Foundation**
> ---|:---:|:---:
> 0: `ƒ()` | 5 | 7
> 1: `ƒ(:)` | 19 | 48
> 2: `ƒ(::)` | 13 | 19
> 3: `ƒ(:::)` | 5 | 11
> 4: `ƒ(::::)` | 1 | 7
> 5: `ƒ(:::::)` | - | 2
> 6: `ƒ(::::::)` | - | 1
>
> **API Kind** | **Standard Library** | **Foundation**
> ---|:---:|:---:
> `init` | 41 | 18
> `func` | 42 | 55
> `subscript` | 9 | 0
> `var` | 26 | 14
>
> **Total: 205 APIs**
>
> By contrast, `Int` has 80 APIs, none with more than two parameters.[0] 
> String processing is complex enough; users shouldn't have
> to press through physical API sprawl just to get started.
>
> Many of the choices detailed below contribute to solving this problem,
> including:
>
>   * Restoring `Collection` conformance and dropping the `.characters` 
> view.
>   * Providing a more general, composable slicing syntax.
>   * Altering `Comparable` so that parameterized
>     (e.g. case-insensitive) comparison fits smoothly into the basic 
> syntax.
>   * Clearly separating language-dependent operations on text produced 
>     by and for humans from language-independent
>     operations on text produced by and for machine processing.
>   * Relocating APIs that fall outside the domain of basic string 
> processing and
>     discouraging the proliferation of ad-hoc extensions.
>
> ### Batteries Included
>
> While `String` is available to all programs out-of-the-box, crucial APIs 
> for
> basic string processing tasks are still inaccessible until `Foundation` is
> imported.  While it makes sense that `Foundation` is needed for 
> domain-specific
> jobs such as
> [linguistic tagging](
> https://developer.apple.com/reference/foundation/nslinguistictagger),
> one should not need to import anything to, for example, do 
> case-insensitive
> comparison.
>
> ### Unicode Compliance and Platform Support
>
> The Unicode standard provides a crucial objective reference point for what
> constitutes correct behavior in an extremely complex domain, so
> Unicode-correctness is, and will remain, a fundamental design principle 
> behind
> Swift's `String`.  That said, the Unicode standard is an evolving 
> document, so
> this objective reference-point is not fixed.[1] While
> many of the most important operations—e.g. string hashing, equality, and
> non-localized comparison—will be stable, the semantics
> of others, such as grapheme breaking and localized comparison and case
> conversion, are expected to change as platforms are updated, so programs 
> should
> be written so their correctness does not depend on precise stability of 
> these
> semantics across OS versions or platforms.  Although it may be possible to
> imagine static and/or dynamic analysis tools that will help users find 
> such
> errors, the only sure way to deal with this fact of life is to educate 
> users.
>
> ## Design Points
>
> ### Internationalization
>
> There is strong evidence that developers cannot determine how to use
> internationalization APIs correctly.  Although documentation could and 
> should be
> improved, the sheer size, complexity, and diversity of these APIs is a 
> major
> contributor to the problem, causing novices to tune out, and more 
> experienced
> programmers to make avoidable mistakes.
>
> The first step in improving this situation is to regularize all localized
> operations as invocations of normal string operations with extra
> parameters. Among other things, this means:
>
> 1. Doing away with `localizedXXX` methods 
> 2. Providing a terse way to name the current locale as a parameter
> 3. Automatically adjusting defaults for options such
>    as case sensitivity based on whether the operation is localized.
> 4. Removing correctness traps like `localizedCaseInsensitiveCompare` (see
>     guidance in the
>     [Internationalization and Localization Guide](
> https://developer.apple.com/library/content/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html
> ).
>
> Along with appropriate documentation updates, these changes will make 
> localized
> operations more teachable, comprehensible, and approachable, thereby 
> lowering a
> barrier that currently leads some developers to ignore localization issues
> altogether.
>
> ####  The Default Behavior of `String`
>
> Although this isn't well-known, the most accessible form of many 
> operations on
> Swift `String` (and `NSString`) are really only appropriate for text that 
> is
> intended to be processed for, and consumed by, machines.  The semantics of 
> the
> operations with the simplest spellings are always non-localized and
> language-agnostic.
>
> Two major factors play into this design choice:
>
> 1. Machine processing of text is important, so we should have first-class,
>    accessible functions appropriate to that use case.
>
> 2. The most general localized operations require a locale parameter not 
> required
>    by their un-localized counterparts.  This naturally skews complexity 
> towards
>    localized operations.
>
> Reaffirming that `String`'s simplest APIs have
> language-independent/machine-processed semantics has the benefit of 
> clarifying
> the proper default behavior of operations such as comparison, and allows 
> us to
> make [significant optimizations](#collation-semantics) that were 
> previously
> thought to conflict with Unicode.
>
> #### Future Directions
>
> One of the most common internationalization errors is the unintentional
> presentation to users of text that has not been localized, but 
> regularizing APIs
> and improving documentation can go only so far in preventing this error.
> Combined with the fact that `String` operations are non-localized by 
> default,
> the environment for processing human-readable text may still be somewhat
> error-prone in Swift 4.
>
> For an audience of mostly non-experts, it is especially important that 
> naïve
> code is very likely to be correct if it compiles, and that more 
> sophisticated
> issues can be revealed progressively.  For this reason, we intend to
> specifically and separately target localization and internationalization
> problems in the Swift 5 timeframe.
>
> ### Operations With Options
>
> There are three categories of common string operation that commonly need 
> to be
> tuned in various dimensions:
>
> **Operation**|**Applicable Options**
> ---|---
> sort ordering | locale, case/diacritic/width-insensitivity
> case conversion | locale
> pattern matching | locale, case/diacritic/width-insensitivity
>
> The defaults for case-, diacritic-, and width-insensitivity are different 
> for
> localized operations than for non-localized operations, so for example a
> localized sort should be case-insensitive by default, and a non-localized 
> sort
> should be case-sensitive by default.  We propose a standard “language” of
> defaulted parameters to be used for these purposes, with usage roughly 
> like this:
>
> ```swift
>   x.compared(to: y, case: .sensitive, in: swissGerman)
>
>   x.lowercased(in: .currentLocale)
>
>   x.allMatches(
>     somePattern, case: .insensitive, diacritic: .insensitive)
> ```
>
> This usage might be supported by code like this:
>
> ```swift
> enum StringSensitivity {
> case sensitive
> case insensitive
> }
>
> extension Locale {
>   static var currentLocale: Locale { ... }
> }
>
> extension Unicode {
>   // An example of the option language in declaration context,
>   // with nil defaults indicating unspecified, so defaults can be
>   // driven by the presence/absence of a specific Locale
>   func frobnicated(
>     case caseSensitivity: StringSensitivity? = nil,
>     diacritic diacriticSensitivity: StringSensitivity? = nil,
>     width widthSensitivity: StringSensitivity? = nil,
>     in locale: Locale? = nil
>   ) -> Self { ... }
> }
> ```
>
> ### Comparing and Hashing Strings
>
> #### Collation Semantics
>
> What Unicode says about collation—which is used in `<`, `==`, and hashing— 
> turns
> out to be quite interesting, once you pick it apart.  The full Unicode 
> Collation
> Algorithm (UCA) works like this:
>
> 1. Fully normalize both strings
> 2. Convert each string to a sequence of numeric triples to form a 
> collation key
> 3. “Flatten” the key by concatenating the sequence of first elements to 
> the
>    sequence of second elements to the sequence of third elements
> 4. Lexicographically compare the flattened keys 
>
> While step 1 can usually
> be [done quickly](http://unicode.org/reports/tr15/#Description_Norm) and
> incrementally, step 2 uses a collation table that maps matching 
> *sequences* of
> unicode scalars in the normalized string to *sequences* of triples, which 
> get
> accumulated into a collation key.  Predictably, this is where the real 
> costs
> lie.
>
> *However*, there are some bright spots to this story.  First, as it turns 
> out,
> string sorting (localized or not) should be done down to what's called
> the
> [“identical” level](
> http://unicode.org/reports/tr10/#Multi_Level_Comparison),
> which adds a step 3a: append the string's normalized form to the flattened
> collation key.  At first blush this just adds work, but consider what it 
> does
> for equality: two strings that normalize the same, naturally, will collate 
> the
> same.  But also, *strings that normalize differently will always collate
> differently*.  In other words, for equality, it is sufficient to compare 
> the
> strings' normalized forms and see if they are the same.  We can therefore
> entirely skip the expensive part of collation for equality comparison.
>
> Next, naturally, anything that applies to equality also applies to 
> hashing: it
> is sufficient to hash the string's normalized form, bypassing collation 
> keys.
> This should provide significant speedups over the current implementation.
> Perhaps more importantly, since comparison down to the “identical” level 
> applies
> even to localized strings, it means that hashing and equality can be 
> implemented
> exactly the same way for localized and non-localized text, and hash tables 
> with
> localized keys will remain valid across current-locale changes.
>
> Finally, once it is agreed that the *default* role for `String` is to 
> handle
> machine-generated and machine-readable text, the default ordering of 
> `String`s
> need no longer use the UCA at all.  It is sufficient to order them in any 
> way
> that's consistent with equality, so `String` ordering can simply be a
> lexicographical comparison of normalized forms,[4]
> (which is equivalent to lexicographically comparing the sequences of 
> grapheme
> clusters), again bypassing step 2 and offering another speedup.
>
> This leaves us executing the full UCA *only* for localized sorting, and 
> ICU's
> implementation has apparently been very well optimized.
>
> Following this scheme everywhere would also allow us to make sorting 
> behavior
> consistent across platforms.  Currently, we sort `String` according to the 
> UCA,
> except that—*only on Apple platforms*—pairs of ASCII characters are 
> ordered by
> unicode scalar value.
>
> #### Syntax
>
> Because the current `Comparable` protocol expresses all comparisons with 
> binary
> operators, string comparisons—which may require
> additional [options](#operations-with-options)—do not fit smoothly into 
> the
> existing syntax.  At the same time, we'd like to solve other problems with
> comparison, as outlined
> in
> [this proposal](
> https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e)
> (implemented by changes at the head
> of
> [this branch](
> https://github.com/CodaFi/swift/commits/space-the-final-frontier)).
> We should adopt a modification of that proposal that uses a method rather 
> than
> an operator `<=>`:
>
> ```swift
> enum SortOrder { case before, same, after }
>
> protocol Comparable : Equatable {
>  func compared(to: Self) -> SortOrder
>  ...
> }
> ```
>
> This change will give us a syntactic platform on which to implement 
> methods with
> additional, defaulted arguments, thereby unifying and regularizing 
> comparison
> across the library.
>
> ```swift
> extension String {
>  func compared(to: Self) -> SortOrder
>
> }
> ```
>
> **Note:** `SortOrder` should bridge to `NSComparisonResult`.  It's also 
> possible
> that the standard library simply adopts Foundation's `ComparisonResult` as 
> is,
> but we believe the community should at least consider alternate naming 
> before
> that happens.  There will be an opportunity to discuss the choices in 
> detail
> when the modified
> [Comparison Proposal](
> https://gist.github.com/CodaFi/f0347bd37f1c407bf7ea0c429ead380e) comes
> up for review.
>
> ### `String` should be a `Collection` of `Character`s Again
>
> In Swift 2.0, `String`'s `Collection` conformance was dropped, because we
> convinced ourselves that its semantics differed from those of `Collection` 
> too
> significantly.
>
> It was always well understood that if strings were treated as sequences of
> `UnicodeScalar`s, algorithms such as `lexicographicalCompare`, 
> `elementsEqual`,
> and `reversed` would produce nonsense results. Thus, in Swift 1.0, 
> `String` was
> a collection of `Character` (extended grapheme clusters). During 2.0
> development, though, we realized that correct string concatenation could
> occasionally merge distinct grapheme clusters at the start and end of 
> combined
> strings.
>
> This quirk aside, every aspect of strings-as-collections-of-graphemes 
> appears to
> comport perfectly with Unicode. We think the concatenation problem is 
> tolerable,
> because the cases where it occurs all represent partially-formed 
> constructs. The
> largest class—isolated combining characters such as ◌́ (U+0301 COMBINING 
> ACUTE
> ACCENT)—are explicitly called out in the Unicode standard as
> “[degenerate](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
> )” or
> “[defective](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)”. The 
> other
> cases—such as a string ending in a zero-width joiner or half of a regional
> indicator—appear to be equally transient and unlikely outside of a text 
> editor.
>
> Admitting these cases encourages exploration of grapheme composition and 
> is
> consistent with what appears to be an overall Unicode philosophy that “no
> special provisions are made to get marginally better behavior for… cases 
> that
> never occur in practice.”[2] Furthermore, it seems
> unlikely to disturb the semantics of any plausible algorithms. We can 
> handle
> these cases by documenting them, explicitly stating that the elements of a
> `String` are an emergent property based on Unicode rules.
>
> The benefits of restoring `Collection` conformance are substantial: 
>
>   * Collection-like operations encourage experimentation with strings to
>     investigate and understand their behavior. This is useful for teaching 
> new
>     programmers, but also good for experienced programmers who want to
>     understand more about strings/unicode.
>
>   * Extended grapheme clusters form a natural element boundary for Unicode
>     strings.  For example, searching and matching operations will always 
> produce
>     results that line up on grapheme cluster boundaries.
>
>   * Character-by-character processing is a legitimate thing to do in many 
> real
>     use-cases, including parsing, pattern matching, and language-specific
>     transformations such as transliteration.
>
>   * `Collection` conformance makes a wide variety of powerful operations
>     available that are appropriate to `String`'s default role as the 
> vehicle for
>     machine processed text.
>
>     The methods `String` would inherit from `Collection`, where similar to
>     higher-level string algorithms, have the right semantics.  For 
> example,
>     grapheme-wise `lexicographicalCompare`, `elementsEqual`, and 
> application of
>     `flatMap` with case-conversion, produce the same results one would 
> expect
>     from whole-string ordering comparison, equality comparison, and
>     case-conversion, respectively.  `reverse` operates correctly on 
> graphemes,
>     keeping diacritics moored to their base characters and leaving emoji 
> intact.
>     Other methods such as `indexOf` and `contains` make obvious sense. A 
> few
>     `Collection` methods, like `min` and `max`, may not be particularly 
> useful
>     on `String`, but we don't consider that to be a problem worth solving, 
> in
>     the same way that we wouldn't try to suppress `min` and `max` on a
>     `Set([UInt8])` that was used to store IP addresses.
>
>   * Many of the higher-level operations that we want to provide for 
> `String`s,
>     such as parsing and pattern matching, should apply to any 
> `Collection`, and
>     many of the benefits we want for `Collections`, such
>     as unified slicing, should accrue
>     equally to `String`.  Making `String` part of the same protocol 
> hierarchy
>     allows us to write these operations once and not worry about keeping 
> the
>     benefits in sync.
>
>   * Slicing strings into substrings is a crucial part of the vocabulary of
>     string processing, and all other sliceable things are `Collection`s.
>     Because of its collection-like behavior, users naturally think of 
> `String`
>     in collection terms, but run into frustrating limitations where it 
> fails to
>     conform and are left to wonder where all the differences lie.  Many 
> simply
>     “correct” this limitation by declaring a trivial conformance:
>
>     ```swift
>   extension String : BidirectionalCollection {}
>     ```
>
>     Even if we removed indexing-by-element from `String`, users could 
> still do
>     this:
>
>     ```swift
>       extension String : BidirectionalCollection {
>         subscript(i: Index) -> Character { return characters[i] }
>       }
>     ```
>
>     It would be much better to legitimize the conformance to `Collection` 
> and
>     simply document the oddity of any concatenation corner-cases, than to 
> deny
>     users the benefits on the grounds that a few cases are confusing.
>
> Note that the fact that `String` is a collection of graphemes does *not* 
> mean
> that string operations will necessarily have to do grapheme boundary
> recognition.  See the Unicode protocol section for details.
>
> ### `Character` and `CharacterSet`
>
> `Character`, which represents a
> Unicode
> [extended grapheme cluster](
> http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries),
> is a bit of a black box, requiring conversion to `String` in order to
> do any introspection, including interoperation with ASCII.  To fix this, 
> we should:
>
>  - Add a `unicodeScalars` view much like `String`'s, so that the 
> sub-structure
>    of grapheme clusters is discoverable.
>  - Add a failable `init` from sequences of scalars (returning nil for 
> sequences
>    that contain 0 or 2+ graphemes).
>  - (Lower priority) expose some operations, such as `func uppercase() ->
>    String`, `var isASCII: Bool`, and, to the extent they can be sensibly
>    generalized, queries of unicode properties that should also be exposed 
> on
>    `UnicodeScalar` such as `isAlphabetic` and `isGraphemeBase` .
>
> Despite its name, `CharacterSet` currently operates on the Swift 
> `UnicodeScalar`
> type. This means it is usable on `String`, but only by going through the 
> unicode
> scalar view. To deal with this clash in the short term, `CharacterSet` 
> should be
> renamed to `UnicodeScalarSet`.  In the longer term, it may be appropriate 
> to
> introduce a `CharacterSet` that provides similar functionality for 
> extended
> grapheme clusters.[5]
>
> ### Unification of Slicing Operations
>
> Creating substrings is a basic part of String processing, but the slicing
> operations that we have in Swift are inconsistent in both their spelling 
> and
> their naming: 
>
>   * Slices with two explicit endpoints are done with subscript, and 
> support
>     in-place mutation:
>
>     ```swift
>         s[i..<j].mutate()
>     ```
>
>   * Slicing from an index to the end, or from the start to an index, is 
> done
>     with a method and does not support in-place mutation:
>     ```swift
>         s.prefix(upTo: i).readOnly()
>     ```
>
> Prefix and suffix operations should be migrated to be subscripting 
> operations
> with one-sided ranges i.e. `s.prefix(upTo: i)` should become `s[..<i]`, as
> in
> [this proposal](
> https://github.com/apple/swift-evolution/blob/9cf2685293108ea3efcbebb7ee6a8618b83d4a90/proposals/0132-sequence-end-ops.md
> ).
> With generic subscripting in the language, that will allow us to collapse 
> a wide
> variety of methods and subscript overloads into a single implementation, 
> and
> give users an easy-to-use and composable way to describe subranges.
>
> Further extending this EDSL to integrate use-cases like 
> `s.prefix(maxLength: 5)`
> is an ongoing research project that can be considered part of the 
> potential
> long-term vision of text (and collection) processing.
>
> ### Substrings
>
> When implementing substring slicing, languages are faced with three 
> options:
>
> 1. Make the substrings the same type as string, and share storage.
> 2. Make the substrings the same type as string, and copy storage when 
> making the substring.
> 3. Make substrings a different type, with a storage copy on conversion to 
> string.
>
> We think number 3 is the best choice. A walk-through of the tradeoffs 
> follows.
>
> #### Same type, shared storage
>
> In Swift 3.0, slicing a `String` produces a new `String` that is a view 
> into a
> subrange of the original `String`'s storage. This is why `String` is 3 
> words in
> size (the start, length and buffer owner), unlike the similar `Array` type
> which is only one.
>
> This is a simple model with big efficiency gains when chopping up strings 
> into
> multiple smaller strings. But it does mean that a stored substring keeps 
> the
> entire original string buffer alive even after it would normally have been
> released.
>
> This arrangement has proven to be problematic in other programming 
> languages,
> because applications sometimes extract small strings from large ones and 
> keep
> those small strings long-term. That is considered a memory leak and was 
> enough
> of a problem in Java that they changed from substrings sharing storage to
> making a copy in 1.7.
>
> #### Same type, copied storage
>
> Copying of substrings is also the choice made in C#, and in the default
> `NSString` implementation. This approach avoids the memory leak issue, but 
> has
> obvious performance overhead in performing the copies.
>
> This in turn encourages trafficking in string/range pairs instead of in
> substrings, for performance reasons, leading to API challenges. For 
> example:
>
> ```swift
> foo.compare(bar, range: start..<end)
> ```
>
> Here, it is not clear whether `range` applies to `foo` or `bar`. This
> relationship is better expressed in Swift as a slicing operation:
>
> ```swift
> foo[start..<end].compare(bar)
> ```
>
> Not only does this clarify to which string the range applies, it also 
> brings
> this sub-range capability to any API that operates on `String` "for free". 
> So
> these other combinations also work equally well:
>
> ```swift
> // apply range on argument rather than target
> foo.compare(bar[start..<end])
> // apply range on both
> foo[start..<end].compare(bar[start1..<end1])
> // compare two strings ignoring first character
> foo.dropFirst().compare(bar.dropFirst())
> ```
>
> In all three cases, an explicit range argument need not appear on the 
> `compare`
> method itself. The implementation of `compare` does not need to know 
> anything
> about ranges. Methods need only take range arguments when that was an
> integral part of their purpose (for example, setting the start and end of 
> a
> user's current selection in a text box).
>
> #### Different type, shared storage
>
> The desire to share underlying storage while preventing accidental memory 
> leaks
> occurs with slices of `Array`. For this reason we have an `ArraySlice` 
> type.
> The inconvenience of a separate type is mitigated by most operations used 
> on
> `Array` from the standard library being generic over `Sequence` or 
> `Collection`.
>
> We should apply the same approach for `String` by introducing a distinct
> `SubSequence` type, `Substring`. Similar advice given for `ArraySlice` 
> would apply to `Substring`:
>
>> Important: Long-term storage of `Substring` instances is discouraged. A
>> substring holds a reference to the entire storage of a larger string, 
> not
>> just to the portion it presents, even after the original string's 
> lifetime
>> ends. Long-term storage of a `Substring` may therefore prolong the 
> lifetime
>> of large strings that are no longer otherwise accessible, which can 
> appear
>> to be memory leakage.
>
> When assigning a `Substring` to a longer-lived variable (usually a stored
> property) explicitly of type `String`, a type conversion will be 
> performed, and
> at this point the substring buffer is copied and the original string's 
> storage
> can be released.
>
> A `String` that was not its own `Substring` could be one word—a single 
> tagged
> pointer—without requiring additional allocations. `Substring`s would be a 
> view
> onto a `String`, so are 3 words - pointer to owner, pointer to start, and 
> a
> length. The small string optimization for `Substring` would take advantage 
> of
> the larger size, probably with a less compressed encoding for speed.
>
> The downside of having two types is the inconvenience of sometimes having 
> a
> `Substring` when you need a `String`, and vice-versa. It is likely this 
> would
> be a significantly bigger problem than with `Array` and `ArraySlice`, as
> slicing of `String` is such a common operation. It is especially relevant 
> to
> existing code that assumes `String` is the currency type. To ease the pain 
> of
> type mismatches, `Substring` should be a subtype of `String` in the same 
> way
> that `Int` is a subtype of `Optional<Int>`. This would give users an 
> implicit
> conversion from `Substring` to `String`, as well as the usual implicit
> conversions such as `[Substring]` to `[String]` that other subtype
> relationships receive.
>
> In most cases, type inference combined with the subtype relationship 
> should
> make the type difference a non-issue and users will not care which type 
> they
> are using. For flexibility and optimizability, most operations from the
> standard library will traffic in generic models of
> [`Unicode`](#the--code-unicode--code--protocol).
>
> ##### Guidance for API Designers
>
> In this model, **if a user is unsure about which type to use, `String` is 
> always
> a reasonable default**. A `Substring` passed where `String` is expected 
> will be
> implicitly copied. When compared to the “same type, copied storage” model, 
> we
> have effectively deferred the cost of copying from the point where a 
> substring
> is created until it must be converted to `String` for use with an API.
>
> A user who needs to optimize away copies altogether should use this 
> guideline:
> if for performance reasons you are tempted to add a `Range` argument to 
> your
> method as well as a `String` to avoid unnecessary copies, you should 
> instead
> use `Substring`.
>
> ##### The “Empty Subscript”
>
> To make it easy to call such an optimized API when you only have a 
> `String` (or
> to call any API that takes a `Collection`'s `SubSequence` when all you 
> have is
> the `Collection`), we propose the following “empty subscript” operation,
>
> ```swift
> extension Collection {
>   subscript() -> SubSequence { 
>     return self[startIndex..<endIndex] 
>   }
> }
> ```
>
> which allows the following usage:
>
> ```swift
> funcThatIsJustLooking(at: person.name[]) // pass person.name as Substring
> ```
>
> The `[]` syntax can be offered as a fixit when needed, similar to `&` for 
> an
> `inout` argument. While it doesn't help a user to convert `[String]` to
> `[Substring]`, the need for such conversions is extremely rare, can be 
> done with
> a simple `map` (which could also be offered by a fixit):
>
> ```swift
> takesAnArrayOfSubstring(arrayOfString.map { $0[] })
> ```
>
> #### Other Options Considered
>
> As we have seen, all three options above have downsides, but it's possible
> these downsides could be eliminated/mitigated by the compiler. We are 
> proposing
> one such mitigation—implicit conversion—as part of the the "different 
> type,
> shared storage" option, to help avoid the cognitive load on developers of
> having to deal with a separate `Substring` type.
>
> To avoid the memory leak issues of a "same type, shared storage" substring
> option, we considered whether the compiler could perform an implicit copy 
> of
> the underlying storage when it detects the string is being "stored" for 
> long
> term usage, say when it is assigned to a stored property. The trouble with 
> this
> approach is it is very difficult for the compiler to distinguish between
> long-term storage versus short-term in the case of abstractions that rely 
> on
> stored properties. For example, should the storing of a substring inside 
> an
> `Optional` be considered long-term? Or the storing of multiple substrings
> inside an array? The latter would not work well in the case of a
> `components(separatedBy:)` implementation that intended to return an array 
> of
> substrings. It would also be difficult to distinguish intentional 
> medium-term
> storage of substrings, say by a lexer. There does not appear to be an 
> effective
> consistent rule that could be applied in the general case for detecting 
> when a
> substring is truly being stored long-term.
>
> To avoid the cost of copying substrings under "same type, copied storage", 
> the
> optimizer could be enhanced to to reduce the impact of some of those 
> copies.
> For example, this code could be optimized to pull the invariant substring 
> out
> of the loop:
>
> ```swift
> for _ in 0..<lots { 
>   someFunc(takingString: bigString[bigRange]) 
> }
> ```
>
> It's worth noting that a similar optimization is needed to avoid an 
> equivalent
> problem with implicit conversion in the "different type, shared storage" 
> case:
>
> ```swift
> let substring = bigString[bigRange]
> for _ in 0..<lots { someFunc(takingString: substring) }
> ```
>
> However, in the case of "same type, copied storage" there are many use 
> cases
> that cannot be optimized as easily. Consider the following simple 
> definition of
> a recursive `contains` algorithm, which when substring slicing is linear 
> makes
> the overall algorithm quadratic:
>
> ```swift
> extension String {
>     func containsChar(_ x: Character) -> Bool {
>         return !isEmpty && (first == x || dropFirst().containsChar(x))
>     }
> }
> ```
>
> For the optimizer to eliminate this problem is unrealistic, forcing the 
> user to
> remember to optimize the code to not use string slicing if they want it to 
> be
> efficient (assuming they remember):
>
> ```swift
> extension String {
>     // add optional argument tracking progress through the string
>     func containsCharacter(_ x: Character, atOrAfter idx: Index? = nil) -> 
> Bool {
>         let idx = idx ?? startIndex
>         return idx != endIndex
>             && (self[idx] == x || containsCharacter(x, atOrAfter: 
> index(after: idx)))
>     }
> }
> ```
>
> #### Substrings, Ranges and Objective-C Interop
>
> The pattern of passing a string/range pair is common in several 
> Objective-C
> APIs, and is made especially awkward in Swift by the 
> non-interchangeability of
> `Range<String.Index>` and `NSRange`. 
>
> ```swift
> s2.find(s2, sourceRange: NSRange(j..<s2.endIndex, in: s2))
> ```
>
> In general, however, the Swift idiom for operating on a sub-range of a
> `Collection` is to *slice* the collection and operate on that:
>
> ```swift
> s2.find(s2[j..<s2.endIndex])
> ```
>
> Therefore, APIs that operate on an `NSString`/`NSRange` pair should be 
> imported
> without the `NSRange` argument.  The Objective-C importer should be 
> changed to
> give these APIs special treatment so that when a `Substring` is passed, 
> instead
> of being converted to a `String`, the full `NSString` and range are passed 
> to
> the Objective-C method, thereby avoiding a copy.
>
> As a result, you would never need to pass an `NSRange` to these APIs, 
> which
> solves the impedance problem by eliminating the argument, resulting in 
> more
> idiomatic Swift code while retaining the performance benefit.  To help 
> users
> manually handle any cases that remain, Foundation should be augmented to 
> allow
> the following syntax for converting to and from `NSRange`:
>
> ```swift
> let nsr = NSRange(i..<j, in: s) // An NSRange corresponding to s[i..<j]
> let iToJ = Range(nsr, in: s)    // Equivalent to i..<j
> ```
>
> ### The `Unicode` protocol
>
> With `Substring` and `String` being distinct types and sharing almost all
> interface and semantics, and with the highest-performance string 
> processing
> requiring knowledge of encoding and layout that the currency types can't
> provide, it becomes important to capture the common “string API” in a 
> protocol.
> Since Unicode conformance is a key feature of string processing in swift, 
> we
> call that protocol `Unicode`:
>
> **Note:** The following assumes several features that are planned but not 
> yet implemented in
>   Swift, and should be considered a sketch rather than a final design.
>
> ```swift
> protocol Unicode 
>   : Comparable, BidirectionalCollection where Element == Character {
>
>   associatedtype Encoding : UnicodeEncoding
>   var encoding: Encoding { get }
>
>   associatedtype CodeUnits 
>     : RandomAccessCollection where Element == Encoding.CodeUnit
>   var codeUnits: CodeUnits { get }
>
>   associatedtype UnicodeScalars 
>     : BidirectionalCollection  where Element == UnicodeScalar
>   var unicodeScalars: UnicodeScalars { get }
>
>   associatedtype ExtendedASCII 
>     : BidirectionalCollection where Element == UInt32
>   var extendedASCII: ExtendedASCII { get }
>
>   var unicodeScalars: UnicodeScalars { get }
> }
>
> extension Unicode {
>   // ... define high-level non-mutating string operations, e.g. search ...
>
>   func compared<Other: Unicode>(
>     to rhs: Other,
>     case caseSensitivity: StringSensitivity? = nil,
>     diacritic diacriticSensitivity: StringSensitivity? = nil,
>     width widthSensitivity: StringSensitivity? = nil,
>     in locale: Locale? = nil
>   ) -> SortOrder { ... }
> }
>
> extension Unicode : RangeReplaceableCollection where CodeUnits :
>   RangeReplaceableCollection {
>     // Satisfy protocol requirement
>     mutating func replaceSubrange<C : Collection>(_: Range<Index>, with: 
> C) 
>       where C.Element == Element
>
>   // ... define high-level mutating string operations, e.g. replace ...
> }
>
> ```
>
> The goal is that `Unicode` exposes the underlying encoding and code units 
> in
> such a way that for types with a known representation (e.g. a 
> high-performance
> `UTF8String`) that information can be known at compile-time and can be 
> used to
> generate a single path, while still allowing types like `String` that 
> admit
> multiple representations to use runtime queries and branches to fast path
> specializations.
>
> **Note:** `Unicode` would make a fantastic namespace for much of
> what's in this proposal if we could get the ability to nest types and
> protocols in protocols.
>
> ### Scanning, Matching, and Tokenization
>
> #### Low-Level Textual Analysis
>
> We should provide convenient APIs processing strings by character.  For 
> example,
> it should be easy to cleanly express, “if this string starts with `"f"`, 
> process
> the rest of the string as follows…”  Swift is well-suited to expressing 
> this
> common pattern beautifully, but we need to add the APIs.  Here are two 
> examples
> of the sort of code that might be possible given such APIs:
>
> ```swift
> if let firstLetter = input.droppingPrefix(alphabeticCharacter) {
>   somethingWith(input) // process the rest of input
> }
>
> if let (number, restOfInput) = input.parsingPrefix(Int.self) {
>    ...
> }
> ```
>
> The specific spelling and functionality of APIs like this are TBD.  The 
> larger
> point is to make sure matching-and-consuming jobs are well-supported.
>
> #### Unified Pattern Matcher Protocol
>
> Many of the current methods that do matching are overloaded to do the same
> logical operations in different ways, with the following axes:
>
> - Logical Operation: `find`, `split`, `replace`, match at start
> - Kind of pattern: `CharacterSet`, `String`, a regex, a closure
> - Options, e.g. case/diacritic sensitivity, locale.  Sometimes a part of
>   the method name, and sometimes an argument
> - Whole string or subrange.
>
> We should represent these aspects as orthogonal, composable components,
> abstracting pattern matchers into a protocol like
> [this one](
> https://github.com/apple/swift/blob/master/test/Prototypes/PatternMatching.swift#L33
> ),
> that can allow us to define logical operations once, without introducing
> overloads, and massively reducing API surface area.
>
> For example, using the strawman prefix `%` syntax to turn string literals 
> into
> patterns, the following pairs would all invoke the same generic methods:
>
> ```swift
> if let found = s.firstMatch(%"searchString") { ... }
> if let found = s.firstMatch(someRegex) { ... }
>
> for m in s.allMatches((%"searchString"), case: .insensitive) { ... }
> for m in s.allMatches(someRegex) { ... }
>
> let items = s.split(separatedBy: ", ")
> let tokens = s.split(separatedBy: CharacterSet.whitespace)
> ```
>
> Note that, because Swift requires the indices of a slice to match the 
> indices of
> the range from which it was sliced, operations like `firstMatch` can 
> return a
> `Substring?` in lieu of a `Range<String.Index>?`: the indices of the match 
> in
> the string being searched, if needed, can easily be recovered as the
> `startIndex` and `endIndex` of the `Substring`.
>
> Note also that matching operations are useful for collections in general, 
> and
> would fall out of this proposal:
>
> ```
> // replace subsequences of contiguous NaNs with zero
> forces.replace(oneOrMore([Float.nan]), [0.0])
> ```
>
> #### Regular Expressions
>
> Addressing regular expressions is out of scope for this proposal.
> That said, it is important that to note the pattern matching protocol 
> mentioned
> above provides a suitable foundation for regular expressions, and types 
> such as
> `NSRegularExpression` can easily be retrofitted to conform to it.  In the
> future, support for regular expression literals in the compiler could 
> allow for
> compile-time syntax checking and optimization.
>
> ### String Indices
>
> `String` currently has four views—`characters`, `unicodeScalars`, `utf8`, 
> and
> `utf16`—each with its own opaque index type.  The APIs used to translate 
> indices
> between views add needless complexity, and the opacity of indices makes 
> them
> difficult to serialize.
>
> The index translation problem has two aspects:
>
>   1. `String` views cannot consume one anothers' indices without a 
> cumbersome
>     conversion step.  An index into a `String`'s `characters` must be 
> translated
>     before it can be used as a position in its `unicodeScalars`.  Although 
> these
>     translations are rarely needed, they add conceptual and API 
> complexity.
>   2. Many APIs in the core libraries and other frameworks still expose 
> `String`
>     positions as `Int`s and regions as `NSRange`s, which can only 
> reference a
>     `utf16` view and interoperate poorly with `String` itself.
>
> #### Index Interchange Among Views
>
> String's need for flexible backing storage and reasonably-efficient 
> indexing
> (i.e. without dynamically allocating and reference-counting the indices
> themselves) means indices need an efficient underlying storage type. 
> Although
> we do not wish to expose `String`'s indices *as* integers, `Int` offsets 
> into
> underlying code unit storage makes a good underlying storage type, 
> provided
> `String`'s underlying storage supports random-access.  We think 
> random-access
> *code-unit storage* is a reasonable requirement to impose on all `String`
> instances.
>
> Making these `Int` code unit offsets conveniently accessible and 
> constructible
> solves the serialization problem:
>
> ```swift
> clipboard.write(s.endIndex.codeUnitOffset)
> let offset = clipboard.read(Int.self)
> let i = String.Index(codeUnitOffset: offset)
> ```
>
> Index interchange between `String` and its `unicodeScalars`, `codeUnits`,
> and [`extendedASCII`](#parsing-ascii-structure) views can be made entirely
> seamless by having them share an index type (semantics of indexing a 
> `String`
> between grapheme cluster boundaries are TBD—it can either trap or be 
> forgiving).
> Having a common index allows easy traversal into the interior of 
> graphemes,
> something that is often needed, without making it likely that someone will 
> do it
> by accident.
>
>  - `String.index(after:)` should advance to the next grapheme, even when 
> the
>    index points partway through a grapheme.
>
>  - `String.index(before:)` should move to the start of the grapheme before
>    the current position.
>
> Seamless index interchange between `String` and its UTF-8 or UTF-16 views 
> is not
> crucial, as the specifics of encoding should not be a concern for most use
> cases, and would impose needless costs on the indices of other views. That
> said, we can make translation much more straightforward by exposing simple
> bidirectional converting `init`s on both index types:
>
> ```swift
> let u8Position = String.UTF8.Index(someStringIndex)
> let originalPosition = String.Index(u8Position)
> ```
>
> #### Index Interchange with Cocoa
>
> We intend to address `NSRange`s that denote substrings in Cocoa APIs as
> described [later in this 
> document](#substrings--ranges-and-objective-c-interop).
> That leaves the interchange of bare indices with Cocoa APIs trafficking in
> `Int`.  Hopefully such APIs will be rare, but when needed, the following
> extension, which would be useful for all `Collections`, can help:
>
> ```swift
> extension Collection {
>   func index(offset: IndexDistance) -> Index {
>     return index(startIndex, offsetBy: offset)
>   }
>   func offset(of i: Index) -> IndexDistance {
>     return distance(from: startIndex, to: i)
>   }
> }
> ```
>
> Then integers can easily be translated into offsets into a `String`'s 
> `utf16`
> view for consumption by Cocoa:
>
> ```swift
> let cocoaIndex = s.utf16.offset(of: String.UTF16Index(i))
> let swiftIndex = s.utf16.index(offset: cocoaIndex)
> ```
>
> ### Formatting
>
> A full treatment of formatting is out of scope of this proposal, but
> we believe it's crucial for completing the text processing picture.  This
> section details some of the existing issues and thinking that may guide 
> future
> development.
>
> #### Printf-Style Formatting
>
> `String.format` is designed on the `printf` model: it takes a format 
> string with
> textual placeholders for substitution, and an arbitrary list of other 
> arguments.
> The syntax and meaning of these placeholders has a long history in
> C, but for anyone who doesn't use them regularly they are cryptic and 
> complex,
> as the `printf (3)` man page attests.
>
> Aside from complexity, this style of API has two major problems: First, 
> the
> spelling of these placeholders must match up to the types of the 
> arguments, in
> the right order, or the behavior is undefined.  Some limited support for
> compile-time checking of this correspondence could be implemented, but 
> only for
> the cases where the format string is a literal. Second, there's no 
> reasonable
> way to extend the formatting vocabulary to cover the needs of new types: 
> you are
> stuck with what's in the box.
>
> #### Foundation Formatters
>
> The formatters supplied by Foundation are highly capable and versatile, 
> offering
> both formatting and parsing services.  When used for formatting, though, 
> the
> design pattern demands more from users than it should:
>
>   * Matching the type of data being formatted to a formatter type
>   * Creating an instance of that type
>   * Setting stateful options (`currency`, `dateStyle`) on the type.  Note: 
> the
>     need for this step prevents the instance from being used and discarded 
> in
>     the same expression where it is created.
>   * Overall, introduction of needless verbosity into source
>
> These may seem like small issues, but the experience of Apple localization
> experts is that the total drag of these factors on programmers is such 
> that they
> tend to reach for `String.format` instead.
>
> #### String Interpolation
>
> Swift string interpolation provides a user-friendly alternative to 
> printf's
> domain-specific language (just write ordinary swift code!) and its type 
> safety
> problems (put the data right where it belongs!) but the following issues 
> prevent
> it from being useful for localized formatting (among other jobs):
>
>   * [SR-2303](https://bugs.swift.org/browse/SR-2303) We are unable to 
> restrict
>     types used in string interpolation.
>   * [SR-1260](https://bugs.swift.org/browse/SR-1260) String interpolation 
> can't
>     distinguish (fragments of) the base string from the string 
> substitutions.
>
> In the long run, we should improve Swift string interpolation to the point 
> where
> it can participate in most any formatting job.  Mostly this centers around
> fixing the interpolation protocols per the previous item, and supporting
> localization.
>
> To be able to use formatting effectively inside interpolations, it needs 
> to be
> both lightweight (because it all happens in-situ) and discoverable.  One 
> approach would be to standardize on `format` methods, e.g.:
>
> ```swift
> "Column 1: \(n.format(radix:16, width:8)) *** \(message)"
>
> "Something with leading zeroes: \(x.format(fill: zero, width:8))"
> ```
>
> ### C String Interop
>
> Our support for interoperation with nul-terminated C strings is scattered 
> and
> incoherent, with 6 ways to transform a C string into a `String` and four 
> ways to
> do the inverse.  These APIs should be replaced with the following
>
> ```swift
> extension String {
>   /// Constructs a `String` having the same contents as 
> `nulTerminatedUTF8`.
>   ///
>   /// - Parameter nulTerminatedUTF8: a sequence of contiguous UTF-8 
> encoded 
>   ///   bytes ending just before the first zero byte (NUL character).
>   init(cString nulTerminatedUTF8: UnsafePointer<CChar>)
>
>   /// Constructs a `String` having the same contents as 
> `nulTerminatedCodeUnits`.
>   ///
>   /// - Parameter nulTerminatedCodeUnits: a sequence of contiguous code 
> units in
>   ///   the given `encoding`, ending just before the first zero code unit.
>   /// - Parameter encoding: describes the encoding in which the code units
>   ///   should be interpreted.
>   init<Encoding: UnicodeEncoding>(
>     cString nulTerminatedCodeUnits: UnsafePointer<Encoding.CodeUnit>,
>     encoding: Encoding)
>
>   /// Invokes the given closure on the contents of the string, represented 
> as a
>   /// pointer to a null-terminated sequence of UTF-8 code units.
>   func withCString<Result>(
>     _ body: (UnsafePointer<CChar>) throws -> Result) rethrows -> Result
> }
> ```
>
> In both of the construction APIs, any invalid encoding sequence detected 
> will
> have its longest valid prefix replaced by U+FFFD, the Unicode replacement
> character, per Unicode specification.  This covers the common case.  The
> replacement is done *physically* in the underlying storage and the 
> validity of
> the result is recorded in the `String`'s `encoding` such that future 
> accesses
> need not be slowed down by possible error repair separately.
>
> Construction that is aborted when encoding errors are detected can be
> accomplished using APIs on the `encoding`.  String types that retain their
> physical encoding even in the presence of errors and are repaired 
> on-the-fly can
> be built as different instances of the `Unicode` protocol.
>
> ### Unicode 9 Conformance
>
> Unicode 9 (and MacOS 10.11) brought us support for family emoji, which 
> changes
> the process of properly identifying `Character` boundaries.  We need to 
> update
> `String` to account for this change.
>
> ### High-Performance String Processing
>
> Many strings are short enough to store in 64 bits, many can be stored 
> using only
> 8 bits per unicode scalar, others are best encoded in UTF-16, and some 
> come to
> us already in some other encoding, such as UTF-8, that would be costly to
> translate.  Supporting these formats while maintaining usability for
> general-purpose APIs demands that a single `String` type can be backed by 
> many
> different representations.
>
> That said, the highest performance code always requires static knowledge 
> of the
> data structures on which it operates, and for this code, dynamic selection 
> of
> representation comes at too high a cost.  Heavy-duty text processing 
> demands a
> way to opt out of dynamism and directly use known encodings.  Having this
> ability can also make it easy to cleanly specialize code that handles 
> dynamic
> cases for maximal efficiency on the most common representations.
>
> To address this need, we can build models of the `Unicode` protocol that 
> encode
> representation information into the type, such as 
> `NFCNormalizedUTF16String`.
>
> ### Parsing ASCII Structure
>
> Although many machine-readable formats support the inclusion of arbitrary
> Unicode text, it is also common that their fundamental structure lies 
> entirely
> within the ASCII subset (JSON, YAML, many XML formats).  These formats are 
> often
> processed most efficiently by recognizing ASCII structural elements as 
> ASCII,
> and capturing the arbitrary sections between them in more-general strings. 
>  The
> current String API offers no way to efficiently recognize ASCII and skip 
> past
> everything else without the overhead of full decoding into unicode 
> scalars.
>
> For these purposes, strings should supply an `extendedASCII` view that is 
> a
> collection of `UInt32`, where values less than `0x80` represent the
> corresponding ASCII character, and other values represent data that is 
> specific
> to the underlying encoding of the string.
>
> ## Language Support
>
> This proposal depends on two new features in the Swift language:
>
> 1. **Generic subscripts**, to
>    enable unified slicing syntax.
>
> 2. **A subtype relationship** between
>    `Substring` and `String`, enabling framework APIs to traffic solely in
>    `String` while still making it possible to avoid copies by handling
>    `Substring`s where necessary.
>
> Additionally, **the ability to nest types and protocols inside
> protocols** could significantly shrink the footprint of this proposal
> on the top-level Swift namespace.
>
> ## Open Questions
>
> ### Must `String` be limited to storing UTF-16 subset encodings?
>
> - The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is 
> not in
>   question here; this is about what encodings must be storable, without
>   transcoding, in the common currency type called “`String`”.
> - ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets.  UTF-8 is not.
> - If we have a way to get at a `String`'s code units, we need a concrete 
> type in
>   which to express them in the API of `String`, which is a concrete type
> - If String needs to be able to represent UTF-32, presumably the code 
> units need
>   to be `UInt32`.
> - Not supporting UTF-32-encoded text seems like one reasonable design 
> choice.
> - Maybe we can allow UTF-8 storage in `String` and expose its code units 
> as
>   `UInt16`, just as we would for Latin-1.
> - Supporting only UTF-16-subset encodings would imply that `String` 
> indices can
>   be serialized without recording the `String`'s underlying encoding.
>
> ### Do we need a type-erasable base protocol for UnicodeEncoding?
>
> UnicodeEncoding has an associated type, but it may be important to be able 
> to
> traffic in completely dynamic encoding values, e.g. for “tell me the most
> efficient encoding for this string.”
>
> ### Should there be a string “facade?”
>
> One possible design alternative makes `Unicode` a vehicle for expressing
> the storage and encoding of code units, but does not attempt to give it an 
> API
> appropriate for `String`.  Instead, string APIs would be provided by a 
> generic
> wrapper around an instance of `Unicode`:
>
> ```swift
> struct StringFacade<U: Unicode> : BidirectionalCollection {
>
>   // ...APIs for high-level string processing here...
>
>   var unicode: U // access to lower-level unicode details
> }
>
> typealias String = StringFacade<StringStorage>
> typealias Substring = StringFacade<StringStorage.SubSequence>
> ```
>
> This design would allow us to de-emphasize lower-level `String` APIs such 
> as
> access to the specific encoding, by putting them behind a `.unicode` 
> property.
> A similar effect in a facade-less design would require a new top-level
> `StringProtocol` playing the role of the facade with an an `associatedtype
> Storage : Unicode`.
>
> An interesting variation on this design is possible if defaulted generic
> parameters are introduced to the language:
>
> ```swift
> struct String<U: Unicode = StringStorage> 
>   : BidirectionalCollection {
>
>   // ...APIs for high-level string processing here...
>
>   var unicode: U // access to lower-level unicode details
> }
>
> typealias Substring = String<StringStorage.SubSequence>
> ```
>
> One advantage of such a design is that naïve users will always extend “the 
> right
> type” (`String`) without thinking, and the new APIs will show up on 
> `Substring`,
> `MyUTF8String`, etc.  That said, it also has downsides that should not be
> overlooked, not least of which is the confusability of the meaning of the 
> word
> “string.”  Is it referring to the generic or the concrete type?
>
> ### `TextOutputStream` and `TextOutputStreamable`
>
> `TextOutputStreamable` is intended to provide a vehicle for
> efficiently transporting formatted representations to an output stream
> without forcing the allocation of storage.  Its use of `String`, a
> type with multiple representations, at the lowest-level unit of
> communication, conflicts with this goal.  It might be sufficient to
> change `TextOutputStream` and `TextOutputStreamable` to traffic in an
> associated type conforming to `Unicode`, but that is not yet clear.
> This area will require some design work.
>
> ### `description` and `debugDescription`
>
> * Should these be creating localized or non-localized representations?
> * Is returning a `String` efficient enough?
> * Is `debugDescription` pulling the weight of the API surface area it 
> adds?
>
> ### `StaticString`
>
> `StaticString` was added as a byproduct of standard library developed and 
> kept
> around because it seemed useful, but it was never truly *designed* for 
> client
> programmers.  We need to decide what happens with it.  Presumably 
> *something*
> should fill its role, and that should conform to `Unicode`.
>
> ## Footnotes
>
> <b id="f0">0</b> The integers rewrite currently underway is expected to
>     substantially reduce the scope of `Int`'s API by using more
>     generics. [↩](#a0)
>
> <b id="f1">1</b> In practice, these semantics will usually be tied to the
> version of the installed [ICU](http://icu-project.org) library, which
> programmatically encodes the most complex rules of the Unicode Standard 
> and its
> de-facto extension, CLDR.[↩](#a1)
>
> <b id="f2">2</b>
> See
> [http://unicode.org/reports/tr29/#Notation](
> http://unicode.org/reports/tr29/#Notation). Note
> that inserting Unicode scalar values to prevent merging of grapheme 
> clusters would
> also constitute a kind of misbehavior (one of the clusters at the boundary 
> would
> not be found in the result), so would be relatively costly to implement, 
> with
> little benefit. [↩](#a2)
>
> <b id="f4">4</b> The use of non-UCA-compliant ordering is fully sanctioned 
> by
>   the Unicode standard for this purpose.  In fact there's
>   a [whole chapter](http://www.unicode.org/versions/Unicode9.0.0/ch05.pdf)
>   dedicated to it.  In particular, §5.17 says:
>
>   > When comparing text that is visible to end users, a correct linguistic 
> sort
>   > should be used, as described in _Section 5.16, Sorting and
>   > Searching_. However, in many circumstances the only requirement is for 
> a
>   > fast, well-defined ordering. In such cases, a binary ordering can be 
> used.
>
>   [↩](#a4)
>
> <b id="f5">5</b> The queries supported by `NSCharacterSet` map directly 
> onto
> properties in a table that's indexed by unicode scalar value.  This table 
> is
> part of the Unicode standard.  Some of these queries (e.g., “is this an
> uppercase character?”) may have fairly obvious generalizations to grapheme
> clusters, but exactly how to do it is a research topic and *ideally* we'd 
> either
> establish the existing practice that the Unicode committee would 
> standardize, or
> the Unicode committee would do the research and we'd implement their
> result.[↩](#a5)
>
> _______________________________________________
> swift-evolution mailing list
> [email protected]
> https://lists.swift.org/mailman/listinfo/swift-evolution
>

-- 
-Dave
_______________________________________________
swift-evolution mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-evolution

Re: [swift-evolution] Strings in Swift 4

Reply via email to