[swift-dev] State of String: ABI, Performance, Ergonomics, and You!

Michael Ilseman via swift-dev Wed, 10 Jan 2018 11:56:01 -0800

(A gist-formatted version of this email can be found at 
https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f 
<https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f>)


# State of String: ABI, Performance, Ergonomics, and You!

Hello, I’ve been working on implementing, optimizing, and improving String in 
preparation for ABI stability, and I thought I’d share the current status of 
String in Swift 5 and some potential directions to go. This is the product of 
conversations with open source contributors and my colleagues on the Swift 
standard library team at Apple. 

The primary area of focus is stabilizing String’s ABI, but we’d be remiss if we 
couldn’t also fit in performance and ergonomic improvements. String’s 
ergonomics in particular is one area where we think the status quo is woefully 
inadequate, and over half of this email is devoted to that topic. At the end, 
there’s a section about a community initiative that we hope can help users of 
String as well as guide future development.

(Note: I’m sending this to swift-dev because much of the contents revolve 
around implementation concerns. I’ll also cross-reference on swift-evolution 
and swift-users. See also the 
[StringManifesto](https://github.com/apple/swift/blob/master/docs/StringManifesto.md
 <https://github.com/apple/swift/blob/master/docs/StringManifesto.md>) from 
Swift 4 era, which elaborates more on the philosophy and justification of many 
of the things discussed here.)

## String ABI

String’s ABI consists of its in-memory layout, any methods that are public or 
callable from inlineable methods, and any interpretation of its bit pattern by 
inlineable methods.  The challenge in designing String’s ABI is determining 
what avenues are worth keeping open to pursue in the future and what should be 
closed off for good to ensure performance.

In a non-ABI-stable world, the Swift optimizer is able to peer deeply into the 
bowels of String’s implementation to perform analyses and optimizations. But, 
ABI stability introduces resilience barriers through which the optimizer cannot 
peer. Keeping reasonable performance while stabilizing an ABI requires careful 
crafting of the String type as well as reevaluating many implementation 
choices. This area is critical to String’s long term health, even though it has 
little user-visibility in the short term.

For example, as an implementation detail, String uses UTF-16 as its underlying 
encoding. For values that fit within the first 7-bit subset of UTF-16 (that is, 
ASCII), String uses 1-byte backing storage to save space. UTF-16 provides the 
best interop story for application development and interacting with ICU. But, 
other domains such as scripting, systems programming, and server-side 
development strongly prefer UTF-8 and transcoding may not serve their 
performance needs. If this implementation detail can remain behind a resilience 
barrier, that is if the performance cost is not too high, then String could 
evolve in the future to natively support backing UTF-8 storage.

The details are still in flux, and likely not of interest to most readers. Much 
of the re-gutting work is happening on the appropriately-named [StringGuts 
branch](https://github.com/apple/swift/pull/12442 
<https://github.com/apple/swift/pull/12442>).

## String Performance

Strings are pervasive in any system and their performance matters. We’re 
planning on two major efforts to improve performance this release: comparison 
improvements and small-string optimizations. Additionally, internal to the 
standard library, we’re introducing and using unmanaged strings and some 
performance flags, which may be worth surfacing as API for 
highly-performance-sensitive uses.

### String Comparison

**Strings should compare consistently**, with all the nice properties we know 
and love such as transitivity. String honors Unicode canonical equivalence, 
i.e. comparison is modulo-normalization. This means that ự (U+1EF1: LATIN SMALL 
LETTER U WITH HORN AND DOT BELOW) compares equivalent to ư ◌̣ (U+01B0 U+0323: 
LATIN SMALL LETTER U WITH HORN, COMBINING DOT BELOW) and u ◌̣ ◌̛ (U+0075 U+0323 
U+031B: LATIN SMALL LETTER U, COMBINING DOT BELOW, COMBINING HORN), and all the 
other wonderful wacky permutations thereof.

**String’s ordering is not sufficient for human presentation**, as there is no 
ordering that’s appropriate for all languages or uses. For example, `ö` comes 
before `z` in German, but not in Swedish. String’s comparison operators (`==` 
and `<`) do not attempt to provide locale-specific orderings, but instead a 
*universal* ordering, that is one which is useful for maintaining programmer 
invariants such as the sorted-ness of a data structure. String’s choice to 
honor Unicode canonical equivalence strikes a balance between universality and 
efficiency while helping most programmers avoid common pitfalls in dealing with 
Unicode.

(Quick note on localization: Localization is not within the purview of the 
standard library as it’s better provided by the platform, i.e. Foundation. If 
you’re presenting an ordering to a user, use a locale-aware method from 
Foundation such as `localizedStandardCompare()`.)

Given that ordering is not fit for human consumption, but rather machine 
processing, it might as well be fast. The current ordering differs on Darwin 
and Linux platforms, with Linux in particular suffering from poor performance 
due to choice of ordering (UCA with DUCET) and older versions of ICU. Instead, 
[String Comparison Prototype](https://github.com/apple/swift/pull/12115 
<https://github.com/apple/swift/pull/12115>)  provides a simpler ordering that 
allows for many common-case fast-paths and optimizations. For all the Unicode 
enthusiasts out there, this is the lexicographical ordering of NFC-normalized 
UTF-16 code units.

### Small String Optimization

Many strings are small. It is wasteful and inefficient to store them on the 
heap and track their lifetimes when the String struct could store it inline, 
i.e. encoded directly in the struct’s bit pattern.  Substrings, if they cover a 
small span of a potentially-large String, can avoid retaining ownership of the 
storage and instead encode their contents inline with an offset to map indices.

NSString has leveraged this kind of optimization on 64 bit platforms to great 
effect through tagged-pointer representations. Swift’s String has the advantage 
of being a struct and, depending on the final ABI, having more bits and flags 
available. We will probably have multiple small string representations.

For example, assuming a 16-byte String struct and 8 bits used for flags and 
discriminators (including discriminators for which small form a String is in), 
120 bits are available for a small string payload. 120 bits can hold 7 UTF-16 
code units, which is sufficient for most graphemes and many common words and 
separators. 120 bits can also fit 15 ASCII/UTF-8 code units without any 
packing, which suffices for many programmer/system strings (which have a strong 
skew towards ASCII).

We may also want a compact 5-bit encoding for formatted numbers, such as 64-bit 
memory addresses in hex, `Int.max` in base-10, and `Double` in base-10, which 
would require 18, 19, and 24 characters respectively. 120 bits with a 5-bit 
encoding would fit all of these. This would speed up the creation and 
interpolation of many strings containing numbers.

Final details are still in exploration. If the construction and interpretation 
of small bit patterns can remain behind a resilience barrier, new forms could 
be added in the future. However, the performance impact of this would need to 
be carefully evaluated.

### Unmanaged Strings

One of the ways that the standard library internally is dealing with the 
upcoming optimization barriers is by restructuring much of String’s 
implementation to be on top of an unmanaged form. These are rough analogues to 
UnsafeBufferPointer, i.e. a pointer and length with flags and string-semantic 
operations defined on them. Like small strings, these still have type String as 
they are just a different form encoded in the struct’s bit pattern.

Such unmanaged strings can be used when the lifetime of the storage is known to 
outlive all uses. As such, they don’t need to participate in reference 
counting. In the future, perhaps with ownership annotations or sophisticated 
static analysis, Swift could convert managed strings into unmanaged ones as an 
optimization when it knows the contents would not escape. Similarly in the 
future, a String constructed from a Substring that is known to not outlive the 
Substring’s owner can use an unmanaged form to skip copying the contents. This 
would benefit String’s status as the recommended common-currency type for API 
design.

Some kind of unmanaged string construct is an often-requested feature for 
highly performance-sensitive code. We haven’t thought too deeply about how best 
to surface this as API and it can be sliced and diced many ways. Some 
considerations:

* Does it make sense to introduce a new type (or types) here, or is it better 
to base everything on a sort of  `withUnsafeCodeUnits`-like construct that 
supplies UnsafeBufferPointers?
* How does 1-byte or 2-byte backing storage surface? Some options:
        * Closures receive optional arguments. E.g. `withUnsafeUTF16CodeUnits` 
passes an  `UnsafeBufferPointer<UInt16>?` to its closure. String would also 
have to be queryable as to the nature of the underlying storage, and those 
queries also become new API and ABI.
        * The backing storage is abstracted away sufficiently. I.e. the 1-byte 
or 2-byte decision is tracked by the type provided by e.g. 
`withUnsafeCodeUnits`. This trades a little performance for simpler usage. This 
would probably require a new type which in turn is new API and ABI.
        * Operations are given an enum of the different forms. That enum also 
becomes API as well as ABI.
                * Is the enum open or closed? Can this evolve over time?
        * Note that whether the underlying storage is 1-byte or 2-byte is not 
necessarily stable version-to-version, e.g. if String utilized UTF-8 storage 
for mostly-ASCII Strings in the future.
* Many Strings are effectively immutable and immortal, as in, the String’s 
lifetime will definitely outlive all potential uses (e.g. String literals). 
Would this work also encompass uses of StaticString? Can they be unified?
* What about a low-level API for efficiently building Strings? Could the user 
call `reserveCapacity` and then get a buffer pointer to excess capacity, 
updating count along the way?
* If this is niche and specifically for performance, should it be UnsafeString 
instead and skip bounds checks in release configurations?
* The APIs need to be designed and named. UnmanagedString’s subscript loads raw 
code units, but it also provides the functionality to perform grapheme breaking 
(as String is layered on top of it). How can this be designed in a way that 
implies semantic symmetry with String?
* How would small strings operate? Should the APIs fail on them, or receive a 
scratch buffer to spill them into, or be abstracted by the type?

### Performance Flags

Many operations such as decoding, comparison (under the new model), and 
grapheme breaking have fast-paths for common situations where the portion of 
the String under consideration does not exhibit the full complexity of Unicode. 
We plan on introducing flags denoting whether the *entire* String has these 
properties, allowing the implementation to just execute these fast-paths 
without repeatedly querying the underlying code units and with better 
instruction locality. Unsetting a flag will always preserve semantics, just 
slightly regress performance (the fast-paths will remain). Some flags include:

* **isTriviallyEncoded**: If all scalars are comprised of a single code unit, 
then decoding is trivial: just zero-extend.
* **isCanonical**: If a String is already in our preferred canonical form for 
comparison purposes, then honoring canonical equivalence is trivial.
* **isSingleScalarGrapheme**: If all graphemes are composed of a single scalar, 
then grapheme breaking is trivial.

String is a struct and these flags are stored inline, so they can only be set 
upon creation or inside a mutable method. In general, it’s not worth expensive 
checks to set these flags or onerous bookkeeping to maintain them, but a 
best-effort can be made when it’s cheap to do so. Since unsetting a flag is 
semantics-preserving, flags can be reserved for future use by always setting 
them to 0 in the current version. In this way, flags can be added in a 
backwards-compatible way, where older versions ignore them.

For highly performance-sensitive code, it might be worth having a mutating 
method to perform a linear scan and update these flags. This should be a very 
fast check to update inline flags, and exposing it as API should probably also 
expose getters to query flags. We haven’t thought deeply about how best to 
surface this for judicious use without creating an easily [mis-applied 
API](http://perldoc.perl.org/functions/study.html 
<http://perldoc.perl.org/functions/study.html>) littered throughout code bases.

### Miscellaneous

There are many other tweaks and improvements possible under the new ABI prior 
to stabilization, such as:

* Guaranteed nul-termination for String’s storage classes for faster C bridging.
* Unification of String and Substring’s various Views.
* Some form of opaque string, such as an existential, which can be used as an 
extension point.
* More compact/performant indices.

## String Ergonomics

Ergonomics is an area that’s not been well-served by String. Collection 
conformance and multi-line literals in Swift 4 were a welcome first step. But, 
trying to write a script, solve a classic whiteboard interview question, or 
participate in a coding competition can quickly become an exercise in 
frustration.

String ergonomics is a long term, multi-release endeavor. It has to continually 
compete for focus against near-term critical needs such as ABI stability and 
performance. The below provides some potential focus areas in the Swift 5 
timeframe, though we probably won’t be able to focus on them all.

### API Gaps

String’s APIs have a lot of holes in them, and we’re hoping the effort 
mentioned later in this email can help expose them. Filling these gaps is 
always in-scope for any release.

#### Character Properties

One immediately obvious gap is Character, which lacks basic queries. For 
example, it should be possible to write something like `myStr.filter { 
!$0.isWhitespace }` to get a string with all whitespace removed.

While this can quickly get bogged down in the details (Q: “Does the Mongolian 
vowel separator count as whitespace?” A: Depends on the version of Unicode!), 
there should be obvious, well-understood, generally useful queries that 
Character can answer. This area has been well-trodden by CharacterSet and its 
static properties can provide guidance on useful queries. Some potential 
queries would include isWhitespace, isNewline, isPunctuation, isAlphaNumeric, 
etc. These should probably also be present on UnicodeScalar.

These might be useful on other types such as Substring, where they would apply 
to the whole slice. E.g., it should be possible to write 
`myStr.split(separator: ";").filter { $0.isAlphaNumeric }` rather than 
`myStr.split(separator: ";").filter { !$0.contains { !$0.isAlphaNumeric } }`

Note: In no way would these properties obviate the need for CharacterSet, as 
CharacterSet API is independently useful and the right model for many 
whole-string operations.

### Interpolation: Building Strings Up

String interpolation, especially when used in conjunction with other language 
features such as multi-line string literals, present a compelling, visual way 
to construct strings.  For example, the following theoretical printout of a 
benchmark result:

```swift
print("""
 Result: \(benchmark.result)
 Execution time:
   user(ms): \(benchmark.userTime.milliseconds)
   system(ms): \(benchmark.systemTime.milliseconds)
   wall(ms): \(benchmark.wallTime.milliseconds)
 """)
```

String interpolation has many issues and limitations:

* Current interface is inefficient. Individual segments are turned into Strings 
before interpolation. While small string optimizations alleviate some of this, 
larger segments incur unneeded heap traffic and tracking.
* While it is [fairly 
customizable](https://oleb.net/blog/2017/01/fun-with-string-interpolation/ 
<https://oleb.net/blog/2017/01/fun-with-string-interpolation/>), the interface 
is clunky and deprecated. Brent Royal-Gordon has clearly articulated the 
problems and provided potential solutions in [Fix 
ExpressibleByStringInterpolation](https://github.com/brentdax/swift-evolution/blob/230e3ab5fe4a4acaabf52806f69ae3dd1ff9f538/proposals/NNNN-fix-expressible-by-string-interpolation.md
 
<https://github.com/brentdax/swift-evolution/blob/230e3ab5fe4a4acaabf52806f69ae3dd1ff9f538/proposals/NNNN-fix-expressible-by-string-interpolation.md>).
 This approach does need to be adjusted to address the performance issue above.
* Interpolated expressions are supplied inside the literal, meaning they cannot 
be passed around like format strings without extra boilerplate (e.g. a wrapping 
function).

Additionally, some concept of string formatters would greatly aid the 
ergonomics of constructing strings, whether inside interpolations or not.

### Regex: Tearing Strings Apart

Perhaps the most pressing area in need of ergonomic aid is tearing strings 
apart to extract information embedded within them. This can take many forms, 
and we would like to focus the discussion on language-level support for string 
matching literals, aka “regexes”. (Obligatory note that regexes here may or may 
not have a loose relationship to regular expressions in formal language theory, 
depending on details.)

This wouldn’t be worth doing if all it afforded was replacing `“foo”` with 
`/foo/` inside calls to NSRegularExpression. Regexes become compelling when 
viewed in conjunction with other language constructs such as switches (i.e. a 
new kind of 
[Pattern](https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Patterns.html
 
<https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Patterns.html>)),
 an ability to bind to variables, interpolate other matchers, a rich type 
system, etc.

Before delving into details on feature set, semantics, and implementation, 
here’s a quick tour of how they could fit into Swift.

#### One Possible Approach

This is one, brief and hand-wavy, approach to adding regexes to Swift. This is 
not a proposal, and **all syntax is a total strawman** meant for illustration 
purposes only. The regex literal syntax below is an adaptation of  [Chris 
Lattner’s strawman 
syntax](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171120/041619.html
 
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171120/041619.html>).
 In this syntax, whitespace is insignificant.

###### Compile Regex Literals into `Regex<T>`

```swift
let usPhoneNumber = /
 (let area: Int? <- \d{3}?) -
 (let routing: Int <- \d{3}) -
 (let local: Int <- \d{4}) /
print(type(of: usPhoneNumber)) // => Regex<(area: Int?, routing: Int, local: 
Int)>
...
let (_, routing, local) = try usPhoneNumber.match(myStr)!
print(type(of: routing)) // => Int
```

If no type is specified, the default is `Substring` covering the match.

###### Types in Captures Conform to Some Protocol

```swift
extension Int: RegexSubmatchableiblewobble {
 init?(regexSubmatch: Substring)
}
```

###### Captures and Interpolation Propagated Through Type System

```swift
let countryCode = / \+ (let country: Int <- \d+) /
print(type(of: countryCode)) // => Regex<(country: Int)>

let betterUSPhoneNumber = /
 (let country: Int? <- countryCode?)
 (let (area, routing, local) <- usPhoneNumber) /
print(type(of: betterUSPhoneNumber)) 
 // => Regex<(country: Int?, area: Int?, routing: Int, local: Int)>

let nestedCapture = /
 (let country: Int? <- countryCode?)
 (let number <- usPhoneNumber) /
print(type(of: nestedCapture)) 
 // => Regex<(country: Int?, number: (area: Int?, routing: Int, local: Int))> 
```

This alone is not sufficient to use regexes in pattern matching, as `~=` 
returns a bool rather than a tuple describing a match. For that, we need to 
**generalize pattern matching in Swift**.

###### Pattern Matching Through Conformance to `Pattern`

```swift
protocol Pattern {
 associatedtype In
 associatedtype Out
 func match(_: In) -> Out?
}
```

`~=` can still be checked if needed for compatibility. Alternatively, a default 
implementation of `match` could be provided in terms of `~=`, where `Out` is 
`Void`, i.e. only match success or failure. We may want a default 
implementation of match generic over any collection that `In` is a SubSequence 
of, to address the `Slice` vs `Collection` or `Substring` vs `String` divide.

###### Syntax for Destructing Matching

Introduce some kind of syntax for a new 
[value-binding-pattern](https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Patterns.html#//apple_ref/swift/grammar/value-binding-pattern
 
<https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Patterns.html#//apple_ref/swift/grammar/value-binding-pattern>)
 that performs destructing. E.g.

```swift
let myValue: T = ...
let pattern: MyTPattern<(U,U)> = ... // matches a T to (U, U)
let otherPattern: OtherTPattern<V> = ... // matches a T to V
switch myValue {
 case (let a: Int, let b: Int) <- pattern: ...
 case let pair <- pattern: ... // pair has type (U,U)
 case let d: Double <- otherPattern: ...
 case let num <- otherPattern: // num has type V
}
```

which also (hand-wavy timey-wimey) invokes some failable init on `Int` taking a 
`U`, `Double` taking a `V`, etc., guided by some protocol. 

###### Regexes are Just Patterns

```swift
struct Regex<T> {
 var program: CompiledRegex<T>
}
extension Regex: Pattern {
 typealias In = Substring
 typealias Out = T
 func match(_ s: In) -> Out? {
   // magic goes here
 }
}
```

Regex *literals* may have special syntax, but otherwise regexes are not a 
special case in the language and libraries can take advantage of generalized 
destructing matching. Similarly, character classes (which may be user defined, 
e.g. with a closure), could be Patterns over Characters.

###### Regexes Provide Variants

```swift
extension Regex {
 var caseInsensitive: Regex<T> { get }
 var allMatches: Regex<[T]> { get }
 var lineBased: Regex<[T]> { get }
 // ... matches with ranges, maybe even a Match object, etc.
 // ... lazy allMatches variant, etc.
}
```

###### Final Contrived Example (for Illustrative Purposes)

```swift
switch myStringlyTypedNumber {
case let (_, area?, routing, local) <- betterUSPhoneNumber: // ...
case let hexNumStr <- 
        / 0 x (let num <- [0-9 A-F]+) /.caseInsensitive: // ...
case let (mantissa: Double, exponent: Int) <- scientificNumber: // ...
}
```

#### Feature Set

The examples above merely illustrate how regex literals could appear, but we 
need to carefully design the feature set in accordance with Swift’s goals and 
principles as a language.

One goal of Swift is to eventually [be better at string processing than 
Perl](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html
 
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html>),
 so let’s take a look at Perl. Perl’s history is full of interesting insights. 
Perl organically evolved into the de-facto standard for regex-based string 
processing culminating in Perl 5. More importantly, they reflected on the 
[“mess”](https://perl6.org/archive/doc/design/apo/A05.html 
<https://perl6.org/archive/doc/design/apo/A05.html>) and cleaned it up in [Perl 
6](https://docs.perl6.org/language/regexes 
<https://docs.perl6.org/language/regexes>).  Perl 6 also has a 
[PEG](https://dl.acm.org/citation.cfm?id=964001.964011 
<https://dl.acm.org/citation.cfm?id=964001.964011>)-like [grammar 
system](https://docs.perl6.org/language/grammars 
<https://docs.perl6.org/language/grammars>) and we think similar approaches may 
be a part of Swift’s long-term future. For now, we’re focusing on regexes, 
although future considerations do impact the desired feature set.

Swift contrasts with Perl in that Swift places greater importance on clarity 
and simplicity over raw expressive power and extreme configurability. However, 
these are some features Swift should outright steal from Perl 6.

* **Free form syntax**. Whitespace is insignificant, comments can occur inside 
literals, etc., all of which aids readability
* **Named Captures**. If something’s worth capturing, it’s worth giving it a 
name.
* **Multi-line matching**. A new line is just another character, not a match 
terminator. 
* **Regex interpolation**. Refactor complex regexes into simpler ones (with 
names!) and interpolate them. This is perhaps the most Swifty aspect of Perl 6 
regexes!
* Miscellaneous features common to modern regex systems, such as:
        * Quantifier modifiers, e.g. frugal (“lazy”), ratcheting (“possessive”)
        * Non-capturing grouping, ratcheting (“atomic”) grouping
        * Separator-quantification (“[modified 
quantifier](https://docs.perl6.org/language/regexes#Modified_quantifier 
<https://docs.perl6.org/language/regexes#Modified_quantifier>:_%,_%%)”)
        * String literals
        * Rich set of character classes (more on that next)

##### Character Classes

We will want a broad array of builtin pre-defined character classes for use in 
regexes. Additionally, users should be able to construct and define their own, 
whether from CharacterSets (i.e. enumeration) or through an arbitrary closure 
of type `(Character) -> Bool`. There should be syntax to use user-defined 
character classes in a regex literal.

Character classes should provide many set-algebraic operations such as union, 
intersection, inversion, mutual-exclusion (“xor”), and set difference. (Note 
that for our purposes, the standard library’s `SetAlgebra` protocol is not 
appropriate for character classes, as `SetAlgebra` requires conformances to be 
able to check for subset relationships which is intractable, if not 
undecidable, for closures over graphemes.)

##### Arguable Features

Some literal syntactic features common in other regex systems may be 
unnecessary or a bad fit for Swift.

1. Substitution literals (e.g. sed-like `s/regex/replacement/g`)
        * If String had better substitution APIs, such as one that took a 
`Regex<T>` and a closure from `(T) -> String` alongside options such as `.all` 
or `.firstOnly`, that might be sufficient and/or clearer anyways.
2. Adverbs (e.g. the `i` meaning case-insensitive outside the literal’s 
delimiters)
        * These could be Regex variants accessible via computed properties, 
which may be clearer
3. Anonymous (numbered) captures
        * In general, if something’s worth capturing it’s worth giving it a 
name.
        * A construct like `(let _ = \d+)` could produce an unlabeled tuple 
element.
        * Tuples (the representation for literal captures) support 0-based 
access already, e.g. `myCaptures.3`.
4. Enumerated character class ranges for non-ASCII graphemes (the `-` in  
`[A-Z]`)
        * Convenient for ASCII, but meaningless for graphemes.
        * Good built-in classes and convenient ways for users to combine and 
enumerate their own may even obviate the need for ranges for ASCII.

##### Forbidden Features (with Alternatives)

There are some regex features that we argue should be disallowed, or at least 
carefully constrained. Regexes are powerful, specifying simple textual analysis 
in a concise syntax, but are easily misused in ways that add 
exponentially-compounding complexity in non-obvious ways. This misuse contrasts 
with Swift’s values of balancing expressivity with clarity, where complex 
behavior should be *explicitly expressed in code*.

However, we think there’s so much flexibility in what can accomplished without 
the most general form of these features that it might not be an issue in 
practice.

###### 1. Arbitrarily-Complex Look-Around Assertions

Assertions are the ability for a regex to “peek” ahead or behind at surrounding 
characters without consuming them as part of a match. They can be positive or 
negative. Arbitrarily-complex assertions include the ability to do captures as 
part of an assertion, unknown-length assertions satisfied by subpatterns, etc., 
which can have non-intuitively complex interactions with other parts of the 
regex.

**Alternative:** The following constrained assertions could be permissible:

1. Anchors (zero-width assertions), which test for positions within a string, 
e.g. `^` and `$`.

2. Fixed-length character-class or closure-based assertions, which help avoid 
[common pitfalls in 
negation](https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word
 
<https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word>)
 as well as aid readability. This would permit fixed-length arbitrary 
computation *explicitly expressed in code*. These might also encompass 
boundaries.

3. Invoking a closure on the string prefix/suffix prior/after the assertion 
point. This would permit variable-length arbitrary computation, but *explicitly 
expressed in code*.

###### 2. Arbitrary Uses of Backreferences

Backreferences allow testing a later portion of a string against a 
previously-matched portion. Unfortunately, in their most general form, they 
require exploration of the entire search space and [make matching 
NP-complete](https://perl.plover.com/NPC/NPC-3SAT.html 
<https://perl.plover.com/NPC/NPC-3SAT.html>), that is take exponential time. 
Very simple uses of backreferences just want to check for an unambiguous 
capture after the fact, for which a `where` clause suffices. Very complex uses 
are better served with an exponential loop *explicitly expressed in code*.

**Alternative:** There is a range of convenient uses between these two 
extremes, which we may be able to tractably address. The feasibility of these 
is still under investigation, but some potential constrained backreferences may 
include:

1. A “ratcheting” backreference, where upon matching (i.e. the current section 
of the string satisfies the backreference), the referred-to capture is “frozen” 
in place, so that alternative assignments need not be explored.
2. A delimiter-checking construct. While eventually we want to encourage the 
user to write a real grammar/parser (assuming some future grammar/parser 
description system in Swift), simple delimiter matching can greatly aid 
readability. This might even be general and powerful enough to define tasks 
such as pairing XML tags and ignoring escaped delimiters or delimiters inside 
literals. Ideally, this would be done in a direction conducive to a future 
parsing solution.

#### Semantics

Matching semantics should roughly corresponding to [UTS #18 Level 
2](https://unicode.org/reports/tr18/#Extended_Unicode_Support 
<https://unicode.org/reports/tr18/#Extended_Unicode_Support>) of Unicode 
support. That is, “any character” would be any Swift Character (a grapheme), 
matching obeys canonical equivalence, etc. This is consistent with String’s 
overall philosophy of universality and harm reduction. This does require that 
the implementation use Unicode rules and tables provided at run time, e.g. for 
grapheme breaking.

Regex interpolation should effectively “inline” the interpolated regex, i.e. 
[formal 
concatenation](https://en.m.wikipedia.org/wiki/Regular_expression#Formal_definition
 <https://en.m.wikipedia.org/wiki/Regular_expression#Formal_definition>). We 
will probably also want different combination semantics for combining patterns 
in general (including regexes), possibly in conjunction with a future parsing 
solution. For example, a form of alternation that is short-circuiting and some 
form of concatenation that is 
[“ratcheting”](https://docs.perl6.org/language/regexes#Ratchet 
<https://docs.perl6.org/language/regexes#Ratchet>), where something like  
`(myRegex1 || myRegex2) ++ myRegex3` tries myRegex1 first, and if successful 
myRegex2 is *never* tried regardless of whether myRegex3 succeeds or not. This 
is similar to the semantics of 
[PEGs](https://en.wikipedia.org/wiki/Parsing_expression_grammar 
<https://en.wikipedia.org/wiki/Parsing_expression_grammar>) or 
parser-combinators and much more closely reflects programmer intuition for 
parsing.

We’ll need to nail down a lot of important details:

* How are alternations matched? 
[LTM](https://perlgeek.de/en/article/longest-token-matching 
<https://perlgeek.de/en/article/longest-token-matching>)?
* What are the built in character classes and what do they cover?
        * What Unicode properties to expose?
        * Should there be API symmetry with Character queries mentioned above?
* What are the built in anchors, fixed-length assertions, and can they be 
customized?
* What is the type of a quantified capture? An array of the capture’s declared 
type?
* What guarantees about runtime and space complexity do we provide, if any?
* While we may have variants for whole-string, partial-string, line-by-line, 
etc., what should the defaults be?

#### Implementation

With static knowledge, many regexes can be compiled into [efficient, minimal 
state machines](https://swtch.com/%7Ersc/regexp/regexp1.html 
<https://swtch.com/%7Ersc/regexp/regexp1.html>) (albeit with Unicode rules and 
tables as input at run time). [Virtual machine 
approaches](https://swtch.com/%7Ersc/regexp/regexp2.html 
<https://swtch.com/%7Ersc/regexp/regexp2.html>) can make it easier to specify 
and guarantee behavior while also being much more extensible. They also may 
alleviate some ABI-stability concerns, as it’s easier to keep a bytecode stable 
while growing new features.

Delivering ergonomic regexes in Swift 5 should not be blocked on having an 
efficient implementation available immediately. However, long term, it is 
important to have a [predictable and consistent performance 
model](https://github.com/google/re2/wiki/WhyRE2 
<https://github.com/google/re2/wiki/WhyRE2>) to prevent [unanticipated 
exponential behavior](https://en.wikipedia.org/wiki/ReDoS 
<https://en.wikipedia.org/wiki/ReDoS>). Thus, implementation concerns are 
relevant in scoping the initial feature set (e.g. regarding backreferences). 

Some regexes are amenable to task-parallel strategies (i.e. threads), 
data-parallel strategies (i.e. vector or GPU), or both. Swift will probably 
grow new strategies and combinations of strategies over time. Implementations 
can also take advantage of the aforementioned performance flags (e.g. 
isCanonical) to greatly speed up matches.

The interface to a regex engine will be part of Swift’s ABI, including whatever 
intermediate representation regexes are compiled into. This could be a simple 
regex AST, a virtual machine bytecode, etc., but it needs to be extensible to 
accommodate any potential future additions and implementation strategies.

### Recap: Potential Additions for Swift 5

* Some form of unmanaged or unsafe Strings, and corresponding APIs
* Exposing performance flags, and some way to request a scan to populate them
* API gaps
* Character and UnicodeScalar properties, such as isNewline
* Generalizing, and optimizing, String interpolation
* Regex literals, Regex type, and generalized pattern match destructuring
* Substitution APIs, in conjunction with Regexes.

## Community

With Swift’s mailing lists [moving to 
Discourse](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171211/042127.html
 
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171211/042127.html>),
 this allows the opportunity for better community engagement and 
cross-referencing posts. We plan on starting an area under the “Using Swift” 
category focusing on idiomatic and/or performant solutions to string 
programming needs. These might be anything from simple programming problems to 
make-my-parser-faster challenges.

While this provides a community benefit by showing how best to use String, our 
ulterior motive is to mine posts looking for benchmarks, API gaps, and 
representative code that new features could improve. We’ll also, of course, 
want to bikeshed the name! My vote is for “String Dojo”, but there’s probably a 
reason I don’t normally name things.

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

[swift-dev] State of String: ABI, Performance, Ergonomics, and You!

Reply via email to