(A gist-formatted version of this email can be found at
https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f
<https://gist.github.com/milseman/bb39ef7f170641ae52c13600a512782f>)
# State of String: ABI, Performance, Ergonomics, and You!
Hello, I’ve been working on implementing, optimizing, and improving String in
preparation for ABI stability, and I thought I’d share the current status of
String in Swift 5 and some potential directions to go. This is the product of
conversations with open source contributors and my colleagues on the Swift
standard library team at Apple.
The primary area of focus is stabilizing String’s ABI, but we’d be remiss if we
couldn’t also fit in performance and ergonomic improvements. String’s
ergonomics in particular is one area where we think the status quo is woefully
inadequate, and over half of this email is devoted to that topic. At the end,
there’s a section about a community initiative that we hope can help users of
String as well as guide future development.
(Note: I’m sending this to swift-dev because much of the contents revolve
around implementation concerns. I’ll also cross-reference on swift-evolution
and swift-users. See also the
[StringManifesto](https://github.com/apple/swift/blob/master/docs/StringManifesto.md
<https://github.com/apple/swift/blob/master/docs/StringManifesto.md>) from
Swift 4 era, which elaborates more on the philosophy and justification of many
of the things discussed here.)
## String ABI
String’s ABI consists of its in-memory layout, any methods that are public or
callable from inlineable methods, and any interpretation of its bit pattern by
inlineable methods. The challenge in designing String’s ABI is determining
what avenues are worth keeping open to pursue in the future and what should be
closed off for good to ensure performance.
In a non-ABI-stable world, the Swift optimizer is able to peer deeply into the
bowels of String’s implementation to perform analyses and optimizations. But,
ABI stability introduces resilience barriers through which the optimizer cannot
peer. Keeping reasonable performance while stabilizing an ABI requires careful
crafting of the String type as well as reevaluating many implementation
choices. This area is critical to String’s long term health, even though it has
little user-visibility in the short term.
For example, as an implementation detail, String uses UTF-16 as its underlying
encoding. For values that fit within the first 7-bit subset of UTF-16 (that is,
ASCII), String uses 1-byte backing storage to save space. UTF-16 provides the
best interop story for application development and interacting with ICU. But,
other domains such as scripting, systems programming, and server-side
development strongly prefer UTF-8 and transcoding may not serve their
performance needs. If this implementation detail can remain behind a resilience
barrier, that is if the performance cost is not too high, then String could
evolve in the future to natively support backing UTF-8 storage.
The details are still in flux, and likely not of interest to most readers. Much
of the re-gutting work is happening on the appropriately-named [StringGuts
branch](https://github.com/apple/swift/pull/12442
<https://github.com/apple/swift/pull/12442>).
## String Performance
Strings are pervasive in any system and their performance matters. We’re
planning on two major efforts to improve performance this release: comparison
improvements and small-string optimizations. Additionally, internal to the
standard library, we’re introducing and using unmanaged strings and some
performance flags, which may be worth surfacing as API for
highly-performance-sensitive uses.
### String Comparison
**Strings should compare consistently**, with all the nice properties we know
and love such as transitivity. String honors Unicode canonical equivalence,
i.e. comparison is modulo-normalization. This means that ự (U+1EF1: LATIN SMALL
LETTER U WITH HORN AND DOT BELOW) compares equivalent to ư ◌̣ (U+01B0 U+0323:
LATIN SMALL LETTER U WITH HORN, COMBINING DOT BELOW) and u ◌̣ ◌̛ (U+0075 U+0323
U+031B: LATIN SMALL LETTER U, COMBINING DOT BELOW, COMBINING HORN), and all the
other wonderful wacky permutations thereof.
**String’s ordering is not sufficient for human presentation**, as there is no
ordering that’s appropriate for all languages or uses. For example, `ö` comes
before `z` in German, but not in Swedish. String’s comparison operators (`==`
and `<`) do not attempt to provide locale-specific orderings, but instead a
*universal* ordering, that is one which is useful for maintaining programmer
invariants such as the sorted-ness of a data structure. String’s choice to
honor Unicode canonical equivalence strikes a balance between universality and
efficiency while helping most programmers avoid common pitfalls in dealing with
Unicode.
(Quick note on localization: Localization is not within the purview of the
standard library as it’s better provided by the platform, i.e. Foundation. If
you’re presenting an ordering to a user, use a locale-aware method from
Foundation such as `localizedStandardCompare()`.)
Given that ordering is not fit for human consumption, but rather machine
processing, it might as well be fast. The current ordering differs on Darwin
and Linux platforms, with Linux in particular suffering from poor performance
due to choice of ordering (UCA with DUCET) and older versions of ICU. Instead,
[String Comparison Prototype](https://github.com/apple/swift/pull/12115
<https://github.com/apple/swift/pull/12115>) provides a simpler ordering that
allows for many common-case fast-paths and optimizations. For all the Unicode
enthusiasts out there, this is the lexicographical ordering of NFC-normalized
UTF-16 code units.
### Small String Optimization
Many strings are small. It is wasteful and inefficient to store them on the
heap and track their lifetimes when the String struct could store it inline,
i.e. encoded directly in the struct’s bit pattern. Substrings, if they cover a
small span of a potentially-large String, can avoid retaining ownership of the
storage and instead encode their contents inline with an offset to map indices.
NSString has leveraged this kind of optimization on 64 bit platforms to great
effect through tagged-pointer representations. Swift’s String has the advantage
of being a struct and, depending on the final ABI, having more bits and flags
available. We will probably have multiple small string representations.
For example, assuming a 16-byte String struct and 8 bits used for flags and
discriminators (including discriminators for which small form a String is in),
120 bits are available for a small string payload. 120 bits can hold 7 UTF-16
code units, which is sufficient for most graphemes and many common words and
separators. 120 bits can also fit 15 ASCII/UTF-8 code units without any
packing, which suffices for many programmer/system strings (which have a strong
skew towards ASCII).
We may also want a compact 5-bit encoding for formatted numbers, such as 64-bit
memory addresses in hex, `Int.max` in base-10, and `Double` in base-10, which
would require 18, 19, and 24 characters respectively. 120 bits with a 5-bit
encoding would fit all of these. This would speed up the creation and
interpolation of many strings containing numbers.
Final details are still in exploration. If the construction and interpretation
of small bit patterns can remain behind a resilience barrier, new forms could
be added in the future. However, the performance impact of this would need to
be carefully evaluated.
### Unmanaged Strings
One of the ways that the standard library internally is dealing with the
upcoming optimization barriers is by restructuring much of String’s
implementation to be on top of an unmanaged form. These are rough analogues to
UnsafeBufferPointer, i.e. a pointer and length with flags and string-semantic
operations defined on them. Like small strings, these still have type String as
they are just a different form encoded in the struct’s bit pattern.
Such unmanaged strings can be used when the lifetime of the storage is known to
outlive all uses. As such, they don’t need to participate in reference
counting. In the future, perhaps with ownership annotations or sophisticated
static analysis, Swift could convert managed strings into unmanaged ones as an
optimization when it knows the contents would not escape. Similarly in the
future, a String constructed from a Substring that is known to not outlive the
Substring’s owner can use an unmanaged form to skip copying the contents. This
would benefit String’s status as the recommended common-currency type for API
design.
Some kind of unmanaged string construct is an often-requested feature for
highly performance-sensitive code. We haven’t thought too deeply about how best
to surface this as API and it can be sliced and diced many ways. Some
considerations:
* Does it make sense to introduce a new type (or types) here, or is it better
to base everything on a sort of `withUnsafeCodeUnits`-like construct that
supplies UnsafeBufferPointers?
* How does 1-byte or 2-byte backing storage surface? Some options:
* Closures receive optional arguments. E.g. `withUnsafeUTF16CodeUnits`
passes an `UnsafeBufferPointer<UInt16>?` to its closure. String would also
have to be queryable as to the nature of the underlying storage, and those
queries also become new API and ABI.
* The backing storage is abstracted away sufficiently. I.e. the 1-byte
or 2-byte decision is tracked by the type provided by e.g.
`withUnsafeCodeUnits`. This trades a little performance for simpler usage. This
would probably require a new type which in turn is new API and ABI.
* Operations are given an enum of the different forms. That enum also
becomes API as well as ABI.
* Is the enum open or closed? Can this evolve over time?
* Note that whether the underlying storage is 1-byte or 2-byte is not
necessarily stable version-to-version, e.g. if String utilized UTF-8 storage
for mostly-ASCII Strings in the future.
* Many Strings are effectively immutable and immortal, as in, the String’s
lifetime will definitely outlive all potential uses (e.g. String literals).
Would this work also encompass uses of StaticString? Can they be unified?
* What about a low-level API for efficiently building Strings? Could the user
call `reserveCapacity` and then get a buffer pointer to excess capacity,
updating count along the way?
* If this is niche and specifically for performance, should it be UnsafeString
instead and skip bounds checks in release configurations?
* The APIs need to be designed and named. UnmanagedString’s subscript loads raw
code units, but it also provides the functionality to perform grapheme breaking
(as String is layered on top of it). How can this be designed in a way that
implies semantic symmetry with String?
* How would small strings operate? Should the APIs fail on them, or receive a
scratch buffer to spill them into, or be abstracted by the type?
### Performance Flags
Many operations such as decoding, comparison (under the new model), and
grapheme breaking have fast-paths for common situations where the portion of
the String under consideration does not exhibit the full complexity of Unicode.
We plan on introducing flags denoting whether the *entire* String has these
properties, allowing the implementation to just execute these fast-paths
without repeatedly querying the underlying code units and with better
instruction locality. Unsetting a flag will always preserve semantics, just
slightly regress performance (the fast-paths will remain). Some flags include:
* **isTriviallyEncoded**: If all scalars are comprised of a single code unit,
then decoding is trivial: just zero-extend.
* **isCanonical**: If a String is already in our preferred canonical form for
comparison purposes, then honoring canonical equivalence is trivial.
* **isSingleScalarGrapheme**: If all graphemes are composed of a single scalar,
then grapheme breaking is trivial.
String is a struct and these flags are stored inline, so they can only be set
upon creation or inside a mutable method. In general, it’s not worth expensive
checks to set these flags or onerous bookkeeping to maintain them, but a
best-effort can be made when it’s cheap to do so. Since unsetting a flag is
semantics-preserving, flags can be reserved for future use by always setting
them to 0 in the current version. In this way, flags can be added in a
backwards-compatible way, where older versions ignore them.
For highly performance-sensitive code, it might be worth having a mutating
method to perform a linear scan and update these flags. This should be a very
fast check to update inline flags, and exposing it as API should probably also
expose getters to query flags. We haven’t thought deeply about how best to
surface this for judicious use without creating an easily [mis-applied
API](http://perldoc.perl.org/functions/study.html
<http://perldoc.perl.org/functions/study.html>) littered throughout code bases.
### Miscellaneous
There are many other tweaks and improvements possible under the new ABI prior
to stabilization, such as:
* Guaranteed nul-termination for String’s storage classes for faster C bridging.
* Unification of String and Substring’s various Views.
* Some form of opaque string, such as an existential, which can be used as an
extension point.
* More compact/performant indices.
## String Ergonomics
Ergonomics is an area that’s not been well-served by String. Collection
conformance and multi-line literals in Swift 4 were a welcome first step. But,
trying to write a script, solve a classic whiteboard interview question, or
participate in a coding competition can quickly become an exercise in
frustration.
String ergonomics is a long term, multi-release endeavor. It has to continually
compete for focus against near-term critical needs such as ABI stability and
performance. The below provides some potential focus areas in the Swift 5
timeframe, though we probably won’t be able to focus on them all.
### API Gaps
String’s APIs have a lot of holes in them, and we’re hoping the effort
mentioned later in this email can help expose them. Filling these gaps is
always in-scope for any release.
#### Character Properties
One immediately obvious gap is Character, which lacks basic queries. For
example, it should be possible to write something like `myStr.filter {
!$0.isWhitespace }` to get a string with all whitespace removed.
While this can quickly get bogged down in the details (Q: “Does the Mongolian
vowel separator count as whitespace?” A: Depends on the version of Unicode!),
there should be obvious, well-understood, generally useful queries that
Character can answer. This area has been well-trodden by CharacterSet and its
static properties can provide guidance on useful queries. Some potential
queries would include isWhitespace, isNewline, isPunctuation, isAlphaNumeric,
etc. These should probably also be present on UnicodeScalar.
These might be useful on other types such as Substring, where they would apply
to the whole slice. E.g., it should be possible to write
`myStr.split(separator: ";").filter { $0.isAlphaNumeric }` rather than
`myStr.split(separator: ";").filter { !$0.contains { !$0.isAlphaNumeric } }`
Note: In no way would these properties obviate the need for CharacterSet, as
CharacterSet API is independently useful and the right model for many
whole-string operations.
### Interpolation: Building Strings Up
String interpolation, especially when used in conjunction with other language
features such as multi-line string literals, present a compelling, visual way
to construct strings. For example, the following theoretical printout of a
benchmark result:
```swift
print("""
Result: \(benchmark.result)
Execution time:
user(ms): \(benchmark.userTime.milliseconds)
system(ms): \(benchmark.systemTime.milliseconds)
wall(ms): \(benchmark.wallTime.milliseconds)
""")
```
String interpolation has many issues and limitations:
* Current interface is inefficient. Individual segments are turned into Strings
before interpolation. While small string optimizations alleviate some of this,
larger segments incur unneeded heap traffic and tracking.
* While it is [fairly
customizable](https://oleb.net/blog/2017/01/fun-with-string-interpolation/
<https://oleb.net/blog/2017/01/fun-with-string-interpolation/>), the interface
is clunky and deprecated. Brent Royal-Gordon has clearly articulated the
problems and provided potential solutions in [Fix
ExpressibleByStringInterpolation](https://github.com/brentdax/swift-evolution/blob/230e3ab5fe4a4acaabf52806f69ae3dd1ff9f538/proposals/NNNN-fix-expressible-by-string-interpolation.md
<https://github.com/brentdax/swift-evolution/blob/230e3ab5fe4a4acaabf52806f69ae3dd1ff9f538/proposals/NNNN-fix-expressible-by-string-interpolation.md>).
This approach does need to be adjusted to address the performance issue above.
* Interpolated expressions are supplied inside the literal, meaning they cannot
be passed around like format strings without extra boilerplate (e.g. a wrapping
function).
Additionally, some concept of string formatters would greatly aid the
ergonomics of constructing strings, whether inside interpolations or not.
### Regex: Tearing Strings Apart
Perhaps the most pressing area in need of ergonomic aid is tearing strings
apart to extract information embedded within them. This can take many forms,
and we would like to focus the discussion on language-level support for string
matching literals, aka “regexes”. (Obligatory note that regexes here may or may
not have a loose relationship to regular expressions in formal language theory,
depending on details.)
This wouldn’t be worth doing if all it afforded was replacing `“foo”` with
`/foo/` inside calls to NSRegularExpression. Regexes become compelling when
viewed in conjunction with other language constructs such as switches (i.e. a
new kind of
[Pattern](https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Patterns.html
<https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Patterns.html>)),
an ability to bind to variables, interpolate other matchers, a rich type
system, etc.
Before delving into details on feature set, semantics, and implementation,
here’s a quick tour of how they could fit into Swift.
#### One Possible Approach
This is one, brief and hand-wavy, approach to adding regexes to Swift. This is
not a proposal, and **all syntax is a total strawman** meant for illustration
purposes only. The regex literal syntax below is an adaptation of [Chris
Lattner’s strawman
syntax](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171120/041619.html
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171120/041619.html>).
In this syntax, whitespace is insignificant.
###### Compile Regex Literals into `Regex<T>`
```swift
let usPhoneNumber = /
(let area: Int? <- \d{3}?) -
(let routing: Int <- \d{3}) -
(let local: Int <- \d{4}) /
print(type(of: usPhoneNumber)) // => Regex<(area: Int?, routing: Int, local:
Int)>
...
let (_, routing, local) = try usPhoneNumber.match(myStr)!
print(type(of: routing)) // => Int
```
If no type is specified, the default is `Substring` covering the match.
###### Types in Captures Conform to Some Protocol
```swift
extension Int: RegexSubmatchableiblewobble {
init?(regexSubmatch: Substring)
}
```
###### Captures and Interpolation Propagated Through Type System
```swift
let countryCode = / \+ (let country: Int <- \d+) /
print(type(of: countryCode)) // => Regex<(country: Int)>
let betterUSPhoneNumber = /
(let country: Int? <- countryCode?)
(let (area, routing, local) <- usPhoneNumber) /
print(type(of: betterUSPhoneNumber))
// => Regex<(country: Int?, area: Int?, routing: Int, local: Int)>
let nestedCapture = /
(let country: Int? <- countryCode?)
(let number <- usPhoneNumber) /
print(type(of: nestedCapture))
// => Regex<(country: Int?, number: (area: Int?, routing: Int, local: Int))>
```
This alone is not sufficient to use regexes in pattern matching, as `~=`
returns a bool rather than a tuple describing a match. For that, we need to
**generalize pattern matching in Swift**.
###### Pattern Matching Through Conformance to `Pattern`
```swift
protocol Pattern {
associatedtype In
associatedtype Out
func match(_: In) -> Out?
}
```
`~=` can still be checked if needed for compatibility. Alternatively, a default
implementation of `match` could be provided in terms of `~=`, where `Out` is
`Void`, i.e. only match success or failure. We may want a default
implementation of match generic over any collection that `In` is a SubSequence
of, to address the `Slice` vs `Collection` or `Substring` vs `String` divide.
###### Syntax for Destructing Matching
Introduce some kind of syntax for a new
[value-binding-pattern](https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Patterns.html#//apple_ref/swift/grammar/value-binding-pattern
<https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/Patterns.html#//apple_ref/swift/grammar/value-binding-pattern>)
that performs destructing. E.g.
```swift
let myValue: T = ...
let pattern: MyTPattern<(U,U)> = ... // matches a T to (U, U)
let otherPattern: OtherTPattern<V> = ... // matches a T to V
switch myValue {
case (let a: Int, let b: Int) <- pattern: ...
case let pair <- pattern: ... // pair has type (U,U)
case let d: Double <- otherPattern: ...
case let num <- otherPattern: // num has type V
}
```
which also (hand-wavy timey-wimey) invokes some failable init on `Int` taking a
`U`, `Double` taking a `V`, etc., guided by some protocol.
###### Regexes are Just Patterns
```swift
struct Regex<T> {
var program: CompiledRegex<T>
}
extension Regex: Pattern {
typealias In = Substring
typealias Out = T
func match(_ s: In) -> Out? {
// magic goes here
}
}
```
Regex *literals* may have special syntax, but otherwise regexes are not a
special case in the language and libraries can take advantage of generalized
destructing matching. Similarly, character classes (which may be user defined,
e.g. with a closure), could be Patterns over Characters.
###### Regexes Provide Variants
```swift
extension Regex {
var caseInsensitive: Regex<T> { get }
var allMatches: Regex<[T]> { get }
var lineBased: Regex<[T]> { get }
// ... matches with ranges, maybe even a Match object, etc.
// ... lazy allMatches variant, etc.
}
```
###### Final Contrived Example (for Illustrative Purposes)
```swift
switch myStringlyTypedNumber {
case let (_, area?, routing, local) <- betterUSPhoneNumber: // ...
case let hexNumStr <-
/ 0 x (let num <- [0-9 A-F]+) /.caseInsensitive: // ...
case let (mantissa: Double, exponent: Int) <- scientificNumber: // ...
}
```
#### Feature Set
The examples above merely illustrate how regex literals could appear, but we
need to carefully design the feature set in accordance with Swift’s goals and
principles as a language.
One goal of Swift is to eventually [be better at string processing than
Perl](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20160725/025676.html>),
so let’s take a look at Perl. Perl’s history is full of interesting insights.
Perl organically evolved into the de-facto standard for regex-based string
processing culminating in Perl 5. More importantly, they reflected on the
[“mess”](https://perl6.org/archive/doc/design/apo/A05.html
<https://perl6.org/archive/doc/design/apo/A05.html>) and cleaned it up in [Perl
6](https://docs.perl6.org/language/regexes
<https://docs.perl6.org/language/regexes>). Perl 6 also has a
[PEG](https://dl.acm.org/citation.cfm?id=964001.964011
<https://dl.acm.org/citation.cfm?id=964001.964011>)-like [grammar
system](https://docs.perl6.org/language/grammars
<https://docs.perl6.org/language/grammars>) and we think similar approaches may
be a part of Swift’s long-term future. For now, we’re focusing on regexes,
although future considerations do impact the desired feature set.
Swift contrasts with Perl in that Swift places greater importance on clarity
and simplicity over raw expressive power and extreme configurability. However,
these are some features Swift should outright steal from Perl 6.
* **Free form syntax**. Whitespace is insignificant, comments can occur inside
literals, etc., all of which aids readability
* **Named Captures**. If something’s worth capturing, it’s worth giving it a
name.
* **Multi-line matching**. A new line is just another character, not a match
terminator.
* **Regex interpolation**. Refactor complex regexes into simpler ones (with
names!) and interpolate them. This is perhaps the most Swifty aspect of Perl 6
regexes!
* Miscellaneous features common to modern regex systems, such as:
* Quantifier modifiers, e.g. frugal (“lazy”), ratcheting (“possessive”)
* Non-capturing grouping, ratcheting (“atomic”) grouping
* Separator-quantification (“[modified
quantifier](https://docs.perl6.org/language/regexes#Modified_quantifier
<https://docs.perl6.org/language/regexes#Modified_quantifier>:_%,_%%)”)
* String literals
* Rich set of character classes (more on that next)
##### Character Classes
We will want a broad array of builtin pre-defined character classes for use in
regexes. Additionally, users should be able to construct and define their own,
whether from CharacterSets (i.e. enumeration) or through an arbitrary closure
of type `(Character) -> Bool`. There should be syntax to use user-defined
character classes in a regex literal.
Character classes should provide many set-algebraic operations such as union,
intersection, inversion, mutual-exclusion (“xor”), and set difference. (Note
that for our purposes, the standard library’s `SetAlgebra` protocol is not
appropriate for character classes, as `SetAlgebra` requires conformances to be
able to check for subset relationships which is intractable, if not
undecidable, for closures over graphemes.)
##### Arguable Features
Some literal syntactic features common in other regex systems may be
unnecessary or a bad fit for Swift.
1. Substitution literals (e.g. sed-like `s/regex/replacement/g`)
* If String had better substitution APIs, such as one that took a
`Regex<T>` and a closure from `(T) -> String` alongside options such as `.all`
or `.firstOnly`, that might be sufficient and/or clearer anyways.
2. Adverbs (e.g. the `i` meaning case-insensitive outside the literal’s
delimiters)
* These could be Regex variants accessible via computed properties,
which may be clearer
3. Anonymous (numbered) captures
* In general, if something’s worth capturing it’s worth giving it a
name.
* A construct like `(let _ = \d+)` could produce an unlabeled tuple
element.
* Tuples (the representation for literal captures) support 0-based
access already, e.g. `myCaptures.3`.
4. Enumerated character class ranges for non-ASCII graphemes (the `-` in
`[A-Z]`)
* Convenient for ASCII, but meaningless for graphemes.
* Good built-in classes and convenient ways for users to combine and
enumerate their own may even obviate the need for ranges for ASCII.
##### Forbidden Features (with Alternatives)
There are some regex features that we argue should be disallowed, or at least
carefully constrained. Regexes are powerful, specifying simple textual analysis
in a concise syntax, but are easily misused in ways that add
exponentially-compounding complexity in non-obvious ways. This misuse contrasts
with Swift’s values of balancing expressivity with clarity, where complex
behavior should be *explicitly expressed in code*.
However, we think there’s so much flexibility in what can accomplished without
the most general form of these features that it might not be an issue in
practice.
###### 1. Arbitrarily-Complex Look-Around Assertions
Assertions are the ability for a regex to “peek” ahead or behind at surrounding
characters without consuming them as part of a match. They can be positive or
negative. Arbitrarily-complex assertions include the ability to do captures as
part of an assertion, unknown-length assertions satisfied by subpatterns, etc.,
which can have non-intuitively complex interactions with other parts of the
regex.
**Alternative:** The following constrained assertions could be permissible:
1. Anchors (zero-width assertions), which test for positions within a string,
e.g. `^` and `$`.
2. Fixed-length character-class or closure-based assertions, which help avoid
[common pitfalls in
negation](https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word
<https://stackoverflow.com/questions/406230/regular-expression-to-match-a-line-that-doesnt-contain-a-word>)
as well as aid readability. This would permit fixed-length arbitrary
computation *explicitly expressed in code*. These might also encompass
boundaries.
3. Invoking a closure on the string prefix/suffix prior/after the assertion
point. This would permit variable-length arbitrary computation, but *explicitly
expressed in code*.
###### 2. Arbitrary Uses of Backreferences
Backreferences allow testing a later portion of a string against a
previously-matched portion. Unfortunately, in their most general form, they
require exploration of the entire search space and [make matching
NP-complete](https://perl.plover.com/NPC/NPC-3SAT.html
<https://perl.plover.com/NPC/NPC-3SAT.html>), that is take exponential time.
Very simple uses of backreferences just want to check for an unambiguous
capture after the fact, for which a `where` clause suffices. Very complex uses
are better served with an exponential loop *explicitly expressed in code*.
**Alternative:** There is a range of convenient uses between these two
extremes, which we may be able to tractably address. The feasibility of these
is still under investigation, but some potential constrained backreferences may
include:
1. A “ratcheting” backreference, where upon matching (i.e. the current section
of the string satisfies the backreference), the referred-to capture is “frozen”
in place, so that alternative assignments need not be explored.
2. A delimiter-checking construct. While eventually we want to encourage the
user to write a real grammar/parser (assuming some future grammar/parser
description system in Swift), simple delimiter matching can greatly aid
readability. This might even be general and powerful enough to define tasks
such as pairing XML tags and ignoring escaped delimiters or delimiters inside
literals. Ideally, this would be done in a direction conducive to a future
parsing solution.
#### Semantics
Matching semantics should roughly corresponding to [UTS #18 Level
2](https://unicode.org/reports/tr18/#Extended_Unicode_Support
<https://unicode.org/reports/tr18/#Extended_Unicode_Support>) of Unicode
support. That is, “any character” would be any Swift Character (a grapheme),
matching obeys canonical equivalence, etc. This is consistent with String’s
overall philosophy of universality and harm reduction. This does require that
the implementation use Unicode rules and tables provided at run time, e.g. for
grapheme breaking.
Regex interpolation should effectively “inline” the interpolated regex, i.e.
[formal
concatenation](https://en.m.wikipedia.org/wiki/Regular_expression#Formal_definition
<https://en.m.wikipedia.org/wiki/Regular_expression#Formal_definition>). We
will probably also want different combination semantics for combining patterns
in general (including regexes), possibly in conjunction with a future parsing
solution. For example, a form of alternation that is short-circuiting and some
form of concatenation that is
[“ratcheting”](https://docs.perl6.org/language/regexes#Ratchet
<https://docs.perl6.org/language/regexes#Ratchet>), where something like
`(myRegex1 || myRegex2) ++ myRegex3` tries myRegex1 first, and if successful
myRegex2 is *never* tried regardless of whether myRegex3 succeeds or not. This
is similar to the semantics of
[PEGs](https://en.wikipedia.org/wiki/Parsing_expression_grammar
<https://en.wikipedia.org/wiki/Parsing_expression_grammar>) or
parser-combinators and much more closely reflects programmer intuition for
parsing.
We’ll need to nail down a lot of important details:
* How are alternations matched?
[LTM](https://perlgeek.de/en/article/longest-token-matching
<https://perlgeek.de/en/article/longest-token-matching>)?
* What are the built in character classes and what do they cover?
* What Unicode properties to expose?
* Should there be API symmetry with Character queries mentioned above?
* What are the built in anchors, fixed-length assertions, and can they be
customized?
* What is the type of a quantified capture? An array of the capture’s declared
type?
* What guarantees about runtime and space complexity do we provide, if any?
* While we may have variants for whole-string, partial-string, line-by-line,
etc., what should the defaults be?
#### Implementation
With static knowledge, many regexes can be compiled into [efficient, minimal
state machines](https://swtch.com/%7Ersc/regexp/regexp1.html
<https://swtch.com/%7Ersc/regexp/regexp1.html>) (albeit with Unicode rules and
tables as input at run time). [Virtual machine
approaches](https://swtch.com/%7Ersc/regexp/regexp2.html
<https://swtch.com/%7Ersc/regexp/regexp2.html>) can make it easier to specify
and guarantee behavior while also being much more extensible. They also may
alleviate some ABI-stability concerns, as it’s easier to keep a bytecode stable
while growing new features.
Delivering ergonomic regexes in Swift 5 should not be blocked on having an
efficient implementation available immediately. However, long term, it is
important to have a [predictable and consistent performance
model](https://github.com/google/re2/wiki/WhyRE2
<https://github.com/google/re2/wiki/WhyRE2>) to prevent [unanticipated
exponential behavior](https://en.wikipedia.org/wiki/ReDoS
<https://en.wikipedia.org/wiki/ReDoS>). Thus, implementation concerns are
relevant in scoping the initial feature set (e.g. regarding backreferences).
Some regexes are amenable to task-parallel strategies (i.e. threads),
data-parallel strategies (i.e. vector or GPU), or both. Swift will probably
grow new strategies and combinations of strategies over time. Implementations
can also take advantage of the aforementioned performance flags (e.g.
isCanonical) to greatly speed up matches.
The interface to a regex engine will be part of Swift’s ABI, including whatever
intermediate representation regexes are compiled into. This could be a simple
regex AST, a virtual machine bytecode, etc., but it needs to be extensible to
accommodate any potential future additions and implementation strategies.
### Recap: Potential Additions for Swift 5
* Some form of unmanaged or unsafe Strings, and corresponding APIs
* Exposing performance flags, and some way to request a scan to populate them
* API gaps
* Character and UnicodeScalar properties, such as isNewline
* Generalizing, and optimizing, String interpolation
* Regex literals, Regex type, and generalized pattern match destructuring
* Substitution APIs, in conjunction with Regexes.
## Community
With Swift’s mailing lists [moving to
Discourse](https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171211/042127.html
<https://lists.swift.org/pipermail/swift-evolution/Week-of-Mon-20171211/042127.html>),
this allows the opportunity for better community engagement and
cross-referencing posts. We plan on starting an area under the “Using Swift”
category focusing on idiomatic and/or performant solutions to string
programming needs. These might be anything from simple programming problems to
make-my-parser-faster challenges.
While this provides a community benefit by showing how best to use String, our
ulterior motive is to mine posts looking for benchmarks, API gaps, and
representative code that new features could improve. We’ll also, of course,
want to bikeshed the name! My vote is for “String Dojo”, but there’s probably a
reason I don’t normally name things.
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev