> On Mar 29, 2017, at 5:32 PM, Ben Cohen via swift-evolution
> <[email protected]> wrote:
>
> Hi Swift Evolution,
>
> Below is a pitch for the first part of the String revision. This covers a
> number of changes that would allow the basic internals to be overhauled.
>
> Online version here:
> https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md
Really great stuff, guys. Thanks for your work on this!
> In order to be able to write extensions accross both String and Substring, a
> new Unicode protocol to which the two types will conform will be introduced.
> For the purposes of this proposal, Unicode will be defined as a protocol to
> be used whenver you would previously extend String. It should be possible to
> substitute extension Unicode { ... } in Swift 4 wherever extension String {
> ... } was written in Swift 3, with one exception: any passing of self into an
> API that takes a concrete String will need to be rewritten as String(self).
> If Self is a String then this should effectively optimize to a no-op, whereas
> if Self is a Substring then this will force a copy, helping to avoid the
> “memory leak” problems described above.
I continue to feel that `Unicode` is the wrong name for this protocol,
essentially because it sounds like a protocol for, say, a version of Unicode or
some kind of encoding machinery instead of a Unicode string. I won't rehash
that argument since I made it already in the manifesto thread, but I would like
to make a couple new suggestions in this area.
Later on, you note that it would be nice to namespace many of these types:
> Several of the types related to String, such as the encodings, would ideally
> reside inside a namespace rather than live at the top level of the standard
> library. The best namespace for this is probably Unicode, but this is also
> the name of the protocol. At some point if we gain the ability to nest enums
> and types inside protocols, they should be moved there. Putting them inside
> String or some other enum namespace is probably not worthwhile in the
> mean-time.
Perhaps we should use an empty enum to create a `Unicode` namespace and then
nest the protocol within it via typealias. If we do that, we can consider names
like `Unicode.Collection` or even `Unicode.String` which would shadow existing
types if they were top-level.
If not, then given this:
> The exact nature of the protocol – such as which methods should be protocol
> requirements vs which can be implemented as protocol extensions, are
> considered implementation details and so not covered in this proposal.
We may simply want to wait to choose a name. As the protocol develops, we may
discover a theme in its requirements which would suggest a good name. For
instance, we may realize that the core of what the protocol abstracts is
grouping code units into characters, which might suggest a name like
`Characters`, or `Unicode.Characters`, or `CharacterCollection`, or
what-have-you.
(By the way, I hope that the eventual protocol requirements will be put through
the review process, if only as an amendment, once they're determined.)
> Unicode will conform to BidirectionalCollection. RangeReplaceableCollection
> conformance will be added directly onto the String and Substring types, as it
> is possible future Unicode-conforming types might not be range-replaceable
> (e.g. an immutable type that wraps a const char *).
I'm a little worried about this because it seems to imply that the protocol
cannot include any mutation operations that aren't in
`RangeReplaceableCollection`. For instance, it won't be possible to include an
in-place `applyTransform` method in the protocol. Do you anticipate that being
an issue? Might it be a good idea to define a parallel `Mutable` or
`RangeReplaceable` protocol?
> The C string interop methods will be updated to those described here: a
> single withCString operation and two init(cString:) constructors, one for
> UTF8 and one for arbitrary encodings.
Sorry if I'm repeating something that was already discussed, but is there a
reason you don't include a `withCString` variant for arbitrary encodings? It
seems like an odd asymmetry.
> The standard library currently lacks a Latin1 codec, so a enum Latin1:
> UnicodeEncoding type will be added.
Nice. I wrote one of those once; I'll enjoy deleting it.
> A new protocol, UnicodeEncoding, will be added to replace the current
> UnicodeCodec protocol:
>
> public enum UnicodeParseResult<T, Index> {
Either `T` should be given a more specific name, or the enum should be given a
less specific one, becoming `ParseResult` and being oriented towards
incremental parsing of anything from any kind of collection.
> /// Indicates valid input was recognized.
> ///
> /// `resumptionPoint` is the end of the parsed region
> case valid(T, resumptionPoint: Index) // FIXME: should these be reordered?
No, I think this is the right order. The thing that's valid is the code point.
> /// Indicates invalid input was recognized.
> ///
> /// `resumptionPoint` is the next position at which to continue parsing after
> /// the invalid input is repaired.
> case error(resumptionPoint: Index)
I know this is abbreviated documentation, but I hope the full version includes
a good usage example demonstrating, among other things, how to detect partial
characters and defer processing of them instead of rejecting them as erroneous.
> /// An encoding for text with UnicodeScalar as a common currency type
> public protocol UnicodeEncoding {
> /// The maximum number of code units in an encoded unicode scalar value
> static var maxLengthOfEncodedScalar: Int { get }
>
> /// A type that can represent a single UnicodeScalar as it is encoded in
> this
> /// encoding.
> associatedtype EncodedScalar : EncodedScalarProtocol
There's an `EncodedScalarProtocol`-shaped hole in this proposal. What does it
do? What are its semantics? How does `EncodedScalar` relate to the old
`CodeUnit`?
> @discardableResult
> public static func parseForward<C: Collection>(
> _ input: C,
> repairingIllFormedSequences makeRepairs: Bool = true,
> into output: (EncodedScalar) throws->Void
> ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>
> @discardableResult
> public static func parseReverse<C: BidirectionalCollection>(
> _ input: C,
> repairingIllFormedSequences makeRepairs: Bool = true,
> into output: (EncodedScalar) throws->Void
> ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
> where C.SubSequence : BidirectionalCollection,
> C.SubSequence.SubSequence == C.SubSequence,
> C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
> }
Are there constraints missing on `parseForward`?
What do these do if `makeRepairs` is false? Would it be clearer if we made an
enum that described the behaviors and changed the label to something like
`ifIllFormed:`?
> Due to the change in internal implementation, this means that these
> operations will be O(n) rather than O(1). This is not expected to be a major
> concern, based on experiences from a similar change made to Java, but
> projects will be able to work around performance issues without upgrading to
> Swift 4 by explicitly typing slices as Substring, which will call the Swift 4
> variant, and which will be available but not invoked by default in Swift 3
> mode.
Will there be a way to make this also work with a real Swift 3 compiler? For
instance, can you define `typealias Substring = String` in such a way that real
Swift 3 will parse and use it, but Swift 4 in Swift 3 mode will ignore it?
> This proposal does not yet introduce an implicit conversion from Substring to
> String. The decision on whether to add this will be deferred pending feedback
> on the initial implementation. The intention is to make a preview toolchain
> available for feedback, including on whether this implicit conversion is
> necessary, prior to the release of Swift 4.
This is a sensible approach.
Thank you for developing this into a full proposal. I discussed the plans for
Swift 4 with a local group of programmers recently, and everyone was pleased to
hear that `String` would get an overhaul, that the `characters` view would be
integrated into the string, etc. We even talked a little about `Substring` and
people thought it was a good idea. This proposal is shaping up to impact a lot
of people, but in a good way!
--
Brent Royal-Gordon
Architechies
_______________________________________________
swift-evolution mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-evolution