Re: [swift-evolution] [Pitch] String revision proposal #1

Brent Royal-Gordon via swift-evolution Thu, 30 Mar 2017 02:49:37 -0700

> On Mar 29, 2017, at 5:32 PM, Ben Cohen via swift-evolution 
> <[email protected]> wrote:
> 
> Hi Swift Evolution,
> 
> Below is a pitch for the first part of the String revision. This covers a 
> number of changes that would allow the basic internals to be overhauled.
> 
> Online version here: 
> https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md


Really great stuff, guys. Thanks for your work on this!

> In order to be able to write extensions accross both String and Substring, a 
> new Unicode protocol to which the two types will conform will be introduced. 
> For the purposes of this proposal, Unicode will be defined as a protocol to 
> be used whenver you would previously extend String. It should be possible to 
> substitute extension Unicode { ... } in Swift 4 wherever extension String { 
> ... } was written in Swift 3, with one exception: any passing of self into an 
> API that takes a concrete String will need to be rewritten as String(self). 
> If Self is a String then this should effectively optimize to a no-op, whereas 
> if Self is a Substring then this will force a copy, helping to avoid the 
> “memory leak” problems described above.

I continue to feel that `Unicode` is the wrong name for this protocol, 
essentially because it sounds like a protocol for, say, a version of Unicode or 
some kind of encoding machinery instead of a Unicode string. I won't rehash 
that argument since I made it already in the manifesto thread, but I would like 
to make a couple new suggestions in this area.

Later on, you note that it would be nice to namespace many of these types:

> Several of the types related to String, such as the encodings, would ideally 
> reside inside a namespace rather than live at the top level of the standard 
> library. The best namespace for this is probably Unicode, but this is also 
> the name of the protocol. At some point if we gain the ability to nest enums 
> and types inside protocols, they should be moved there. Putting them inside 
> String or some other enum namespace is probably not worthwhile in the 
> mean-time.

Perhaps we should use an empty enum to create a `Unicode` namespace and then 
nest the protocol within it via typealias. If we do that, we can consider names 
like `Unicode.Collection` or even `Unicode.String` which would shadow existing 
types if they were top-level.

If not, then given this:

> The exact nature of the protocol – such as which methods should be protocol 
> requirements vs which can be implemented as protocol extensions, are 
> considered implementation details and so not covered in this proposal.

We may simply want to wait to choose a name. As the protocol develops, we may 
discover a theme in its requirements which would suggest a good name. For 
instance, we may realize that the core of what the protocol abstracts is 
grouping code units into characters, which might suggest a name like 
`Characters`, or `Unicode.Characters`, or `CharacterCollection`, or 
what-have-you.

(By the way, I hope that the eventual protocol requirements will be put through 
the review process, if only as an amendment, once they're determined.)

> Unicode will conform to BidirectionalCollection. RangeReplaceableCollection 
> conformance will be added directly onto the String and Substring types, as it 
> is possible future Unicode-conforming types might not be range-replaceable 
> (e.g. an immutable type that wraps a const char *).

I'm a little worried about this because it seems to imply that the protocol 
cannot include any mutation operations that aren't in 
`RangeReplaceableCollection`. For instance, it won't be possible to include an 
in-place `applyTransform` method in the protocol. Do you anticipate that being 
an issue? Might it be a good idea to define a parallel `Mutable` or 
`RangeReplaceable` protocol?

> The C string interop methods will be updated to those described here: a 
> single withCString operation and two init(cString:) constructors, one for 
> UTF8 and one for arbitrary encodings.

Sorry if I'm repeating something that was already discussed, but is there a 
reason you don't include a `withCString` variant for arbitrary encodings? It 
seems like an odd asymmetry.

> The standard library currently lacks a Latin1 codec, so a enum Latin1: 
> UnicodeEncoding type will be added.

Nice. I wrote one of those once; I'll enjoy deleting it.

> A new protocol, UnicodeEncoding, will be added to replace the current 
> UnicodeCodec protocol:
> 
> public enum UnicodeParseResult<T, Index> {

Either `T` should be given a more specific name, or the enum should be given a 
less specific one, becoming `ParseResult` and being oriented towards 
incremental parsing of anything from any kind of collection.

> /// Indicates valid input was recognized.
> ///
> /// `resumptionPoint` is the end of the parsed region
> case valid(T, resumptionPoint: Index)  // FIXME: should these be reordered?

No, I think this is the right order. The thing that's valid is the code point.

> /// Indicates invalid input was recognized.
> ///
> /// `resumptionPoint` is the next position at which to continue parsing after
> /// the invalid input is repaired.
> case error(resumptionPoint: Index)

I know this is abbreviated documentation, but I hope the full version includes 
a good usage example demonstrating, among other things, how to detect partial 
characters and defer processing of them instead of rejecting them as erroneous.

> /// An encoding for text with UnicodeScalar as a common currency type
> public protocol UnicodeEncoding {
>   /// The maximum number of code units in an encoded unicode scalar value
>   static var maxLengthOfEncodedScalar: Int { get }
>   
>   /// A type that can represent a single UnicodeScalar as it is encoded in 
> this
>   /// encoding.
>   associatedtype EncodedScalar : EncodedScalarProtocol

There's an `EncodedScalarProtocol`-shaped hole in this proposal. What does it 
do? What are its semantics? How does `EncodedScalar` relate to the old 
`CodeUnit`?

>   @discardableResult
>   public static func parseForward<C: Collection>(
>     _ input: C,
>     repairingIllFormedSequences makeRepairs: Bool = true,
>     into output: (EncodedScalar) throws->Void
>   ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>   
>   @discardableResult    
>   public static func parseReverse<C: BidirectionalCollection>(
>     _ input: C,
>     repairingIllFormedSequences makeRepairs: Bool = true,
>     into output: (EncodedScalar) throws->Void
>   ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>   where C.SubSequence : BidirectionalCollection,
>         C.SubSequence.SubSequence == C.SubSequence,
>         C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
> }

Are there constraints missing on `parseForward`?

What do these do if `makeRepairs` is false? Would it be clearer if we made an 
enum that described the behaviors and changed the label to something like 
`ifIllFormed:`?

> Due to the change in internal implementation, this means that these 
> operations will be O(n) rather than O(1). This is not expected to be a major 
> concern, based on experiences from a similar change made to Java, but 
> projects will be able to work around performance issues without upgrading to 
> Swift 4 by explicitly typing slices as Substring, which will call the Swift 4 
> variant, and which will be available but not invoked by default in Swift 3 
> mode.

Will there be a way to make this also work with a real Swift 3 compiler? For 
instance, can you define `typealias Substring = String` in such a way that real 
Swift 3 will parse and use it, but Swift 4 in Swift 3 mode will ignore it?

> This proposal does not yet introduce an implicit conversion from Substring to 
> String. The decision on whether to add this will be deferred pending feedback 
> on the initial implementation. The intention is to make a preview toolchain 
> available for feedback, including on whether this implicit conversion is 
> necessary, prior to the release of Swift 4.

This is a sensible approach.

Thank you for developing this into a full proposal. I discussed the plans for 
Swift 4 with a local group of programmers recently, and everyone was pleased to 
hear that `String` would get an overhaul, that the `characters` view would be 
integrated into the string, etc. We even talked a little about `Substring` and 
people thought it was a good idea. This proposal is shaping up to impact a lot 
of people, but in a good way!

-- 
Brent Royal-Gordon
Architechies

_______________________________________________
swift-evolution mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-evolution

Re: [swift-evolution] [Pitch] String revision proposal #1

Reply via email to