Re: [swift-evolution] [Pitch] String revision proposal #1

Félix Cloutier via swift-evolution Thu, 30 Mar 2017 09:35:48 -0700

I don't have much non-nitpick issues that I greatly care about; I'm in favor of 
this.


My only request: it's currently painful to create a String from a fixed-size C 
array. For instance, if I have a pointer to a `struct foo { char name[16]; }` 
in Swift where the last character doesn't have to be a NUL, it's hard to create 
a String from it. Real-world examples of this are Mach-O LC_SEGMENT and 
LC_SEGMENT_64 commands.

The generally-accepted wisdom <http://stackoverflow.com/a/27456220/251153> is 
that you take a pointer to the CChar tuple that represents the fixed-size 
array, but this still requires the string to be NUL-terminated. What do we 
think of an additional init(cString:) overload that takes an 
UnsafeBufferPointer and reads up to the first NUL or the end of the buffer, 
whichever comes first?

> Le 30 mars 2017 à 02:48, Brent Royal-Gordon via swift-evolution 
> <[email protected]> a écrit :
> 
>> On Mar 29, 2017, at 5:32 PM, Ben Cohen via swift-evolution 
>> <[email protected]> wrote:
>> 
>> Hi Swift Evolution,
>> 
>> Below is a pitch for the first part of the String revision. This covers a 
>> number of changes that would allow the basic internals to be overhauled.
>> 
>> Online version here: 
>> https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md
> 
> Really great stuff, guys. Thanks for your work on this!
> 
>> In order to be able to write extensions accross both String and Substring, a 
>> new Unicode protocol to which the two types will conform will be introduced. 
>> For the purposes of this proposal, Unicode will be defined as a protocol to 
>> be used whenver you would previously extend String. It should be possible to 
>> substitute extension Unicode { ... } in Swift 4 wherever extension String { 
>> ... } was written in Swift 3, with one exception: any passing of self into 
>> an API that takes a concrete String will need to be rewritten as 
>> String(self). If Self is a String then this should effectively optimize to a 
>> no-op, whereas if Self is a Substring then this will force a copy, helping 
>> to avoid the “memory leak” problems described above.
> 
> I continue to feel that `Unicode` is the wrong name for this protocol, 
> essentially because it sounds like a protocol for, say, a version of Unicode 
> or some kind of encoding machinery instead of a Unicode string. I won't 
> rehash that argument since I made it already in the manifesto thread, but I 
> would like to make a couple new suggestions in this area.
> 
> Later on, you note that it would be nice to namespace many of these types:
> 
>> Several of the types related to String, such as the encodings, would ideally 
>> reside inside a namespace rather than live at the top level of the standard 
>> library. The best namespace for this is probably Unicode, but this is also 
>> the name of the protocol. At some point if we gain the ability to nest enums 
>> and types inside protocols, they should be moved there. Putting them inside 
>> String or some other enum namespace is probably not worthwhile in the 
>> mean-time.
> 
> Perhaps we should use an empty enum to create a `Unicode` namespace and then 
> nest the protocol within it via typealias. If we do that, we can consider 
> names like `Unicode.Collection` or even `Unicode.String` which would shadow 
> existing types if they were top-level.
> 
> If not, then given this:
> 
>> The exact nature of the protocol – such as which methods should be protocol 
>> requirements vs which can be implemented as protocol extensions, are 
>> considered implementation details and so not covered in this proposal.
> 
> We may simply want to wait to choose a name. As the protocol develops, we may 
> discover a theme in its requirements which would suggest a good name. For 
> instance, we may realize that the core of what the protocol abstracts is 
> grouping code units into characters, which might suggest a name like 
> `Characters`, or `Unicode.Characters`, or `CharacterCollection`, or 
> what-have-you.
> 
> (By the way, I hope that the eventual protocol requirements will be put 
> through the review process, if only as an amendment, once they're determined.)
> 
>> Unicode will conform to BidirectionalCollection. RangeReplaceableCollection 
>> conformance will be added directly onto the String and Substring types, as 
>> it is possible future Unicode-conforming types might not be 
>> range-replaceable (e.g. an immutable type that wraps a const char *).
> 
> I'm a little worried about this because it seems to imply that the protocol 
> cannot include any mutation operations that aren't in 
> `RangeReplaceableCollection`. For instance, it won't be possible to include 
> an in-place `applyTransform` method in the protocol. Do you anticipate that 
> being an issue? Might it be a good idea to define a parallel `Mutable` or 
> `RangeReplaceable` protocol?
> 
>> The C string interop methods will be updated to those described here: a 
>> single withCString operation and two init(cString:) constructors, one for 
>> UTF8 and one for arbitrary encodings.
> 
> Sorry if I'm repeating something that was already discussed, but is there a 
> reason you don't include a `withCString` variant for arbitrary encodings? It 
> seems like an odd asymmetry.
> 
>> The standard library currently lacks a Latin1 codec, so a enum Latin1: 
>> UnicodeEncoding type will be added.
> 
> Nice. I wrote one of those once; I'll enjoy deleting it.
> 
>> A new protocol, UnicodeEncoding, will be added to replace the current 
>> UnicodeCodec protocol:
>> 
>> public enum UnicodeParseResult<T, Index> {
> 
> Either `T` should be given a more specific name, or the enum should be given 
> a less specific one, becoming `ParseResult` and being oriented towards 
> incremental parsing of anything from any kind of collection.
> 
>> /// Indicates valid input was recognized.
>> ///
>> /// `resumptionPoint` is the end of the parsed region
>> case valid(T, resumptionPoint: Index)  // FIXME: should these be reordered?
> 
> No, I think this is the right order. The thing that's valid is the code point.
> 
>> /// Indicates invalid input was recognized.
>> ///
>> /// `resumptionPoint` is the next position at which to continue parsing after
>> /// the invalid input is repaired.
>> case error(resumptionPoint: Index)
> 
> I know this is abbreviated documentation, but I hope the full version 
> includes a good usage example demonstrating, among other things, how to 
> detect partial characters and defer processing of them instead of rejecting 
> them as erroneous.
> 
>> /// An encoding for text with UnicodeScalar as a common currency type
>> public protocol UnicodeEncoding {
>>  /// The maximum number of code units in an encoded unicode scalar value
>>  static var maxLengthOfEncodedScalar: Int { get }
>> 
>>  /// A type that can represent a single UnicodeScalar as it is encoded in 
>> this
>>  /// encoding.
>>  associatedtype EncodedScalar : EncodedScalarProtocol
> 
> There's an `EncodedScalarProtocol`-shaped hole in this proposal. What does it 
> do? What are its semantics? How does `EncodedScalar` relate to the old 
> `CodeUnit`?
> 
>>  @discardableResult
>>  public static func parseForward<C: Collection>(
>>    _ input: C,
>>    repairingIllFormedSequences makeRepairs: Bool = true,
>>    into output: (EncodedScalar) throws->Void
>>  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>> 
>>  @discardableResult    
>>  public static func parseReverse<C: BidirectionalCollection>(
>>    _ input: C,
>>    repairingIllFormedSequences makeRepairs: Bool = true,
>>    into output: (EncodedScalar) throws->Void
>>  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>>  where C.SubSequence : BidirectionalCollection,
>>        C.SubSequence.SubSequence == C.SubSequence,
>>        C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
>> }
> 
> Are there constraints missing on `parseForward`?
> 
> What do these do if `makeRepairs` is false? Would it be clearer if we made an 
> enum that described the behaviors and changed the label to something like 
> `ifIllFormed:`?
> 
>> Due to the change in internal implementation, this means that these 
>> operations will be O(n) rather than O(1). This is not expected to be a major 
>> concern, based on experiences from a similar change made to Java, but 
>> projects will be able to work around performance issues without upgrading to 
>> Swift 4 by explicitly typing slices as Substring, which will call the Swift 
>> 4 variant, and which will be available but not invoked by default in Swift 3 
>> mode.
> 
> Will there be a way to make this also work with a real Swift 3 compiler? For 
> instance, can you define `typealias Substring = String` in such a way that 
> real Swift 3 will parse and use it, but Swift 4 in Swift 3 mode will ignore 
> it?
> 
>> This proposal does not yet introduce an implicit conversion from Substring 
>> to String. The decision on whether to add this will be deferred pending 
>> feedback on the initial implementation. The intention is to make a preview 
>> toolchain available for feedback, including on whether this implicit 
>> conversion is necessary, prior to the release of Swift 4.
> 
> This is a sensible approach.
> 
> Thank you for developing this into a full proposal. I discussed the plans for 
> Swift 4 with a local group of programmers recently, and everyone was pleased 
> to hear that `String` would get an overhaul, that the `characters` view would 
> be integrated into the string, etc. We even talked a little about `Substring` 
> and people thought it was a good idea. This proposal is shaping up to impact 
> a lot of people, but in a good way!
> 
> -- 
> Brent Royal-Gordon
> Architechies
> 
> _______________________________________________
> swift-evolution mailing list
> [email protected]
> https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-evolution

Re: [swift-evolution] [Pitch] String revision proposal #1

Reply via email to