Re: [swift-evolution] [Pitch] String revision proposal #1

Ben Cohen via swift-evolution Fri, 31 Mar 2017 08:05:56 -0700

You could even argue that what we need is a Collection wrapper that turns a 
pointer + a terminating sigil into a Collection… but from-C-string-creation is 
such a common operation that it deserves a dedicated shorthand. 
Non-null-terminated creation probably doesn’t.


> On Mar 31, 2017, at 8:03 AM, Ben Cohen <[email protected]> wrote:
> 
> 
> When you have a pointer and a length, you can create a fully functional 
> Collection using UnsafeBufferPointer. This means you aren't need something 
> that’s C interop-specific any more – just the ability to create a String from 
> a Collection of code units of some encoding.
> 
> We’ll add something to the proposal making it clear this will be possible.
> 
>> On Mar 31, 2017, at 4:01 AM, Jean-Daniel via swift-evolution 
>> <[email protected] <mailto:[email protected]>> wrote:
>> 
>> I’m with you for a C intro API that support taking a non-null terminated 
>> string. I often work with API that support efficient parsing by providing 
>> pointer to a global buffer + length to report parsed strings.
>> 
>> Without a way to create a Swift string from buffer + length, interop with 
>> such API will be difficult for no good reason, as Swift string don’t event 
>> have to be null terminated.
>> 
>>> Le 30 mars 2017 à 18:35, Félix Cloutier via swift-evolution 
>>> <[email protected] <mailto:[email protected]>> a écrit :
>>> 
>>> I don't have much non-nitpick issues that I greatly care about; I'm in 
>>> favor of this.
>>> 
>>> My only request: it's currently painful to create a String from a 
>>> fixed-size C array. For instance, if I have a pointer to a `struct foo { 
>>> char name[16]; }` in Swift where the last character doesn't have to be a 
>>> NUL, it's hard to create a String from it. Real-world examples of this are 
>>> Mach-O LC_SEGMENT and LC_SEGMENT_64 commands.
>>> 
>>> The generally-accepted wisdom <http://stackoverflow.com/a/27456220/251153> 
>>> is that you take a pointer to the CChar tuple that represents the 
>>> fixed-size array, but this still requires the string to be NUL-terminated. 
>>> What do we think of an additional init(cString:) overload that takes an 
>>> UnsafeBufferPointer and reads up to the first NUL or the end of the buffer, 
>>> whichever comes first?
>>> 
>>>> Le 30 mars 2017 à 02:48, Brent Royal-Gordon via swift-evolution 
>>>> <[email protected] <mailto:[email protected]>> a écrit :
>>>> 
>>>>> On Mar 29, 2017, at 5:32 PM, Ben Cohen via swift-evolution 
>>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>>> 
>>>>> Hi Swift Evolution,
>>>>> 
>>>>> Below is a pitch for the first part of the String revision. This covers a 
>>>>> number of changes that would allow the basic internals to be overhauled.
>>>>> 
>>>>> Online version here: 
>>>>> https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md
>>>>>  
>>>>> <https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md>
>>>> 
>>>> Really great stuff, guys. Thanks for your work on this!
>>>> 
>>>>> In order to be able to write extensions accross both String and 
>>>>> Substring, a new Unicode protocol to which the two types will conform 
>>>>> will be introduced. For the purposes of this proposal, Unicode will be 
>>>>> defined as a protocol to be used whenver you would previously extend 
>>>>> String. It should be possible to substitute extension Unicode { ... } in 
>>>>> Swift 4 wherever extension String { ... } was written in Swift 3, with 
>>>>> one exception: any passing of self into an API that takes a concrete 
>>>>> String will need to be rewritten as String(self). If Self is a String 
>>>>> then this should effectively optimize to a no-op, whereas if Self is a 
>>>>> Substring then this will force a copy, helping to avoid the “memory leak” 
>>>>> problems described above.
>>>> 
>>>> I continue to feel that `Unicode` is the wrong name for this protocol, 
>>>> essentially because it sounds like a protocol for, say, a version of 
>>>> Unicode or some kind of encoding machinery instead of a Unicode string. I 
>>>> won't rehash that argument since I made it already in the manifesto 
>>>> thread, but I would like to make a couple new suggestions in this area.
>>>> 
>>>> Later on, you note that it would be nice to namespace many of these types:
>>>> 
>>>>> Several of the types related to String, such as the encodings, would 
>>>>> ideally reside inside a namespace rather than live at the top level of 
>>>>> the standard library. The best namespace for this is probably Unicode, 
>>>>> but this is also the name of the protocol. At some point if we gain the 
>>>>> ability to nest enums and types inside protocols, they should be moved 
>>>>> there. Putting them inside String or some other enum namespace is 
>>>>> probably not worthwhile in the mean-time.
>>>> 
>>>> Perhaps we should use an empty enum to create a `Unicode` namespace and 
>>>> then nest the protocol within it via typealias. If we do that, we can 
>>>> consider names like `Unicode.Collection` or even `Unicode.String` which 
>>>> would shadow existing types if they were top-level.
>>>> 
>>>> If not, then given this:
>>>> 
>>>>> The exact nature of the protocol – such as which methods should be 
>>>>> protocol requirements vs which can be implemented as protocol extensions, 
>>>>> are considered implementation details and so not covered in this proposal.
>>>> 
>>>> We may simply want to wait to choose a name. As the protocol develops, we 
>>>> may discover a theme in its requirements which would suggest a good name. 
>>>> For instance, we may realize that the core of what the protocol abstracts 
>>>> is grouping code units into characters, which might suggest a name like 
>>>> `Characters`, or `Unicode.Characters`, or `CharacterCollection`, or 
>>>> what-have-you.
>>>> 
>>>> (By the way, I hope that the eventual protocol requirements will be put 
>>>> through the review process, if only as an amendment, once they're 
>>>> determined.)
>>>> 
>>>>> Unicode will conform to BidirectionalCollection. 
>>>>> RangeReplaceableCollection conformance will be added directly onto the 
>>>>> String and Substring types, as it is possible future Unicode-conforming 
>>>>> types might not be range-replaceable (e.g. an immutable type that wraps a 
>>>>> const char *).
>>>> 
>>>> I'm a little worried about this because it seems to imply that the 
>>>> protocol cannot include any mutation operations that aren't in 
>>>> `RangeReplaceableCollection`. For instance, it won't be possible to 
>>>> include an in-place `applyTransform` method in the protocol. Do you 
>>>> anticipate that being an issue? Might it be a good idea to define a 
>>>> parallel `Mutable` or `RangeReplaceable` protocol?
>>>> 
>>>>> The C string interop methods will be updated to those described here: a 
>>>>> single withCString operation and two init(cString:) constructors, one for 
>>>>> UTF8 and one for arbitrary encodings.
>>>> 
>>>> Sorry if I'm repeating something that was already discussed, but is there 
>>>> a reason you don't include a `withCString` variant for arbitrary 
>>>> encodings? It seems like an odd asymmetry.
>>>> 
>>>>> The standard library currently lacks a Latin1 codec, so a enum Latin1: 
>>>>> UnicodeEncoding type will be added.
>>>> 
>>>> Nice. I wrote one of those once; I'll enjoy deleting it.
>>>> 
>>>>> A new protocol, UnicodeEncoding, will be added to replace the current 
>>>>> UnicodeCodec protocol:
>>>>> 
>>>>> public enum UnicodeParseResult<T, Index> {
>>>> 
>>>> Either `T` should be given a more specific name, or the enum should be 
>>>> given a less specific one, becoming `ParseResult` and being oriented 
>>>> towards incremental parsing of anything from any kind of collection.
>>>> 
>>>>> /// Indicates valid input was recognized.
>>>>> ///
>>>>> /// `resumptionPoint` is the end of the parsed region
>>>>> case valid(T, resumptionPoint: Index)  // FIXME: should these be 
>>>>> reordered?
>>>> 
>>>> No, I think this is the right order. The thing that's valid is the code 
>>>> point.
>>>> 
>>>>> /// Indicates invalid input was recognized.
>>>>> ///
>>>>> /// `resumptionPoint` is the next position at which to continue parsing 
>>>>> after
>>>>> /// the invalid input is repaired.
>>>>> case error(resumptionPoint: Index)
>>>> 
>>>> I know this is abbreviated documentation, but I hope the full version 
>>>> includes a good usage example demonstrating, among other things, how to 
>>>> detect partial characters and defer processing of them instead of 
>>>> rejecting them as erroneous.
>>>> 
>>>>> /// An encoding for text with UnicodeScalar as a common currency type
>>>>> public protocol UnicodeEncoding {
>>>>>  /// The maximum number of code units in an encoded unicode scalar value
>>>>>  static var maxLengthOfEncodedScalar: Int { get }
>>>>> 
>>>>>  /// A type that can represent a single UnicodeScalar as it is encoded in 
>>>>> this
>>>>>  /// encoding.
>>>>>  associatedtype EncodedScalar : EncodedScalarProtocol
>>>> 
>>>> There's an `EncodedScalarProtocol`-shaped hole in this proposal. What does 
>>>> it do? What are its semantics? How does `EncodedScalar` relate to the old 
>>>> `CodeUnit`?
>>>> 
>>>>>  @discardableResult
>>>>>  public static func parseForward<C: Collection>(
>>>>>    _ input: C,
>>>>>    repairingIllFormedSequences makeRepairs: Bool = true,
>>>>>    into output: (EncodedScalar) throws->Void
>>>>>  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>>>>> 
>>>>>  @discardableResult    
>>>>>  public static func parseReverse<C: BidirectionalCollection>(
>>>>>    _ input: C,
>>>>>    repairingIllFormedSequences makeRepairs: Bool = true,
>>>>>    into output: (EncodedScalar) throws->Void
>>>>>  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>>>>>  where C.SubSequence : BidirectionalCollection,
>>>>>        C.SubSequence.SubSequence == C.SubSequence,
>>>>>        C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
>>>>> }
>>>> 
>>>> Are there constraints missing on `parseForward`?
>>>> 
>>>> What do these do if `makeRepairs` is false? Would it be clearer if we made 
>>>> an enum that described the behaviors and changed the label to something 
>>>> like `ifIllFormed:`?
>>>> 
>>>>> Due to the change in internal implementation, this means that these 
>>>>> operations will be O(n) rather than O(1). This is not expected to be a 
>>>>> major concern, based on experiences from a similar change made to Java, 
>>>>> but projects will be able to work around performance issues without 
>>>>> upgrading to Swift 4 by explicitly typing slices as Substring, which will 
>>>>> call the Swift 4 variant, and which will be available but not invoked by 
>>>>> default in Swift 3 mode.
>>>> 
>>>> Will there be a way to make this also work with a real Swift 3 compiler? 
>>>> For instance, can you define `typealias Substring = String` in such a way 
>>>> that real Swift 3 will parse and use it, but Swift 4 in Swift 3 mode will 
>>>> ignore it?
>>>> 
>>>>> This proposal does not yet introduce an implicit conversion from 
>>>>> Substring to String. The decision on whether to add this will be deferred 
>>>>> pending feedback on the initial implementation. The intention is to make 
>>>>> a preview toolchain available for feedback, including on whether this 
>>>>> implicit conversion is necessary, prior to the release of Swift 4.
>>>> 
>>>> This is a sensible approach.
>>>> 
>>>> Thank you for developing this into a full proposal. I discussed the plans 
>>>> for Swift 4 with a local group of programmers recently, and everyone was 
>>>> pleased to hear that `String` would get an overhaul, that the `characters` 
>>>> view would be integrated into the string, etc. We even talked a little 
>>>> about `Substring` and people thought it was a good idea. This proposal is 
>>>> shaping up to impact a lot of people, but in a good way!
>>>> 
>>>> -- 
>>>> Brent Royal-Gordon
>>>> Architechies
>>>> 
>>>> _______________________________________________
>>>> swift-evolution mailing list
>>>> [email protected] <mailto:[email protected]>
>>>> https://lists.swift.org/mailman/listinfo/swift-evolution 
>>>> <https://lists.swift.org/mailman/listinfo/swift-evolution>
>>> 
>>> _______________________________________________
>>> swift-evolution mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://lists.swift.org/mailman/listinfo/swift-evolution 
>>> <https://lists.swift.org/mailman/listinfo/swift-evolution>
>> 
>> _______________________________________________
>> swift-evolution mailing list
>> [email protected] <mailto:[email protected]>
>> https://lists.swift.org/mailman/listinfo/swift-evolution
>

_______________________________________________
swift-evolution mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-evolution

Re: [swift-evolution] [Pitch] String revision proposal #1

Reply via email to