Re: [swift-evolution] [Pitch] String revision proposal #1

Ben Cohen via swift-evolution Fri, 31 Mar 2017 08:03:25 -0700

When you have a pointer and a length, you can create a fully functional 
Collection using UnsafeBufferPointer. This means you aren't need something 
that’s C interop-specific any more – just the ability to create a String from a 
Collection of code units of some encoding.


We’ll add something to the proposal making it clear this will be possible.

> On Mar 31, 2017, at 4:01 AM, Jean-Daniel via swift-evolution 
> <[email protected]> wrote:
> 
> I’m with you for a C intro API that support taking a non-null terminated 
> string. I often work with API that support efficient parsing by providing 
> pointer to a global buffer + length to report parsed strings.
> 
> Without a way to create a Swift string from buffer + length, interop with 
> such API will be difficult for no good reason, as Swift string don’t event 
> have to be null terminated.
> 
>> Le 30 mars 2017 à 18:35, Félix Cloutier via swift-evolution 
>> <[email protected] <mailto:[email protected]>> a écrit :
>> 
>> I don't have much non-nitpick issues that I greatly care about; I'm in favor 
>> of this.
>> 
>> My only request: it's currently painful to create a String from a fixed-size 
>> C array. For instance, if I have a pointer to a `struct foo { char name[16]; 
>> }` in Swift where the last character doesn't have to be a NUL, it's hard to 
>> create a String from it. Real-world examples of this are Mach-O LC_SEGMENT 
>> and LC_SEGMENT_64 commands.
>> 
>> The generally-accepted wisdom <http://stackoverflow.com/a/27456220/251153> 
>> is that you take a pointer to the CChar tuple that represents the fixed-size 
>> array, but this still requires the string to be NUL-terminated. What do we 
>> think of an additional init(cString:) overload that takes an 
>> UnsafeBufferPointer and reads up to the first NUL or the end of the buffer, 
>> whichever comes first?
>> 
>>> Le 30 mars 2017 à 02:48, Brent Royal-Gordon via swift-evolution 
>>> <[email protected] <mailto:[email protected]>> a écrit :
>>> 
>>>> On Mar 29, 2017, at 5:32 PM, Ben Cohen via swift-evolution 
>>>> <[email protected] <mailto:[email protected]>> wrote:
>>>> 
>>>> Hi Swift Evolution,
>>>> 
>>>> Below is a pitch for the first part of the String revision. This covers a 
>>>> number of changes that would allow the basic internals to be overhauled.
>>>> 
>>>> Online version here: 
>>>> https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md
>>>>  
>>>> <https://github.com/airspeedswift/swift-evolution/blob/3a822c799011ace682712532cfabfe32e9203fbb/proposals/0161-StringRevision1.md>
>>> 
>>> Really great stuff, guys. Thanks for your work on this!
>>> 
>>>> In order to be able to write extensions accross both String and Substring, 
>>>> a new Unicode protocol to which the two types will conform will be 
>>>> introduced. For the purposes of this proposal, Unicode will be defined as 
>>>> a protocol to be used whenver you would previously extend String. It 
>>>> should be possible to substitute extension Unicode { ... } in Swift 4 
>>>> wherever extension String { ... } was written in Swift 3, with one 
>>>> exception: any passing of self into an API that takes a concrete String 
>>>> will need to be rewritten as String(self). If Self is a String then this 
>>>> should effectively optimize to a no-op, whereas if Self is a Substring 
>>>> then this will force a copy, helping to avoid the “memory leak” problems 
>>>> described above.
>>> 
>>> I continue to feel that `Unicode` is the wrong name for this protocol, 
>>> essentially because it sounds like a protocol for, say, a version of 
>>> Unicode or some kind of encoding machinery instead of a Unicode string. I 
>>> won't rehash that argument since I made it already in the manifesto thread, 
>>> but I would like to make a couple new suggestions in this area.
>>> 
>>> Later on, you note that it would be nice to namespace many of these types:
>>> 
>>>> Several of the types related to String, such as the encodings, would 
>>>> ideally reside inside a namespace rather than live at the top level of the 
>>>> standard library. The best namespace for this is probably Unicode, but 
>>>> this is also the name of the protocol. At some point if we gain the 
>>>> ability to nest enums and types inside protocols, they should be moved 
>>>> there. Putting them inside String or some other enum namespace is probably 
>>>> not worthwhile in the mean-time.
>>> 
>>> Perhaps we should use an empty enum to create a `Unicode` namespace and 
>>> then nest the protocol within it via typealias. If we do that, we can 
>>> consider names like `Unicode.Collection` or even `Unicode.String` which 
>>> would shadow existing types if they were top-level.
>>> 
>>> If not, then given this:
>>> 
>>>> The exact nature of the protocol – such as which methods should be 
>>>> protocol requirements vs which can be implemented as protocol extensions, 
>>>> are considered implementation details and so not covered in this proposal.
>>> 
>>> We may simply want to wait to choose a name. As the protocol develops, we 
>>> may discover a theme in its requirements which would suggest a good name. 
>>> For instance, we may realize that the core of what the protocol abstracts 
>>> is grouping code units into characters, which might suggest a name like 
>>> `Characters`, or `Unicode.Characters`, or `CharacterCollection`, or 
>>> what-have-you.
>>> 
>>> (By the way, I hope that the eventual protocol requirements will be put 
>>> through the review process, if only as an amendment, once they're 
>>> determined.)
>>> 
>>>> Unicode will conform to BidirectionalCollection. 
>>>> RangeReplaceableCollection conformance will be added directly onto the 
>>>> String and Substring types, as it is possible future Unicode-conforming 
>>>> types might not be range-replaceable (e.g. an immutable type that wraps a 
>>>> const char *).
>>> 
>>> I'm a little worried about this because it seems to imply that the protocol 
>>> cannot include any mutation operations that aren't in 
>>> `RangeReplaceableCollection`. For instance, it won't be possible to include 
>>> an in-place `applyTransform` method in the protocol. Do you anticipate that 
>>> being an issue? Might it be a good idea to define a parallel `Mutable` or 
>>> `RangeReplaceable` protocol?
>>> 
>>>> The C string interop methods will be updated to those described here: a 
>>>> single withCString operation and two init(cString:) constructors, one for 
>>>> UTF8 and one for arbitrary encodings.
>>> 
>>> Sorry if I'm repeating something that was already discussed, but is there a 
>>> reason you don't include a `withCString` variant for arbitrary encodings? 
>>> It seems like an odd asymmetry.
>>> 
>>>> The standard library currently lacks a Latin1 codec, so a enum Latin1: 
>>>> UnicodeEncoding type will be added.
>>> 
>>> Nice. I wrote one of those once; I'll enjoy deleting it.
>>> 
>>>> A new protocol, UnicodeEncoding, will be added to replace the current 
>>>> UnicodeCodec protocol:
>>>> 
>>>> public enum UnicodeParseResult<T, Index> {
>>> 
>>> Either `T` should be given a more specific name, or the enum should be 
>>> given a less specific one, becoming `ParseResult` and being oriented 
>>> towards incremental parsing of anything from any kind of collection.
>>> 
>>>> /// Indicates valid input was recognized.
>>>> ///
>>>> /// `resumptionPoint` is the end of the parsed region
>>>> case valid(T, resumptionPoint: Index)  // FIXME: should these be reordered?
>>> 
>>> No, I think this is the right order. The thing that's valid is the code 
>>> point.
>>> 
>>>> /// Indicates invalid input was recognized.
>>>> ///
>>>> /// `resumptionPoint` is the next position at which to continue parsing 
>>>> after
>>>> /// the invalid input is repaired.
>>>> case error(resumptionPoint: Index)
>>> 
>>> I know this is abbreviated documentation, but I hope the full version 
>>> includes a good usage example demonstrating, among other things, how to 
>>> detect partial characters and defer processing of them instead of rejecting 
>>> them as erroneous.
>>> 
>>>> /// An encoding for text with UnicodeScalar as a common currency type
>>>> public protocol UnicodeEncoding {
>>>>  /// The maximum number of code units in an encoded unicode scalar value
>>>>  static var maxLengthOfEncodedScalar: Int { get }
>>>> 
>>>>  /// A type that can represent a single UnicodeScalar as it is encoded in 
>>>> this
>>>>  /// encoding.
>>>>  associatedtype EncodedScalar : EncodedScalarProtocol
>>> 
>>> There's an `EncodedScalarProtocol`-shaped hole in this proposal. What does 
>>> it do? What are its semantics? How does `EncodedScalar` relate to the old 
>>> `CodeUnit`?
>>> 
>>>>  @discardableResult
>>>>  public static func parseForward<C: Collection>(
>>>>    _ input: C,
>>>>    repairingIllFormedSequences makeRepairs: Bool = true,
>>>>    into output: (EncodedScalar) throws->Void
>>>>  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>>>> 
>>>>  @discardableResult    
>>>>  public static func parseReverse<C: BidirectionalCollection>(
>>>>    _ input: C,
>>>>    repairingIllFormedSequences makeRepairs: Bool = true,
>>>>    into output: (EncodedScalar) throws->Void
>>>>  ) rethrows -> (remainder: C.SubSequence, errorCount: Int)
>>>>  where C.SubSequence : BidirectionalCollection,
>>>>        C.SubSequence.SubSequence == C.SubSequence,
>>>>        C.SubSequence.Iterator.Element == EncodedScalar.Iterator.Element
>>>> }
>>> 
>>> Are there constraints missing on `parseForward`?
>>> 
>>> What do these do if `makeRepairs` is false? Would it be clearer if we made 
>>> an enum that described the behaviors and changed the label to something 
>>> like `ifIllFormed:`?
>>> 
>>>> Due to the change in internal implementation, this means that these 
>>>> operations will be O(n) rather than O(1). This is not expected to be a 
>>>> major concern, based on experiences from a similar change made to Java, 
>>>> but projects will be able to work around performance issues without 
>>>> upgrading to Swift 4 by explicitly typing slices as Substring, which will 
>>>> call the Swift 4 variant, and which will be available but not invoked by 
>>>> default in Swift 3 mode.
>>> 
>>> Will there be a way to make this also work with a real Swift 3 compiler? 
>>> For instance, can you define `typealias Substring = String` in such a way 
>>> that real Swift 3 will parse and use it, but Swift 4 in Swift 3 mode will 
>>> ignore it?
>>> 
>>>> This proposal does not yet introduce an implicit conversion from Substring 
>>>> to String. The decision on whether to add this will be deferred pending 
>>>> feedback on the initial implementation. The intention is to make a preview 
>>>> toolchain available for feedback, including on whether this implicit 
>>>> conversion is necessary, prior to the release of Swift 4.
>>> 
>>> This is a sensible approach.
>>> 
>>> Thank you for developing this into a full proposal. I discussed the plans 
>>> for Swift 4 with a local group of programmers recently, and everyone was 
>>> pleased to hear that `String` would get an overhaul, that the `characters` 
>>> view would be integrated into the string, etc. We even talked a little 
>>> about `Substring` and people thought it was a good idea. This proposal is 
>>> shaping up to impact a lot of people, but in a good way!
>>> 
>>> -- 
>>> Brent Royal-Gordon
>>> Architechies
>>> 
>>> _______________________________________________
>>> swift-evolution mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://lists.swift.org/mailman/listinfo/swift-evolution 
>>> <https://lists.swift.org/mailman/listinfo/swift-evolution>
>> 
>> _______________________________________________
>> swift-evolution mailing list
>> [email protected] <mailto:[email protected]>
>> https://lists.swift.org/mailman/listinfo/swift-evolution
> 
> _______________________________________________
> swift-evolution mailing list
> [email protected]
> https://lists.swift.org/mailman/listinfo/swift-evolution

_______________________________________________
swift-evolution mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-evolution

Re: [swift-evolution] [Pitch] String revision proposal #1

Reply via email to