> On Jan 19, 2017, at 6:56 PM, Ben Cohen via swift-evolution
> <[email protected]> wrote:
>
> ### Formatting
>
> A full treatment of formatting is out of scope of this proposal, but
> we believe it's crucial for completing the text processing picture. This
> section details some of the existing issues and thinking that may guide future
> development.
>
Filesystem paths are Strings on Apple platforms but not on Linux. How are we
going to square that circle? What about Swift on the server, where
distinguishing HTML and JavaScript is security-critical? There are huge
security implications to string processing, often around platforms making it
easy to do the wrong thing in a careless way and promoting ad-hoc formatting,
serialization and parsing. That’s a huge area to consider of course but it
might be worth thinking about how a ergonomic API for a few example cases would
work.
I guess my point is that formatting and interpolation is far more than “just
formatting”; making the right thing difficult will directly lead to exploitable
security vulnerabilities or not as the case may be. (To be clear I’m not saying
the follow-on proposals from this need to solve those problems, maybe just give
them some consideration).
> ## Open Questions
>
> ### Must `String` be limited to storing UTF-16 subset encodings?
>
> - The ability to handle `UTF-8`-encoded strings (models of `Unicode`) is not
> in
> question here; this is about what encodings must be storable, without
> transcoding, in the common currency type called “`String`”.
> - ASCII, Latin-1, UCS-2, and UTF-16 are UTF-16 subsets. UTF-8 is not.
Depending on who you believe UTF-8 is the encoding of ~65-88% of all text
content transmitted over the web. JSON and XML represent the lion’s share of
REST and non-REST APIs in use and both are almost exclusively transmitted as
UTF-8. As you point out with extendedASCII, a lot of markup and structure is
ASCII even if the content is not so UTF-8 represents a significant size savings
even on Chinese/Japanese web pages that require 3 bytes to represent many
characters (the savings on markup overwhelming the loss on textual content).
Any model that makes using UTF-8 backed Strings difficult or cumbersome to use
can have a negative performance and memory impact. I don’t have a good idea of
the actual cost but it might be worth doing some test to determine that.
Is NSString interop the only reason to not just use UTF-8 as the default
storage? If so, is that a solvable problem? Could one choose by typealias or a
compiler flag which default storage they wanted?
> - If we have a way to get at a `String`'s code units, we need a concrete type
> in
> which to express them in the API of `String`, which is a concrete type
> - If String needs to be able to represent UTF-32, presumably the code units
> need
> to be `UInt32`.
> - Not supporting UTF-32-encoded text seems like one reasonable design choice.
> - Maybe we can allow UTF-8 storage in `String` and expose its code units as
> `UInt16`, just as we would for Latin-1.
> - Supporting only UTF-16-subset encodings would imply that `String` indices
> can
> be serialized without recording the `String`'s underlying encoding.
I suppose you could be clever on 64-bit platforms by stealing some bits to
indicate the encoding… not that I recommend that :D
>
> ### Do we need a type-erasable base protocol for UnicodeEncoding?
>
> UnicodeEncoding has an associated type, but it may be important to be able to
> traffic in completely dynamic encoding values, e.g. for “tell me the most
> efficient encoding for this string.”
Generalized Existentials
tis but happiness by another name
For we who live
in The Land of Protocols and Faeries
>
> ### Should there be a string “facade?”
>
> One possible design alternative makes `Unicode` a vehicle for expressing
> the storage and encoding of code units, but does not attempt to give it an API
> appropriate for `String`. Instead, string APIs would be provided by a generic
> wrapper around an instance of `Unicode`:
>
> ```swift
> struct StringFacade<U: Unicode> : BidirectionalCollection {
>
> // ...APIs for high-level string processing here...
>
> var unicode: U // access to lower-level unicode details
> }
>
> typealias String = StringFacade<StringStorage>
> typealias Substring = StringFacade<StringStorage.SubSequence>
> ```
>
> This design would allow us to de-emphasize lower-level `String` APIs such as
> access to the specific encoding, by putting them behind a `.unicode` property.
> A similar effect in a facade-less design would require a new top-level
> `StringProtocol` playing the role of the facade with an an `associatedtype
> Storage : Unicode`.
>
> An interesting variation on this design is possible if defaulted generic
> parameters are introduced to the language:
>
> ```swift
> struct String<U: Unicode = StringStorage>
> : BidirectionalCollection {
>
> // ...APIs for high-level string processing here...
>
> var unicode: U // access to lower-level unicode details
> }
>
> typealias Substring = String<StringStorage.SubSequence>
> ```
>
> One advantage of such a design is that naïve users will always extend “the
> right
> type” (`String`) without thinking, and the new APIs will show up on
> `Substring`,
> `MyUTF8String`, etc. That said, it also has downsides that should not be
> overlooked, not least of which is the confusability of the meaning of the word
> “string.” Is it referring to the generic or the concrete type?
Fair point, but I do like the idea of separating the two and encouraging people
to extend String while automatically extending all the String-ish types. This
would compose well with a hypothetical HTMLString, JavaScriptString, etc
(assuming one could design a model where those things compose well, e.g.
appending MyUTF8String to HTMLString performs automatic HTML-escaping whereas
appending HTMLString to HTMLString does not).
Anything that avoids forcing the average app or library author to stop and
think about which String type to use is probably a net win if the performance
isn’t horrible; someone writing a web server pipeline will need to write their
own String-ish type for performance reasons anyway so a slight perf hit may be
no great loss.
Thanks to you and Ben for the hard work so far; I can’t even imagine taking on
such a task!
Russ
_______________________________________________
swift-evolution mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-evolution