> Le 25 janv. 2017 à 13:08, Ben Cohen <[email protected]> a écrit :
>> Okay, so I'm serializing two strings "a" and "b", and later on I want to 
>> deserialize them. I control "a", and the user controls "b". I know that I'll 
>> never have a comma in "a", so one obvious way to serialize the two strings 
>> is with "\(a),\(b)", and the most obvious way to deserialize them is with 
>> string.split(maxSplits: 2) { $0 == "," }.
>> 
>> For the example, string "a" is "hello", and the user put in "\u{0301}screw 
>> you" for "b". This makes the result "hello,́screw you". Now split misses the 
>> comma.
>> 
>> How do I fix it?
>> 
> 
> One option (once Character acquires a unicodeScalars view similar to 
> String’s) would be:
> 
> s.split { $0.unicodeScalars.first == "," }

My two main objections to this are that (1) this drops the acute accent 
(although that's probably an acceptable sacrifice in the face of purposefully 
bad input); and (2) it's annoying to me that you have to drop below the 
Character level to safely perform a task this simple.

> There’s probably also a case to be made for a String-specific overload 
> split(separator: UnicodeScalar) in which case you’d pass in the scalar of 
> “,”. This would replicate similar behavior to languages that use code points 
> as their “character”.

The way they're being built, I'm leaning towards the opinion that Strings 
wouldn't be the right tool to serialize anything. Unfortunately, in a world of 
XML, JSON, YAML, Markdown and such, they're also a very obvious choice.

> Alternatively, the right solution is to sanitize your input before the 
> interpolation. Sanitization is a big topic, of which this is just one 
> example. Essentially, you are asking for this kind of sanitization to be 
> automatically applied for all range-replaceable operations on strings for 
> this specific use case. I’m not sure that’s a good precedent to set. There 
> are other ways in which Unicode can be abused that wouldn’t be covered, 
> should we be sanitizing for those too on all low-level operations?

I agree that the general Unicode abuse problem cannot be solved. The novel 
thing here is that Swift is one of the first languages to bring 
grapheme-cluster-aware strings to a wide audience, and doing so, it introduces 
a class of bugs that have essentially no precedent. I feel like this should 
worry people a little bit. People have been able to abuse RTL overrides for 
several years now, and we found that it's a problem to users but machines are 
pretty good at dealing with it. However, if you'll allow me to dramatize, these 
are characters that basically eat their neighbor.

> This would also have pretty far-reaching implications across lots of 
> different types and operations. For example, it’s not just on append:
> 
> var s = "pokemon"
> let i = s.index(of: "m”)!
> // insert not just \u{0301} but also a separator?
> s.insert("\u{0301}", at: i)
> 
> It also would apply to in-place mutation on slices, given you can do this:
> 
> var a = [1,2,3,4]
> a[0...2].append(99)
> a // [1,2,3,99,4]
> 
> In this case, suppose you appended "e" to a slice that ended between "m" and 
> "\u{0301}”. The append operation on the substring would need to look into the 
> outer string, see that the next scalar is a combining character, and then 
> insert a spacer element in between them.
> 
> We would still need the ability to append modifiers to characters 
> legitimately. If users could not do this by inserting/appending these 
> modifiers into String, we would have to put this logic onto Character, which 
> would need to have the ability to range-replace within its scalars, which 
> adds to a lot to the complexity of that type. It would also be fiddly to use, 
> given that String is not going to conform to MutableCollection (because 
> mutation on an element cannot be done in constant time). So you couldn’t do 
> it in-place i.e. s[i].unicodeScalars.append("\u{0301}") wouldn’t work.

I'd argue that no one should feel particularly great about writing code points 
to a collection that exposes Characters in return. Have any alternatives around 
modifying a Unicode scalar view been explored? I don't have any problem with 
making it impossible to add a Character-that-is-not-a-Character to a String's 
Character view if you can opt in to Unicode scalars when you mean it.

Félix

_______________________________________________
swift-evolution mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-evolution

Reply via email to