> Le 25 janv. 2017 à 13:08, Ben Cohen <[email protected]> a écrit :
>> Okay, so I'm serializing two strings "a" and "b", and later on I want to
>> deserialize them. I control "a", and the user controls "b". I know that I'll
>> never have a comma in "a", so one obvious way to serialize the two strings
>> is with "\(a),\(b)", and the most obvious way to deserialize them is with
>> string.split(maxSplits: 2) { $0 == "," }.
>>
>> For the example, string "a" is "hello", and the user put in "\u{0301}screw
>> you" for "b". This makes the result "hello,́screw you". Now split misses the
>> comma.
>>
>> How do I fix it?
>>
>
> One option (once Character acquires a unicodeScalars view similar to
> String’s) would be:
>
> s.split { $0.unicodeScalars.first == "," }
My two main objections to this are that (1) this drops the acute accent
(although that's probably an acceptable sacrifice in the face of purposefully
bad input); and (2) it's annoying to me that you have to drop below the
Character level to safely perform a task this simple.
> There’s probably also a case to be made for a String-specific overload
> split(separator: UnicodeScalar) in which case you’d pass in the scalar of
> “,”. This would replicate similar behavior to languages that use code points
> as their “character”.
The way they're being built, I'm leaning towards the opinion that Strings
wouldn't be the right tool to serialize anything. Unfortunately, in a world of
XML, JSON, YAML, Markdown and such, they're also a very obvious choice.
> Alternatively, the right solution is to sanitize your input before the
> interpolation. Sanitization is a big topic, of which this is just one
> example. Essentially, you are asking for this kind of sanitization to be
> automatically applied for all range-replaceable operations on strings for
> this specific use case. I’m not sure that’s a good precedent to set. There
> are other ways in which Unicode can be abused that wouldn’t be covered,
> should we be sanitizing for those too on all low-level operations?
I agree that the general Unicode abuse problem cannot be solved. The novel
thing here is that Swift is one of the first languages to bring
grapheme-cluster-aware strings to a wide audience, and doing so, it introduces
a class of bugs that have essentially no precedent. I feel like this should
worry people a little bit. People have been able to abuse RTL overrides for
several years now, and we found that it's a problem to users but machines are
pretty good at dealing with it. However, if you'll allow me to dramatize, these
are characters that basically eat their neighbor.
> This would also have pretty far-reaching implications across lots of
> different types and operations. For example, it’s not just on append:
>
> var s = "pokemon"
> let i = s.index(of: "m”)!
> // insert not just \u{0301} but also a separator?
> s.insert("\u{0301}", at: i)
>
> It also would apply to in-place mutation on slices, given you can do this:
>
> var a = [1,2,3,4]
> a[0...2].append(99)
> a // [1,2,3,99,4]
>
> In this case, suppose you appended "e" to a slice that ended between "m" and
> "\u{0301}”. The append operation on the substring would need to look into the
> outer string, see that the next scalar is a combining character, and then
> insert a spacer element in between them.
>
> We would still need the ability to append modifiers to characters
> legitimately. If users could not do this by inserting/appending these
> modifiers into String, we would have to put this logic onto Character, which
> would need to have the ability to range-replace within its scalars, which
> adds to a lot to the complexity of that type. It would also be fiddly to use,
> given that String is not going to conform to MutableCollection (because
> mutation on an element cannot be done in constant time). So you couldn’t do
> it in-place i.e. s[i].unicodeScalars.append("\u{0301}") wouldn’t work.
I'd argue that no one should feel particularly great about writing code points
to a collection that exposes Characters in return. Have any alternatives around
modifying a Unicode scalar view been explored? I don't have any problem with
making it impossible to add a Character-that-is-not-a-Character to a String's
Character view if you can opt in to Unicode scalars when you mean it.
Félix
_______________________________________________
swift-evolution mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-evolution