> On Jun 9, 2017, at 9:24 PM, Dave Abrahams via swift-evolution
> <[email protected]> wrote:
> on Fri Jun 09 2017, Kevin Ballard <[email protected]
> <mailto:[email protected]>> wrote:
>> On Tue, Jun 6, 2017, at 10:57 AM, Dave Abrahams via swift-evolution wrote:
<snip>
>>
>> Ah, right. So a String.Index is actually something similar to
>>
>> public struct Index {
>> public var encodedOffset: Int
>> private var byteOffset: Int // UTF-8 offset into the UTF-8 code unit
>> }
>
> Similar. I'd write it this way:
>
> public struct Index {
> public var encodedOffset: Int
>
> // Offset into a UnicodeScalar represented in an encoding other
> // than the String's underlying encoding
> private var transcodedOffset: Int
> }
I *think* the following is what the proposal is saying, but let me walk through
it:
My understanding would be:
- An index manipulated at the string level points to the start a grapheme
cluster which is also a particular code point and to a code unit of the
underlying string backing data
- The unicodeScalar view can be intra-grapheme cluster, pointing at a code point
- The utf-16 index can be intra-codepoint, since some code points are
represented by two code units
- The uff-8 index can be intra-codepoint as well, since code points are
represented by up to four code units
So is the idea of the Index struct is that the encodedOffset is an offset in
the native representation of the string (byte offset, word offset, etc) to the
start of a grapheme, and transcodedOffset is data for Unicode Scalar, UTF-16
and UTF-8 views to represent an offset within a grapheme to a code point or
code unit?
My feeling is that ‘encoded’ is not enough to distinguish whether encodedOffset
is meant to indicate an offset in graphemes, code points, or code units, or to
specify that an index to the same character in two normalized strings may be
different if one is backed by UTF-8 and the other UTF-16.
“encodedCharacterOffset” may be better.
This index struct does limit some sorts of imagined string implementations,
such as a string maintained piecewise across multiple allocation units or
strings using a stateful character encoding like ISO/IEC 2022.
-DW
P.S. I’m also curious why the methods are optional failing vs retaining the
current API and having them fatal error.
_______________________________________________
swift-evolution mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-evolution