Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

David Waite via swift-evolution Mon, 12 Jun 2017 11:11:49 -0700

> On Jun 9, 2017, at 9:24 PM, Dave Abrahams via swift-evolution 
> <[email protected]> wrote:
> on Fri Jun 09 2017, Kevin Ballard <[email protected] 
> <mailto:[email protected]>> wrote:
>> On Tue, Jun 6, 2017, at 10:57 AM, Dave Abrahams via swift-evolution wrote:
<snip>
>> 
>> Ah, right. So a String.Index is actually something similar to
>> 
>> public struct Index {
>>    public var encodedOffset: Int
>>    private var byteOffset: Int // UTF-8 offset into the UTF-8 code unit
>> }
> 
> Similar.  I'd write it this way:
> 
> public struct Index {
>   public var encodedOffset: Int
> 
>   // Offset into a UnicodeScalar represented in an encoding other
>   // than the String's underlying encoding
>   private var transcodedOffset: Int 
> }


I *think* the following is what the proposal is saying, but let me walk through 
it:

My understanding would be:
- An index manipulated at the string level points to the start a grapheme 
cluster which is also a particular code point and to a code unit of the 
underlying string backing data
- The unicodeScalar view can be intra-grapheme cluster, pointing at a code point
- The utf-16 index can be intra-codepoint, since some code points are 
represented by two code units
- The uff-8 index can be intra-codepoint as well,  since code points are 
represented by up to four code units

So is the idea of the Index struct is that the encodedOffset is an offset in 
the native representation of the string (byte offset, word offset, etc) to the 
start of a grapheme, and transcodedOffset is data for Unicode Scalar, UTF-16 
and UTF-8 views to represent an offset within a grapheme to a code point or 
code unit?

My feeling is that ‘encoded’ is not enough to distinguish whether encodedOffset 
is meant to indicate an offset in graphemes, code points, or code units, or to 
specify that an index to the same character in two normalized strings may be 
different if one is backed by UTF-8 and the other UTF-16. 
“encodedCharacterOffset” may be better.

This index struct does limit some sorts of imagined string implementations, 
such as a string maintained piecewise across multiple allocation units or 
strings using a stateful character encoding like ISO/IEC 2022.

-DW

P.S. I’m also curious why the methods are optional failing vs retaining the 
current API and having them fatal error.

_______________________________________________
swift-evolution mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-evolution

Re: [swift-evolution] [Review] SE-0180: String Index Overhaul

Reply via email to