Hi Dave,
Oops! yes, you’re right!
I did read again more thoroughly about Unicode
and how Unicode is handled within Swift...
-should have done that before I write something- sorry.
Nevertheless:
How about this solution: (if I am not making other omissions in my thinking
again)
-Store the string as a collection of fixed-width 32 bit UTF-32 characters
anyway.
-however, if the Unicode character is a grapheme cluster (2..n Unicode
characters),then
store a pointer to a hidden child string containing the actual grapheme
cluster, like so:
1: [UTF32, UTF32, UTF32, 1pointer, UTF32, UTF32, 1pointer, UTF32, UTF32]
|
|
2: [UTF32, UTF32] [UTF32, UTF32,
UTF32, ...]
whereby (1) is aString as seen by the programmer.
and (2) are hidden child strings, each containing a grapheme cluster.
To make the distinction between a “plain” single UTF-32 char and a grapheme
cluster,
set the most significant bit of the 32 bit value to 1 and use the other 31 bits
as a pointer to another (hidden) String instance, containing the grapheme
cluster.
In this way, one could then also make graphemes within graphemes,
but that is probably not desired? Another solution is to store the grapheme
clusters
in a dedicated “grapheme pool’, containing the (unique as in aSet) grapheme
clusters
encountered whenever a Unicode string (in whatever format) is read-in or
defined at runtime.
but then again.. seeing how hard it is to recognise Grapheme clusters in the
first place..
? I don’t know. Unicode is complicated..
Kind regards
TedvG.
www.tedvg.com <http://www.tedvg.com/>
www.ravelnotes.com <http://www.ravelnotes.com/>
> On 6 Feb 2017, at 05:15, Dave Abrahams <[email protected]> wrote:
>
>
>
>> On Feb 5, 2017, at 2:57 PM, Ted F.A. van Gaalen <[email protected]>
>> wrote:
>>
>> However, that is not the case with UTF-32, because with UTF-32 encoding
>> each character has a fixed-width and always occupies exactly 4 bytes, 32
>> bit.
>> Ergo: the problem can be easily solved: The simple solution is to always
>> and without exception use UTF-32 encoding as Swift's internal
>> string format because it only contains fixed width Unicode characters.
>
> Those are not (user-perceived) Characters; they are Unicode Scalar Values
> (often called "characters" by the Unicode standard. Characters as defined in
> Swift (a.k.a. extended grapheme clusters) have no fixed-width encoding, and
> Unicode scalar values are an inappropriate unit for most string processing.
> Please read the manifesto for details.
>
> Sent from my iPad
_______________________________________________
swift-evolution mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-evolution