Re: [swift-evolution] Strings in Swift 4

Ted F.A. van Gaalen via swift-evolution Mon, 06 Feb 2017 09:40:15 -0800

Hi Dave,
Oops! yes, you’re right!
I did read again more thoroughly about Unicode 
and how Unicode is handled within Swift...
-should have done that before I write something- sorry.

Nevertheless: 

How about this solution:  (if I am not making other omissions in my thinking 
again) 
-Store the string as a collection of fixed-width 32 bit UTF-32 characters 
anyway.
-however, if the Unicode character is a grapheme cluster (2..n Unicode 
characters),then 
store a pointer to a hidden child string containing the actual grapheme 
cluster, like so:

1: [UTF32, UTF32, UTF32, 1pointer,  UTF32, UTF32, 1pointer, UTF32, UTF32]
                                                |                               
           |
2:                               [UTF32, UTF32]                  [UTF32, UTF32, 
UTF32, ...]

whereby (1) is aString as seen by the programmer.
and (2)  are hidden child strings, each containing a grapheme cluster. 

To make the distinction between a “plain” single UTF-32 char and a grapheme 
cluster, 
set the most significant bit of the 32 bit value to 1 and use the other 31 bits
as a pointer to another (hidden) String instance, containing the grapheme 
cluster. 
In this way, one could then also make graphemes within graphemes,  
but that is probably not desired? Another solution is to store the grapheme 
clusters
in a dedicated “grapheme pool’, containing the (unique as in aSet) grapheme 
clusters
encountered whenever a Unicode string (in whatever format) is read-in or 
defined at runtime. 

but then again.. seeing how hard it is to recognise Grapheme clusters in the 
first place.. 
? I don’t know. Unicode is complicated..  

Kind regards 
TedvG. 

www.tedvg.com <http://www.tedvg.com/>
www.ravelnotes.com <http://www.ravelnotes.com/>

> On 6 Feb 2017, at 05:15, Dave Abrahams <[email protected]> wrote:
> 
> 
> 
>> On Feb 5, 2017, at 2:57 PM, Ted F.A. van Gaalen <[email protected]> 
>> wrote:
>> 
>> However, that is not the case with UTF-32, because with UTF-32 encoding
>> each character has a fixed-width and always occupies exactly 4 bytes, 32 
>> bit. 
>> Ergo: the problem can be easily solved: The simple solution is to always 
>> and without exception use UTF-32 encoding as Swift's internal 
>> string format because it only contains fixed width Unicode characters. 
> 
> Those are not (user-perceived) Characters; they are Unicode Scalar Values 
> (often called "characters" by the Unicode standard.  Characters as defined in 
> Swift (a.k.a. extended grapheme clusters) have no fixed-width encoding, and 
> Unicode scalar values are an inappropriate unit for most string processing. 
> Please read the manifesto for details.
> 
> Sent from my iPad

_______________________________________________
swift-evolution mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-evolution

Re: [swift-evolution] Strings in Swift 4

Reply via email to