Re: [swift-evolution] Strings in Swift 4

Dave Abrahams via swift-evolution Wed, 22 Feb 2017 10:21:18 -0800

Ted, that sort of implementation grows many common strings by a factor of 8 and 
makes some less common strings require multiple memory allocations. Considering 
that our research has shown it is a big performance and energy-use win  to 
heroically compress 
<https://www.mikeash.com/pyblog/friday-qa-2012-07-27-lets-build-tagged-pointers.html>
  strings to avoid both kinds of bloat (plenty of actual data was gathered 
before tagged pointer strings were added to Cocoa), a scheme like the one 
you're proposing is pretty much a non-starter as far as I'm concerned.


Sent from my moss-covered three-handled family gradunza

> On Feb 22, 2017, at 5:56 AM, Ted F.A. van Gaalen <[email protected]> 
> wrote:
> 
> Hi Ben,
> thank you, yes, I know all that by now. 
> 
> Have seen that one goes to great lengths to optimise, not only for storage 
> but also for speed. But how far does this need to go?  In any case, 
> optimisation should not be used
> as an argument for restricting a PLs functionality that is to refrain from PL 
> elements which are common and useful.?
> 
> I wouldn’t worry so much over storage (unless one wants to load a complete 
> book into memory… in iOS, the average app is about 15-50 MB, String data is 
> mostly a fraction of that. In macOS or similar I’d think it is even less 
> significant…
> 
> I wonder how much performance and memory consumption would be different from 
> the current contiguous memory implementation?  if a String is just is a plain 
> row of (references to) Character (extended grapheme cluster) objects, 
> Array<[Character>, which would simplify the basic logic and (sub)string 
> handling significantly, because then one has direct access to the String’s 
> elements directly, using the reasonably fast access methods of a Swift 
> Collection/Array. 
> 
> I have experimented  with an alternative String struct based upon 
> Array<Character>, seeing how easy it was to implement most popular string 
> handling functions as one can work with the Character array directly. 
> 
> Currently at deep-dive-depth in the standard lib sources, especially String & 
> Co.
> 
> Kind Regards
> TedvG
> 
> 
>> On 21 Feb 2017, at 01:31, Ben Cohen <[email protected]> wrote:
>> 
>> Hi Ted,
>> 
>> While Character is the Element type for String, it would be unsuitable for a 
>> String’s implementation to actually use Character for storage. Character is 
>> fairly large (currently 9 bytes), very little of which is used for most 
>> values. For unusual graphemes that require more storage, it allocates more 
>> memory on the heap. By contrast, String’s actual storage is a buffer of 1- 
>> or 2-byte elements, and all graphemes (what we expose as Characters) are 
>> held in that contiguous memory no matter how many code points they comprise. 
>> When you iterate over the string, the graphemes are unpacked into a 
>> Character on the fly. This gives you an user interface of a collection that 
>> superficially appears to resemble [Character], but this does not mean that 
>> this would be a workable implementation.
>> 
>>> On Feb 20, 2017, at 12:59 PM, Ted F.A. van Gaalen <[email protected]> 
>>> wrote:
>>> 
>>> Hi Ben, Dave (you should not read this now, you’re on vacation :o)  & Others
>>> 
>>> As described in the Swift Standard Library API Reference:
>>> 
>>> The Character type represents a character made up of one or more Unicode 
>>> scalar values, 
>>> grouped by a Unicode boundary algorithm. Generally, a Character instance 
>>> matches what 
>>> the reader of a string will perceive as a single character. The number of 
>>> visible characters is 
>>> generally the most natural way to count the length of a string.
>>> The smallest discrete unit we (app programmers) are mostly working with is 
>>> this
>>> perceived visible character, what else? 
>>> 
>>> If that is the case, my reasoning is, that Strings (could / should? ) be 
>>> relatively simple, 
>>> because most, if not all, complexity of Unicode is confined within the 
>>> Character object and
>>> completely hidden**  for the average application programmer, who normally 
>>> only needs
>>> to work with Strings which contains these visible Characters, right? 
>>> It doesn’t then make no difference at all “what’ is in” the Character, 
>>> (excellent implementation btw) 
>>> (Unicode, ASCCII, EBCDIC, Elvish, KlingonIV, IntergalacticV.2, whatever)
>>> because we rely in sublime oblivion for the visually representation of 
>>> whatever is in
>>> the Character on miraculous font processors hidden in the dark depths of 
>>> the OS. 
>>> 
>>> Then, in this perspective, my question is: why is String not implemented as 
>>> directly based upon an array [Character]  ? In that case one can refer to 
>>> the Characters of the
>>> String directly, not only for direct subscripting and other String 
>>> functionality in an efficient way. 
>>> (i do hava scope of independent Swift here, that is interaction with 
>>> libraries should be 
>>> solved by the compiler, so as not to be restricted by legacy ObjC etc. 
>>> 
>>> **   (expect if one needs to do e.g. access individual elements and/or 
>>> compose graphics directly?
>>>       but for  this purpose the Character’s properties are accessible) 
>>> 
>>> For the sake of convenience, based upon the above reasoning,  I now 
>>> “emulate" this in 
>>> a string extension, thereby ignoring the rare cases that a visible 
>>> character could be based 
>>> upon more than a single Character (extended grapheme cluster)  If that 
>>> would occur, 
>>> thye should be merged into one extended grapheme cluster, a single 
>>> Character that is. 
>>> 
>>> //: Playground - implement direct subscripting using a Character array
>>> // of course, when the String is defined as an array of Characters, directly
>>> // accessible it would be more efficient as in these extension functions. 
>>> extension String
>>> {
>>>     var count: Int
>>>         {
>>>         get
>>>         {
>>>             return self.characters.count
>>>         }
>>>     }
>>> 
>>>     subscript (n: Int) -> String
>>>     {
>>>         return String(Array(self.characters)[n])
>>>     }
>>>     
>>>     subscript (r: Range<Int>) -> String
>>>     {
>>>         return String(Array(self.characters)[r])
>>>     }
>>>     
>>>     subscript (r: ClosedRange<Int>) -> String
>>>     {
>>>         return String(Array(self.characters)[r])
>>>     }
>>> }
>>> 
>>> func test()
>>> {
>>>     let zoo = "Koala 🐨, Snail 🐌, Penguin 🐧, Dromedary 🐪"
>>>     print("zoo has \(zoo.count) characters (discrete extended graphemes):")
>>>     for i in 0..<zoo.count
>>>     {
>>>         print(i,zoo[i],separator: "=", terminator:" ")
>>>     }
>>>     print("\n")
>>>     print(zoo[0..<7])
>>>     print(zoo[9..<16])
>>>     print(zoo[18...26])
>>>     print(zoo[29...39])
>>>     print("images:" + zoo[6] + zoo[15] + zoo[26] + zoo[39])
>>> }
>>> 
>>> test()
>>> 
>>> this works as intended  and generates the following output:  
>>> 
>>> zoo has 40 characters (discrete extended graphemes):
>>> 0=K 1=o 2=a 3=l 4=a 5=  6=🐨 7=, 8=  9=S 10=n 11=a 12=i 13=l 14=  15=🐌 16=, 
>>> 17=  
>>> 18=P 19=e 20=n 21=g 22=u 23=i 24=n 25=  26=🐧 27=, 28=  29=D 30=r 31=o 32=m 
>>> 33=e 34=d 35=a 36=r 37=y 38=  39=🐪 
>>> 
>>> Koala 🐨
>>> Snail 🐌
>>> Penguin 🐧
>>> Dromedary 🐪
>>> images:🐨🐌🐧🐪
>>> 
>>> I don’t know how (in) efficient this method is. 
>>> but in many cases this is not so important as e.g. with numerical 
>>> computation.
>>> 
>>> I still fail to understand why direct subscripting strings would be 
>>> unnecessary,
>>> and would like to see this built-in in Swift asap. 
>>> 
>>> Btw, I do share the concern as expressed by Rien regarding the increasing 
>>> complexity of the language.
>>> 
>>> Kind Regards, 
>>> 
>>> TedvG
>>> 
>>> 
>>>  
>> 
>

_______________________________________________
swift-evolution mailing list
[email protected]
https://lists.swift.org/mailman/listinfo/swift-evolution

Re: [swift-evolution] Strings in Swift 4

Reply via email to