Ted, that sort of implementation grows many common strings by a factor of 8 and makes some less common strings require multiple memory allocations. Considering that our research has shown it is a big performance and energy-use win to heroically compress <https://www.mikeash.com/pyblog/friday-qa-2012-07-27-lets-build-tagged-pointers.html> strings to avoid both kinds of bloat (plenty of actual data was gathered before tagged pointer strings were added to Cocoa), a scheme like the one you're proposing is pretty much a non-starter as far as I'm concerned.
Sent from my moss-covered three-handled family gradunza > On Feb 22, 2017, at 5:56 AM, Ted F.A. van Gaalen <[email protected]> > wrote: > > Hi Ben, > thank you, yes, I know all that by now. > > Have seen that one goes to great lengths to optimise, not only for storage > but also for speed. But how far does this need to go? In any case, > optimisation should not be used > as an argument for restricting a PLs functionality that is to refrain from PL > elements which are common and useful.? > > I wouldn’t worry so much over storage (unless one wants to load a complete > book into memory… in iOS, the average app is about 15-50 MB, String data is > mostly a fraction of that. In macOS or similar I’d think it is even less > significant… > > I wonder how much performance and memory consumption would be different from > the current contiguous memory implementation? if a String is just is a plain > row of (references to) Character (extended grapheme cluster) objects, > Array<[Character>, which would simplify the basic logic and (sub)string > handling significantly, because then one has direct access to the String’s > elements directly, using the reasonably fast access methods of a Swift > Collection/Array. > > I have experimented with an alternative String struct based upon > Array<Character>, seeing how easy it was to implement most popular string > handling functions as one can work with the Character array directly. > > Currently at deep-dive-depth in the standard lib sources, especially String & > Co. > > Kind Regards > TedvG > > >> On 21 Feb 2017, at 01:31, Ben Cohen <[email protected]> wrote: >> >> Hi Ted, >> >> While Character is the Element type for String, it would be unsuitable for a >> String’s implementation to actually use Character for storage. Character is >> fairly large (currently 9 bytes), very little of which is used for most >> values. For unusual graphemes that require more storage, it allocates more >> memory on the heap. By contrast, String’s actual storage is a buffer of 1- >> or 2-byte elements, and all graphemes (what we expose as Characters) are >> held in that contiguous memory no matter how many code points they comprise. >> When you iterate over the string, the graphemes are unpacked into a >> Character on the fly. This gives you an user interface of a collection that >> superficially appears to resemble [Character], but this does not mean that >> this would be a workable implementation. >> >>> On Feb 20, 2017, at 12:59 PM, Ted F.A. van Gaalen <[email protected]> >>> wrote: >>> >>> Hi Ben, Dave (you should not read this now, you’re on vacation :o) & Others >>> >>> As described in the Swift Standard Library API Reference: >>> >>> The Character type represents a character made up of one or more Unicode >>> scalar values, >>> grouped by a Unicode boundary algorithm. Generally, a Character instance >>> matches what >>> the reader of a string will perceive as a single character. The number of >>> visible characters is >>> generally the most natural way to count the length of a string. >>> The smallest discrete unit we (app programmers) are mostly working with is >>> this >>> perceived visible character, what else? >>> >>> If that is the case, my reasoning is, that Strings (could / should? ) be >>> relatively simple, >>> because most, if not all, complexity of Unicode is confined within the >>> Character object and >>> completely hidden** for the average application programmer, who normally >>> only needs >>> to work with Strings which contains these visible Characters, right? >>> It doesn’t then make no difference at all “what’ is in” the Character, >>> (excellent implementation btw) >>> (Unicode, ASCCII, EBCDIC, Elvish, KlingonIV, IntergalacticV.2, whatever) >>> because we rely in sublime oblivion for the visually representation of >>> whatever is in >>> the Character on miraculous font processors hidden in the dark depths of >>> the OS. >>> >>> Then, in this perspective, my question is: why is String not implemented as >>> directly based upon an array [Character] ? In that case one can refer to >>> the Characters of the >>> String directly, not only for direct subscripting and other String >>> functionality in an efficient way. >>> (i do hava scope of independent Swift here, that is interaction with >>> libraries should be >>> solved by the compiler, so as not to be restricted by legacy ObjC etc. >>> >>> ** (expect if one needs to do e.g. access individual elements and/or >>> compose graphics directly? >>> but for this purpose the Character’s properties are accessible) >>> >>> For the sake of convenience, based upon the above reasoning, I now >>> “emulate" this in >>> a string extension, thereby ignoring the rare cases that a visible >>> character could be based >>> upon more than a single Character (extended grapheme cluster) If that >>> would occur, >>> thye should be merged into one extended grapheme cluster, a single >>> Character that is. >>> >>> //: Playground - implement direct subscripting using a Character array >>> // of course, when the String is defined as an array of Characters, directly >>> // accessible it would be more efficient as in these extension functions. >>> extension String >>> { >>> var count: Int >>> { >>> get >>> { >>> return self.characters.count >>> } >>> } >>> >>> subscript (n: Int) -> String >>> { >>> return String(Array(self.characters)[n]) >>> } >>> >>> subscript (r: Range<Int>) -> String >>> { >>> return String(Array(self.characters)[r]) >>> } >>> >>> subscript (r: ClosedRange<Int>) -> String >>> { >>> return String(Array(self.characters)[r]) >>> } >>> } >>> >>> func test() >>> { >>> let zoo = "Koala 🐨, Snail 🐌, Penguin 🐧, Dromedary 🐪" >>> print("zoo has \(zoo.count) characters (discrete extended graphemes):") >>> for i in 0..<zoo.count >>> { >>> print(i,zoo[i],separator: "=", terminator:" ") >>> } >>> print("\n") >>> print(zoo[0..<7]) >>> print(zoo[9..<16]) >>> print(zoo[18...26]) >>> print(zoo[29...39]) >>> print("images:" + zoo[6] + zoo[15] + zoo[26] + zoo[39]) >>> } >>> >>> test() >>> >>> this works as intended and generates the following output: >>> >>> zoo has 40 characters (discrete extended graphemes): >>> 0=K 1=o 2=a 3=l 4=a 5= 6=🐨 7=, 8= 9=S 10=n 11=a 12=i 13=l 14= 15=🐌 16=, >>> 17= >>> 18=P 19=e 20=n 21=g 22=u 23=i 24=n 25= 26=🐧 27=, 28= 29=D 30=r 31=o 32=m >>> 33=e 34=d 35=a 36=r 37=y 38= 39=🐪 >>> >>> Koala 🐨 >>> Snail 🐌 >>> Penguin 🐧 >>> Dromedary 🐪 >>> images:🐨🐌🐧🐪 >>> >>> I don’t know how (in) efficient this method is. >>> but in many cases this is not so important as e.g. with numerical >>> computation. >>> >>> I still fail to understand why direct subscripting strings would be >>> unnecessary, >>> and would like to see this built-in in Swift asap. >>> >>> Btw, I do share the concern as expressed by Rien regarding the increasing >>> complexity of the language. >>> >>> Kind Regards, >>> >>> TedvG >>> >>> >>> >> >
_______________________________________________ swift-evolution mailing list [email protected] https://lists.swift.org/mailman/listinfo/swift-evolution
