> On Jan 11, 2018, at 9:46 PM, Chris Lattner via swift-dev 
> <swift-dev@swift.org> wrote:
> 
>>> 
>>> Finally, what tradeoffs do you see between a 1-word vs 2-word string?  Are 
>>> we really destined to have 2-words?  That’s still much better than the 3 
>>> words we have now, but for some workloads it is a significant bloat.
>> 
>> <repeat disclaimer about final details being down to real data>. Some 
>> arguments in favor of 2-word, presented roughly in order of impact:
> 
> Understood.  I don’t have a strong opinion on 1 vs 2 words, either are 
> dramatically better than 3 :-).  I’m glad you’re carefully evaluating the 
> tradeoff.
> 
>> 1. This allows the String type to accommodate llvm::StringRef-style usages. 
>> This is pretty broad usage: “mmap a file and treat its contents as a 
>> String”, “store all my contents in an llvm::BumpPtr which outlives uses”, 
>> un-owned slices, etc. One word String would greatly limit this to only 
>> whole-string nul-terminated cases.
> 
> Yes, StringRef style algorithms are a big deal, as I mentioned in my previous 
> email, but it is also unclear if this will really be a win when shoehorned 
> into String.  The benefit of StringRef is that it is a completely trivial 
> type (both in the SIL sense but also in the implementation sense) and all the 
> primitive ops get inlined.  Given the “all things to all people” design of 
> String, I’m very much afraid that trying to shoehorn this into the String 
> currency type will fail to provide significant wins and thus lead to having a 
> separate StringRef style type anyway.  Providing a StringRef style projection 
> type that is trivial (in the Swift sense) that knows in its static type that 
> it never owns memory seems like the best path.
> 
> By point of comparison, C++ has std::string (yes, sure, with lots of issues) 
> but they still introduced StringRef nee std::string_view instead of wedging 
> it in.
> 
>> 2. Two-word String fits more small strings. Exactly where along the 
>> diminishing-returns curve 7 vs 15 UTF-8 code units lie is dependent on the 
>> data set. One example is NSString, which (according to reasoning at 
>> https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html
>>  
>> <https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html>)
>>  considered it important enough to have 6- and 5- bit reduced ASCII 
>> character sets to squeeze up to 11-length strings in a word. 15 code unit 
>> small strings would be a super-set of tagged NSStrings, meaning we could 
>> bridge them eagerly in-line, while 7 code unit small strings would be a 
>> subset (and also a strong argument against eagerly bridging them). 
> 
> Agreed, this is a big deal.
> 
>> If you have access to any interesting data sets and can report back some 
>> statistics, that would be immensely helpful!
> 
> Sadly, I don’t. I’m only an opinionated hobbyist in this domain, one who has 
> coded a lot of string processing over the years and understands at least some 
> of the tradeoffs.
> 
>> 3. More bits available to reserve for future-proofing, etc., though many of 
>> these could be stored in the header.
>> 
>> 4. The second word can cache useful information from large strings. 
>> `endIndex` is a very frequently requested computed property and it could be 
>> stored directly in-line rather than loaded from memory (though perhaps a 
>> load happens anyways in a subsequent read of the string). Alternatively, we 
>> could store the grapheme count or some other piece of information that we’d 
>> otherwise have to recompute. More experimentation needed here.
> 
> This seems weakly motivated: large strings can store end index in the heap 
> allocation.
> 
>> 5. (vague and hand-wavy) Two-words fits into a nice groove that 3-words 
>> doesn’t: 2 words is a rule-of-thumb size for very small buffers. It’s a 
>> common heap alignment, stack alignment, vector-width, double-word-load 
>> width, etc.. 1-word Strings may be under-utilizing available resources, that 
>> is the second word will often be there for use anyways. The main case where 
>> this is not true and 1-word shines is aggregates of String.
> 
> What is the expected existential inline buffer size going to wind up being?  
> We sized it to 3 words specifically to fit string and array.  It would be 
> great to shrink that to 2 or 1 words.
> 

We are planning to reevaluate the size of the inline buffer based on 
experimental performance data, but we can’t do that in a useful way until the 
size of String has been settled.
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Reply via email to