> On Jan 11, 2018, at 9:46 PM, Chris Lattner via swift-dev
> <swift-dev@swift.org> wrote:
>
>>>
>>> Finally, what tradeoffs do you see between a 1-word vs 2-word string? Are
>>> we really destined to have 2-words? That’s still much better than the 3
>>> words we have now, but for some workloads it is a significant bloat.
>>
>> <repeat disclaimer about final details being down to real data>. Some
>> arguments in favor of 2-word, presented roughly in order of impact:
>
> Understood. I don’t have a strong opinion on 1 vs 2 words, either are
> dramatically better than 3 :-). I’m glad you’re carefully evaluating the
> tradeoff.
>
>> 1. This allows the String type to accommodate llvm::StringRef-style usages.
>> This is pretty broad usage: “mmap a file and treat its contents as a
>> String”, “store all my contents in an llvm::BumpPtr which outlives uses”,
>> un-owned slices, etc. One word String would greatly limit this to only
>> whole-string nul-terminated cases.
>
> Yes, StringRef style algorithms are a big deal, as I mentioned in my previous
> email, but it is also unclear if this will really be a win when shoehorned
> into String. The benefit of StringRef is that it is a completely trivial
> type (both in the SIL sense but also in the implementation sense) and all the
> primitive ops get inlined. Given the “all things to all people” design of
> String, I’m very much afraid that trying to shoehorn this into the String
> currency type will fail to provide significant wins and thus lead to having a
> separate StringRef style type anyway. Providing a StringRef style projection
> type that is trivial (in the Swift sense) that knows in its static type that
> it never owns memory seems like the best path.
>
> By point of comparison, C++ has std::string (yes, sure, with lots of issues)
> but they still introduced StringRef nee std::string_view instead of wedging
> it in.
>
>> 2. Two-word String fits more small strings. Exactly where along the
>> diminishing-returns curve 7 vs 15 UTF-8 code units lie is dependent on the
>> data set. One example is NSString, which (according to reasoning at
>> https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html
>>
>> <https://www.mikeash.com/pyblog/friday-qa-2015-07-31-tagged-pointer-strings.html>)
>> considered it important enough to have 6- and 5- bit reduced ASCII
>> character sets to squeeze up to 11-length strings in a word. 15 code unit
>> small strings would be a super-set of tagged NSStrings, meaning we could
>> bridge them eagerly in-line, while 7 code unit small strings would be a
>> subset (and also a strong argument against eagerly bridging them).
>
> Agreed, this is a big deal.
>
>> If you have access to any interesting data sets and can report back some
>> statistics, that would be immensely helpful!
>
> Sadly, I don’t. I’m only an opinionated hobbyist in this domain, one who has
> coded a lot of string processing over the years and understands at least some
> of the tradeoffs.
>
>> 3. More bits available to reserve for future-proofing, etc., though many of
>> these could be stored in the header.
>>
>> 4. The second word can cache useful information from large strings.
>> `endIndex` is a very frequently requested computed property and it could be
>> stored directly in-line rather than loaded from memory (though perhaps a
>> load happens anyways in a subsequent read of the string). Alternatively, we
>> could store the grapheme count or some other piece of information that we’d
>> otherwise have to recompute. More experimentation needed here.
>
> This seems weakly motivated: large strings can store end index in the heap
> allocation.
>
>> 5. (vague and hand-wavy) Two-words fits into a nice groove that 3-words
>> doesn’t: 2 words is a rule-of-thumb size for very small buffers. It’s a
>> common heap alignment, stack alignment, vector-width, double-word-load
>> width, etc.. 1-word Strings may be under-utilizing available resources, that
>> is the second word will often be there for use anyways. The main case where
>> this is not true and 1-word shines is aggregates of String.
>
> What is the expected existential inline buffer size going to wind up being?
> We sized it to 3 words specifically to fit string and array. It would be
> great to shrink that to 2 or 1 words.
>
We are planning to reevaluate the size of the inline buffer based on
experimental performance data, but we can’t do that in a useful way until the
size of String has been settled.
_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev