Re: [swift-dev] Quick pitch: Change Linux’s string comparison to match Darwin’s

Michael Ilseman via swift-dev Tue, 25 Jul 2017 20:00:09 -0700

Unfortunately after some investigations and discussion, the situation seems to 
be more murky. This approach would break transitivity of String comparison on 
Linux, at least with any implementation of UCA using the normal collation 
weights. A < B, B < C should imply A < C. But, if both A and B are known-ASCII 
while C is UTF16, transitivity can be broken for any character that UCA yields 
a different sort order for (e.g. “#” vs “&”). On Darwin, the comparison 
implementation happens to preserve transitivity as the platform (in effect) 
relatively weights ASCII by code unit values.


While I would like to get some performance improvements in time for Linux, I 
don’t think this approach is viable for Swift 4.0. Unless anyone has any ideas 
about another minimally invasive approach, my recommendation is to do the 
long-term solution (lexicographical order of normalized code units) immediately 
after Swift 4.0.


> On Jul 25, 2017, at 2:01 PM, Michael Ilseman via swift-dev 
> <swift-dev@swift.org> wrote:
> 
> On Darwin, known-ASCII strings are sorted according to the lexicographical 
> ordering of their code units. All non-known-ASCII strings are otherwise 
> ordered based on the UCA[1]. On Linux, however, even known-ASCII strings are 
> ordered based on UCA. I propose to unify these by changing Linux’s string 
> sort order to match Darwin’s in Swift 4.0.
> 
> Background
> 
> Swift’s default ordering for strings is appropriate for machine consumption 
> (e.g. implementing sorted collections). It obeys Unicode canonical 
> equivalence[2], that is strings compare the same modulo normalization. 
> However, it is not meant to be sufficient for presenting a meaningful 
> ordering to human consumers, as that requires incorporating reader-specific 
> information (e.g. [3]). 
> 
> Known-ASCII strings are a trivial case for the described sort order 
> semantics: pure ASCII is unaffected by normalization. Thus, lexicographical 
> ordering of code units is a valid machine ordering for ASCII strings. On 
> Darwin, this is used to order known-ASCII strings while Linux uses UCA even 
> for known-ASCII strings.
> 
> Long term, the plan is to switch String’s sort order to be the 
> lexicographical ordering of normalized code units (or perhaps scalar values), 
> as mentioned in the String Manifesto[4]. This is a more efficient ordering 
> than that provided by UCA. However, this will not make it in time for Swift 
> 4.0. 
> 
> Changes
> 
> I propose to change Linux’s sort order for known-ASCII strings to be the same 
> as it is on Darwin. This will be accomplished by dropping the relevant #if 
> guards in StringCompare.swift. An example implementation can be found at [5].
> 
> In addition to unifying sort order semantics across platforms, this will also 
> deliver significant performance boosts to pure ASCII strings on Linux.
> 
> [1] UTS #10: Unicode Collation Algorithm <http://unicode.org/reports/tr10/>
> [2] Canonical Equivalence in Applications <http://unicode.org/notes/tn5/>
> [3] UCA: Contextual Sensitivity 
> <http://unicode.org/reports/tr10/#Contextual_Sensitivity>
> [4] String Manifesto: Comparing and Hashing Strings 
> <https://github.com/apple/swift/blob/master/docs/StringManifesto.md#comparing-and-hashing-strings>
> [5] Unifying Linux/Darwin ASCII sort order semantics - github 
> <https://github.com/milseman/swift/commit/5560e13198d5cc284f46bf190f59a2edf7ed747b>_______________________________________________
> swift-dev mailing list
> swift-dev@swift.org
> https://lists.swift.org/mailman/listinfo/swift-dev

_______________________________________________
swift-dev mailing list
swift-dev@swift.org
https://lists.swift.org/mailman/listinfo/swift-dev

Re: [swift-dev] Quick pitch: Change Linux’s string comparison to match Darwin’s

Reply via email to