I think they affect the implementation, and (to a small extent) the exposed semantics. It’s also possible that, to provide some of this functionality, it’s necessary to use an NFA implementation where a traditional regex might be able to use DFA.
They certainly impact the implementation to the extent that a traditional regex wouldn’t exhibit Unicode-compliant behavior in some cases. Debbie > On Jan 26, 2017, at 7:31 PM, Dave Abrahams via swift-evolution > <[email protected]> wrote: > > > on Thu Jan 26 2017, Deborah Goldsmith <[email protected]> wrote: > >> To throw another ingredient into the mix, there are issues for Unicode regex >> that don’t appear in >> more “traditional” regex implementations. See: >> >> http://userguide.icu-project.org/strings/regexp >> >> For example: >> >>> Case insensitive matching is specified by the >>> UREGEX_CASE_INSENSITIVE flag during pattern compilation, or by the >>> (?i) flag within a pattern itself. Unicode case insensitive >>> matching is complicated by the fact that changing the case of a >>> string may change its length. See >>> http://unicode.org/faq/casemap_charprop.html for more information on >>> Unicode casing operations. >>> >>> Examples: >>> • pattern "fussball" will match "fußball or "fussball" >>> • pattern "fu(s)(s)ball" or "fus{2}ball" will match "fussball" or >>> "FUSSBALL" but not "fußball. >>> • pattern "ß" will find occurences of "ss" or "ß" >>> • pattern "s+" will not find "ß" >>> > > These all appear to be issues for users to consider rather than design > issues for the regex implementation. Am I mistaken? > >> and >> >>> >>> w UREGEX_UWORD Controls the behavior of \b in a pattern. If set, >>> word boundaries are found according to the definitions of word found >>> in Unicode UAX 29, Text Boundaries. By default, word boundaries are >>> identified by means of a simple classification of characters as >>> either “word” or “non-word”, which approximates traditional regular >>> expression behavior. The results obtained with the two options can >>> be quite different in runs of spaces and other non-word characters. >>> >> >> If regexes are going to be used on human language text, these are all >> important considerations. > > Yup, but I don't see how they affect the design, other than that maybe > matching on a LocalizedString type would use UREGEX_WORD by default. > > -- > -Dave > > _______________________________________________ > swift-evolution mailing list > [email protected] > https://lists.swift.org/mailman/listinfo/swift-evolution _______________________________________________ swift-evolution mailing list [email protected] https://lists.swift.org/mailman/listinfo/swift-evolution
