on Thu Jan 26 2017, Deborah Goldsmith <[email protected]> wrote:
> To throw another ingredient into the mix, there are issues for Unicode regex > that don’t appear in > more “traditional” regex implementations. See: > > http://userguide.icu-project.org/strings/regexp > > For example: > >> Case insensitive matching is specified by the >> UREGEX_CASE_INSENSITIVE flag during pattern compilation, or by the >> (?i) flag within a pattern itself. Unicode case insensitive >> matching is complicated by the fact that changing the case of a >> string may change its length. See >> http://unicode.org/faq/casemap_charprop.html for more information on >> Unicode casing operations. >> >> Examples: >> • pattern "fussball" will match "fußball or "fussball" >> • pattern "fu(s)(s)ball" or "fus{2}ball" will match "fussball" or "FUSSBALL" >> but not "fußball. >> • pattern "ß" will find occurences of "ss" or "ß" >> • pattern "s+" will not find "ß" >> These all appear to be issues for users to consider rather than design issues for the regex implementation. Am I mistaken? > and > >> >> w UREGEX_UWORD Controls the behavior of \b in a pattern. If set, >> word boundaries are found according to the definitions of word found >> in Unicode UAX 29, Text Boundaries. By default, word boundaries are >> identified by means of a simple classification of characters as >> either “word” or “non-word”, which approximates traditional regular >> expression behavior. The results obtained with the two options can >> be quite different in runs of spaces and other non-word characters. >> > > If regexes are going to be used on human language text, these are all > important considerations. Yup, but I don't see how they affect the design, other than that maybe matching on a LocalizedString type would use UREGEX_WORD by default. -- -Dave _______________________________________________ swift-evolution mailing list [email protected] https://lists.swift.org/mailman/listinfo/swift-evolution
