On Thu, Feb 19, 2015 at 12:17 PM, Eli Zaretskii <[email protected]> wrote:
> Sorry, I disagree. First, collation data is overkill for search, > since the order information is not required, so the weights are simply > wasting storage. Second, people do want to find, e.g., "²" when they > search for "2" etc. > Depends on what you do. "the weights are simply wasting storage" is not really true, you do have to encode something for which characters are same or different, and it turns out that that comes close to defining a sort order. Some people also want to ignore accents, others don't. As to your original question, Unicode collation would give you primary-equal "mem" and "sigma" characters. 05DE; [63 1E, 05, 05] # Hebr Lo [1F81.0020.0002] * HEBREW LETTER MEM FB26; [63 1E, 05, 20] # Hebr Lo [1F81.0020.0005] * HEBREW LETTER WIDE FINAL MEM 05DD; [63 1E, 05, 2E] # Hebr Lo [1F81.0020.0019] * HEBREW LETTER FINAL MEM FB3E; [63 1E, 05, 05][, E5 B1, 05] # Hebr Lo [1F81.0020.0002][0000.005F.0002] * HEBREW LETTER MEM WITH DAGESH 03C3; [5F 42, 05, 05] # Grek Ll [1C95.0020.0002] * GREEK SMALL LETTER SIGMA 03F2; [5F 42, 05, 10] # Grek Ll [1C95.0020.0004] * GREEK LUNATE SIGMA SYMBOL 1D6D3; [5F 42, 05, 17] # Zyyy Ll [1C95.0020.0005] * MATHEMATICAL BOLD SMALL FINAL SIGMA ... 03C2; [5F 42, 05, 33] # Grek Ll [1C95.0020.0019] * GREEK SMALL LETTER FINAL SIGMA You can certainly simplify a few things when you don't care about the order, therefore CLDR defines "search" tailorings. Some popular browsers use collation-based search for ctrl-F in-page search, either with strength=primary (ignore accent/case/etc. variants), or with asymmetric search. ICU implements those algorithms and carries the CLDR tailorings. See http://www.unicode.org/reports/tr10/#Searching Best regards, markus
_______________________________________________ Unicode mailing list [email protected] http://unicode.org/mailman/listinfo/unicode

