Hello, On Jun 8, 2010, at 11:22 PM, Gerard Meijssen wrote:
> The difference is that is actually does sort according to the CLDR.. It > would be really nice if we did that. It does not, it sorts according to the partial UCA implementation. We have discussed CLDR in the past - it is a huge collection of distinct collations, and even though it is possible to use LDMLs from CLDR project, it is PITA, due to both partial UCA support and continuous effort to rebuild indexing, resolve conflicts, and hit all sorts of obscure "linguists are not computer scientists" problems :) On Jun 8, 2010, at 5:28 PM, Paul Houle wrote: > As a person who has labored mightily to make sense of dbpedia, I > think that one reason why varbinary is preferable to varchar in many > applications in wikimedia is that varchar() string comparisons are case > insensitive and varbinary comparisons are case sensitive. varchar with case insensitive collations is case insensitive, varchar with binary/case sensitive collations is case sensitive. varbinary() otoh is varchar with 'binary' character set (if you define default server charset to be binary, as we do on our 5.x boxes, all varchar creation will be varbinary). > There are 10,000 or so articles in the english wikipedia that have > titles that vary only by case. Load those into a varchar(255) and put a > primary key on them and mysql just won't let you do it. Depends on a collation, but yes, you are right. There're more concerns there, not just case sensitivity, though. Different collations can map different digraphs or different diacritics to different codepoints, causing quite some confusion. Like in my language, ą = a, but š > s :) > I looked at a sample of those article and came to the conclusion > that the semantic relations between them are complicated enough that > they cannot be autosquashed. Indeed. If you go for CLDR-like national collations, you expose yourself not just to case sensitivity though, but also to all the different digraph/accented character mappings, that add even more confusion to your uniqueness constraints. On Jun 8, 2010, at 3:58 PM, Ryan Chan wrote: > obviously, varchar(255) binary does not support character outside of BMP. It does, if you do very very horrible hack of using latin1 character set (but I'd always say that is bad idea and binary charset aka varbinary should be used instead). Domas _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
