Hello,

On Jun 8, 2010, at 11:22 PM, Gerard Meijssen wrote:

> The difference is that is actually does sort according to the CLDR.. It
> would be really nice if we did that.

It does not, it sorts according to the partial UCA implementation. 

We have discussed CLDR in the past - it is a huge collection of distinct 
collations, and even though it is possible to use LDMLs from CLDR project, it 
is PITA, due to both partial UCA support and continuous effort to rebuild 
indexing, resolve conflicts, and hit all sorts of obscure "linguists are not 
computer scientists" problems :) 

On Jun 8, 2010, at 5:28 PM, Paul Houle wrote:

>     As a person who has labored mightily to make sense of dbpedia,  I 
> think that one reason why varbinary is preferable to varchar in many 
> applications in wikimedia is that varchar() string comparisons are case 
> insensitive and varbinary comparisons are case sensitive.

varchar with case insensitive collations is case insensitive, varchar with 
binary/case sensitive collations is case sensitive.

varbinary() otoh is varchar with 'binary' character set (if you define default 
server charset to be binary, as we do on our 5.x boxes, all varchar creation 
will be varbinary). 

>    There are 10,000 or so articles in the english wikipedia that have 
> titles that vary only by case.  Load those into a varchar(255) and put a 
> primary key on them and mysql just won't let you do it.

Depends on a collation, but yes, you are right. There're more concerns there, 
not just case sensitivity, though.
Different collations can map different digraphs or different diacritics to 
different codepoints, causing quite some confusion. 

Like in my language, ą = a, but š > s :) 

> I looked at a sample of those article and came to the conclusion 
> that the semantic relations between them are complicated enough that 
> they cannot be autosquashed.

Indeed. If you go for CLDR-like national collations, you expose yourself not 
just to case sensitivity though, but also to all the different digraph/accented 
character mappings, that add even more confusion to your uniqueness 
constraints. 

On Jun 8, 2010, at 3:58 PM, Ryan Chan wrote:

> obviously, varchar(255) binary does not support character outside of BMP.

It does, if you do very very horrible hack of using latin1 character set (but 
I'd always say that is bad idea and binary charset aka varbinary should be used 
instead).

Domas
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to