Case insensitivity shouldn't be a problem for any language, as long as you do it properly.
Turkish and other languages using dotless i, for example, will need a special rule - Turkish lowercase dotted i capitalizes to a capital dotted İ while lowercase undotted ı capitalizes to regular undotted I. skype: node.ue On Tue, Jul 28, 2009 at 9:26 AM, Aryeh Gregor<[email protected]> wrote: > On Tue, Jul 28, 2009 at 11:53 AM, Paul Houle<[email protected]> wrote: >> I've been looking at the id structure of dbpedia and wikipedia and >> finally found an example where case sensitivity issues really bite. > > We should keep in mind that case isn't so clear-cut if you move away > from English, though -- is "groß" the same as "GROSS" and thus the > same as "gross"? How about languages that don't even have bijections > between uppercase and lowercase if you stick to the same dialect? > (I'm pretty sure there are some; don't some language strip diacritics > from uppercase letters?) There's probably some Unicode standard on > normalization with respect to case, but it's not actually so simple in > an international context. > > That said, I think case-insensitivity would be a good thing to support > in the long run, optionally, and that it would probably be suitable > for all Wikipedias. Or at least almost all, if there are languages > out there where case insensitivity is a real headache -- hopefully > not, since most languages don't have letter case at all. At any rate > it would be good on enwiki. > > But it would require a lot of tedious and error-prone conversion of > old code. Everything tends to assume that a) > $title->getPrefixedText() is what should be displayed to the user, but > b) two titles are equal if and only if their > $title->getPrefixedText()s are equal. Likewise for > $title->getPrefixedDbKey(). Those would need to be systematically and > thoroughly fixed. We'd also have to add a field to the page table or > such to store the normalized form of the title, and fiddle with the > indexes appropriately, and update all other tables to use the > normalized form. A lot of work. > > (But at least we could get rid of the silly Text/DbKey distinction > while we're doing this. I've heard recent MySQL versions actually > support storage of ASCII space characters in text fields!) > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
