Re: [Wikitech-l] More aggressive DEFAULTSORT

Domas Mituzas Thu, 14 May 2009 04:39:38 -0700

Hi!

me and Gerard had some chat about this stuff today, I'll post  
transcript if anyone is interested:


[11:47:03]      GerardM-        Domas, could you explain to me why it would 
make  
a full time job to implement proper collations ? Would it not make  
sense when it is part of the standard MySQL ?
[11:47:37]      domas   well, a) someone would have to actually fix CLDR stuff
[11:47:56]      GerardM-        what is wrong with it ?
[11:47:56]      domas   see, if proposed/draft is better than 'stable', etc
[11:47:59]      domas   it has errors
[11:48:10]      domas   I looked at random collation, and could see errors
[11:48:48]      domas   example: 
http://unicode.org/cldr/data/common/collation/lt.xml?rev=1.26&content-type=text/vnd.viewcvs-markup
[11:49:36]      GerardM-        what is wrong with it ? and also is there a  
better approach to this ?
[11:49:52]      GerardM-        it seems to me that having a standard for this 
is  
the way to go
[11:50:03]      domas   well, status quo is people maintaining this separately
[11:50:07]      domas   we at mysql maintain it separately
[11:50:10]      domas   oracle maintain it separately
[11:50:10]      domas   etc
[11:50:14]      domas   CLDR is not a standard yet
[11:50:25]      domas   it is a good source of information, yes
[11:50:37]      GerardM-        you mean it is not THE standard that everyone  
adheres to
[11:50:44]      domas   it is not standard per se
[11:50:48]      domas   CLDR is a project, a collection
[11:50:59]      domas   a repository
[11:51:16]      GerardM-        it is treated with the same care as other  
standards .. eg ISO-639
[11:51:29]      domas   well, probably the care is similar, yes, just the  
amount of data there is huge
[11:51:38]      domas   and it is a bit sloppy
[11:51:45]      domas   I'm sure ISO639 has way less data
[11:51:46]      domas   :)
[11:51:48]      domas   anyway
[11:51:52]      domas   b)
[11:51:56]      domas   actual merging of this
[11:52:14]      domas   this includes validating if the collation is  
_indexable_
[11:52:32]      domas   see, this defines comparison rules
[11:52:50]      domas   to simplify
[11:52:59]      domas   this is possible: a>b, b>c, c>a
[11:53:09]      GerardM-        given that we are talking about UTF-8 
collation,  
there should be no problem ordering them
[11:53:19]      domas   no, there is always problem
[11:53:30]      domas   say, in some collation 'll' is same as 'l'
[11:53:35]      domas   (thats real fact ;-)
[11:53:55]      GerardM-        in proper Dutch collation ij is one character
[11:53:58]      domas   right
[11:54:05]      domas   how to index that?
[11:54:48]      domas   I've opened quite a few bugs on issues like that in  
past few years :)
[11:54:52]      domas   in mysql
[11:54:56]      GerardM-        the UTF-8 solution would be that there is a  
seperate character for the ij
[11:55:15]      domas   yes, but if people enter 'ij' it has to be treated  
in same way as that special character
[11:55:20]      GerardM-        there are many such issues
[11:55:22]      GerardM-        yes
[11:55:24]      domas   and how does prefixindex work for 'i%' ?
[11:55:30]      domas   should it include ij's or not?
[11:55:32]      domas   :)
[11:55:38]      domas   (it won't)
[11:55:41]      domas   oh well.
[11:55:44]      domas   or it may
[11:55:46]      domas   but it will break indexing
[11:56:00]      domas   c)
[11:56:04]      GerardM-        ok so the question is how can we have MySQL and 
 
the CLDR people work together
[11:56:25]      GerardM-        because it is in everyone's interest to get to 
a  
standard solution
[11:57:09]      domas   well, mysql supports LDMLs or whichever was the format
[11:57:14]      domas   here we come to c)
[11:57:18]      domas   and c) is
[11:57:23]      domas   deployment
[11:57:47]      domas   there're two major problems here
[11:58:19]      domas   during conversion to different charset one would  
have diverging datasets in different orders
[11:58:45]      domas   so, replication can cause data drift
[11:58:53]      domas   another is uniqueness
[11:59:00]      domas   e.g. the change I showed you in lt.xml
[11:59:09]      domas   it is changing unique constraints
[11:59:21]      domas   'i' and 'y' were separate in 'standard', and are  
together in 'draft'
[12:00:06]      domas   so, there would have to be a way to resolve all the  
collisions in page titles
[12:00:15]      domas   which is no longer a technology problem, as  
community has to be involved
[12:00:50]      GerardM-        sorting and constraints are imho separate things
[12:00:57]      domas   you think so :))
[12:02:16]      GerardM-        when i and y are equal, you show them next to  
each other
[12:02:17]      GerardM-        when the constraint has it that there can only 
be  
one i or y it is a separate rule that can be implemented in a  
different way
[12:02:24]      domas   well
[12:02:31]      domas   you cannot have different constraints for ordering/ 
uniqueness
[12:02:33]      GerardM-        I am sure that it can be implemented in a 
stored  
procedure
[12:02:36]      domas   as in, different rulesets
[12:03:51]      domas   tertiary weights would help here, probably, but on  
the other hand
[12:03:56]      domas   'i' and 'y' _are_ same :)
[12:04:09]      domas   or, well, who is authority on uniqueness then
[12:04:13]      domas   if we have one on sorting
[12:04:36]      GerardM-        the issue you raised was how to deal with  
existing data where i and y both exist
[12:05:06]      domas   well, or 'i' and 'I' as this would lead to case  
insensitive unique constraints
[12:05:31]      GerardM-        so allowing for existing garbage means you have 
 
to be able to sort. uniqueness rules would then exist for new content
[12:05:49]      domas   you can't
[12:05:56]      domas   uniqueness rules are enforced by indexing
[12:06:11]      GerardM-        that is one way of doing it
[12:06:25]      GerardM-        technically the best
[12:06:35]      domas   I understand the other way
[12:06:39]      domas   and I don't want to hear about it ;-)
[12:06:44]      GerardM-        <grin>
[12:07:04]      domas   it would probably mean converting all the indexing  
into non-unique
[12:07:12]      domas   not having any varchar PKs
[12:07:31]      domas   building another layer for multiple-weight comparisons
[12:07:51]      domas   etc
[12:07:57]      domas   complexity is insane
[12:08:01]      domas   we don't do insanely complex things :))
[12:08:09]      GerardM-        So the question is, how do we get to an open  
standard for this ?
[12:08:23]      GerardM-        and how do we get some proper sorting for  
MediaWiki ?
[12:08:24]      domas   well, let's say CLDR _is_ an open standard
[12:08:38]      domas   even then, there's huge work to migrate all that  
data to MW's data store
[12:08:40]      domas   and maintain it there
[12:08:48]      domas   as by a), b), c)
[12:08:50]      domas   let me add d)
[12:09:01]      GerardM-        CLDR data IS already implemented in some of the 
 
extensions !!
[12:09:03]      domas   amount of DBA work, with staging
[12:09:12]      domas   CLDR data is easy to implement
[12:09:20]      domas   some of CLDR data is implemented as mysql  
collations, or will be
[12:09:27]      domas   maybe all mysql collations will move to CLDR data in  
future
[12:09:46]      domas   bugs like that get opened: 
http://bugs.mysql.com/bug.php?id=37898
[12:10:27]      GerardM-        That is cool ..
[12:10:47]      GerardM-        so people can ask MySQL to implement particular 
 
collations ?
[12:10:48]      domas   and... collation changes are _incompatible_changes_  
on datasets
[12:10:54]      domas   of course
[12:11:04]      GerardM-        do they get implemented ?
[12:11:08]      domas   takes a while
[12:11:14]      domas   it goes to next alpha/development release
[12:11:28]      GerardM-        right but that is normal in standards as well
[12:11:50]      GerardM-        so how long would it take .. given a positive  
approach
[12:12:55]      domas   make mediawiki support CLDR - easy
[12:12:59]      domas   as in, the stack
[12:13:12]      domas   make the stack support CLDR as a process suitable  
for Wikimedia - extreme pain
[12:14:25]      GerardM-        on the phone
[12:14:28]      domas   okie
[12:14:35]      domas   (btw, can I publish this transcript on mailing  
list? :)
[12:14:45]      domas   it is quite informative
[12:15:14]      GerardM-        <grin> we are not done talking and I was to ask 
 
you if I could blog about this :)
[12:15:34]      domas   I'm just asking in advance!
[12:15:43]      domas   btw, as a fun note
[12:15:59]      domas   CLDR does not resolve problems, where dictionaries  
are needed for sorting
[12:16:07]      domas   my favorite example, provided by River: Welsh.
[12:16:13]      domas   \o/
[12:17:10]      domas   gonna go grab some coffee too
[12:20:29]      domas   hold on, switching back from GPRS to DSL
[12:20:39]      *       Disconnected ().
[12:20:50]      *       Disconnected ().
[12:21:00]      *       Disconnected ().
[12:21:46]      *       domas_ sets mode -e domas_
[12:21:46]      *       services. sets mode +e domas
[12:21:49]      domas   back!
[12:45:19]      GerardM-        Hoi
[12:45:25]      GerardM-        back from the phone
[12:46:25]      GerardM-        given that there are over 7000 languages and 
that  
there are over 25000 linguistic entities (including dialects and  
orthographies) it is certainly a "growth area
[12:48:28]      GerardM-        What I am looking for is that we at least 
support  
the WMF languages properly and that is already enough of a challenge
[13:04:59]      domas   hehe
[13:05:02]      domas   as I've said
[13:05:03]      domas   for now
[13:05:06]      domas   it is huge huge effort
[13:05:35]      GerardM-        we can do one language at a time <grin> and 
start  
with English </grin>
[13:07:40]      GerardM-        What you do not say is that there is an  
alternative to this ... so it is a job that needs doing
[13:08:11]      GerardM-        and imho making sure that it fits in with what 
is  
the standard makes it a best effort
[13:08:32]      domas   english is easiest
[13:08:35]      domas   kind of
[13:08:44]      domas   because we sort english properly even now %)
[13:09:18]      domas   and to have this going forward, we need multiple  
people to work on this project
[13:09:34]      domas   DBA, engine development, content conflict testing,  
collation evaluation, etc
[13:09:44]      domas   + more hardware for regular staging :)
[13:09:59]      domas   + rethinking of case sensitivity issues
[13:10:15]      GerardM-        if we would do it TOGETHER with the Unicode  
people, we could get funding for it
[13:10:37]      domas   I personally am too lazy to work on this project :))
[13:11:38]      GerardM-        there are other things where you can make a  
difference as well
[13:12:19]      GerardM-        I learned from Siebrand at Berlin about 
something  
that would make our servers more efficient and thereby allow for  
growth for a couple of months ...
[13:12:28]      GerardM-        had to do with loading messages
[13:12:42]      domas   thats for appservers
[13:12:43]      domas   :)
[13:12:46]      domas   thats my plan!
[13:12:49]      domas   my project!
[13:13:12]      GerardM-        That is certainly important
[13:13:32]      GerardM-        and <grin> there is certain to be much more 
that  
needs doing
[13:13:37]      domas   yes
[13:13:49]      domas   all this CLDR stuff is _very_ complex, when it comes  
to actual implementation
[13:14:00]      domas   do note, we're already way ahead of curve of others  
when it comes to unicode
[13:14:11]      domas   when I had to talk about certain charset/collation  
issues in a conference
[13:14:19]      domas   the only public implementation of unicode  
normalization in software was mediawiki
[13:14:20]      domas   :)
[13:14:30]      GerardM-        <grin> you know my rants about Lingala ? We 
sure  
have a long way to go !! </grin>
[13:22:03]      GerardM-        Would MySQL be interested in this project ?
[13:22:55]      domas   file a bug, and see! :)
[13:23:20]      GerardM-        this would be to implement CLDR collation on 
the  
Wikipedia projects
[13:23:48]      GerardM-        for them it would be a research project ?
[13:24:00]      domas   why research project
[13:24:03]      domas   we know all the problems already :)
[13:24:19]      GerardM-        the problems of implementing CLDR ?
[13:24:29]      domas   as I've said, MySQL supports LDML
[13:24:48]      domas   
http://dev.mysql.com/doc/refman/5.0/en/adding-collation-unicode-uca.html
[13:25:07]      domas   as I've said
[13:25:09]      domas   CLDR is just a repo
[13:25:13]      GerardM-        skyoing
[13:25:15]      domas   LDML is spec, standard
[13:30:04]      GerardM-        reading about LDML to find the relation to the  
collation
[13:32:41]      domas   lol
[13:32:47]      domas   Each collation must have a unique ID. To add a new  
collation, you must choose an ID value that is not currently used. The  
value must be in the range from 1 to 254. The collation ID that you  
choose will show up in these contexts:
[13:32:48]      domas   hahahahaha
[13:32:51]      domas   thats awesome limitation
[13:33:15]      domas   As of MySQL 6.0.8, the range of IDs from 1024 to  
2047 is reserved for user-defined collations.
[13:33:18]      domas   thats better
[13:33:23]      GerardM-        yes
[13:33:24]      GerardM-        indeed
[13:36:06]      GerardM-        so how do we currently deal with the languages  
from India where the order of Unicode is almost certainly to be wrong
[13:36:17]      domas   well, currently we're using byte order
[13:36:24]      domas   it is not any kind of unicode order
[13:36:35]      GerardM-        so there is no proper sorting
[13:36:36]      domas   as utf8 is variable length, offsets of character  
starts are different
[13:43:57]      GerardM-        fyi Martin Benjamin is working on the Afrigen  
project, a project to get CLDR data for African languages
[13:44:20]      domas   as I've said
[13:44:22]      domas   using that data statically
[13:44:28]      domas   on a mediawiki deployment
[13:44:30]      domas   is very easy
[13:44:34]      domas   the process is difficult
[13:45:36]      GerardM-        is having this deployed in different ways per  
database instance going to have an impact ?
[13:46:03]      domas   it isn't problem with an instance
[13:46:06]      domas   it is problem with data
[13:46:45]      GerardM-        right and this get you in either conversions or 
 
in inabilties to deal with them
[13:47:13]      domas   yup
[13:47:18]      domas   it is both problematic and labor-intensive work
[13:47:32]      domas   anyway
[13:47:38]      domas   switching to generic unicode sorting
[13:47:41]      domas   would be huge step ahead
[13:47:55]      domas   but we're not ready for that atm, simpy because  
until mysql6.0, there's no unicode support outside BMP
[13:48:45]      GerardM-        so when would mysql6 be an option ?
[13:49:19]      domas   *shrug*, it is alpha now
[13:49:50]      GerardM-        <grin> it would be an option when you say.. I  
will support it </grin>
[13:50:10]      domas   yes, kind of


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] More aggressive DEFAULTSORT

Reply via email to