https://bugzilla.wikimedia.org/show_bug.cgi?id=164
--- Comment #186 from Philippe Verdy <[email protected]> 2010-07-24 05:53:30 UTC --- (In reply to comment #183) > Upgrading the collation can be done in-place. The worst case is that > categories sort weirdly for a few hours. Also, we would only realistically > have to change the collation often on smaller wikis, since the largest wikis > should have high-quality collations that cover almost all their pages to begin > with. I don't think we need to adjust the schema to prepare for this. That'a a bad assumption : even the highest quality collations will need to be updated from time to time: - Unicode evolves and new characters get encoded (new versions are published about every 1-2 years after synchronization and final balloting at both ISO WG2 and the UTC. - The content of the Unicode DUCET is NOT stable: characters are inserted in the sequence so that the full list of collation weights needs to be offseted where the new characters get inserted. - Collations for languages get corrected. We should be able to upgrade these rules when the CLDR project produces new tailorings (CLDR updates are published separately, about every 1-2 years.) These corrections may be rare (every few months), but when they will occur, any upgrade could take many hours that could horce the site to go offline when recomputing sortkeys, or NO correction will be possible. Upgrading "in place" is effectively what I proposed, but how will you track which pages need to to reindexed? A collation ID in the stored index can really help determine which collation rule was used to generate the stored sortkey; In addition it will allow to support multiple collations. This is the mean by which the "in place" recomputing can be safely be done. Note: truncating the sortkeys will ALWAYS be needed, just because the database column will still have a length limit. Truncating is not so bad anyway, because: - the compact binary sequence of primary collation weights, that starts the sort key will be at the begining of the sort key. Further length is used to store the compacted sequence of secundary collation weights, then the sequence of ternary collation weights. - if truncation occurs, the effect will be that only the smallest differences will not be represented. But if you accept to store only non-truncated sort keys, you'll still have the case where some pages will have some long name, plus the case where someone will have indicated for that page a very long {{DEFAULTSORT:sortkey}} or very long text in the second parameter of [[category:...|sortkey]]. To avoid this: - page names already have a length limit. This also limits the length of sort keys computed from only them - we should already truncate the string given in {{DEFAULTSORT:sortkey}} or {{category:..|sortkey]] so that the concatenation of this string and of the page name can be used to compute the binary sortkey. If you can accept arbitrary lengths, so go with it, but it will be unsafe and your schema will not be able to put that in a sortable column (you'll be only able to put it in a referenced BLOB, just like the the text of articles, and databases can never sort external BLOB's) Anyway you did not reply to the idea of first developin the parser functions and test them. Developping the SQL schema extension should not be attempted before at least the first function {{SORTKEY:text|locale|level}} has been fully developed and tested on specific pages (it can be tested easily in tables). And with just this function, it should be possible on specific wikis to use it immediately to sort specific categories (for example by using templates using that function). The second function {{COLLATIONMAP:text|locale|level|clusters}} is not needed immediately to develop the schema, but will be useful to restore the functionality of headings. Headings don't need to be stored as they can be computed on the fly, directly by reading sequentially the sorted result set from the SQL query: You can compute headings from the returned page names, or from the existing stored "cl_sortkey" which should be used now ONLY to store the plain-text specified in articles with {{DEFAULTSORT:sortkey}} and [[category:...|sortkey]]. The existing cl_sortkey is just a forced "hint", it does not make the sort order unique. Otherwise it should remain completely empty with the new schema. It will always be locale neutral and will take precedence on the page name : to sort the pages effectively, the content of the cl_sortkey content and the pagename should be always concatenated inernally to compute the binary sortkey for various locales. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
