https://bugzilla.wikimedia.org/show_bug.cgi?id=164
--- Comment #178 from Philippe Verdy <verd...@wanadoo.fr> 2010-07-22 10:48:39 UTC --- Because his specification is really incomplete, and he said that Bug#164 was useless (despite of the fact that I had described my solution extensively in Bug#164long before Ariyeh started working on it. And yes, before ever attempting to change the schema, I support the prior developement and extensive testing of builtin parser functions supported by PHP functions which will be shared later to support the updated SQL schema. Only this developmeent alone will have significant difficulties: * notably integrating ICU in a PHP module installation, or * rewriting the collation algorithms entirely with PHP; * having to support the DUCET updates caused by new Unicode versions or corrections; * having to support multiple collation orders by per-locale tailorizations (coming from CLDR or from other sources). The need to support upgraded collation orders is also an important decision factor for the schema, if sortkeys are stored in a SQL backend, that's why I speak about it very early: * collations supported by SQL backends have very strong limitations, or any upgrade would require shutting down the servers for hours or days to perform the upgrade of collated indexes. * in their missing full ISO 10646 "level 3 implementation" for the support of supplementary planes. All this is something that can be avoided completely by using ICU and not depending on SQL backends for their support of many more collation locales that we need in our international projects: * the schema just needs to be able to store multiple sortkeys, so that newer sortkeys (computed with the new rules) can be progressively computed in the background by a bot or server script or some upgrades occuring on the fly when processing articles. * older sortkeys that were using a older generation rule can be deleted in a simple DELETE operation after the new collation rule for a corrected locale has been made the default one, or can be deleted one by one each time a new generation sortkey is recomputed and has been inserted (there's not even the need to perform the two sucessive operations in a transaction if the first INSERT withe the new rule has been sucessful). Because we have now multiple sortkeys per indexed page in a category, we can conveniently support multiple sortkeys for different locales and offer a good experience for users that will want alternate sort orders (notably Chinese users that will want presentations in radical/stroke order, or in pinyin order). ---- Another note about how to serialize the opaque sortkeys: the builtin function {{SORTKEY:text|locale|level}} described above will not limit the length of the generated binary sortkey, however it should serilize it in a valid Unicode text that can be used in tables. A convenient serialization of bytes to characters that will sort correctly is Base-36 using the alphabet [0-9A-Z] (no padding necessary) or Base-32 (it avoids modular arithmetics but will serialize into longer strings) If sortkeys are about to be stored, retrieved in the SQL schema, and sorted by the SQL clause "ORDER BY...sortkey...", then: - either the SQL backend allows storing and sorting binary sequences of bytes as VARBINARY(N) : then no extra serialization is needed, store directly that opaque sort key, after truncation to the max length value (N) indicated in the SQL type of the "sortkey" table column. - or the SQL backend does not support sortable binary sequences of arbitrary bytes, but can only sort VARCHAR(N), then use a similar Base-32 or Base-36 conversion to create compatible sortkeys, and then store the converted string after truncating to the max length value (N) indicated in the SQL type of the 'sortkey" table column. - in both cases, the stored sortkeys will NEVER be exposed to users, its sole purpose is to make the SQL "ORDER BY" clause work properly. To start listing a category from a given artbitrary Unicode text, use the "start=" HTTP query parameter and compute internally the sortkey associated with it to generate the value used in SQL clause "WHERE sortkey >= 'value'". - Section headings in categories will never need to be stored, they are generated on the fly by reading the page names retrieved in the SQL result set using the {{COLLATIONMAP:}} function, with the specified locale in the "uselang=" HTTP query parameters, and the specified (or default) "clusters=" parameter (whose default will be 1 or 0 as indicated above). They will be diretly readable by users and do not require decoding anything from the stored sortkey. - the readable collation mappings and the opaque sortkeys should be coherent in the same locale, but they are clearly different: pagenames that are collation-mapped should sort in the same natural order as the section headings generated from them, so it's absolutely not needed to generate sort keys from collation-ampped headings computed in the fly. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l