https://bugzilla.wikimedia.org/show_bug.cgi?id=164
--- Comment #219 from Philippe Verdy <[email protected]> 2011-07-20 11:20:08 UTC --- I note the following comment in r80443: * Changed Title::getCategorySortkey() to separate its parts with a line break instead of a null character. All collations supported by the intl extension ignore the null character, i.e. "ab" == "a\0b". It would have required a lot of hacking to make it work. If you had read my comments above, I had already spoken since long about the choice of the separator to use between the prefix and the rest of the full page name: this had to be a non-ignorable character, that is clearly lower than any character usable in the full pagename. I had also stated which character to use, namely the TAB character that is the first non-ignorable (i.e. with non-all-zeroes weights). See my Comment #190 for the rationale (one year ago, much longer before your recent fix to use LF, because before yu had tried with NULL, which was wrong, and even without any separator which was also wrong). You could have avoided these bugs, as I had analyzed it long before. OK, LF (U+000A) is OK (because it is not allowed in full page names), even it is not the first one. --- One word for performance and space: you have put too many fields in table "categorylinks". Only one essential think is needed there for referencial constraints and counting: the UNIQUE index on (cl_from, cl_to), which is architectural. As this index is unique, it should share the same store as the table itself. But SQL engines will never use more than one index per table to read it, because it requires additional random-access cursors to access the table. As the current unique index uses less fields than the table, it implicitly creates a random-access reference from this index to the table (basically, using internal rowids stored in the index, instead of reading directly from the table). To avoid this problem, you have to make sure that the table is formatted so that its first fields match exactly the same order as the unique index. This is the case for the first unique index created just after that. So the table and the index share the same store, however the number of rows in the index per storage page will be limited (the table can use up to 1KB per row, this does not give a very compact and fast search in the index, as it will load too many storage pages in memory, even when using a "very" selective search). You may optimize searches (at the price of storage space), by indicating that the unique index should still use a separate store. Normally this is the role of the specifier "PRIMARY" which indicates to not use a separate store (it is still UNIQUE). For now you don't have any PRIMARY index, only a UNIQUE index which only uses a separate index store. Now comes the two main kind of requests performed by MediaWiki: (1) getting the list of categories in which a page is currently listed (including when saving a page because this list must be flushed out and replaced by the categories found when parsing the page). Ideally, this should require at least an index on the cl_from. The UNIQUE index on (cl_from,cl_to) works perfectly there because it is in separate store. Other attributes in the table will not even be read, unless the SQL query wants to get these attributes (if this is the case, both the index and table store will be read, the index being read very rapidly and joined with the table using the internal table row id, stored in the index with the two keys). In all what I have seen in the MediaWiki code, this does not occur (including when rendering a page, to display the list of categories, it just needs this list but gets it from the page's wiki code parsing before saving it, and this gets saved in the page cache). (2) getting the list of pages in a category. There you need to use collation as well because this list is sorted. Ideally, this should use an index and table that just contain the necessary fields, and that the SQL SORT BY clause will use those same fields, in the same order (and either all in ASCENDING, or all in DESCENDING order). This is the second index: CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to, cl_type, cl_sortkey, cl_from); Obviously, it cannot use the same index store as the table. The introduction of the "cl_type" field is recent. It supposes that you will be paging the category views by page type, in the order set by cl_type, which is: ENUM ('page', 'subcat', 'file'). I wonder if this is correct and if it's the expected order. But anyway this design means that the current code will actually perform three separate requests, each one will be paged separately: one request to get the list of normal pages, another to get the list of subcats, and another to get the list of files. This will be inefficient, and hard to manage for the paging, unless these three types can be effectively paged separately in the MediaWiki HTML UI (this is not the case, as of today). So if we can only do paging on all categorized pages, whatever their type, the "cl_type" field in the index causes the "SORT BY cl_sortkey" clause in the SQL query to build a clostly temporary index, instead of reading directly from the index where it will finally get the "cl_from" field to display. So as long as we cannot have separate paging by type, I really suggest that you either drop the "cl_type" field from this index, or that you place it after "cl_sortkey" needed for fast "ORDER BY". I.e. : CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to, cl_sortkey, cl_from); My opinion if that "cl_type" is not even needed there, the page id stored in "cl_from" already has this information, because it will be always be used as a reference to the "pages" table to get the actual page name to display, as well as its namespace, none of them being stored in the "categorylinks" table itself. Note that because this is a secondary index (in a separate store), it is still good to include 'cl_from' there, even if it's not needed for the selectibity of this index, just to avoid an additional random-access cursor to the "categorylinks" table to be used by the SQL engine using table row ids also stored in this index (SQL engines prefer to use referencial cursors, if possible, instead of creating temporary table/index stores for cross joins between tables and/or indexes, except if both joined parts are not very selective ; as this is SQL-engine specific to their internal query optimizer, I won't discuss more about the tricky details ; it's just simpler to avoid the selectivy problem here by including the small "cl_from" field in this index which does not fill up much space in the encoded index row and still allows a good compaction of many rows in the index pages). So why "cl_type" is there (in the index) if we still cannot perform paging separately on these types when viewing a category content ? Even if you do so later, it will require separate SELECT requests, with separate sorts, and separate paging every 200 items. In this case only, "cl_type" will be specified as a constant and will become very selective, and probably better viewed with separate indexes (so even "cl_type" will not be needed if you have separate tables by splitting the "categorylinks" table for each type). If you really want to maintain the "cl_type" field, for me it's just an informational denormalization of something really represented by "cl_from". Beware of the effects of such denormalizations (I hope that a page id will never change its type, including when a page is renamed to another namespace; but I won't bet it; I think this may be the cause of future database inconsistencies). For me the rest of the datashema is not prepared to handle such separation (notably between cl_type="page" and cl_type="file"). And in fact if you really want to maintain it in the index, as a convenience to simplify the MediaWiki client code perorming the SQL queries, it should better be, for now, only the last field of this index: CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to, cl_sortkey, cl_from, cl_type /* denormalized: see cl_from */); This will still avoid the ORDER BY clause of your queries to create an expensive temporary index when viewing and paging the (sorted) contents of a category (notably for very populated categories, notably on Wiktionnaries that contain categories for large collections of words per language, with 200 000 entries or more). -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
