[Bug 164] Support collation by a certain locale (sorting order of characters)

bugzilla-daemon Wed, 20 Jul 2011 04:21:09 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=164


--- Comment #219 from Philippe Verdy <[email protected]> 2011-07-20 11:20:08 
UTC ---
I note the following comment in r80443:
* Changed Title::getCategorySortkey() to separate its parts with a line break
instead of a null character. All collations supported by the intl extension
ignore the null character, i.e. "ab" == "a\0b". It would have required a lot of
hacking to make it work.

If you had read my comments above, I had already spoken since long about the
choice of the separator to use between the prefix and the rest of the full page
name: this had to be a non-ignorable character, that is clearly lower than any
character usable in the full pagename.

I had also stated which character to use, namely the TAB character that is the
first non-ignorable (i.e. with non-all-zeroes weights). See my Comment #190 for
the rationale (one year ago, much longer before your recent fix to use LF,
because before yu had tried with NULL, which was wrong, and even without any
separator which was also wrong).

You could have avoided these bugs, as I had analyzed it long before.

OK, LF (U+000A) is OK (because it is not allowed in full page names), even it
is not the first one.

---

One word for performance and space: you have put too many fields in table
"categorylinks". Only one essential think is needed there for referencial
constraints and counting: the UNIQUE index on (cl_from, cl_to), which is
architectural. As this index is unique, it should share the same store as the
table itself. But SQL engines will never use more than one index per table to
read it, because it requires additional random-access cursors to access the
table. As the current unique index uses less fields than the table, it
implicitly creates a random-access reference from this index to the table
(basically, using internal rowids stored in the index, instead of reading
directly from the table).

To avoid this problem, you have to make sure that the table is formatted so
that its first fields match exactly the same order as the unique index. This is
the case for the first unique index created just after that. So the table and
the index share the same store, however the number of rows in the index per
storage page will be limited (the table can use up to 1KB per row, this does
not give a very compact and fast search in the index, as it will load too many
storage pages in memory, even when using a "very" selective search).

You may optimize searches (at the price of storage space), by indicating that
the unique index should still use a separate store. Normally this is the role
of the specifier "PRIMARY" which indicates to not use a separate store (it is
still UNIQUE). For now you don't have any PRIMARY index, only a UNIQUE index
which only uses a separate index store.

Now comes the two main kind of requests performed by MediaWiki:

(1) getting the list of categories in which a page is currently listed
(including when saving a page because this list must be flushed out and
replaced by the categories found when parsing the page). Ideally, this should
require at least an index on the cl_from. The UNIQUE index on (cl_from,cl_to)
works perfectly there because it is in separate store. Other attributes in the
table will not even be read, unless the SQL query wants to get these attributes
(if this is the case, both the index and table store will be read, the index
being read very rapidly and joined with the table using the internal table row
id, stored in the index with the two keys). In all what I have seen in the
MediaWiki code, this does not occur (including when rendering a page, to
display the list of categories, it just needs this list but gets it from the
page's wiki code parsing before saving it, and this gets saved in the page
cache).

(2) getting the list of pages in a category. There you need to use collation as
well because this list is sorted. Ideally, this should use an index and table
that just contain the necessary fields, and that the SQL SORT BY clause will
use those same fields, in the same order (and either all in ASCENDING, or all
in DESCENDING order). This is the second index:

CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to, cl_type, cl_sortkey,
cl_from);

Obviously, it cannot use the same index store as the table. The introduction of
the "cl_type" field is recent. It supposes that you will be paging the category
views by page type, in the order set by cl_type, which is:
ENUM ('page', 'subcat', 'file').

I wonder if this is correct and if it's the expected order. But anyway this
design means that the current code will actually perform three separate
requests, each one will be paged separately: one request to get the list of
normal pages, another to get the list of subcats, and another to get the list
of files. This will be inefficient, and hard to manage for the paging, unless
these three types can be effectively paged separately in the MediaWiki HTML UI
(this is not the case, as of today).

So if we can only do paging on all categorized pages, whatever their type, the
"cl_type" field in the index causes the "SORT BY cl_sortkey" clause in the SQL
query to build a clostly temporary index, instead of reading directly from the
index where it will finally get the "cl_from" field to display.

So as long as we cannot have separate paging by type, I really suggest that you
either drop the "cl_type" field from this index, or that you place it after
"cl_sortkey" needed for fast "ORDER BY". I.e. :

CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to, cl_sortkey,
cl_from);

My opinion if that "cl_type" is not even needed there, the page id stored in
"cl_from" already has this information, because it will be always be used as a
reference to the "pages" table to get the actual page name to display, as well
as its namespace, none of them being stored in the "categorylinks" table
itself. 

Note that because this is a secondary index (in a separate store), it is still
good to include 'cl_from' there, even if it's not needed for the selectibity of
this index, just to avoid an additional random-access cursor to the
"categorylinks" table to be used by the SQL engine using table row ids also
stored in this index (SQL engines prefer to use referencial cursors, if
possible, instead of creating temporary table/index stores for cross joins
between tables and/or indexes, except if both joined parts are not very
selective ; as this is SQL-engine specific to their internal query optimizer, I
won't discuss more about the tricky details ; it's just simpler to avoid the
selectivy problem here by including the small "cl_from" field in this index
which does not fill up much space in the encoded index row and still allows a
good compaction of many rows in the index pages).

So why "cl_type" is there (in the index) if we still cannot perform paging
separately on these types when viewing a category content ? Even if you do so
later, it will require  separate SELECT requests, with separate sorts, and
separate paging every 200 items. In this case only, "cl_type" will be specified
as a constant and will become very selective, and probably better viewed with
separate indexes (so even "cl_type" will not be needed if you have separate
tables by splitting the "categorylinks" table for each type).

If you really want to maintain the "cl_type" field, for me it's just an
informational denormalization of something really represented by "cl_from".
Beware of the effects of such denormalizations (I hope that a page id will
never  change its type, including when a page is renamed to another namespace;
but I won't bet it; I think this may be the cause of future database
inconsistencies). For me the rest of the datashema is not prepared to handle
such separation (notably between cl_type="page" and cl_type="file").

And in fact if you really want to maintain it in the index, as a convenience to
simplify the MediaWiki client code perorming the SQL queries, it should better
be, for now, only the last field of this index:

CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to, cl_sortkey, cl_from,
cl_type /* denormalized: see cl_from */);

This will still avoid the ORDER BY clause of your queries to create an
expensive temporary index when viewing and paging the (sorted) contents of a
category (notably for very populated categories, notably on Wiktionnaries that
contain categories for large collections of words per language, with 200 000
entries or more).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 164] Support collation by a certain locale (sorting order of characters)

Reply via email to