daniel added a comment.

In the light of the above comments regarding prefixes and namespaces, a few thoughts about the database schema for connecting the pages. It seems we need the following fields:

  • cgnt_wiki: the wiki ID
  • cgnt_title: the page title (including namespace)
  • cgnt_key: a normalized version of the title (including namespace)

Here, cgnt_wiki+cgnt_title are unique; cgnt_wiki+cgnt_key are also unique. Pages to link are all rows with the same cgnt_wiki+cgnt_key.

Note that potentially, multiple titles could get normalized to the same key, creating a conflict. This should be rare and would very likely be the result of a mistake, but the software need to recover from such a situation gracefully, particularly when one of the conflicting pages gets renamed or deleted.

However, this table will become very tall, because it has roughly one entry per content page on all wiktionary projects combined. So we should try to make the rows less "broad", and remove redundant information. For instance:

  • cgnt_wiki: the wiki ID (int ID referencing another table)
  • cgnt_namespace: a "virtual" namespace id (int referencing another table that has namespace names and IDs for each virtual namespace, for each wiki)
  • cgnt_title: the page title (no namespace)
  • cgnt_key: a normalized version of the title (no namespace)

Here, cgnt_wiki+cgnt_namespace+cgnt_title are unique; cgnt_wiki+cgnt_namespace+cgnt_key are also unique. Pages to link are all rows with the same cgnt_wiki+cgnt_namespace+cgnt_key. In order to construct the titles to link to, the actual per-wiki namespace IDs need to be looked up.

Also note that cgnt_title and cgnt_key are usually the same. We can potentially save a lot of room by setting cgnt_title to null unless it is different from cgnt_key. This means that we can't have a unique key on cgnt_wiki+cgnt_namespace+cgnt_title, so we can only query by key, not by title.

This approach makes constructing the actual page title more complex (use the title if not null, key otherwise), and updates for page deletion and renaming need to be based on the key, not the title. This may be problematic if there are multiple conflicting pages with the same key.


TASK DETAIL
https://phabricator.wikimedia.org/T145412

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore, daniel
Cc: Darkdadaah, WMDE-leszek, Lydia_Pintscher, gabriel-wmde, JAnD, daniel, Addshore, Aklapper, Lewizho99, Maathavan, D3r1ck01, Izno, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to