[Wikidata-bugs] [Maniphest] [Edited] T187775: Investigation: Constraints for a database schema to store representations of a Lexeme

thiemowmde Tue, 20 Feb 2018 05:38:29 -0800

thiemowmde updated the task description. (Show Details)

CHANGES TO TASK DESCRIPTION

Two new secondary database tables are proposed:

* wbl_lemmas stores the lemma (a text value) for each Lexeme.

* wbl_item_references (or similar) stores two Item references (lexical category and language) for each Lexeme.

== Indexes

* For the current use case both tables need a primary index on the Lexeme ID.

* A fulltext index on the Lemma column is not needed for the current use case, but it is suggested to design the table so one can be added any time later.

* Indexes on the two Item reference columns are probably never needed. Backreferences from the Items to Lexemes that use these Items will be done via MediaWiki core's link table, and Wikibase usage tracking.

== Table size

* WIPThe numbers of rows in the proposed tables is going to be identical to the total number of Lexemes. There is only one lemma per Lexeme (in only one language), only one Item reference for the lexical category, and only one Item reference for the language. This makes the two database tables proposed here significantly different from wb_terms, where each Item can have labels and descriptions in 300+ languages, and users are able to enter as many aliases as they want.

== Item references

* Item IDs are currently limited to 32 bits (signed). The maximum ID is Q2147483647, which is 11 characters.

* Some day we might need to switch to 64 bits (still signed). The maximum is then Q9223372036854775807. That's 20 characters.

* WIPSince we know we are exclusively dealing with Item references, but no other entity types, we could store the references as integers. Is this worth it with regards to performance? Or is an indexed VARCHAR column as efficient as an indexed INT column?

== Lemma column length

* The numbers of rows in the proposed tables is going to be identical to the total number of Lexemes. There is only one lemma per Lexeme (in only one language), only one Item reference for the lexical category, and only one Item reference for the language. This makes the two database tables proposed here significantly different from wb_terms, where each Item can have labels and descriptions in 300+ languages, and users are able to enter as many aliases as they want.Suggestion is to go with VARCHAR(20) for both columns.

== Lemma column length

* In contract to Item labels, the lemma of a Lexeme is (by definition) a single word only....
I talked to PM (@Lydia_Pintscher) and we established the limit should **not** be 255, but 768.

TASK DETAIL

https://phabricator.wikimedia.org/T187775

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: thiemowmde
Cc: daniel, Lucas_Werkmeister_WMDE, Ladsgroup, WMDE-leszek, thiemowmde, Aklapper, Lydia_Pintscher, Lahi, Gq86, Cinemantique, GoranSMilovanovic, QZanden, LawExplorer, Wikidata-bugs, aude, Darkdadaah, Mbch331

_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Edited] T187775: Investigation: Constraints for a database schema to store representations of a Lexeme

Reply via email to