| thiemowmde updated the task description. (Show Details) |
CHANGES TO TASK DESCRIPTION
Two new secondary database tables are proposed:
* wbl_lemmas stores the lemma (a text value) for each Lexeme.
* wbl_item_references (or similar) stores two Item references (lexical category and language) for each Lexeme.
== Indexes
* For the current use case both tables need a primary index on the Lexeme ID.
* A fulltext index on the Lemma column is not needed for the current use case, but it is suggested to design the table so one can be added any time later.
* Indexes on the two Item reference columns are probably never needed. Backreferences from the Items to Lexemes that use these Items will be done via MediaWiki core's link table, and Wikibase usage tracking.
== Table size
* WIPThe numbers of rows in the proposed tables is going to be identical to the total number of Lexemes. There is only one lemma per Lexeme (in only one language), only one Item reference for the lexical category, and only one Item reference for the language. This makes the two database tables proposed here significantly different from wb_terms, where each Item can have labels and descriptions in 300+ languages, and users are able to enter as many aliases as they want.
== Item references
* Item IDs are currently limited to 32 bits (signed). The maximum ID is Q2147483647, which is 11 characters.
* Some day we might need to switch to 64 bits (still signed). The maximum is then Q9223372036854775807. That's 20 characters.
* WIPSince we know we are exclusively dealing with Item references, but no other entity types, we could store the references as integers. Is this worth it with regards to performance? Or is an indexed VARCHAR column as efficient as an indexed INT column?
== Lemma column length
* The numbers of rows in the proposed tables is going to be identical to the total number of Lexemes. There is only one lemma per Lexeme (in only one language), only one Item reference for the lexical category, and only one Item reference for the language. This makes the two database tables proposed here significantly different from wb_terms, where each Item can have labels and descriptions in 300+ languages, and users are able to enter as many aliases as they want.Suggestion is to go with VARCHAR(20) for both columns.
== Lemma column length
* In contract to Item labels, the lemma of a Lexeme is (by definition) a single word only.
* wbl_lemmas stores the lemma (a text value) for each Lexeme.
* wbl_item_references (or similar) stores two Item references (lexical category and language) for each Lexeme.
== Indexes
* For the current use case both tables need a primary index on the Lexeme ID.
* A fulltext index on the Lemma column is not needed for the current use case, but it is suggested to design the table so one can be added any time later.
* Indexes on the two Item reference columns are probably never needed. Backreferences from the Items to Lexemes that use these Items will be done via MediaWiki core's link table, and Wikibase usage tracking.
== Table size
* WIPThe numbers of rows in the proposed tables is going to be identical to the total number of Lexemes. There is only one lemma per Lexeme (in only one language), only one Item reference for the lexical category, and only one Item reference for the language. This makes the two database tables proposed here significantly different from wb_terms, where each Item can have labels and descriptions in 300+ languages, and users are able to enter as many aliases as they want.
== Item references
* Item IDs are currently limited to 32 bits (signed). The maximum ID is Q2147483647, which is 11 characters.
* Some day we might need to switch to 64 bits (still signed). The maximum is then Q9223372036854775807. That's 20 characters.
* WIPSince we know we are exclusively dealing with Item references, but no other entity types, we could store the references as integers. Is this worth it with regards to performance? Or is an indexed VARCHAR column as efficient as an indexed INT column?
== Lemma column length
* The numbers of rows in the proposed tables is going to be identical to the total number of Lexemes. There is only one lemma per Lexeme (in only one language), only one Item reference for the lexical category, and only one Item reference for the language. This makes the two database tables proposed here significantly different from wb_terms, where each Item can have labels and descriptions in 300+ languages, and users are able to enter as many aliases as they want.Suggestion is to go with VARCHAR(20) for both columns.
== Lemma column length
* In contract to Item labels, the lemma of a Lexeme is (by definition) a single word only.
...
I talked to PM (@Lydia_Pintscher) and we established the limit should **not** be 255, but 768.TASK DETAIL
EMAIL PREFERENCES
To: thiemowmde
Cc: daniel, Lucas_Werkmeister_WMDE, Ladsgroup, WMDE-leszek, thiemowmde, Aklapper, Lydia_Pintscher, Lahi, Gq86, Cinemantique, GoranSMilovanovic, QZanden, LawExplorer, Wikidata-bugs, aude, Darkdadaah, Mbch331
Cc: daniel, Lucas_Werkmeister_WMDE, Ladsgroup, WMDE-leszek, thiemowmde, Aklapper, Lydia_Pintscher, Lahi, Gq86, Cinemantique, GoranSMilovanovic, QZanden, LawExplorer, Wikidata-bugs, aude, Darkdadaah, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
