| thiemowmde created this task. thiemowmde triaged this task as "Normal" priority. thiemowmde added projects: Wikidata, Lexicographical data, Wikidata-Sprint-2018-02-14. Herald added a subscriber: Aklapper. |
TASK DESCRIPTION
Table size
- WIP
Item references
- WIP
Lemma column length
- The numbers of rows in the proposed tables is going to be identical to the total number of Lexemes. There is only one lemma per Lexeme (in only one language), only one Item reference for the lexical category, and only one Item reference for the language. This makes the two database tables proposed here significantly different from wb_terms, where each Item can have labels and descriptions in 300+ languages, and users are able to enter as many aliases as they want.
- In contract to Item labels, the lemma of a Lexeme is (by definition) a single word only.
- One of the longest words in an English dictionary is "Supercalifragilisticexpialidocious" (34 characters).
- Some chemicals have names with tens of thousands, even hundreds of thousands of characters. What such extreme examples basically mean is: Whatever limit we choose, it will be arbitrary. There will always be exceptions. We must always think about truncation.
- Rendering lemmas with thousands of characters untruncated in contexts that reference the Lexeme (but are not meant to represent the Lexeme like the Lexeme page itself does) certainly does not make sense. When a lemma is used in the visible text or tooltip of a link, some truncation must happen. Otherwise a single link would span multiple lines or even paragraphs. A trivial truncation algorithm that prooved to be sufficent many times (e.g. in TwoColConflict) is to hard truncate via the database, and apply a CSS ellipsis to hide the hard truncation.
- If we want to make sure MySQL can index all characters in a VARCHAR column, we should not go beyond VARCHAR(768). See https://phabricator.wikimedia.org/T154660#2936497 for a very closely related discussion.
- Labels, descriptions, and aliases are currently limited in two ways: to 250 Unicode characters via a setting "multilang-limits", as well as 255 bytes via a VARCHAR(255) in the wb_terms table.
- If we make sure we are able to expand the table structure later, we could start with VARCHAR(255), and later expand to VARCHAR(768) or further if needed.
I talked to PM (@Lydia_Pintscher) and we established the limit should not be 255, but 768.
TASK DETAIL
EMAIL PREFERENCES
To: thiemowmde
Cc: daniel, Lucas_Werkmeister_WMDE, Ladsgroup, WMDE-leszek, thiemowmde, Aklapper, Lydia_Pintscher, Lahi, Gq86, Cinemantique, GoranSMilovanovic, QZanden, LawExplorer, Wikidata-bugs, aude, Darkdadaah, Mbch331
Cc: daniel, Lucas_Werkmeister_WMDE, Ladsgroup, WMDE-leszek, thiemowmde, Aklapper, Lydia_Pintscher, Lahi, Gq86, Cinemantique, GoranSMilovanovic, QZanden, LawExplorer, Wikidata-bugs, aude, Darkdadaah, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
