[Wikidata-bugs] [Maniphest] [Commented On] T187775: Investigation: Constraints for a database schema to store representations of a Lexeme

2018-02-23 Thread thiemowmde
thiemowmde added a comment. Estimated table sizes: wbl_lexemes The latest Item ID is currently Q49977198. Thats 9 bytes. 9 * 3 = 27 bytes per row. 27 * 1 million Lexemes = 26 megabytes. wbl_lemmata Lexeme IDs will be similar to Item IDs, so 9 bytes again. Lets say language codes are 5 bytes on

[Wikidata-bugs] [Maniphest] [Commented On] T187775: Investigation: Constraints for a database schema to store representations of a Lexeme

2018-02-23 Thread thiemowmde
thiemowmde added a comment. wb_terms is plural. Most MediaWiki core tables are plural. I also like plural names for tables more. But in the end it really does not matter. I used VARBINARY and VARCHAR BINARY as they currently are on other Wikibase tables. From

[Wikidata-bugs] [Maniphest] [Commented On] T187775: Investigation: Constraints for a database schema to store representations of a Lexeme

2018-02-23 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE added a comment. (Minor comment – the MediaWiki database coding conventions prefer singular table names, i. e. wbl_lexeme and wbl_lemma. But I don’t know if there’s a different convention within Wikibase.) (Also, is the use of VARCHAR BINARY instead of VARBINARY for

[Wikidata-bugs] [Maniphest] [Commented On] T187775: Investigation: Constraints for a database schema to store representations of a Lexeme

2018-02-21 Thread thiemowmde
thiemowmde added a comment. @WMDE-leszek, something like this would be my draft: CREATE TABLE IF NOT EXISTS wbl_lexemes ( lex_lexeme_id VARBINARY(20) NOT NULL PRIMARY KEY, lex_lexical_category_id VARBINARY(20) NOT NULL, lex_language_item_id VARBINARY(20) NOT NULL ); CREATE TABLE IF NOT

[Wikidata-bugs] [Maniphest] [Commented On] T187775: Investigation: Constraints for a database schema to store representations of a Lexeme

2018-02-20 Thread Ladsgroup
Ladsgroup added a comment. Our idea would be to have it as memcached.TASK DETAILhttps://phabricator.wikimedia.org/T187775EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: thiemowmde, LadsgroupCc: daniel, Lucas_Werkmeister_WMDE, Ladsgroup, WMDE-leszek,

[Wikidata-bugs] [Maniphest] [Commented On] T187775: Investigation: Constraints for a database schema to store representations of a Lexeme

2018-02-20 Thread thiemowmde
thiemowmde added a comment. Personally, I'm totally fine with using any kind of cache, might it be an in-memory one or something else. My worst-case scenario is as follows: Let's say we have 10 million Lexemes, 2 lemmas per Lexeme, 20 bytes per lemma. The cache would need to hold about 0.4

[Wikidata-bugs] [Maniphest] [Commented On] T187775: Investigation: Constraints for a database schema to store representations of a Lexeme

2018-02-20 Thread WMDE-leszek
WMDE-leszek added a comment. I just had a chat with @Ladsgroup and he suggested regarding wbl_lemmas table the following: what about not putting this stuff in the database table but storing all lemmas for display in the cache (or cache them when they're used). I am bit ignorant, but as wbl_lemmas

[Wikidata-bugs] [Maniphest] [Commented On] T187775: Investigation: Constraints for a database schema to store representations of a Lexeme

2018-02-20 Thread WMDE-leszek
WMDE-leszek added a comment. Regarding number of lemmas per lexeme, @Lucas_Werkmeister_WMDE makes a good point. As far as I remember, @thiemowmde and I talked IRL last week about the number there, and we said something like that the security guesstimate would be to say the total number of lemmas

[Wikidata-bugs] [Maniphest] [Commented On] T187775: Investigation: Constraints for a database schema to store representations of a Lexeme

2018-02-20 Thread thiemowmde
thiemowmde added a comment. We should fix https://commons.wikimedia.org/wiki/File:Lexeme_data_model.png then, because it very prominently says there is only "one" lemma. It could be this is meant to be interpreted as "one" value that can somehow contain multiple values. I wonder what the benefit

[Wikidata-bugs] [Maniphest] [Commented On] T187775: Investigation: Constraints for a database schema to store representations of a Lexeme

2018-02-20 Thread WMDE-leszek
WMDE-leszek added a comment. One of the longest words in an English dictionary is "Supercalifragilisticexpialidocious" (34 characters). General note: English is probably not the best language to look for in the context of long words (even German beats it easily). In contract to Item labels, the

[Wikidata-bugs] [Maniphest] [Commented On] T187775: Investigation: Constraints for a database schema to store representations of a Lexeme

2018-02-20 Thread daniel
daniel added a comment. Am 20.02.2018 um 15:44 schrieb Lucas_Werkmeister_WMDE: Lucas_Werkmeister_WMDE added a comment. There is only one lemma per Lexeme (in only one language) Don’t we have something to support e. g. “color” and “colour” for the same lexeme? I’m not sure if that’s two lemmas

[Wikidata-bugs] [Maniphest] [Commented On] T187775: Investigation: Constraints for a database schema to store representations of a Lexeme

2018-02-20 Thread Lucas_Werkmeister_WMDE
Lucas_Werkmeister_WMDE added a comment. There is only one lemma per Lexeme (in only one language) Don’t we have something to support e. g. “color” and “colour” for the same lexeme? I’m not sure if that’s two lemmas or one lemma (multilingual text) with two spellings, but there seems to be some