[Wikidata-bugs] [Maniphest] [Updated] T215902: Investigate normalization of data stored in wb_terms table

Addshore Wed, 13 Feb 2019 00:58:55 -0800

Addshore added a comment.

My first investigation into table normalization went for full normalization:

F28207723: image.png

The queries to get there from the wb_terms table can be found @ P8075 and have been running on the test servers over the past week.

term_type, small table holding the types of strings to be indexed in the db, right now this would be labels, descriptions and aliases, but this would scale to allowing more similar terms into the index (if desired) It might be the case not having a table here would be better and just keep INT ids in code.
languages, table giving language codes numeric IDs.
strings, the strings used in terms
langstring, representing a string in a set language
langstringtype, representing a langstring as a given type
page_entity, one row for each entity, providing a direct lookup to a pageid
entity_terms, representing a term on an entity (using reference to page_id)

Results

	Data	Index	Compressed Data	Compressed Index	rows
term_type	16KB	16KB	8KB	8KB	3
languages	16KB	16KB	24KB	8KB	432
strings	6.5GB	5GB	3.7GB	2.6GB	108,059,306
langstring	9.8GB	9.1GB	6.6GB	2.7GB	310,983,369
langstringtype	9.8GB	8.9GB	6.1GB	2.7GB	313,554,459
page entity	1.6GB	1GB	0.95GB	0.5GB	50,276,496
entity terms	58.5GB	53.1GB	37.5GB	16.4GB	1,809,036,849
TOTAL	86GB	77GB	54.8GB	24.9GB

Down from the current size of wb_terms on the test server, 107GB data, 288GB index = 395GB total

Things to note

There may be little point in normalizing the term_type for example. This thing only has 3 rows. (languages is also pretty small)
The page table would be used for joins as part of this normalization, and this normalization currently has the assumption that there is a 1:1 relationship between page and entity.
entity_type is in old wb_terms table, but not in this new normalization, given the assumption of 1:1 with pages and 1 NS per entity type right now it would be safe to do queries joined on the page table when looking or entities of a given type.
The insertion logic for terms is more complex than we currently have, deletion logic to be decided, do we even have to remove values from tables such as langstring and langstring type when an entity term is removed? probably not...
There is only likely to be duplication between aliases and labels in terms of string usage currently, descriptions and labels are less likely to have this space saving duplication
langstring has 310983369 rows, langstringtype has 313554459 rows. This might indicate slightly too much normalization here (at least for now) and we could remove the complexity by having a single langstring type table with id, lang_id, string_id, type_id

Next steps

Investigate what the queries would look like with this schema for reading in the various ways that we do, insertions and deletions.
Would we be able to maintain a wb_terms view from this data for labs? Would we even want to do that?
#DBA review of this comment
What would the migration look like?
TBA

TASK DETAIL

https://phabricator.wikimedia.org/T215902

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: Aklapper, Addshore, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331

_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Updated] T215902: Investigate normalization of data stored in wb_terms table

Reply via email to