[Wikidata-bugs] [Maniphest] [Commented On] T215902: Investigate normalization of data stored in wb_terms table

Marostegui Wed, 13 Feb 2019 02:37:52 -0800

Marostegui added a comment.

In T215902#4950293, @Addshore wrote:

My first investigation into table normalization went for full normalization:

Thanks for working on this! :-)

F28207723: image.png

The queries to get there from the wb_terms table can be found @ P8075 and have been running on the test servers over the past week.

term_type, small table holding the types of strings to be indexed in the db, right now this would be labels, descriptions and aliases, but this would scale to allowing more similar terms into the index (if desired) It might be the case not having a table here would be better and just keep INT ids in code.

languages, table giving language codes numeric IDs.

strings, the strings used in terms

langstring, representing a string in a set language

langstringtype, representing a langstring as a given type

page_entity, one row for each entity, providing a direct lookup to a pageid

entity_terms, representing a term on an entity (using reference to page_id)

Results

Data Index Compressed Data Compressed Index rows

term_type 16KB 16KB 8KB 8KB 3

languages 16KB 16KB 24KB 8KB 432

strings 6.5GB 5GB 3.7GB 2.6GB 108,059,306

langstring 9.8GB 9.1GB 6.6GB 2.7GB 310,983,369

langstringtype 9.8GB 8.9GB 6.1GB 2.7GB 313,554,459

page entity 1.6GB 1GB 0.95GB 0.5GB 50,276,496

entity terms 58.5GB 53.1GB 37.5GB 16.4GB 1,809,036,849

TOTAL 86GB 77GB 54.8GB 24.9GB

Down from the current size of wb_terms on the test server, 107GB data, 288GB index = 395GB total

Things to note

There may be little point in normalizing the term_type for example. This thing only has 3 rows. (languages is also pretty small)

	Data	Index	Compressed Data	Compressed Index	rows
term_type	16KB	16KB	8KB	8KB	3
languages	16KB	16KB	24KB	8KB	432
strings	6.5GB	5GB	3.7GB	2.6GB	108,059,306
langstring	9.8GB	9.1GB	6.6GB	2.7GB	310,983,369
langstringtype	9.8GB	8.9GB	6.1GB	2.7GB	313,554,459
page entity	1.6GB	1GB	0.95GB	0.5GB	50,276,496
entity terms	58.5GB	53.1GB	37.5GB	16.4GB	1,809,036,849
TOTAL	86GB	77GB	54.8GB	24.9GB

Agreed, probably no need to do that with either term_type and languages.

The page table would be used for joins as part of this normalization, and this normalization currently has the assumption that there is a 1:1 relationship between page and entity.

entity_type is in old wb_terms table, but not in this new normalization, given the assumption of 1:1 with pages and 1 NS per entity type right now it would be safe to do queries joined on the page table when looking or entities of a given type.

The insertion logic for terms is more complex than we currently have, deletion logic to be decided, do we even have to remove values from tables such as langstring and langstring type when an entity term is removed? probably not...

Why is that? (just trying to understand the logic)

There is only likely to be duplication between aliases and labels in terms of string usage currently, descriptions and labels are less likely to have this space saving duplication

langstring has 310983369 rows, langstringtype has 313554459 rows. This might indicate slightly too much normalization here (at least for now) and we could remove the complexity by having a single langstring type table with id, lang_id, string_id, type_id

Why not combining strings langstring and langstringtype on the same table?
Is that what you mean with: we could remove the complexity by having a single langstring type table with id, lang_id, string_id, type_id?
Keep in mind that I could be saying nosenses! I don't have the full context of how all the logic behind wb_terms.

Next steps

Investigate what the queries would look like with this schema for reading in the various ways that we do, insertions and deletions.

Indeed, that is a key point. We'd need to check their query plans and all that beforehand?

Would we be able to maintain a wb_terms view from this data for labs? Would we even want to do that?

What do you mean? Create a view to query all the new tables to produce the same results as we currently have with wb_terms?

#DBA review of this comment

What would the migration look like?

I guess the easiest is to migrate things while writing to both and at some point once both set of tables are in sync switch the writes to the new tables only?

TASK DETAIL

https://phabricator.wikimedia.org/T215902

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore, Marostegui
Cc: Marostegui, Aklapper, Addshore, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331

_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T215902: Investigate normalization of data stored in wb_terms table

Reply via email to