| Marostegui added a comment. |
In T215902#4950293, @Addshore wrote:My first investigation into table normalization went for full normalization:
Thanks for working on this! :-)
The queries to get there from the wb_terms table can be found @ P8075 and have been running on the test servers over the past week.
- term_type, small table holding the types of strings to be indexed in the db, right now this would be labels, descriptions and aliases, but this would scale to allowing more similar terms into the index (if desired) It might be the case not having a table here would be better and just keep INT ids in code.
- languages, table giving language codes numeric IDs.
- strings, the strings used in terms
- langstring, representing a string in a set language
- langstringtype, representing a langstring as a given type
- page_entity, one row for each entity, providing a direct lookup to a pageid
- entity_terms, representing a term on an entity (using reference to page_id)
Results
Data Index Compressed Data Compressed Index rows term_type 16KB 16KB 8KB 8KB 3 languages 16KB 16KB 24KB 8KB 432 strings 6.5GB 5GB 3.7GB 2.6GB 108,059,306 langstring 9.8GB 9.1GB 6.6GB 2.7GB 310,983,369 langstringtype 9.8GB 8.9GB 6.1GB 2.7GB 313,554,459 page entity 1.6GB 1GB 0.95GB 0.5GB 50,276,496 entity terms 58.5GB 53.1GB 37.5GB 16.4GB 1,809,036,849 TOTAL 86GB 77GB 54.8GB 24.9GB Down from the current size of wb_terms on the test server, 107GB data, 288GB index = 395GB total
Things to note
- There may be little point in normalizing the term_type for example. This thing only has 3 rows. (languages is also pretty small)
Agreed, probably no need to do that with either term_type and languages.
- The page table would be used for joins as part of this normalization, and this normalization currently has the assumption that there is a 1:1 relationship between page and entity.
- entity_type is in old wb_terms table, but not in this new normalization, given the assumption of 1:1 with pages and 1 NS per entity type right now it would be safe to do queries joined on the page table when looking or entities of a given type.
- The insertion logic for terms is more complex than we currently have, deletion logic to be decided, do we even have to remove values from tables such as langstring and langstring type when an entity term is removed? probably not...
Why is that? (just trying to understand the logic)
- There is only likely to be duplication between aliases and labels in terms of string usage currently, descriptions and labels are less likely to have this space saving duplication
- langstring has 310983369 rows, langstringtype has 313554459 rows. This might indicate slightly too much normalization here (at least for now) and we could remove the complexity by having a single langstring type table with id, lang_id, string_id, type_id
Why not combining strings langstring and langstringtype on the same table?
Is that what you mean with: we could remove the complexity by having a single langstring type table with id, lang_id, string_id, type_id?
Keep in mind that I could be saying nosenses! I don't have the full context of how all the logic behind wb_terms.
Next steps
- Investigate what the queries would look like with this schema for reading in the various ways that we do, insertions and deletions.
Indeed, that is a key point. We'd need to check their query plans and all that beforehand?
- Would we be able to maintain a wb_terms view from this data for labs? Would we even want to do that?
What do you mean? Create a view to query all the new tables to produce the same results as we currently have with wb_terms?
- #DBA review of this comment
- What would the migration look like?
I guess the easiest is to migrate things while writing to both and at some point once both set of tables are in sync switch the writes to the new tables only?
Cc: Marostegui, Aklapper, Addshore, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
