| Addshore added a comment. |
My first investigation into table normalization went for full normalization:
The queries to get there from the wb_terms table can be found @ P8075 and have been running on the test servers over the past week.
- term_type, small table holding the types of strings to be indexed in the db, right now this would be labels, descriptions and aliases, but this would scale to allowing more similar terms into the index (if desired) It might be the case not having a table here would be better and just keep INT ids in code.
- languages, table giving language codes numeric IDs.
- strings, the strings used in terms
- langstring, representing a string in a set language
- langstringtype, representing a langstring as a given type
- page_entity, one row for each entity, providing a direct lookup to a pageid
- entity_terms, representing a term on an entity (using reference to page_id)
Results
| Data | Index | Compressed Data | Compressed Index | rows | |
|---|---|---|---|---|---|
| term_type | 16KB | 16KB | 8KB | 8KB | 3 |
| languages | 16KB | 16KB | 24KB | 8KB | 432 |
| strings | 6.5GB | 5GB | 3.7GB | 2.6GB | 108,059,306 |
| langstring | 9.8GB | 9.1GB | 6.6GB | 2.7GB | 310,983,369 |
| langstringtype | 9.8GB | 8.9GB | 6.1GB | 2.7GB | 313,554,459 |
| page entity | 1.6GB | 1GB | 0.95GB | 0.5GB | 50,276,496 |
| entity terms | 58.5GB | 53.1GB | 37.5GB | 16.4GB | 1,809,036,849 |
| TOTAL | 86GB | 77GB | 54.8GB | 24.9GB | |
Down from the current size of wb_terms on the test server, 107GB data, 288GB index = 395GB total
Things to note
- There may be little point in normalizing the term_type for example. This thing only has 3 rows. (languages is also pretty small)
- The page table would be used for joins as part of this normalization, and this normalization currently has the assumption that there is a 1:1 relationship between page and entity.
- entity_type is in old wb_terms table, but not in this new normalization, given the assumption of 1:1 with pages and 1 NS per entity type right now it would be safe to do queries joined on the page table when looking or entities of a given type.
- The insertion logic for terms is more complex than we currently have, deletion logic to be decided, do we even have to remove values from tables such as langstring and langstring type when an entity term is removed? probably not...
- There is only likely to be duplication between aliases and labels in terms of string usage currently, descriptions and labels are less likely to have this space saving duplication
- langstring has 310983369 rows, langstringtype has 313554459 rows. This might indicate slightly too much normalization here (at least for now) and we could remove the complexity by having a single langstring type table with id, lang_id, string_id, type_id
Next steps
- Investigate what the queries would look like with this schema for reading in the various ways that we do, insertions and deletions.
- Would we be able to maintain a wb_terms view from this data for labs? Would we even want to do that?
- #DBA review of this comment
- What would the migration look like?
- TBA
TASK DETAIL
EMAIL PREFERENCES
To: Addshore
Cc: Aklapper, Addshore, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331
Cc: Aklapper, Addshore, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
