[Wikidata-bugs] [Maniphest] [Commented On] T215902: Investigate normalization of data stored in wb_terms table

Addshore Mon, 18 Feb 2019 05:56:34 -0800

Addshore added a comment.

In T215902#4950468, @Marostegui wrote:

The insertion logic for terms is more complex than we currently have, deletion logic to be decided, do we even have to remove values from tables such as langstring and langstring type when an entity term is removed? probably not...

Why is that? (just trying to understand the logic)

"Why is that?", is that referring to insertion or deletion logic?

Well, with the wb_terms table, we are just adding a row for each term on an entity.
The normalization of course introduces more tables so for a single term entry 7 inserts may be needed (at least with the initial draft first introduced in this ticket)
Using a hashing approach for languages and term types of course brings this from 7 to 5.
The Mapping from page_id to entity ID could also be seen as not relating to the terms index specifically and could then be assumed to already be there.

For deletions the same applies, just deleting rows.
For the normalized version we will likely only remove entries from wb_entity_terms in real time, which has the potential to leave "dead data" in all of the other tables.
Initially this is not something that we will need to think about, but at some stage we might want to think about cleanup measures (could just be a maint script etc).

There is only likely to be duplication between aliases and labels in terms of string usage currently, descriptions and labels are less likely to have this space saving duplication

langstring has 310983369 rows, langstringtype has 313554459 rows. This might indicate slightly too much normalization here (at least for now) and we could remove the complexity by having a single langstring type table with id, lang_id, string_id, type_id

Why not combining strings langstring and langstringtype on the same table?
Is that what you mean with: we could remove the complexity by having a single langstring type table with id, lang_id, string_id, type_id?
Keep in mind that I could be saying nosenses! I don't have the full context of how all the logic behind wb_terms.

It looks like it will make sense to merge the langstring and langstringtype tables together, resulting in a single table that looks something like (below example also including the hashing of types and language codes rather than relating to another table):

id, stringId, langHashInt, stringTypeInt
1, 1, 421809, 511251
2, 3, 21424, 511251

Next steps

Investigate what the queries would look like with this schema for reading in the various ways that we do, insertions and deletions.

Indeed, that is a key point. We'd need to check their query plans and all that beforehand?

Yes we should probably check the query plans before hand.
In theory all queries plans should be pretty straightforward for mysql to not mess up, and should also be the "most efficient".

Would we be able to maintain a wb_terms view from this data for labs? Would we even want to do that?

What do you mean? Create a view to query all the new tables to produce the same results as we currently have with wb_terms?

Yes. This could be useful for an easier migration, but also we might want to leave this view around for a while for users of the wb_terms table on the labs db replicas.

What would the migration look like?

I guess the easiest is to migrate things while writing to both and at some point once both set of tables are in sync switch the writes to the new tables only?

That would be the easiest IMO.
Something more complex could include creating a set of filled tables at a point in time (perhaps in alternate DC) then switching traffic, then running a maint script to catch the tables back up to reality, and then start reading from them (would likely be faster, but probably more active work?

In T215902#4950653, @Lucas_Werkmeister_WMDE wrote:

We could also turn term types and languages into short IDs via a hash function: as far as I’m aware, Wikibase only needs the string→ID direction (hash function), and if we need the ID→string direction (e. g. during manual investigation) we can hash all the known term types / language codes and look for the value we have.

Yup, removing these 2 tables would be possible with some simple hashing.
This would make the tables slightly harder for people to work with on the labsdb replicas however (something to consider).

In T215902#4950984, @Marostegui wrote:

In T215902#4950653, @Lucas_Werkmeister_WMDE wrote:

Do the DB servers have enough storage space for this?

The old ones certainly not (ie: the master or codfw) - we are discussing now how to proceed in regard to those old servers, but we are on early stages of this.

I should have been clearer though, my point was to, while you move stuff from one table to the other(s) you keep delete rows from wb_terms - we might need to optimize the table along the way to actually claim back all that disk space on the servers.

Deleting rows from wb_terms as we go could be possible, but in the perfect situation we would avoid touching this table at all, until the point that we can drop it, to avoid anything odd happening with query plans and index decisions that could throw a spanner in the works.

How much disk does the old codfw master have?

TASK DETAIL

https://phabricator.wikimedia.org/T215902

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: Lucas_Werkmeister_WMDE, alaa_wmde, JeroenDeDauw, Ladsgroup, Marostegui, Aklapper, Addshore, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, Jonas, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331

_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T215902: Investigate normalization of data stored in wb_terms table

Reply via email to