[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-10-26 Thread gerritbot
gerritbot added a comment. Change 386588 merged by Marostegui: [operations/puppet@production] mediawiki: Disable rebuildTermSqlIndex https://gerrit.wikimedia.org/r/386588TASK DETAILhttps://phabricator.wikimedia.org/T163551EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-10-26 Thread gerritbot
gerritbot added a comment. Change 386588 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani): [operations/puppet@production] mediawiki: Disable rebuildTermSqlIndex https://gerrit.wikimedia.org/r/386588TASK DETAILhttps://phabricator.wikimedia.org/T163551EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-10-25 Thread Ladsgroup
Ladsgroup added a comment. It should, there is a maintenance script running (using cron) to de-duplicate rows and it's 90% done.TASK DETAILhttps://phabricator.wikimedia.org/T163551EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: LadsgroupCc: Lydia_Pintscher,

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-10-11 Thread daniel
daniel added a comment. Do we know where the dupes come from?TASK DETAILhttps://phabricator.wikimedia.org/T163551EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: hoo, danielCc: Lydia_Pintscher, Krinkle, Ladsgroup, gerritbot, daniel, Smalyshev, jcrespo, aude,

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-10-10 Thread Ladsgroup
Ladsgroup added a comment. With the deduplication passing Q11M, the number of duplicate rows has not vanished but it's close to none: mysql:research@analytics-store.eqiad.wmnet [wikidatawiki]> SELECT COUNT(*) FROM wb_terms AS t1 WHERE term_type != 'alias' AND EXISTS(SELECT 1 FROM wb_terms AS t2

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-10-02 Thread gerritbot
gerritbot added a comment. Change 381433 merged by Marostegui: [operations/puppet@production] mediawiki: Maintenance script to clean up duplicates in wb_terms https://gerrit.wikimedia.org/r/381433TASK DETAILhttps://phabricator.wikimedia.org/T163551EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-09-29 Thread gerritbot
gerritbot added a comment. Change 381433 had a related patch set uploaded (by Ladsgroup; owner: Amir Sarabadani): [operations/puppet@production] mediawiki: Maintenance script to clean up duplicates in wb_terms https://gerrit.wikimedia.org/r/381433TASK

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-09-17 Thread Ladsgroup
Ladsgroup added a comment. I meant you as you :) I will make the patch when the current one (populating term_full_entity_id) is done.TASK DETAILhttps://phabricator.wikimedia.org/T163551EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: hoo, LadsgroupCc:

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-09-17 Thread jcrespo
jcrespo added a comment. Not sure if with "you", you mean me, but if it is safe, yes. We may have to defragment the table later to reclaim disk space, but that can be done later and it is not a blocker.TASK DETAILhttps://phabricator.wikimedia.org/T163551EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-09-17 Thread Ladsgroup
Ladsgroup added a comment. The maintenance script that we have can do it, it only takes some time. Is it okay for you?TASK DETAILhttps://phabricator.wikimedia.org/T163551EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: hoo, LadsgroupCc: Lydia_Pintscher,

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-09-17 Thread jcrespo
jcrespo added a comment. We identify and delete duplicate rows (not trivial, but not difficult either), then we add a UNIQUE restriction over that combination of columns so that never happens again.TASK DETAILhttps://phabricator.wikimedia.org/T163551EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-09-16 Thread Ladsgroup
Ladsgroup added a comment. There are two problems right now, 1- We have duplicate rows: mysql:wikiadmin@db1087 [wikidatawiki]> select * from wb_terms where term_full_entity_id = 'Q134092' and term_language = 'en';

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-09-16 Thread Krinkle
Krinkle added a comment. In T163551#3612317, @Lydia_Pintscher wrote: That's definitely not how it should be. There should only ever be one description for a given item and language. Is this also caused by the duplicate rows? I'm not sure. I was gonna file a new task, but found this existing

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-09-16 Thread Lydia_Pintscher
Lydia_Pintscher added a comment. That's definitely not how it should be. There should only ever be one description for a given item and language. Is this also caused by the duplicate rows?TASK DETAILhttps://phabricator.wikimedia.org/T163551EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-09-16 Thread Krinkle
Krinkle added a comment. When editing a page in VisualEditor, I sometimes notice that the description (from Wikidata's pageterms API) is duplicated. I results in the following rendering: F9586347: Screen Shot 2017-09-16 at 20.51.29.png The joining by a single comma also looks odd, but is

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-06-23 Thread Ladsgroup
Ladsgroup added a comment. I confirm with this patch merged, when I edit terms that are duplicated, the new ones doesn't get duplicated so if we wait for a while probably most of the duplicates get fixed automagically.TASK DETAILhttps://phabricator.wikimedia.org/T163551EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-06-07 Thread gerritbot
gerritbot added a comment. Change 354550 merged by jenkins-bot: [mediawiki/extensions/Wikibase@master] Prevent saving duplicate Terms in TermSqlIndex::insertTerms https://gerrit.wikimedia.org/r/354550TASK DETAILhttps://phabricator.wikimedia.org/T163551EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-05-31 Thread hoo
hoo added a comment. My suggestion is, once the fix for not introducing new duplicates (https://gerrit.wikimedia.org/r/354550) is in, to list all duplicate row ids in a text file (generated via a query similar to the one I already mentioned). Then we can delete these in batches using a maintenance

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-05-19 Thread gerritbot
gerritbot added a comment. Change 354550 had a related patch set uploaded (by Hoo man; owner: Hoo man): [mediawiki/extensions/Wikibase@master] Prevent saving duplicate Terms in TermSqlIndex::insertTerms https://gerrit.wikimedia.org/r/354550TASK DETAILhttps://phabricator.wikimedia.org/T163551EMAIL

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-05-19 Thread jcrespo
jcrespo added a comment. To clarify- reads on a slave are not a big concern for MySQL- of course, if you get in the end better latency, that is cool (and I normally ping because it means there is an inefficiency that could be solved); but reads are easy to scale in the large order of things ("just

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-04-28 Thread jcrespo
jcrespo added a comment. This is a bit offtopic to T163551 but with the latest schema changes, wb_terms has become the largest table on a wiki (with the exception of revision on enwiki and image on commons)- and I think it will get bigger once the new column (I assume) gets populated with actual

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-04-24 Thread hoo
hoo added a comment. Probably at least 10% of the table are duplicates (at least 129M/2). mysql:wikiadmin@db1070 [wikidatawiki]> SELECT COUNT(*) FROM wb_terms; +---+ | COUNT(*) | +---+ | 613038548 | +---+ 1 row in set (2 min 41.65 sec)TASK

[Wikidata-bugs] [Maniphest] [Commented On] T163551: Huge number of duplicate rows in wb_terms

2017-04-24 Thread hoo
hoo added a comment. Number of rows which occur at least twice (excludes aliases, for performance reasons): mysql:wikiadmin@db1070 [wikidatawiki]> SELECT COUNT(*) FROM wb_terms AS t1 WHERE term_type != 'alias' AND EXISTS(SELECT 1 FROM wb_terms AS t2 USE INDEX(wb_terms_entity_id) WHERE