Addshore created this task.
Addshore added projects: Wikidata, Wikibase-Quality, TechCom-RFC, wikidata-tech-focus.
Restricted Application added a subscriber: Aklapper.

TASK DESCRIPTION

This RFC is a result of T204024 and specifically the request for an RFC in T204024#4891344

Vocabulary

WBQC: WikibaseQualityConstraints mediawiki extension, deployed on wikidata.org
WDQS: The wikidata query service, https://query.wikidata.org

Current situation

WBQC runs checks on wikidata entities on demand from users.
Check results are stored in memcached with a default ttl of 84600 (1 day).

WBQC checks are accessible via 3 methods:

The special page and API can be used by users directly; the API is also called whenever a logged-in user visits an entity page, to display the results on the entity page.
Executions of the API will result in constraint checks being run if stored data is out of date or not stored / evicted for the entity.
Executions of the special page currently always re run the constraint checks, do not load from the cache and do not store to the cache.
The RDF action is for the WDQS and will not trigger a constraint check run, it can only be used for retrieving the RDF representation of currently stored constraint checks.

When retrieved from the cache the WBQC extension has logic built in to determine if the stored result needs to be updated (because something in the dependency graph has changed).

We are in the process of rolling out a Job that will run constraint checks for an entity post edit rather than on only on demand by a user. T204031

Once constraint checks are stored more persistently we will be able to expose an event queue of the generation of the checks for ingestion into WDQS T201147.
Loading /reloading of data into WDQS will also present the need to dump all constraint checks.

5644 out of 5767 properties on wikidata currently have constraints that need to be checked.
Roughly 1.85 million items do not have statements (currently), leaving 52.05 million items that do have statements and need to have constraint checks run.
Constraint checks also run on Properties and Lexemes but the number there is negligible when compared with Items.

Constraint checks on an item can take a wide variety of times to execute based on the constraints used. Full constraint checks are logged if they take longer than 5 seconds (INFO) or 55 seconds (WARNING) and the performance of all constraint checks is monitored on grafana.
Some full constraint checks reach the current interactive PHP time limit while being generated for special pages or the API.

Problem statement

Primary problem statement:

  • Constraint check results need to be loaded into WDQS, but there is no currently a full set of constraint check results for all entities on wikidata.

Secondary problem statements:

  • Generating constraint reports when the user requests them leads to a bad user experience as they must wait for a prolonged amount of time.
  • Users can flood the API generating constraint checks for entities putting unnecessary load on app servers.

Solution proposal

  • Rather than defaulting to running constraint checks upon a users request primarily pre generate constraint check results post edit using the job queue. T204031
  • Rather that storing constraint check results in memcached, store them in a more permanent storage solution.
  • When new constraint check results are stored, fire and event for the WDQS to listen to so that it can load the new constraint check data
  • Dump constraint check data from the persistent storage to allow for dumping to file and loading into WDQS.
  • Use the same logic that currently exists to determine if the stored constraint check data needs updating when retrieve.
  • Alterations to the special page to load from the cache? Provide the timestamp of when the checks were run? Provide a way to manually purge the checks and re run (get the latest results) with a button from the page.

Note: Even when constraint checks are run after all entity edits, the data persistently stored will slowly become out of date (therefore also the data stored by WDQS). The issue of 1 edit needing to trigger constraint checks on multiple entities is considered a separate issue and is not in the scope of this RFC.


TASK DETAIL
https://phabricator.wikimedia.org/T214362

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Addshore
Cc: Aklapper, Addshore, Nandana, kostajh, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, merbst, LawExplorer, _jensen, D3r1ck01, SBisson, Wikidata-bugs, aude, GWicke, jayvdb, fbstj, santhosh, Jdforrester-WMF, Mbch331, Rxy, Jay8g, Ltrlg, bd808, Legoktm
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to