Addshore created this task. Addshore added projects: Analytics, Dumps-Generation, Wikidata, wdwb-tech.
TASK DESCRIPTION Wikidata dumps currently come directly from the SQL servers. The general process here is iterate through all pages, and slowly write all content to files (possibly in multiple threads). An alternative solution could be for Wikidata to produce 2 event streams of RDF and JSON output to hadoop, if T120242: Consistent MediaWiki state change events | MediaWiki events as source of truth <https://phabricator.wikimedia.org/T120242> & T215001: Revisions missing from mediawiki_revision_create <https://phabricator.wikimedia.org/T215001> are complete. In order to not need to wait for T120242 <https://phabricator.wikimedia.org/T120242> or T215001 <https://phabricator.wikimedia.org/T215001> this could be implemented differently, with a service taking a reliable and consistent input (such as MediaWiki recent changes) and populating a reliable stream in kafka of content by making requests to Wikidata for the content. Dumps could then be created directly from hadoop, which I imagine would take far less time allowing users to get fresher data, and also benefiting services such as #wikidata-query-service <https://phabricator.wikimedia.org/tag/wikidata-query-service/> which sometimes have to reload from dumps. If we could quickly push this data to kafka too, we would likely see some reduction in load on s8 db servers, as dump generation would no longer need to run. I'm sure #dba <https://phabricator.wikimedia.org/tag/dba/> would appreciate this. And the new query service flink updater could also make use of the RDF stream, instead of using mediawiki revision create events and then requesting Special:EntityData. TASK DETAIL https://phabricator.wikimedia.org/T291089 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Addshore Cc: Addshore, Invadibot, maantietaja, jannee_e, Akuckartz, 4748kitoko, holger.knust, Nandana, Akovalyov, Lahi, Gq86, GoranSMilovanovic, Lunewa, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, gnosygnu, JAllemandou, terrrydactyl, Wikidata-bugs, aude, Mbch331, jeremyb
_______________________________________________ Wikidata-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
