GoranSMilovanovic added a comment.
@Lydia_Pintscher @Lea_WMDE @WMDE-leszek The data that you are looking for are **extremely** difficult to obtain. The only way that works - or at least the only one that I was able to discover - is to parse the revisions from the Mediawiki wikitext history <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content/Mediawiki_wikitext_history> table in the Data Lake which represents the "//... the full-historical-revision wikitext history of WMF's wikis, as provided through monthly XML Dumps//". However, the data there are not structured so I will be parsing revisions with regular expressions to figure out when the constraints specified in the ticket description are met. Adding an additional layer of complexity, some useful regex functions are not available from the version of Apache Spark which is the actual version in our Analytics Cluster (e.g. regexp_extract_all <https://spark.apache.org/docs/latest/api/sql/#regexp_extract_all>). That means that I need to work partly in the Analytics Cluster (Pyspark to extract the data w. some basic filtering) and partly on the Analytics Clients (Python or R to process the data to meet the definitions of the constraints that you have specified). At this point, even figuring out the correct repartitioning of the dataset just in order to be able to efficiently store it to hdfs and then process in-memory from the Analytics Clients turns out to be very complicated. That being said: I am focused on this very much, but I cannot promise that this will be finished as soon as I have expected. TASK DETAIL https://phabricator.wikimedia.org/T278698 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic Cc: WMDE-leszek, Aklapper, GoranSMilovanovic, Lea_WMDE, Lydia_Pintscher, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
