GoranSMilovanovic added a comment.
@WMDE-leszek @Lydia_Pintscher Thank you for your suggestions. Please let me first try to accomplish this by relying on the approach described in T278698#6986358 <https://phabricator.wikimedia.org/T278698#6986358>: it seems doable and I have already invested quite some time into it. @WMDE-leszek > As I got it, you are kind of reconstructing the state of each lexeme from the wikitext history table to have its JSON structure at the requested point in time? Yes, and that is the reason why the dumps do not work here. The formulation of the problem says "//...as of Jan 1st 2021//" and that means that I should not take into account any changes made prior to 2021/01/01. As the dump is the snapshot of the current state at some point in time there is no way to figure what happened before and what happened after a certain date there. Comparing the dumps would then do, but that already sounds like an overkill of processing. > Would be significantly less efficient to use JSON dump of Wikidata lexemes and items (in this case historical one, as we are looking for data on the past state), Unfortunately yes, even if the dumps could help here. Processing the JSON dump from Python or R takes forever. Processing the dump w. Apache Spark in the cluster is efficient, but the current snapshot of the wmf.wikidata_entity <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content/Wikidata_entity> - the hdfs import of the JSON dump - //does not encompass lexemes//. @Lydia_Pintscher > Ah wait that'll be better: https://archive.org/details/wikibase-wikidatawiki-20210101 The previous one might not contain Lexemes. I hope this one does. Please see my previous response to Leszek. @Lydia_Pintscher @WMDE-leszek I have underestimated this task seriously, by relying on a false assumption that the lexemes would be found in the hdfs copy of the JSON dump in the WMF Data Lake. If that was the case, comparing successive snapshots would solve the problem. However... TASK DETAIL https://phabricator.wikimedia.org/T278698 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic Cc: WMDE-leszek, Aklapper, GoranSMilovanovic, Lea_WMDE, Lydia_Pintscher, Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
