GoranSMilovanovic added a comment.

  @WMDE-leszek @Lydia_Pintscher
  
  Thank you for your suggestions. Please let me first try to accomplish this by 
relying on the approach described in T278698#6986358 
<https://phabricator.wikimedia.org/T278698#6986358>: it seems doable and I have 
already invested quite some time into it.
  
  @WMDE-leszek
  
  > As I got it, you are kind of reconstructing the state of each lexeme from 
the wikitext history table to have its JSON structure at the requested point in 
time?
  
  Yes, and that is the reason why the dumps do not work here. The formulation 
of the problem says "//...as of Jan 1st 2021//" and that means that I should 
not take into account any changes made prior to 2021/01/01. As the dump is the 
snapshot of the current state at some point in time there is no way to figure 
what happened before and what happened after a certain date there. Comparing 
the dumps would then do, but that already sounds like an overkill of processing.
  
  > Would be significantly less efficient to use JSON dump of Wikidata lexemes 
and items (in this case historical one, as we are looking for data on the past 
state),
  
  Unfortunately yes, even if the dumps could help here. Processing the JSON 
dump from Python or R takes forever. Processing the dump w. Apache Spark in the 
cluster is efficient, but the current snapshot of the wmf.wikidata_entity 
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Content/Wikidata_entity>
 - the hdfs import of the JSON dump - //does not encompass lexemes//.
  
  @Lydia_Pintscher
  
  > Ah wait that'll be better: 
https://archive.org/details/wikibase-wikidatawiki-20210101 The previous one 
might not contain Lexemes. I hope this one does.
  
  Please see my previous response to Leszek.
  
  @Lydia_Pintscher @WMDE-leszek 
  I have underestimated this task seriously, by relying on a false assumption 
that the lexemes would be found in the hdfs copy of the JSON dump in the WMF 
Data Lake. If that was the case, comparing successive snapshots would solve the 
problem. However...

TASK DETAIL
  https://phabricator.wikimedia.org/T278698

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: WMDE-leszek, Aklapper, GoranSMilovanovic, Lea_WMDE, Lydia_Pintscher, 
Invadibot, maantietaja, Akuckartz, Nandana, Lahi, Gq86, QZanden, LawExplorer, 
_jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to