GoranSMilovanovic added a comment.

  @JAllemandou Thank you. I have already considered using `stat.crosstab` see 
T214897#5024647 <https://phabricator.wikimedia.org/T214897#5024647>. Spark did 
the ETL part here and produced the data successfully.
  
  >   I don't have a good understanding of what you're after
  
  In a nutshell:
  
  - 250M rows = item x property pairs = two columns,
  - build a contingency table `unique(items)` x `unique(properties)`
  - it will be binary since every considered property matches an item zero 
times or only once;
  - compute a `property x property` Jaccard similarity distance matrix from 
binary contingencies.
  
  And I will have to sample, at this point I don't see a workaround.

TASK DETAIL
  https://phabricator.wikimedia.org/T214897

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: GoranSMilovanovic
Cc: RazShuty, Addshore, JAllemandou, Aklapper, GoranSMilovanovic, 
Lydia_Pintscher, alaa_wmde, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, 
rosalieper, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to