GoranSMilovanovic added a comment.
@JAllemandou Thank you. I have already considered using `stat.crosstab` see T214897#5024647 <https://phabricator.wikimedia.org/T214897#5024647>. Spark did the ETL part here and produced the data successfully. > I don't have a good understanding of what you're after In a nutshell: - 250M rows = item x property pairs = two columns, - build a contingency table `unique(items)` x `unique(properties)` - it will be binary since every considered property matches an item zero times or only once; - compute a `property x property` Jaccard similarity distance matrix from binary contingencies. And I will have to sample, at this point I don't see a workaround. TASK DETAIL https://phabricator.wikimedia.org/T214897 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic Cc: RazShuty, Addshore, JAllemandou, Aklapper, GoranSMilovanovic, Lydia_Pintscher, alaa_wmde, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
