GoranSMilovanovic added a comment.
@RazShuty @Lydia_Pintscher @JAllemandou Our approach here will be to use an approx. 1M sized sample of WD items to produce the identifier x identifier Jaccard distance matrix - Ratio: - the dataset as produced in Spark has 250M rows x two columns (item-identifier pairs); - the desired binary contingency matrix to compute the Jaccard distances is of approx. dimesion 26M x 1000K+; - due to internal constraints, Spark `stat.crosstab()` cannot produce a binary contingency matrix that we need to compute the Jaccard distances; - while R {data.table} can manage the dataset, it still cannot produce the desired contingency matrix; - moreover, even if could have the contingency matrix produced in an efficient manner, it is questionable what procedure could deliver the Jaccard distances efficiently. The results of the following experiment testify that we can safely proceed with sampling: - Take 10 random samples from the ~250M items x identifiers pairs - by sampling identifiers proportionally (i.e. compute p(identifier), weight the identifier sample appropriately) - and including one observation for each identifier with p = 0 (due to rounding, not due to the absence of the identifier); - for each sample, produce a binary contingency matrix; - from each contingency matrix compute all pair-wise identifier-identifier Jaccard distances, store as vector; - compute Pearson correlation coefficients between the distance vectors obtained from 10 random samples. Here's the correlation matrix; obviously, ~1M sized proportional random samples of item-property pairs are quite representative of the approx. ~26M item-property pairs dataset: | | sample1 | sample2 | sample3 | sample4 | sample5 | sample6 | sample7 | sample8 | sample9 | sample10 | | sample1 | 1 | 0.9992 | 0.9992 | 0.9989 | 0.9991 | 0.9991 | 0.9992 | 0.9992 | 0.999 | 0.9992 | | sample2 | 0.9992 | 1 | 0.9995 | 0.9992 | 0.9996 | 0.9992 | 0.9994 | 0.9995 | 0.9991 | 0.9994 | | sample3 | 0.9992 | 0.9995 | 1 | 0.9993 | 0.9996 | 0.9993 | 0.9994 | 0.9996 | 0.9992 | 0.9994 | | sample4 | 0.9989 | 0.9992 | 0.9993 | 1 | 0.9992 | 0.9991 | 0.9992 | 0.9991 | 0.999 | 0.9991 | | sample5 | 0.9991 | 0.9996 | 0.9996 | 0.9992 | 1 | 0.9994 | 0.9994 | 0.9995 | 0.9992 | 0.9994 | | sample6 | 0.9991 | 0.9992 | 0.9993 | 0.9991 | 0.9994 | 1 | 0.9993 | 0.9992 | 0.9991 | 0.9991 | | sample7 | 0.9992 | 0.9994 | 0.9994 | 0.9992 | 0.9994 | 0.9993 | 1 | 0.9994 | 0.9992 | 0.9994 | | sample8 | 0.9992 | 0.9995 | 0.9996 | 0.9991 | 0.9995 | 0.9992 | 0.9994 | 1 | 0.9992 | 0.9995 | | sample9 | 0.999 | 0.9991 | 0.9992 | 0.999 | 0.9992 | 0.9991 | 0.9992 | 0.9992 | 1 | 0.9991 | | sample10 | 0.9992 | 0.9994 | 0.9994 | 0.9991 | 0.9994 | 0.9991 | 0.9994 | 0.9995 | 0.9991 | 1 | | - Next steps: (1) proceed to produce the dataset; (2) resolve to ticket and proceed to visualization: T204440 <https://phabricator.wikimedia.org/T204440>. TASK DETAIL https://phabricator.wikimedia.org/T214897 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic Cc: RazShuty, Addshore, JAllemandou, Aklapper, GoranSMilovanovic, Lydia_Pintscher, alaa_wmde, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs