GoranSMilovanovic added a comment.
@RazShuty @Lydia_Pintscher @JAllemandou
Our approach here will be to use an approx. 1M sized sample of WD items to
produce the identifier x identifier Jaccard distance matrix
- Ratio:
- the dataset as produced in Spark has 250M rows x two columns
(item-identifier pairs);
- the desired binary contingency matrix to compute the Jaccard distances is
of approx. dimesion 26M x 1000K+;
- due to internal constraints, Spark `stat.crosstab()` cannot produce a
binary contingency matrix that we need to compute the Jaccard distances;
- while R {data.table} can manage the dataset, it still cannot produce the
desired contingency matrix;
- moreover, even if could have the contingency matrix produced in an
efficient manner, it is questionable what procedure could deliver the Jaccard
distances efficiently.
The results of the following experiment testify that we can safely
proceed with sampling:
- Take 10 random samples from the ~250M items x identifiers pairs
- by sampling identifiers proportionally (i.e. compute p(identifier), weight
the identifier sample appropriately)
- and including one observation for each identifier with p = 0 (due to
rounding, not due to the absence of the identifier);
- for each sample, produce a binary contingency matrix;
- from each contingency matrix compute all pair-wise identifier-identifier
Jaccard distances, store as vector;
- compute Pearson correlation coefficients between the distance vectors
obtained from 10 random samples.
Here's the correlation matrix; obviously, ~1M sized proportional random
samples of item-property pairs are quite representative of the approx. ~26M
item-property pairs dataset:
| | sample1 | sample2 | sample3 | sample4 | sample5 | sample6 |
sample7 | sample8 | sample9 | sample10 |
| sample1 | 1 | 0.9992 | 0.9992 | 0.9989 | 0.9991 | 0.9991 |
0.9992 | 0.9992 | 0.999 | 0.9992 |
| sample2 | 0.9992 | 1 | 0.9995 | 0.9992 | 0.9996 | 0.9992 |
0.9994 | 0.9995 | 0.9991 | 0.9994 |
| sample3 | 0.9992 | 0.9995 | 1 | 0.9993 | 0.9996 | 0.9993 |
0.9994 | 0.9996 | 0.9992 | 0.9994 |
| sample4 | 0.9989 | 0.9992 | 0.9993 | 1 | 0.9992 | 0.9991 |
0.9992 | 0.9991 | 0.999 | 0.9991 |
| sample5 | 0.9991 | 0.9996 | 0.9996 | 0.9992 | 1 | 0.9994 |
0.9994 | 0.9995 | 0.9992 | 0.9994 |
| sample6 | 0.9991 | 0.9992 | 0.9993 | 0.9991 | 0.9994 | 1 |
0.9993 | 0.9992 | 0.9991 | 0.9991 |
| sample7 | 0.9992 | 0.9994 | 0.9994 | 0.9992 | 0.9994 | 0.9993 | 1
| 0.9994 | 0.9992 | 0.9994 |
| sample8 | 0.9992 | 0.9995 | 0.9996 | 0.9991 | 0.9995 | 0.9992 |
0.9994 | 1 | 0.9992 | 0.9995 |
| sample9 | 0.999 | 0.9991 | 0.9992 | 0.999 | 0.9992 | 0.9991 |
0.9992 | 0.9992 | 1 | 0.9991 |
| sample10 | 0.9992 | 0.9994 | 0.9994 | 0.9991 | 0.9994 | 0.9991 |
0.9994 | 0.9995 | 0.9991 | 1 |
|
- Next steps: (1) proceed to produce the dataset; (2) resolve to ticket and
proceed to visualization: T204440 <https://phabricator.wikimedia.org/T204440>.
TASK DETAIL
https://phabricator.wikimedia.org/T214897
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: GoranSMilovanovic
Cc: RazShuty, Addshore, JAllemandou, Aklapper, GoranSMilovanovic,
Lydia_Pintscher, alaa_wmde, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen,
rosalieper, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs