GoranSMilovanovic added a comment.
- UNESCO and Ethnologue Language Status: **solved**.
- Number of speakers: **solved**.
TASK DETAIL
https://phabricator.wikimedia.org/T223118
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: GoranSMilovanovic
GoranSMilovanovic added a comment.
- Script variants: **solved**.
TASK DETAIL
https://phabricator.wikimedia.org/T223118
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: GoranSMilovanovic
Cc: Aklapper, Lydia_Pintscher, RazShuty, GoranSMilovanovic,
GoranSMilovanovic added a comment.
- the Jaccard similarity and distance matrices: testing, the procedure is
memory efficient but slow (subsetting the dgCMatrix class matrix...):
- **DONE.** We can have the Jaccard distances here too.
TASK DETAIL
https://phabricator.wikimedia.org/T223118
GoranSMilovanovic added a comment.
- Batch processing over sparse matrices (`dgCMatrix` class) is now employed
to compute
- the co-occurence data set: **success**, using approx. order of magnitude
less resources than the previously employed procedure, and
- the Jaccard similarity and
GoranSMilovanovic added a comment.
- given how often is `stat1007` used by analysts,
- it barely has the resources for the computations that we need here (the
languages x languages contingency table; takes at least ~25Gb to compute);
- a fail-safe, batch processing procedure to compute