[Wikidata-bugs] [Maniphest] [Commented On] T223118: WD Languages Landscape: fundamental data sets
GoranSMilovanovic added a comment. - UNESCO and Ethnologue Language Status: **solved**. - Number of speakers: **solved**. TASK DETAIL https://phabricator.wikimedia.org/T223118 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic Cc: Aklapper, Lydia_Pintscher, RazShuty, GoranSMilovanovic, darthmon_wmde, DannyS712, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T223118: WD Languages Landscape: fundamental data sets
GoranSMilovanovic added a comment. - Script variants: **solved**. TASK DETAIL https://phabricator.wikimedia.org/T223118 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic Cc: Aklapper, Lydia_Pintscher, RazShuty, GoranSMilovanovic, darthmon_wmde, DannyS712, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T223118: WD Languages Landscape: fundamental data sets
GoranSMilovanovic added a comment. - the Jaccard similarity and distance matrices: testing, the procedure is memory efficient but slow (subsetting the dgCMatrix class matrix...): - **DONE.** We can have the Jaccard distances here too. TASK DETAIL https://phabricator.wikimedia.org/T223118 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic Cc: Aklapper, Lydia_Pintscher, RazShuty, GoranSMilovanovic, darthmon_wmde, DannyS712, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T223118: WD Languages Landscape: fundamental data sets
GoranSMilovanovic added a comment. - Batch processing over sparse matrices (`dgCMatrix` class) is now employed to compute - the co-occurence data set: **success**, using approx. order of magnitude less resources than the previously employed procedure, and - the Jaccard similarity and distance matrices: **testing**, the procedure is memory efficient but slow (subsetting the `dgCMatrix` class matrix...). TASK DETAIL https://phabricator.wikimedia.org/T223118 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic Cc: Aklapper, Lydia_Pintscher, RazShuty, GoranSMilovanovic, darthmon_wmde, DannyS712, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T223118: WD Languages Landscape: fundamental data sets
GoranSMilovanovic added a comment. - given how often is `stat1007` used by analysts, - it barely has the resources for the computations that we need here (the languages x languages contingency table; takes at least ~25Gb to compute); - a fail-safe, batch processing procedure to compute large contingency matrices in R will be developed; - it will rely on `base` and/or `data.table` R functions, but it will be - less demanding in terms of memory resources. TASK DETAIL https://phabricator.wikimedia.org/T223118 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: GoranSMilovanovic Cc: Aklapper, Lydia_Pintscher, RazShuty, GoranSMilovanovic, darthmon_wmde, DannyS712, Nandana, Lahi, Gq86, QZanden, LawExplorer, _jensen, rosalieper, Wikidata-bugs, aude, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs