Menemenetekelufarsim added a comment.
Hello, I read that you were intrested in other corpuses than Wikipedia. I think that Swedish Wikipedia is a skewed source since so many articles are started by robots, and the frequency of odd formulations remain high even after they are manually cleaned up. The Swedish Gigaword Corpus contains one billion words from 1950-2015 analyzed with NLP and stored in XML format: https://spraakbanken.gu.se/en/resources/gigaword A presentation: http://www.ep.liu.se/ecp/126/002/ecp16126002.pdf The license is CC-BY which is incompatible with Wikidata. But just like the Leipzig Corpora Collection it would be possible to extract missing word forms. TASK DETAIL https://phabricator.wikimedia.org/T273221 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Menemenetekelufarsim Cc: Menemenetekelufarsim, ArthurPSmith, Scott_WUaS, Quiddity, Jdforrester-WMF, Invadibot, maantietaja, NavinRizwi, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, Mahir256, QZanden, LawExplorer, _jensen, rosalieper, Bodhisattwa, Nikki, VIGNERON, Wikidata-bugs, aude, Dinoguy1000, Lydia_Pintscher, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
