Menemenetekelufarsim added a comment.

  Hello, I read that you were intrested in other corpuses than Wikipedia. I 
think that Swedish Wikipedia is a skewed source since so many articles are 
started by robots, and the frequency of odd formulations remain high even after 
they are manually cleaned up. The Swedish Gigaword Corpus contains one billion 
words from 1950-2015 analyzed with NLP and stored in XML format: 
https://spraakbanken.gu.se/en/resources/gigaword 
  A presentation: http://www.ep.liu.se/ecp/126/002/ecp16126002.pdf
  
  The license is CC-BY which is incompatible with Wikidata. But just like the 
Leipzig Corpora Collection it would be possible to extract missing word forms.

TASK DETAIL
  https://phabricator.wikimedia.org/T273221

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Menemenetekelufarsim
Cc: Menemenetekelufarsim, ArthurPSmith, Scott_WUaS, Quiddity, Jdforrester-WMF, 
Invadibot, maantietaja, NavinRizwi, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, Mahir256, QZanden, LawExplorer, _jensen, rosalieper, 
Bodhisattwa, Nikki, VIGNERON, Wikidata-bugs, aude, Dinoguy1000, 
Lydia_Pintscher, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to