[Wikidata-bugs] [Maniphest] T273221: Measure and indicate Lexeme language completeness, and prompt editors with what more might need doing

Menemenetekelufarsim Thu, 22 Apr 2021 08:46:30 -0700

Menemenetekelufarsim added a comment.


  Hello, I read that you were intrested in other corpuses than Wikipedia. I 
think that Swedish Wikipedia is a skewed source since so many articles are 
started by robots, and the frequency of odd formulations remain high even after 
they are manually cleaned up. The Swedish Gigaword Corpus contains one billion 
words from 1950-2015 analyzed with NLP and stored in XML format: 
https://spraakbanken.gu.se/en/resources/gigaword 
  A presentation: http://www.ep.liu.se/ecp/126/002/ecp16126002.pdf
  
  The license is CC-BY which is incompatible with Wikidata. But just like the 
Leipzig Corpora Collection it would be possible to extract missing word forms.

TASK DETAIL
  https://phabricator.wikimedia.org/T273221

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Menemenetekelufarsim
Cc: Menemenetekelufarsim, ArthurPSmith, Scott_WUaS, Quiddity, Jdforrester-WMF, 
Invadibot, maantietaja, NavinRizwi, Akuckartz, Nandana, Lahi, Gq86, 
GoranSMilovanovic, Mahir256, QZanden, LawExplorer, _jensen, rosalieper, 
Bodhisattwa, Nikki, VIGNERON, Wikidata-bugs, aude, Dinoguy1000, 
Lydia_Pintscher, Mbch331

_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] T273221: Measure and indicate Lexeme language completeness, and prompt editors with what more might need doing

Reply via email to