There's also Wikispeech <https://meta.wikimedia.org/wiki/Wikispeech> a TTS tool that we (Wikimedia Sverige) are developing. It's currently on the back burner, but hopefully we will have more resources for development soon.
*Sebastian Berlin* Utvecklare/*Developer* Wikimedia Sverige (WMSE) E-post/*E-Mail*: sebastian.ber...@wikimedia.se Telefon/*Phone*: (+46) 0707 - 92 03 84 On Fri, 24 Jun 2022 at 03:06, Gabriel Altay <gabriel.al...@gmail.com> wrote: > Hello Ilario, > > You might find this blog post I wrote a while back interesting > > https://blog.kensho.com/announcing-the-kensho-derived-wikimedia-dataset-5d1197d72bcf > > In it you can find a brief (and definitely not comprehensive) review of > NLP with Wiki* along with links to an open source Kaggle dataset I built > connecting the plain text of wikipedia, the anchor links between pages, and > the links to wikidata. There are a few notebooks that demonstrate its use > ... my favorite are probably, > > * Pointwise Mutual Information embeddings > https://www.kaggle.com/code/kenshoresearch/kdwd-pmi-word-vectors > * Analyzing the "subclass of" graph from wikidata > https://www.kaggle.com/code/gabrielaltay/kdwd-subclass-path-ner > * Explicit topic modeling > https://www.kaggle.com/code/kenshoresearch/kdwd-explicit-topic-models > > and if you are still looking for more after that, this is the query that > gives more every time you use it :) > > > https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=wikipedia&terms-0-field=abstract&terms-1-operator=OR&terms-1-term=wikidata&terms-1-field=abstract&classification-computer_science=y&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first > > best, > -G > > > > On Thu, Jun 23, 2022 at 4:17 PM Isaac Johnson <is...@wikimedia.org> wrote: > >> Chiming in as a member of the Wikimedia Foundation Research team >> <https://research.wikimedia.org/> (so you'll see that likely biases the >> examples I'm aware of). I'd say that the most common type of NLP that shows >> up in our applications is tokenization / language analysis -- i.e. split >> wikitext into words/sentences. As Trey said, this tokenization is >> non-trivial for English and gets much harder in other languages that >> have more complex constructions / don't use spaces to delimit words >> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Spaceless_Writing_Systems_and_Wiki-Projects>. >> These tokens often then become inputs into other types of models that >> aren't necessarily NLP. There are a number of more complex NLP technologies >> too that don't just identify words but try to identify similarities between >> them, translate them, etc. >> >> Some examples below. Additionally, I indicated whether each application >> was rule-based (follow a series of deterministic heuristics) or ML >> (learned, probabilistic model) in case that's of interest: >> >> - Copyedit >> >> <https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task>: >> identifying potential grammar/spelling issues in articles (rule-based). I >> believe there are a number of volunteer-run bots on the wikis as well as >> the under-development tool I linked to, which is a collaboration between >> the Wikimedia Foundation Research team >> <https://research.wikimedia.org/> and Growth team >> <https://www.mediawiki.org/wiki/Growth> that builds on an open-source >> tool >> >> <https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task/LanguageTool> >> . >> - Link recommendation >> >> <https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks/Add_a_link#Link_recommendation_algorithm>: >> detecting links that could be added to Wikipedia articles. The NLP aspect >> mainly involves accurately parsing wikitext into sentences/words >> (rule-based) and comparing the similarity of the source article and pages >> that are potential target links (ML). Also collaboration between Research >> team and Growth team. >> - Content similarity: various tools such as SuggestBot >> <https://en.wikipedia.org/wiki/User:SuggestBot>, RelatedArticles >> Extension <https://www.mediawiki.org/wiki/Extension:RelatedArticles>, >> or GapFinder <https://www.mediawiki.org/wiki/GapFinder> use the morelike >> functionality of CirrusSearch >> <https://www.mediawiki.org/wiki/Help:CirrusSearch#Page_weighting> >> backend maintained by the Search team to find Wikipedia articles with >> similar topics -- this is largely finding keyword overlap between content >> with clever pre-processing/weighting as described by Trey. >> - Readability >> >> <https://meta.wikimedia.org/wiki/Research:Multilingual_Readability_Research>: >> score content based on its readability. Under development by Research >> team. >> - Topic classification: predict what high-level topics are associated >> with Wikipedia articles. The current model for English Wikipedia >> <https://www.mediawiki.org/wiki/ORES#Topic_routing> uses word >> embeddings from the article to make predictions (ML) and a proposed >> model >> >> <https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language_agnostic_link-based_article_topic_model_card> >> from the Research team will use NLP models but with article links instead >> to support more (all) language editions. >> - Citation needed <https://meta.wikimedia.org/wiki/Citation_Detective>: >> detecting sentences in need of citations (ML). Protoype developed by >> Research team. >> - Edit Types >> <https://meta.wikimedia.org/wiki/Research:Wikipedia_Edit_Types>: >> summarizing how much text changed between two revisions of a Wikipedia >> article -- e.g., how many words/sentences changed (rule-based). Protoype >> developed by Research team. >> - Vandalism detection: a number of different approaches in use on the >> wikis generally have some form of a "bad word" list (generally a mix of >> auto/manually-generated), extract words from new edits and compare these >> words to the bad word list, and then use this to help judge how likely the >> edit is to be vandalism. Examples include many filters in AbuseFilter >> <https://www.mediawiki.org/wiki/Extension:AbuseFilter>, volunteer-led >> efforts such as ClueBot NG >> <https://en.wikipedia.org/wiki/User:ClueBot_NG#Bayesian_Classifiers> >> (English Wikipedia) and Salebot >> <https://fr.wikipedia.org/wiki/Utilisateur:Salebot> (French >> Wikipedia) as well as the Wikimedia Foundation ORES edit quality >> models <https://www.mediawiki.org/wiki/ORES/BWDS_review> (many wikis). >> - Sockpuppet detection >> <https://www.mediawiki.org/wiki/User:Ladsgroup/masz>: finding editors >> who have similar stylistic patterns in their comments (volunteer tool). >> - Content Translation was mentioned -- there are numerous potential >> translation models available >> >> <https://www.mediawiki.org/wiki/Content_translation/Machine_Translation/MT_Clients#Machine_translation_clients>, >> of which some are rule-based and some are ML. Tool maintained by Wikimedia >> Foundation Language team >> <https://www.mediawiki.org/wiki/Wikimedia_Language_engineering> but >> depends on several external APIs. >> >> I've also done some thinking that might be of interest about what a >> natural language modeling strategy looks like for Wikimedia that balances >> effectiveness of models with equity/sustainability of supporting so many >> different language communities: >> https://meta.wikimedia.org/wiki/User:Isaac_(WMF)/Language_modeling >> >> Hope that helps. >> >> Best, >> Isaac >> >> >> On Wed, Jun 22, 2022, 10:43 Trey Jones <tjo...@wikimedia.org> wrote: >> >>> Do you have examples of projects using NLP in Wikimedia communities. >>>> >>> >>> I do! Defining NLP is something of a moving target, and the most common >>> definition, which I learned when I worked in industry, is that "NLP" has >>> often been used as a buzzword that means "any language processing you do >>> that your competitors don't". Getting away from profit-driven buzzwords, I >>> have a pretty generous definition of NLP, as any software that improves >>> language-based interactions between people and computers. >>> >>> Guillaume mentioned CirrusSearch in general, but there are lots of >>> specific parts within search. I work on a lot of NLP-type stuff for search, >>> and I write a lot of documentation on Mediawiki, so this is biased towards >>> stuff I have worked on or know about. >>> >>> Language analysis is the general process of converting text (say, of >>> Wikipedia articles) into tokens (approximately "words" in English) to be >>> stored in the search index. There are lots of different levels of >>> complexity in the language analysis. We currently use Elasticsearch, and >>> they provide a lot of language-specific analysis tools (link to Elastic >>> language analyzers >>> <https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-lang-analyzer.html>), >>> which we customize and build on. >>> >>> Here is part of the config for English, reordered to be chronological, >>> rather than alphabetical, and annotated: >>> >>> "text": { >>> "type": "custom", >>> "char_filter": [ >>> "word_break_helper", — break_up.words:with(uncommon)separators >>> "kana_map" — map Japanese Hiragana to Katakana (notes >>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Hiragana_to_Katakana_Mapping_for_English_and_Japanese> >>> ) >>> ], >>> "tokenizer": "standard" — break text into tokens/words; not trivial >>> for English, very hard for other languages (blog post >>> <https://wikimediafoundation.org/news/2018/08/07/anatomy-search-token-affection/> >>> ) >>> "filter": [ >>> "aggressive_splitting", —splitting of more likely *multi-part* >>> *ComplexTokens* >>> "homoglyph_norm", —correct typos/vandalization which mix Latin >>> and Cyrillic letters (notes >>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Homoglyphs>) >>> "possessive_english", —special processing for *English's* >>> possessive forms >>> "icu_normalizer", —normalization of text (blog post >>> <https://wikimediafoundation.org/news/2018/09/13/anatomy-search-variation-under-nature/> >>> ) >>> "stop", —removal of stop words (blog post >>> <https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/>, >>> section "To be or not to be indexed") >>> "icu_folding", —more aggressive normalization >>> "remove_empty", —misc bookkeeping >>> "kstem", —stemming (blog post >>> <https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/> >>> ) >>> "custom_stem" —more stemming >>> ], >>> }, >>> >>> Tokenization, normalization, and stemming can vary wildly between >>> languages. Some other elements (from Elasticsearch or custom-built by us): >>> >>> - Stemmers and stop words for specific languages, including some >>> open-source ones that we ported, and some developed with community help. >>> - Elision processing (*l'homme* == *homme*) >>> - Normalization for digits (١ ٢ ٣ / १ २ ३ / ①②③ / 123) >>> - Custom lowercasing—Greek, Irish, and Turkish have special >>> processing (notes >>> >>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language-Specific_Lowercasing_and_ICU_Normalization> >>> ) >>> - Normalization of written Khmer (blog post >>> >>> <https://techblog.wikimedia.org/2020/06/02/permuting-khmer-restructuring-khmer-syllables-for-search/> >>> ) >>> - Notes on lots more >>> >>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Elasticsearch_Analysis_Chain_Analysis> >>> ... >>> >>> We also did some work improving "Did you mean" suggestions, which >>> currently uses both the built-in suggestions from Elasticsearch (not always >>> great, but there are lots of them) and new suggestions from a module we >>> called "Glent >>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Glent_%22Did_You_Mean%22_Suggestions>" >>> (much better, but not as many suggestions). >>> >>> We have some custom language detection available on some Wikipedias, so >>> that if you don't get very many results and your query looks like it is >>> another language, we show results from that other language. Example, >>> searching >>> for Том Хэнкс on English Wikipedia >>> <https://en.wikipedia.org/w/index.php?search=%D0%A2%D0%BE%D0%BC+%D0%A5%D1%8D%D0%BD%D0%BA%D1%81&ns0=1> >>> will >>> show results from Russian Wikipedia. (too many notes >>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#TextCat,_Language_ID,_Etc.> >>> ) >>> >>> Outside of our search work, there are lots more. Some that come to mind: >>> >>> - Language Converter supports languages with multiple writing >>> systems, which is sometimes easy and sometimes really hard. (blog >>> post >>> >>> <https://diff.wikimedia.org/2018/03/12/supporting-languages-multiple-writing-systems/> >>> ) >>> - There's a Wikidata gadget on French Wikipedia and others that >>> appends results from Wikidata and generates descriptions in various >>> languages based on the Wikidata information. For example, searching for >>> Molenstraat >>> Vught on French Wikipedia >>> >>> <https://fr.wikipedia.org/w/index.php?search=Molenstraat+Vught&title=Sp%C3%A9cial:Recherche&profile=advanced&fulltext=1&ns0=1>, >>> gives no local results, but shows two "Results from Wikidata" / >>> "Résultats >>> sur Wikidata" (if you are logged in you get results in your preferred >>> language, if possible, otherwise the language of the project): >>> - Molenstraat ; hameau de la commune de Vught (in French, when >>> I'm not logged in) >>> - Molenstraat ; street in Vught, the Netherlands (fallback to >>> English for some reason) >>> - The whole giant Content Translation project that uses machine >>> translation to assist translating articles across wikis. (blog post >>> >>> <https://wikimediafoundation.org/news/2019/01/09/you-can-now-use-google-translate-to-translate-articles-on-wikipedia/> >>> ) >>> >>> There's lots more out there, I'm sure—but I gotta run! >>> —Trey >>> >>> Trey Jones >>> Staff Computational Linguist, Search Platform >>> Wikimedia Foundation >>> UTC–4 / EDT >>> >>> >>> _______________________________________________ >>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org >>> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org >>> >>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ >> >> _______________________________________________ >> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org >> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org >> >> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ > > _______________________________________________ > Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org > To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org > https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
_______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/