[Wikitech-l] Re: Examples of projects using NLP

Sebastian Berlin Wed, 29 Jun 2022 23:20:31 -0700

There's also Wikispeech <https://meta.wikimedia.org/wiki/Wikispeech> a TTS
tool that we (Wikimedia Sverige) are developing. It's currently on the back
burner, but hopefully we will have more resources for development soon.


*Sebastian Berlin*
Utvecklare/*Developer*
Wikimedia Sverige (WMSE)

E-post/*E-Mail*: sebastian.ber...@wikimedia.se
Telefon/*Phone*: (+46) 0707 - 92 03 84


On Fri, 24 Jun 2022 at 03:06, Gabriel Altay <gabriel.al...@gmail.com> wrote:

> Hello Ilario,
>
> You might find this blog post I wrote a while back interesting
>
> https://blog.kensho.com/announcing-the-kensho-derived-wikimedia-dataset-5d1197d72bcf
>
> In it you can find a brief (and definitely not comprehensive) review of
> NLP with Wiki* along with links to an open source Kaggle dataset I built
> connecting the plain text of wikipedia, the anchor links between pages, and
> the links to wikidata. There are a few notebooks that demonstrate its use
> ... my favorite are probably,
>
> * Pointwise Mutual Information embeddings
> https://www.kaggle.com/code/kenshoresearch/kdwd-pmi-word-vectors
> * Analyzing the "subclass of" graph from wikidata
> https://www.kaggle.com/code/gabrielaltay/kdwd-subclass-path-ner
> * Explicit topic modeling
> https://www.kaggle.com/code/kenshoresearch/kdwd-explicit-topic-models
>
> and if you are still looking for more after that, this is the query that
> gives more every time you use it :)
>
>
> https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=wikipedia&terms-0-field=abstract&terms-1-operator=OR&terms-1-term=wikidata&terms-1-field=abstract&classification-computer_science=y&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first
>
> best,
> -G
>
>
>
> On Thu, Jun 23, 2022 at 4:17 PM Isaac Johnson <is...@wikimedia.org> wrote:
>
>> Chiming in as a member of the Wikimedia Foundation Research team
>> <https://research.wikimedia.org/> (so you'll see that likely biases the
>> examples I'm aware of). I'd say that the most common type of NLP that shows
>> up in our applications is tokenization / language analysis -- i.e. split
>> wikitext into words/sentences. As Trey said, this tokenization is
>> non-trivial for English and gets much harder in other languages that
>> have more complex constructions / don't use spaces to delimit words
>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Spaceless_Writing_Systems_and_Wiki-Projects>.
>> These tokens often then become inputs into other types of models that
>> aren't necessarily NLP. There are a number of more complex NLP technologies
>> too that don't just identify words but try to identify similarities between
>> them, translate them, etc.
>>
>> Some examples below. Additionally, I indicated whether each application
>> was rule-based (follow a series of deterministic heuristics) or ML
>> (learned, probabilistic model) in case that's of interest:
>>
>>    - Copyedit
>>    
>> <https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task>:
>>    identifying potential grammar/spelling issues in articles (rule-based). I
>>    believe there are a number of volunteer-run bots on the wikis as well as
>>    the under-development tool I linked to, which is a collaboration between
>>    the Wikimedia Foundation Research team
>>    <https://research.wikimedia.org/> and Growth team
>>    <https://www.mediawiki.org/wiki/Growth> that builds on an open-source
>>    tool
>>    
>> <https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task/LanguageTool>
>>    .
>>    - Link recommendation
>>    
>> <https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks/Add_a_link#Link_recommendation_algorithm>:
>>    detecting links that could be added to Wikipedia articles. The NLP aspect
>>    mainly involves accurately parsing wikitext into sentences/words
>>    (rule-based) and comparing the similarity of the source article and pages
>>    that are potential target links (ML). Also collaboration between Research
>>    team and Growth team.
>>    - Content similarity: various tools such as SuggestBot
>>    <https://en.wikipedia.org/wiki/User:SuggestBot>, RelatedArticles
>>    Extension <https://www.mediawiki.org/wiki/Extension:RelatedArticles>,
>>    or GapFinder <https://www.mediawiki.org/wiki/GapFinder> use the morelike
>>    functionality of CirrusSearch
>>    <https://www.mediawiki.org/wiki/Help:CirrusSearch#Page_weighting>
>>    backend maintained by the Search team to find Wikipedia articles with
>>    similar topics -- this is largely finding keyword overlap between content
>>    with clever pre-processing/weighting as described by Trey.
>>    - Readability
>>    
>> <https://meta.wikimedia.org/wiki/Research:Multilingual_Readability_Research>:
>>    score content based on its readability. Under development by Research 
>> team.
>>    - Topic classification: predict what high-level topics are associated
>>    with Wikipedia articles. The current model for English Wikipedia
>>    <https://www.mediawiki.org/wiki/ORES#Topic_routing> uses word
>>    embeddings from the article to make predictions (ML) and a proposed
>>    model
>>    
>> <https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language_agnostic_link-based_article_topic_model_card>
>>    from the Research team will use NLP models but with article links instead
>>    to support more (all) language editions.
>>    - Citation needed <https://meta.wikimedia.org/wiki/Citation_Detective>:
>>    detecting sentences in need of citations (ML). Protoype developed by
>>    Research team.
>>    - Edit Types
>>    <https://meta.wikimedia.org/wiki/Research:Wikipedia_Edit_Types>:
>>    summarizing how much text changed between two revisions of a Wikipedia
>>    article -- e.g., how many words/sentences changed (rule-based). Protoype
>>    developed by Research team.
>>    - Vandalism detection: a number of different approaches in use on the
>>    wikis generally have some form of a "bad word" list (generally a mix of
>>    auto/manually-generated), extract words from new edits and compare these
>>    words to the bad word list, and then use this to help judge how likely the
>>    edit is to be vandalism. Examples include many filters in AbuseFilter
>>    <https://www.mediawiki.org/wiki/Extension:AbuseFilter>, volunteer-led
>>    efforts such as ClueBot NG
>>    <https://en.wikipedia.org/wiki/User:ClueBot_NG#Bayesian_Classifiers>
>>    (English Wikipedia) and Salebot
>>    <https://fr.wikipedia.org/wiki/Utilisateur:Salebot> (French
>>    Wikipedia) as well as the Wikimedia Foundation ORES edit quality
>>    models <https://www.mediawiki.org/wiki/ORES/BWDS_review> (many wikis).
>>    - Sockpuppet detection
>>    <https://www.mediawiki.org/wiki/User:Ladsgroup/masz>: finding editors
>>    who have similar stylistic patterns in their comments (volunteer tool).
>>    - Content Translation was mentioned -- there are numerous potential
>>    translation models available
>>    
>> <https://www.mediawiki.org/wiki/Content_translation/Machine_Translation/MT_Clients#Machine_translation_clients>,
>>    of which some are rule-based and some are ML. Tool maintained by Wikimedia
>>    Foundation Language team
>>    <https://www.mediawiki.org/wiki/Wikimedia_Language_engineering> but
>>    depends on several external APIs.
>>
>> I've also done some thinking that might be of interest about what a
>> natural language modeling strategy looks like for Wikimedia that balances
>> effectiveness of models with equity/sustainability of supporting so many
>> different language communities:
>> https://meta.wikimedia.org/wiki/User:Isaac_(WMF)/Language_modeling
>>
>> Hope that helps.
>>
>> Best,
>> Isaac
>>
>>
>> On Wed, Jun 22, 2022, 10:43 Trey Jones <tjo...@wikimedia.org> wrote:
>>
>>> Do you have examples of projects using NLP in Wikimedia communities.
>>>>
>>>
>>> I do! Defining NLP is something of a moving target, and the most common
>>> definition, which I learned when I worked in industry, is that "NLP" has
>>> often been used as a buzzword that means "any language processing you do
>>> that your competitors don't". Getting away from profit-driven buzzwords, I
>>> have a pretty generous definition of NLP, as any software that improves
>>> language-based interactions between people and computers.
>>>
>>> Guillaume mentioned CirrusSearch in general, but there are lots of
>>> specific parts within search. I work on a lot of NLP-type stuff for search,
>>> and I write a lot of documentation on Mediawiki, so this is biased towards
>>> stuff I have worked on or know about.
>>>
>>> Language analysis is the general process of converting text (say, of
>>> Wikipedia articles) into tokens (approximately "words" in English) to be
>>> stored in the search index. There are lots of different levels of
>>> complexity in the language analysis. We currently use Elasticsearch, and
>>> they provide a lot of language-specific analysis tools (link to Elastic
>>> language analyzers
>>> <https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-lang-analyzer.html>),
>>> which we customize and build on.
>>>
>>> Here is part of the config for English, reordered to be chronological,
>>> rather than alphabetical, and annotated:
>>>
>>> "text": {
>>>     "type": "custom",
>>>     "char_filter": [
>>>         "word_break_helper", — break_up.words:with(uncommon)separators
>>>         "kana_map" — map Japanese Hiragana to Katakana (notes
>>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Hiragana_to_Katakana_Mapping_for_English_and_Japanese>
>>> )
>>>     ],
>>>     "tokenizer": "standard" — break text into tokens/words; not trivial
>>> for English, very hard for other languages (blog post
>>> <https://wikimediafoundation.org/news/2018/08/07/anatomy-search-token-affection/>
>>> )
>>>     "filter": [
>>>         "aggressive_splitting", —splitting of more likely *multi-part*
>>> *ComplexTokens*
>>>         "homoglyph_norm", —correct typos/vandalization which mix Latin
>>> and Cyrillic letters (notes
>>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Homoglyphs>)
>>>         "possessive_english", —special processing for *English's*
>>> possessive forms
>>>         "icu_normalizer", —normalization of text (blog post
>>> <https://wikimediafoundation.org/news/2018/09/13/anatomy-search-variation-under-nature/>
>>> )
>>>         "stop", —removal of stop words (blog post
>>> <https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/>,
>>> section "To be or not to be indexed")
>>>         "icu_folding", —more aggressive normalization
>>>         "remove_empty", —misc bookkeeping
>>>         "kstem", —stemming (blog post
>>> <https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/>
>>> )
>>>         "custom_stem" —more stemming
>>>     ],
>>> },
>>>
>>> Tokenization, normalization, and stemming can vary wildly between
>>> languages. Some other elements (from Elasticsearch or custom-built by us):
>>>
>>>    - Stemmers and stop words for specific languages, including some
>>>    open-source ones that we ported, and some developed with community help.
>>>    - Elision processing (*l'homme* == *homme*)
>>>    - Normalization for digits (١ ٢ ٣ / १ २ ३ / ①②③ / 123)
>>>    - Custom lowercasing—Greek, Irish, and Turkish have special
>>>    processing (notes
>>>    
>>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language-Specific_Lowercasing_and_ICU_Normalization>
>>>    )
>>>    - Normalization of written Khmer (blog post
>>>    
>>> <https://techblog.wikimedia.org/2020/06/02/permuting-khmer-restructuring-khmer-syllables-for-search/>
>>>    )
>>>    - Notes on lots more
>>>    
>>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Elasticsearch_Analysis_Chain_Analysis>
>>>    ...
>>>
>>> We also did some work improving "Did you mean" suggestions, which
>>> currently uses both the built-in suggestions from Elasticsearch (not always
>>> great, but there are lots of them) and new suggestions from a module we
>>> called "Glent
>>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Glent_%22Did_You_Mean%22_Suggestions>"
>>> (much better, but not as many suggestions).
>>>
>>> We have some custom language detection available on some Wikipedias, so
>>> that if you don't get very many results and your query looks like it is
>>> another language, we show results from that other language. Example, 
>>> searching
>>> for Том Хэнкс on English Wikipedia
>>> <https://en.wikipedia.org/w/index.php?search=%D0%A2%D0%BE%D0%BC+%D0%A5%D1%8D%D0%BD%D0%BA%D1%81&ns0=1>
>>>  will
>>> show results from Russian Wikipedia. (too many notes
>>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#TextCat,_Language_ID,_Etc.>
>>> )
>>>
>>> Outside of our search work, there are lots more. Some that come to mind:
>>>
>>>    - Language Converter supports languages with multiple writing
>>>    systems, which is sometimes easy and sometimes really hard. (blog
>>>    post
>>>    
>>> <https://diff.wikimedia.org/2018/03/12/supporting-languages-multiple-writing-systems/>
>>>    )
>>>    - There's a Wikidata gadget on French Wikipedia and others that
>>>    appends results from Wikidata and generates descriptions in various
>>>    languages based on the Wikidata information. For example, searching for 
>>> Molenstraat
>>>    Vught on French Wikipedia
>>>    
>>> <https://fr.wikipedia.org/w/index.php?search=Molenstraat+Vught&title=Sp%C3%A9cial:Recherche&profile=advanced&fulltext=1&ns0=1>,
>>>    gives no local results, but shows two "Results from Wikidata" / 
>>> "Résultats
>>>    sur Wikidata" (if you are logged in you get results in your preferred
>>>    language, if possible, otherwise the language of the project):
>>>       - Molenstraat ; hameau de la commune de Vught (in French, when
>>>       I'm not logged in)
>>>       - Molenstraat ; street in Vught, the Netherlands (fallback to
>>>       English for some reason)
>>>       - The whole giant Content Translation project that uses machine
>>>    translation to assist translating articles across wikis. (blog post
>>>    
>>> <https://wikimediafoundation.org/news/2019/01/09/you-can-now-use-google-translate-to-translate-articles-on-wikipedia/>
>>>    )
>>>
>>> There's lots more out there, I'm sure—but I gotta run!
>>> —Trey
>>>
>>> Trey Jones
>>> Staff Computational Linguist, Search Platform
>>> Wikimedia Foundation
>>> UTC–4 / EDT
>>>
>>>
>>> _______________________________________________
>>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>>> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
>>>
>>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>>
>> _______________________________________________
>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
>>
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

[Wikitech-l] Re: Examples of projects using NLP

Reply via email to