On top of the ones mentioned,
ores topic detection model <https://github.com/wikimedia/drafttopic>(the
one that says what wikiproject an article belongs to, an example
<https://ores.wikimedia.org/v3/scores/enwiki/?revids=1153214555>) has been
using word embedding since 2018-ish.

HTH

Am Di., 9. Mai 2023 um 22:10 Uhr schrieb Isaac Johnson <[email protected]
>:

> +1 to the suggestion to connect with the Search team. Also a few more
> thoughts about vector / natural-language search and its relevance to
> Wikimedia from my perspective in Research:
>
>    - The common critique of lexical / keyword-based search and why folks
>    point to vector / embedding-based search is handling more natural-language
>    queries (e.g., "What are the different objectives of the United Nations
>    Sustainable Development Goals?" vs. "UN SDG"). The former has a lot of
>    words in it that lead to keyword overlap with less-relevant pages so
>    keyword-based search doesn't do as well. The latter is much more direct and
>    even matches an existing redirect on Wikipedia to the article on UN
>    Sustainable Development Goals, so our existing keyword-based search handles
>    it very well.
>    - Most existing users of Wikimedia's search are probably doing
>    something closer to the latter above -- i.e. using pretty exact keywords to
>    navigate to a specific page (or find it exists). This is backed up by the
>    data: 80% of searches on Wikipedia are auto-completed directly to
>    article pages
>    
> <https://upload.wikimedia.org/wikipedia/commons/8/87/Understanding_Search_Behavior_in_Wikipedia_-_Report_-_Bruno_Scarone.pdf#page=7>.
>    In that sense, the system is working quite well! The Search team also has
>    added quite a bit of normalization into the pipeline (see
>    
> https://diff.wikimedia.org/2023/04/28/language-harmony-and-unpacking-a-year-in-the-life-of-a-search-nerd/
>    for a fun overview). For the more complicated natural-language queries to
>    find relevant Wikipedia articles, my sense is that folks using natural
>    language searches are probably doing that within external search engines,
>    which have huge teams/infrastructure to support this, and then clicking
>    through to Wikipedia.
>    - That said, there are probably use-cases where natural-language
>    search would be more valuable. For example, within new interaction domains
>    such as chat-bots or for new editors / developers who don't yet know the
>    exact terminology to search for but want to do generic things like get
>    access to Toolforge or find out how to add a link to a page. I've been
>    putting together an example of this for Wikitech for the upcoming Hackathon
>    (details <https://phabricator.wikimedia.org/T333853>) and others have
>    proposed e.g., this for Project pages to help editors find answers to
>    questions about editing (details
>    <https://phabricator.wikimedia.org/T335013>).
>    - Finally, there's a second, related aspect to this which is the size
>    and diversity of a given document. Within the Wikipedia article namespace,
>    documents are generally about a single, constrained topic. So the fact that
>    lexical search systems like Elasticsearch operate at the document-level is
>    a very good fit -- i.e. index all the keywords for a given article
>    together. When thinking about other namespaces like Project/Help pages or
>    Wikitech documentation, a single page can be much larger and be about far
>    more diverse topics. This presents further challenges to finding good
>    keyword-overlap because often the search would ideally find a very specific
>    paragraph in a much larger document about many other things. Vector search
>    doesn't directly solve this but in practice, folks tend to learn embeddings
>    for smaller passages than an entire doc -- e.g., sections or even
>    paragraphs within the section. For that reason alone, I suspect vector
>    search will do better for namespaces outside of the article namespace on
>    Wikipedia. Whether it's worth the cost is a separate question as it also
>    introduces substantial new challenges in keeping the embeddings up-to-date
>    :)
>
> Hope that helps.
>
> Best,
> Isaac
>
> On Tue, May 9, 2023 at 2:10 PM Dan Andreescu <[email protected]>
> wrote:
>
>> I encourage you to reach out to the search team, they're lovely folks and
>> even better engineers.
>>
>> On Tue, May 9, 2023 at 1:53 PM Lars Aronsson <[email protected]> wrote:
>>
>>> On 2023-05-09 09:27, Thiemo Kreuz wrote:
>>> > I'm curious what the actual question is. The basic concepts are
>>> > studied for about 60 years, and are in use for about 20 to 30 years.
>>>
>>> Sorry to hear that you're so negative. It's quite obvious that this is
>>> not
>>> currently used in Wikipedia, but is presented everywhere as a novelty
>>> that has not been around for 20 or 30 years.
>>>
>>> >
>>> https://www.elastic.co/de/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0
>>> > https://en.wikipedia.org/wiki/Special:Version
>>> >
>>> https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/Product_%26_Technology#Objectives
>>> >
>>> https://wikitech.wikimedia.org/wiki/Search_Platform/Contact#Office_Hours
>>>
>>>
>>> Thanks! This answers my question. It's particularly interesting to read
>>> the talk page to the plan. Part of the problem is that "word embedding"
>>> and "vector search" are not mentioned there, but a vector search could
>>> have found the "ML-enabled natural language search" that is mentioned.
>>> If and when this is tried, we will need to evaluate how well it works for
>>> various languages.
>>>
>>>
>>> --
>>>    Lars Aronsson ([email protected], user:LA2)
>>>    Linköping, Sweden
>>>
>>> _______________________________________________
>>> Wikitech-l mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>>
>>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>>
>> _______________________________________________
>> Wikitech-l mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>>
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
>
>
> --
> Isaac Johnson (he/him/his) -- Senior Research Scientist -- Wikimedia
> Foundation
> _______________________________________________
> Wikitech-l mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/



-- 
Amir (he/him)
_______________________________________________
Wikitech-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to