Hi Lars!

It's certainly not a new idea, I literally wrote my master's thesis on it <https://nbn-resolving.org/urn:nbn:de:bsz:15-qucosa2-166373> (German).

It's an interesting idea, but not easy to make it work properly nicely. There is a lot of noise in the data.

Here's a presentation I gave at Wikimania 2009 on applying it to image search:

 * https://commons.wikimedia.org/wiki/File:Wikimania2009-WikiWord-Paper.pdf
 * 
https://upload.wikimedia.org/wikipedia/commons/4/4d/Wikimania2009-WikiWord.pdf
 * 
https://commons.wikimedia.org/wiki/File:200908280921-Daniel_Kinzler-WikiWord_Multilingual_image_search_and_more.ogv

Am 09.05.2023 um 22:36 schrieb Amir Sarabadani:
On top of the ones mentioned,
ores topic detection model <https://github.com/wikimedia/drafttopic>(the one that says what wikiproject an article belongs to, an example <https://ores.wikimedia.org/v3/scores/enwiki/?revids=1153214555>) has been using word embedding since 2018-ish.

HTH

Am Di., 9. Mai 2023 um 22:10 Uhr schrieb Isaac Johnson <[email protected]>:

    +1 to the suggestion to connect with the Search team. Also a few more
    thoughts about vector / natural-language search and its relevance to
    Wikimedia from my perspective in Research:

      * The common critique of lexical / keyword-based search and why folks
        point to vector / embedding-based search is handling more
        natural-language queries (e.g., "What are the different objectives of
        the United Nations Sustainable Development Goals?" vs. "UN SDG"). The
        former has a lot of words in it that lead to keyword overlap with
        less-relevant pages so keyword-based search doesn't do as well. The
        latter is much more direct and even matches an existing redirect on
        Wikipedia to the article on UN Sustainable Development Goals, so our
        existing keyword-based search handles it very well.
      * Most existing users of Wikimedia's search are probably doing something
        closer to the latter above -- i.e. using pretty exact keywords to
        navigate to a specific page (or find it exists). This is backed up by
        the data: 80% of searches on Wikipedia are auto-completed directly to
        article pages
        
<https://upload.wikimedia.org/wikipedia/commons/8/87/Understanding_Search_Behavior_in_Wikipedia_-_Report_-_Bruno_Scarone.pdf#page=7>.
        In that sense, the system is working quite well! The Search team also
        has added quite a bit of normalization into the pipeline (see
        
https://diff.wikimedia.org/2023/04/28/language-harmony-and-unpacking-a-year-in-the-life-of-a-search-nerd/
        for a fun overview). For the more complicated natural-language queries
        to find relevant Wikipedia articles, my sense is that folks using
        natural language searches are probably doing that within external
        search engines, which have huge teams/infrastructure to support this,
        and then clicking through to Wikipedia.
      * That said, there are probably use-cases where natural-language search
        would be more valuable. For example, within new interaction domains
        such as chat-bots or for new editors / developers who don't yet know
        the exact terminology to search for but want to do generic things like
        get access to Toolforge or find out how to add a link to a page. I've
        been putting together an example of this for Wikitech for the upcoming
        Hackathon (details <https://phabricator.wikimedia.org/T333853>) and
        others have proposed e.g., this for Project pages to help editors find
        answers to questions about editing (details
        <https://phabricator.wikimedia.org/T335013>).
      * Finally, there's a second, related aspect to this which is the size
        and diversity of a given document. Within the Wikipedia article
        namespace, documents are generally about a single, constrained topic.
        So the fact that lexical search systems like Elasticsearch operate at
        the document-level is a very good fit -- i.e. index all the keywords
        for a given article together. When thinking about other namespaces
        like Project/Help pages or Wikitech documentation, a single page can
        be much larger and be about far more diverse topics. This presents
        further challenges to finding good keyword-overlap because often the
        search would ideally find a very specific paragraph in a much larger
        document about many other things. Vector search doesn't directly solve
        this but in practice, folks tend to learn embeddings for smaller
        passages than an entire doc -- e.g., sections or even paragraphs
        within the section. For that reason alone, I suspect vector search
        will do better for namespaces outside of the article namespace on
        Wikipedia. Whether it's worth the cost is a separate question as it
        also introduces substantial new challenges in keeping the embeddings
        up-to-date :)

    Hope that helps.

    Best,
    Isaac

    On Tue, May 9, 2023 at 2:10 PM Dan Andreescu <[email protected]> 
wrote:

        I encourage you to reach out to the search team, they're lovely folks
        and even better engineers.

        On Tue, May 9, 2023 at 1:53 PM Lars Aronsson <[email protected]> wrote:

            On 2023-05-09 09:27, Thiemo Kreuz wrote:
            > I'm curious what the actual question is. The basic concepts are
            > studied for about 60 years, and are in use for about 20 to 30 
years.

            Sorry to hear that you're so negative. It's quite obvious that
            this is not
            currently used in Wikipedia, but is presented everywhere as a 
novelty
            that has not been around for 20 or 30 years.

            >
            
https://www.elastic.co/de/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0
            > https://en.wikipedia.org/wiki/Special:Version
            >
            
https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/Product_%26_Technology#Objectives
            >
            
https://wikitech.wikimedia.org/wiki/Search_Platform/Contact#Office_Hours


            Thanks! This answers my question. It's particularly interesting to
            read
            the talk page to the plan. Part of the problem is that "word
            embedding"
            and "vector search" are not mentioned there, but a vector search 
could
            have found the "ML-enabled natural language search" that is 
mentioned.
            If and when this is tried, we will need to evaluate how well it
            works for
            various languages.


--    Lars Aronsson ([email protected], user:LA2)
               Linköping, Sweden

            _______________________________________________
            Wikitech-l mailing list -- [email protected]
            To unsubscribe send an email to [email protected]
            
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

        _______________________________________________
        Wikitech-l mailing list -- [email protected]
        To unsubscribe send an email to [email protected]
        
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/



-- Isaac Johnson (he/him/his) -- Senior Research Scientist -- Wikimedia
    Foundation
    _______________________________________________
    Wikitech-l mailing list -- [email protected]
    To unsubscribe send an email to [email protected]
    https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/



--
Amir (he/him)


_______________________________________________
Wikitech-l mailing list [email protected]
To unsubscribe send an email [email protected]
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

--
Daniel Kinzler
Principal Software Engineer, Platform Engineering
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to