[Wikitech-l] Re: Word embeddings / vector search

Daniel Kinzler Tue, 09 May 2023 14:17:47 -0700

Hi Lars!

It's certainly not a new idea, I literally wrote my master's thesis on it<https://nbn-resolving.org/urn:nbn:de:bsz:15-qucosa2-166373> (German).

It's an interesting idea, but not easy to make it work properly nicely. There isa lot of noise in the data.


Here's a presentation I gave at Wikimania 2009 on applying it to image search:

 * https://commons.wikimedia.org/wiki/File:Wikimania2009-WikiWord-Paper.pdf
 * 
https://upload.wikimedia.org/wikipedia/commons/4/4d/Wikimania2009-WikiWord.pdf
 * 
https://commons.wikimedia.org/wiki/File:200908280921-Daniel_Kinzler-WikiWord_Multilingual_image_search_and_more.ogv

Am 09.05.2023 um 22:36 schrieb Amir Sarabadani:

On top of the ones mentioned,

ores topic detection model <https://github.com/wikimedia/drafttopic>(the onethat says what wikiproject an article belongs to, an example<https://ores.wikimedia.org/v3/scores/enwiki/?revids=1153214555>) has beenusing word embedding since 2018-ish.

HTH

Am Di., 9. Mai 2023 um 22:10 Uhr schrieb Isaac Johnson <[email protected]>:

+1 to the suggestion to connect with the Search team. Also a few more
thoughts about vector / natural-language search and its relevance to
Wikimedia from my perspective in Research:

* The common critique of lexical / keyword-based search and why folks
point to vector / embedding-based search is handling more
natural-language queries (e.g., "What are the different objectives of
the United Nations Sustainable Development Goals?" vs. "UN SDG"). The
former has a lot of words in it that lead to keyword overlap with
less-relevant pages so keyword-based search doesn't do as well. The
latter is much more direct and even matches an existing redirect on
Wikipedia to the article on UN Sustainable Development Goals, so our
existing keyword-based search handles it very well.
* Most existing users of Wikimedia's search are probably doing something
closer to the latter above -- i.e. using pretty exact keywords to
navigate to a specific page (or find it exists). This is backed up by
the data: 80% of searches on Wikipedia are auto-completed directly to
article pages

<https://upload.wikimedia.org/wikipedia/commons/8/87/Understanding_Search_Behavior_in_Wikipedia_-_Report_-_Bruno_Scarone.pdf#page=7>.
In that sense, the system is working quite well! The Search team also
has added quite a bit of normalization into the pipeline (see

https://diff.wikimedia.org/2023/04/28/language-harmony-and-unpacking-a-year-in-the-life-of-a-search-nerd/
for a fun overview). For the more complicated natural-language queries
to find relevant Wikipedia articles, my sense is that folks using
natural language searches are probably doing that within external
search engines, which have huge teams/infrastructure to support this,
and then clicking through to Wikipedia.
* That said, there are probably use-cases where natural-language search
would be more valuable. For example, within new interaction domains
such as chat-bots or for new editors / developers who don't yet know
the exact terminology to search for but want to do generic things like
get access to Toolforge or find out how to add a link to a page. I've
been putting together an example of this for Wikitech for the upcoming
Hackathon (details <https://phabricator.wikimedia.org/T333853>) and
others have proposed e.g., this for Project pages to help editors find
answers to questions about editing (details
<https://phabricator.wikimedia.org/T335013>).
* Finally, there's a second, related aspect to this which is the size
and diversity of a given document. Within the Wikipedia article
namespace, documents are generally about a single, constrained topic.
So the fact that lexical search systems like Elasticsearch operate at
the document-level is a very good fit -- i.e. index all the keywords
for a given article together. When thinking about other namespaces
like Project/Help pages or Wikitech documentation, a single page can
be much larger and be about far more diverse topics. This presents
further challenges to finding good keyword-overlap because often the
search would ideally find a very specific paragraph in a much larger
document about many other things. Vector search doesn't directly solve
this but in practice, folks tend to learn embeddings for smaller
passages than an entire doc -- e.g., sections or even paragraphs
within the section. For that reason alone, I suspect vector search
will do better for namespaces outside of the article namespace on
Wikipedia. Whether it's worth the cost is a separate question as it
also introduces substantial new challenges in keeping the embeddings
up-to-date :)

Hope that helps.

Best,
Isaac

On Tue, May 9, 2023 at 2:10 PM Dan Andreescu <[email protected]>
wrote:

I encourage you to reach out to the search team, they're lovely folks
and even better engineers.

On Tue, May 9, 2023 at 1:53 PM Lars Aronsson <[email protected]> wrote:

On 2023-05-09 09:27, Thiemo Kreuz wrote:
> I'm curious what the actual question is. The basic concepts are
> studied for about 60 years, and are in use for about 20 to 30
years.

Sorry to hear that you're so negative. It's quite obvious that
this is not
currently used in Wikipedia, but is presented everywhere as a
novelty
that has not been around for 20 or 30 years.

https://www.elastic.co/de/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0
> https://en.wikipedia.org/wiki/Special:Version
>

https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/Product_%26_Technology#Objectives
>

https://wikitech.wikimedia.org/wiki/Search_Platform/Contact#Office_Hours

Thanks! This answers my question. It's particularly interesting to
read
the talk page to the plan. Part of the problem is that "word
embedding"
and "vector search" are not mentioned there, but a vector search
could
have found the "ML-enabled natural language search" that is
mentioned.
If and when this is tried, we will need to evaluate how well it
works for
various languages.

-- Lars Aronsson ([email protected], user:LA2)

               Linköping, Sweden

            _______________________________________________
            Wikitech-l mailing list -- [email protected]
            To unsubscribe send an email to [email protected]
            
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

        _______________________________________________
        Wikitech-l mailing list -- [email protected]
        To unsubscribe send an email to [email protected]
        
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

--Isaac Johnson (he/him/his) -- Senior Research Scientist -- Wikimedia

    Foundation
    _______________________________________________
    Wikitech-l mailing list -- [email protected]
    To unsubscribe send an email to [email protected]
    https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/



--
Amir (he/him)


_______________________________________________
Wikitech-l mailing list [email protected]
To unsubscribe send an email [email protected]
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/


--
Daniel Kinzler
Principal Software Engineer, Platform Engineering
Wikimedia Foundation

_______________________________________________
Wikitech-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

[Wikitech-l] Re: Word embeddings / vector search

Reply via email to