On top of the ones mentioned, ores topic detection model <https://github.com/wikimedia/drafttopic>(the one that says what wikiproject an article belongs to, an example <https://ores.wikimedia.org/v3/scores/enwiki/?revids=1153214555>) has been using word embedding since 2018-ish.
HTH Am Di., 9. Mai 2023 um 22:10 Uhr schrieb Isaac Johnson <[email protected] >: > +1 to the suggestion to connect with the Search team. Also a few more > thoughts about vector / natural-language search and its relevance to > Wikimedia from my perspective in Research: > > - The common critique of lexical / keyword-based search and why folks > point to vector / embedding-based search is handling more natural-language > queries (e.g., "What are the different objectives of the United Nations > Sustainable Development Goals?" vs. "UN SDG"). The former has a lot of > words in it that lead to keyword overlap with less-relevant pages so > keyword-based search doesn't do as well. The latter is much more direct and > even matches an existing redirect on Wikipedia to the article on UN > Sustainable Development Goals, so our existing keyword-based search handles > it very well. > - Most existing users of Wikimedia's search are probably doing > something closer to the latter above -- i.e. using pretty exact keywords to > navigate to a specific page (or find it exists). This is backed up by the > data: 80% of searches on Wikipedia are auto-completed directly to > article pages > > <https://upload.wikimedia.org/wikipedia/commons/8/87/Understanding_Search_Behavior_in_Wikipedia_-_Report_-_Bruno_Scarone.pdf#page=7>. > In that sense, the system is working quite well! The Search team also has > added quite a bit of normalization into the pipeline (see > > https://diff.wikimedia.org/2023/04/28/language-harmony-and-unpacking-a-year-in-the-life-of-a-search-nerd/ > for a fun overview). For the more complicated natural-language queries to > find relevant Wikipedia articles, my sense is that folks using natural > language searches are probably doing that within external search engines, > which have huge teams/infrastructure to support this, and then clicking > through to Wikipedia. > - That said, there are probably use-cases where natural-language > search would be more valuable. For example, within new interaction domains > such as chat-bots or for new editors / developers who don't yet know the > exact terminology to search for but want to do generic things like get > access to Toolforge or find out how to add a link to a page. I've been > putting together an example of this for Wikitech for the upcoming Hackathon > (details <https://phabricator.wikimedia.org/T333853>) and others have > proposed e.g., this for Project pages to help editors find answers to > questions about editing (details > <https://phabricator.wikimedia.org/T335013>). > - Finally, there's a second, related aspect to this which is the size > and diversity of a given document. Within the Wikipedia article namespace, > documents are generally about a single, constrained topic. So the fact that > lexical search systems like Elasticsearch operate at the document-level is > a very good fit -- i.e. index all the keywords for a given article > together. When thinking about other namespaces like Project/Help pages or > Wikitech documentation, a single page can be much larger and be about far > more diverse topics. This presents further challenges to finding good > keyword-overlap because often the search would ideally find a very specific > paragraph in a much larger document about many other things. Vector search > doesn't directly solve this but in practice, folks tend to learn embeddings > for smaller passages than an entire doc -- e.g., sections or even > paragraphs within the section. For that reason alone, I suspect vector > search will do better for namespaces outside of the article namespace on > Wikipedia. Whether it's worth the cost is a separate question as it also > introduces substantial new challenges in keeping the embeddings up-to-date > :) > > Hope that helps. > > Best, > Isaac > > On Tue, May 9, 2023 at 2:10 PM Dan Andreescu <[email protected]> > wrote: > >> I encourage you to reach out to the search team, they're lovely folks and >> even better engineers. >> >> On Tue, May 9, 2023 at 1:53 PM Lars Aronsson <[email protected]> wrote: >> >>> On 2023-05-09 09:27, Thiemo Kreuz wrote: >>> > I'm curious what the actual question is. The basic concepts are >>> > studied for about 60 years, and are in use for about 20 to 30 years. >>> >>> Sorry to hear that you're so negative. It's quite obvious that this is >>> not >>> currently used in Wikipedia, but is presented everywhere as a novelty >>> that has not been around for 20 or 30 years. >>> >>> > >>> https://www.elastic.co/de/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0 >>> > https://en.wikipedia.org/wiki/Special:Version >>> > >>> https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/Product_%26_Technology#Objectives >>> > >>> https://wikitech.wikimedia.org/wiki/Search_Platform/Contact#Office_Hours >>> >>> >>> Thanks! This answers my question. It's particularly interesting to read >>> the talk page to the plan. Part of the problem is that "word embedding" >>> and "vector search" are not mentioned there, but a vector search could >>> have found the "ML-enabled natural language search" that is mentioned. >>> If and when this is tried, we will need to evaluate how well it works for >>> various languages. >>> >>> >>> -- >>> Lars Aronsson ([email protected], user:LA2) >>> Linköping, Sweden >>> >>> _______________________________________________ >>> Wikitech-l mailing list -- [email protected] >>> To unsubscribe send an email to [email protected] >>> >>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ >> >> _______________________________________________ >> Wikitech-l mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> >> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ > > > > -- > Isaac Johnson (he/him/his) -- Senior Research Scientist -- Wikimedia > Foundation > _______________________________________________ > Wikitech-l mailing list -- [email protected] > To unsubscribe send an email to [email protected] > https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ -- Amir (he/him)
_______________________________________________ Wikitech-l mailing list -- [email protected] To unsubscribe send an email to [email protected] https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
