James Salsman wrote:

How will the Foundation's approach to machine learning of search
> results ranking guard against overfitting?


Overfitting, for those who aren't familiar with the term, describes the
situation where a machine learning model inappropriately learns very
specific details about its training set that don't generalize to the real
world. From the point of view of training, the model seems to be getting
better and better, while real-world performance is actually decreasing. As
a somewhat silly example, a model could learn that queries that have
exactly 38 words in them are 100% about baseball—because there is only one
example of a query in the training set that is 38 words long, and it is
about baseball. For more on overfitting, see Wikipedia.[1]

We employ the usual safeguards against overfitting. Certain parameters that
control how a specific type of model is built can discourage overfitting.
For example, not allowing a decision inside the model to be made on too
little data—so rather than 1 or 2 examples to base a decision on, the model
can be told it needs to see 5, or 50, or 500.

We also have separate training and testing data sets. So we build a model
on one set of data, then evaluate the model on another set. The estimate of
model performance from the training set will always be at least a bit
optimistic, but the testing set—which is large enough to be representative
and which does not overlap with the training set—gives a more realistic
estimate. We choose the model that performs the best on the testing set.
Overfitted models will do worse on the testing set, and we won't use them.

We have other methods of validating our models as well.

We have a set of machines and software that we collectively call Relevance
Forge (a.k.a. RelForge) that we can use to run large sets of queries
against different versions of the same index. We can compare the before and
after results, both automatically and manually. RelForge lets us easily
gauge the *impact* of a change. For example, a 1% net improvement could
come from making 1% of queries a bit better, or from making 49% a bit worse
and 50% a bit better. So, we can easily see whether 1% or 99% of results
change. If we see a 2% improvement but a 99% impact, something weird is
happening, and we'd investigate more deeply.

We also have many definitions of "results change" that we can evaluate: #1
result changes, top 3 results change (ordered or unordered), number of
results changes, number of queries getting zero results changes. And for
each of these we can manually inspect a random selection of affected
queries to decide whether the results are generally better or not.

We also run A/B tests, where we let a small sample of users get the
proposed change, while a similar number get the standard results. We do
statistical analyses on user engagement with results and various other
click metrics that let us compare the control and experimental conditions.
For more on how we test search changes in general, see Testing Search on
mediawiki.org.[2]

In both of these cases—RelForge testing and A/B testing in
production—overfitted models would perform poorly, and that would become
apparent.

For example, if most searches on "rent" do not pertain to "rent
> seeking", then how will the machine learning approach to search
> results for "rent" guard against never presenting any results on "rent
> seeking"?


Your wording has left me a bit confused, and I'm not sure whether your
concern is (a) that a query of "rent" should never return "rent seeking",
and so the machine learning model should never present it, or (b) that we
should guard against building a model that *never* presents results on
"rent seeking" for a query of "rent". I'll briefly address each.

Case (a): "rent" should *never* return "rent seeking"

It's not clear to me that returning "rent seeking" for a query of "rent" is
necessarily a case of overfitting per se, but in general the click models
that we use would take note that users who search for "rent", say, click on
the musical 70% of the time and the disambiguation page 29% of the time.
Those would be the "good" results and the model would prioritize moving
them to the top of the list.

*Never* presenting results on "rent seeking" would be an error. The word is
present in the article, and in the title, so it should be somewhere in the
results. Moving it up or down the results list is a question of ranking,
which is what the machine learning model is trying to figure out.

Case (b): "rent" should not be *prevented* from returning "rent seeking"

Our click data shows that about 80% of clicks on search results are on one
of the first two results, and more than 90% are on the top 10. Our click
models for scoring the order of results reflect that. All of the value
then, from the machine learning model's point of view, comes from getting
the top 3 to 5 results in the best possible order. There's not a lot of
value in pushing down any particular result much farther than that. For a
single word query like "rent", title matches are the best. There are only
138 results for intitle:rent, vs over 44K for just rent—however, the first
page of results for both is the same.

We are interested in use cases other than searchers who are looking for a
particular article or particular information, though that tends to
predominate. Editors might want to find all the articles with a particular
word (e.g., a misspelling) and no result would be excluded by the machine
learning model, just possibly ranked lower.

Hope that helps,
—Trey

[1] https://en.wikipedia.org/wiki/Overfitting
[2] https://www.mediawiki.org/wiki/Wikimedia_Discovery/Search/Testing_Search


Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
*(​via Deb Tankersley's email address as Trey's original email got
moderated)*

​

On Tue, Jun 13, 2017 at 3:43 PM, James Salsman <jsals...@gmail.com> wrote:

> On Wed, Jun 14, 2017 at 5:25 AM, Deborah Tankersley
> <dtankers...@wikimedia.org> wrote:
> >
> > The Discovery team structure has now changed, but the new teams will
> still
> > work together to complete the goals as listed in the draft annual
> plan.[2]
> > A summary of their anticipated work, as we finalize these changes, is
> > below. We plan on doing a check-in at the end of the calendar year to see
> > how our goals are progressing with the new smaller and separated team
> > structure.
> >
> > Here is a list of the various projects under the Discovery umbrella,
> along
> > with the goals that they will be working on:
> >
> > Search Backend
> >
> > Improve search capabilities:
> >
> >    Implement ‘learning to rank’ [3] and other advanced machine learning
> >    methodologies
> >...
> > [3] https://en.wikipedia.org/wiki/Learning_to_rank
>
> How will the Foundation's approach to machine learning of search
> results ranking guard against overfitting?
>
> For example, if most searches on "rent" do not pertain to "rent
> seeking", then how will the machine learning approach to search
> results for "rent" guard against never presenting any results on "rent
> seeking"?
>
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>
_______________________________________________
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>

Reply via email to