Determining the intent of a particular search is indeed very difficult, and
is not really feasible to even attempt it at the scale needed for machine
learning (unless you have an immense budget like some for-profit search
For our machine learning training data, we use click models suggested by
academic research. These models allow us to score the results for a given
query based on which results users actually clicked on (and didn't click
on). The results aren't perfect, but they are good, and they can be
automatically generated for millions of training examples taken from real
user queries and clicks.
These scores serve as a proxy for user intent, without needing to actually
understand it. As an example, if 35% of people click on the first result
for a particular query, and 60% on the second result, the click scores
would indicate that the order should be swapped, even without knowing the
intent of the query or the content of the results.
Swapping the top two results isn't really a big win, but the hope is that
by identifying features of the query (e.g., number of words), of the
articles (e.g., popularity), and of the relationship between them (e.g.,
number of words in common between the query and the article title) we will
learn something that is more generally true. If we do, then we may move a
result for a different query from, say, position 8 (where few people ever
click) to position 3 (where there is at least a chance of a click).
Iterating the whole process will allow us to detect that the result newly
in position 3 is actually a really popular result so we should adjust the
model to boost it even more, or that it's not that great and we should
adjust the model to put something better in the #3 slot. Of course, all of
the "adjusting" of the model happens automatically during training.
Through this iterative process of modeling, training, evaluation, and
deployment, we are attempting to take into account the relationship between
the user's intent and the search results—inferred from the user's
behavior—to improve the search results.
Software Engineer, Discovery
On Fri, Jun 16, 2017 at 10:26 AM, James Salsman <jsals...@gmail.com> wrote:
> Hi Trey,
> Thanks for your very detailed reply. I have a followup question.
> How do you determine search intents? For example, if you see someone
> searching for "rents" how do you know whether they are looking for
> economic or property rents when evaluating the quality of the search
> results? If you're training machine learning models from "5, 50, or
> 500," example you need to have labels on each of those examples
> indicating whether the results are good or not.
> Do you interview searchers after the fact? Ask people to search and
> record the terms they search on? What kind of infrastructure do you
> have to make sure you're getting correct intents robust enough to
> score the example results? Maybe surveys occurring on some small
> fraction of results asking users to describe in greater detail exactly
> what they were trying to find?
> Best regards,
> On Thu, Jun 15, 2017 at 10:40 PM, Deborah Tankersley
> <dtankers...@wikimedia.org> wrote:
> > James Salsman wrote:
> > How will the Foundation's approach to machine learning of search
> >> results ranking guard against overfitting?
> > Overfitting, for those who aren't familiar with the term, describes the
> > situation where a machine learning model inappropriately learns very
> > specific details about its training set that don't generalize to the real
> > world. From the point of view of training, the model seems to be getting
> > better and better, while real-world performance is actually decreasing.
> > a somewhat silly example, a model could learn that queries that have
> > exactly 38 words in them are 100% about baseball—because there is only
> > example of a query in the training set that is 38 words long, and it is
> > about baseball. For more on overfitting, see Wikipedia.
> > We employ the usual safeguards against overfitting. Certain parameters
> > control how a specific type of model is built can discourage overfitting.
> > For example, not allowing a decision inside the model to be made on too
> > little data—so rather than 1 or 2 examples to base a decision on, the
> > can be told it needs to see 5, or 50, or 500.
> > We also have separate training and testing data sets. So we build a model
> > on one set of data, then evaluate the model on another set. The estimate
> > model performance from the training set will always be at least a bit
> > optimistic, but the t