Re: Where Search Meets Machine Learning

2015-05-04 Thread Doug Turnbull
Awesome, I think I could learn a lot from you.

Do you have a decent amount of user data? Sounds like you have a ton.

I noticed that information retrieval problems fall into a sort-of layered
pyramid. At the topmopst point is someone like Google where the sheer
amount of high quality user behavior data that search truly is a machine
learning problem, much as you propose. As you move down the pyramid the
quality of user data diminishes.

Eventually you get to a very thick layer of middle-class search
applications that value relevance, but have very modest amounts or no user
data. For most of them, even if they tracked their searches over a year,
they *might* get good data over their top 50 searches. (I know cause they
send me the spreadsheet and say fix it!). The best they can use analytics
data is after-action troubleshooting. Actual user emails complaining about
the search can be more useful than behavior data!

So at this layer, the goal is to construct inverted indices that reflect
features likely to be important to users. In a sense this becomes more of a
programming task than a large-scale optimization task. You have content
experts that tell you either precisely or vaguely what the search solution
ought to do (presumably they represent users). If you're lucky, this will
be informed by some ad-hoc usability testing.

So you end up doing a mix of data modeling and using queries intelligently.
And perhaps some specific kinds of programming to develop specific scoring
functions, etc.
http://opensourceconnections.com/blog/2014/12/08/title-search-when-relevancy-is-only-skin-deep/

One advantage to this approach is for many search applications, you might
be able to explain how the ranking function function works in terms of a
set of specific rules. This also might allow points where domain experts
can tweak an overall ranking strategy. It becomes somewhat predictable to
them and controllable.

Anyway, I'm forever curious about the boundary line between this sort of
work, and search is truly a machine learning problem work. I have seen a
fair amount of gray area where user data might be decent or possibly
misleading, and you have to sort out and do a lot of data janitor work to
figure it out.

Good stuff!
-Doug


On Fri, May 1, 2015 at 6:16 PM, J. Delgado joaquin.delg...@gmail.com
wrote:

 Doug,

 Thanks for your insights. We actually started with trying to build off of
 features and boosting weights combined with built-in relevance scoring
 http://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html.
 We also played around with replacing and/or combining the default score
 with other computations using function_score
 http://www.elastic.co/guide/en/elasticsearch/guide/current/function-score-query.html
  query, with
 but as you mentioned in your article, the crux of the problem is *how to
 figure out the weights that control each features influence*:

 *Once important features are placed in the search engine the final
 problem becomes balancing and regulating their influence. Should text-based
 factors matter more than sales based factors? Should exact text matches
 matter more than synonym-based matches? What about metadata we glean from
 machine learning – how much weight should this play*?

 Furthermore, this only covers cases where the scoring can be represented
 as a function of such weights! We felt that this approach was short sighted
 as some of the problems we are dealing with (e.g. product recommendations,
 response prediction, real-time bidding for advertising, etc) have a very
 large feature space, sometimes requiring *dimensionality reduction* (e.g.
 Matrix Factorization techniques) or learning from past actions/feedback
 (e.g. clickthrough data, bidding win rates, remaining budget, etc.). All
 this seemed well suited for for Machine (supervised) Learning tasks such as
 prediction based on past training data (classification or regression).
 These algorithms usually have an offline model building phase and an online
 evaluator phase that uses the created model to perform the
 prediction/scoring during query evaluation.  Additionally, some of the best
 algorithms in machine learning (Random Forest, Support Vector Machines,
 Deep Learning/Neural Networks, etc.) are not linear combinations of
 feature-weights requiring additional data structure (e.g. trees, support
 vectors) to support the computation.

 Since there is no one-size-fits all predictive algorithm we architected
 the solution so any algorithm that implements our interface can be used. We
 tried this out with algorithms available in Weka
 http://www.cs.waikato.ac.nz/ml/weka/ and Spark MLib
 https://spark.apache.org/docs/1.2.1/mllib-guide.html (only linear
 models for now) and it worked! In any case, nothing prevents us from
 leverage the text based analysis of features and the default scoring
 available within the plugin, which can be combined with the results of the
 prediction.

 To demonstrate its general utility 

Re: Where Search Meets Machine Learning

2015-05-04 Thread J. Delgado
Sorry, as I was saying, the machine learning approach, is NOT limited to
having lots of user action data. In fact having little or no user action
data is commonly referred to as the cold start problem in recommender
systems. In which case, it is useful to exploit content based similarities
as well as context (such as location, time-of-the-day, day-of-the-week,
site-section, device type, etc) to make predictions/scoring. This can still
be combined with the usual IR based scoring to keep semantics as the
driving force.

-J

On Monday, May 4, 2015, J. Delgado joaquin.delg...@gmail.com wrote:

 BTW, as i mentioned, the machine learning

 On Monday, May 4, 2015, J. Delgado joaquin.delg...@gmail.com
 javascript:_e(%7B%7D,'cvml','joaquin.delg...@gmail.com'); wrote:

 I totally agree that it depends at the task at hand and the
 amount/quality of the data that you can get hold of.

 The problem of relevancy in traditional document/semantic information
 retrieval (IR) task is such a hard thing because there is little or no
 source of truth you could use as training data (unless you you something
 like TREC for a limited set of documents to evaluate) in most cases.
 Additionally the feedback data you get from users, if it exists, is very
 noisy. It this case prior knowledge, encoded as attributes-weights, crafted
 functions, and heuristics is your best bet. You can however mine the
 content itself by leveraging clustering/topic modeling via LDA which is
 unsupervised learning algorithm and use that as input. Or perhaps
 Labeled-LDA and Multi-Grain LDA, another topic model for classification and
 sentiment analysis, which are supervised algorithms, in which case you can
 still use the approach I suggested.

 However, for search tasks that involve e-commerce, advertisements,
 recommendations, etc., there seems to be more data that can be captured
 from users interactions with the system/site, that can be used as signals
 and users' actions (adding things to wish lists, clicks for more info,
 conversions, etc.) is much more telling about the intention/values the user
 give to what is presented to them. Then viewing search as a machine
 learning/multi-objective optimization problem makes sense.

 My point is that search engines nowadays is used for all these use cases,
 thus it is worth exploring all the venues exposed in this thread.

 Cheers,

 -- Joaquin

 On Mon, May 4, 2015 at 2:31 PM, Tom Burton-West tburt...@umich.edu
 wrote:

 Hi Doug and Joaquin,

 This is a really interesting discussion.  Joaquin, I'm looking forward
 to taking your code for a test drive.  Thank you for making it publicly
 available.

 Doug,  I'm interested in your pyramid observation.  I work with academic
 search which has some of the problems unique queries/information needs and
 of data sparsity you mention in your blog post.

 This article makes a similar argument that massive amounts of user data
 are so important for modern search engines that it is essentially a barrier
 to entry for new web search engines.
 Usage Data in Web Search: Benefits and Limitations. Ricardo Baeza-Yates and
 Yoelle Maarek.  In Proceedings of SSDBM'2012, Chania, Crete, June 2012.
 http://www.springerlink.com/index/58255K40151U036N.pdf

  Tom


 I noticed that information retrieval problems fall into a sort-of
 layered pyramid. At the topmopst point is someone like Google where the
 sheer amount of high quality user behavior data that search truly is a
 machine learning problem, much as you propose. As you move down the pyramid
 the quality of user data diminishes.

 Eventually you get to a very thick layer of middle-class search
 applications that value relevance, but have very modest amounts or no user
 data. For most of them, even if they tracked their searches over a year,
 they *might* get good data over their top 50 searches. (I know cause they
 send me the spreadsheet and say fix it!). The best they can use analytics
 data is after-action troubleshooting. Actual user emails complaining about
 the search can be more useful than behavior data!






Re: Where Search Meets Machine Learning

2015-05-04 Thread J. Delgado
I totally agree that it depends at the task at hand and the amount/quality
of the data that you can get hold of.

The problem of relevancy in traditional document/semantic information
retrieval (IR) task is such a hard thing because there is little or no
source of truth you could use as training data (unless you you something
like TREC for a limited set of documents to evaluate) in most cases.
Additionally the feedback data you get from users, if it exists, is very
noisy. It this case prior knowledge, encoded as attributes-weights, crafted
functions, and heuristics is your best bet. You can however mine the
content itself by leveraging clustering/topic modeling via LDA which is
unsupervised learning algorithm and use that as input. Or perhaps
Labeled-LDA and Multi-Grain LDA, another topic model for classification and
sentiment analysis, which are supervised algorithms, in which case you can
still use the approach I suggested.

However, for search tasks that involve e-commerce, advertisements,
recommendations, etc., there seems to be more data that can be captured
from users interactions with the system/site, that can be used as signals
and users' actions (adding things to wish lists, clicks for more info,
conversions, etc.) is much more telling about the intention/values the user
give to what is presented to them. Then viewing search as a machine
learning/multi-objective optimization problem makes sense.

My point is that search engines nowadays is used for all these use cases,
thus it is worth exploring all the venues exposed in this thread.

Cheers,

-- Joaquin

On Mon, May 4, 2015 at 2:31 PM, Tom Burton-West tburt...@umich.edu wrote:

 Hi Doug and Joaquin,

 This is a really interesting discussion.  Joaquin, I'm looking forward to
 taking your code for a test drive.  Thank you for making it publicly
 available.

 Doug,  I'm interested in your pyramid observation.  I work with academic
 search which has some of the problems unique queries/information needs and
 of data sparsity you mention in your blog post.

 This article makes a similar argument that massive amounts of user data
 are so important for modern search engines that it is essentially a barrier
 to entry for new web search engines.
 Usage Data in Web Search: Benefits and Limitations. Ricardo Baeza-Yates and
 Yoelle Maarek.  In Proceedings of SSDBM'2012, Chania, Crete, June 2012.
 http://www.springerlink.com/index/58255K40151U036N.pdf

  Tom


 I noticed that information retrieval problems fall into a sort-of layered
 pyramid. At the topmopst point is someone like Google where the sheer
 amount of high quality user behavior data that search truly is a machine
 learning problem, much as you propose. As you move down the pyramid the
 quality of user data diminishes.

 Eventually you get to a very thick layer of middle-class search
 applications that value relevance, but have very modest amounts or no user
 data. For most of them, even if they tracked their searches over a year,
 they *might* get good data over their top 50 searches. (I know cause they
 send me the spreadsheet and say fix it!). The best they can use analytics
 data is after-action troubleshooting. Actual user emails complaining about
 the search can be more useful than behavior data!





Re: Where Search Meets Machine Learning

2015-05-04 Thread J. Delgado
BTW, as i mentioned, the machine learning

On Monday, May 4, 2015, J. Delgado joaquin.delg...@gmail.com wrote:

 I totally agree that it depends at the task at hand and the amount/quality
 of the data that you can get hold of.

 The problem of relevancy in traditional document/semantic information
 retrieval (IR) task is such a hard thing because there is little or no
 source of truth you could use as training data (unless you you something
 like TREC for a limited set of documents to evaluate) in most cases.
 Additionally the feedback data you get from users, if it exists, is very
 noisy. It this case prior knowledge, encoded as attributes-weights, crafted
 functions, and heuristics is your best bet. You can however mine the
 content itself by leveraging clustering/topic modeling via LDA which is
 unsupervised learning algorithm and use that as input. Or perhaps
 Labeled-LDA and Multi-Grain LDA, another topic model for classification and
 sentiment analysis, which are supervised algorithms, in which case you can
 still use the approach I suggested.

 However, for search tasks that involve e-commerce, advertisements,
 recommendations, etc., there seems to be more data that can be captured
 from users interactions with the system/site, that can be used as signals
 and users' actions (adding things to wish lists, clicks for more info,
 conversions, etc.) is much more telling about the intention/values the user
 give to what is presented to them. Then viewing search as a machine
 learning/multi-objective optimization problem makes sense.

 My point is that search engines nowadays is used for all these use cases,
 thus it is worth exploring all the venues exposed in this thread.

 Cheers,

 -- Joaquin

 On Mon, May 4, 2015 at 2:31 PM, Tom Burton-West tburt...@umich.edu
 javascript:_e(%7B%7D,'cvml','tburt...@umich.edu'); wrote:

 Hi Doug and Joaquin,

 This is a really interesting discussion.  Joaquin, I'm looking forward to
 taking your code for a test drive.  Thank you for making it publicly
 available.

 Doug,  I'm interested in your pyramid observation.  I work with academic
 search which has some of the problems unique queries/information needs and
 of data sparsity you mention in your blog post.

 This article makes a similar argument that massive amounts of user data
 are so important for modern search engines that it is essentially a barrier
 to entry for new web search engines.
 Usage Data in Web Search: Benefits and Limitations. Ricardo Baeza-Yates and
 Yoelle Maarek.  In Proceedings of SSDBM'2012, Chania, Crete, June 2012.
 http://www.springerlink.com/index/58255K40151U036N.pdf

  Tom


 I noticed that information retrieval problems fall into a sort-of
 layered pyramid. At the topmopst point is someone like Google where the
 sheer amount of high quality user behavior data that search truly is a
 machine learning problem, much as you propose. As you move down the pyramid
 the quality of user data diminishes.

 Eventually you get to a very thick layer of middle-class search
 applications that value relevance, but have very modest amounts or no user
 data. For most of them, even if they tracked their searches over a year,
 they *might* get good data over their top 50 searches. (I know cause they
 send me the spreadsheet and say fix it!). The best they can use analytics
 data is after-action troubleshooting. Actual user emails complaining about
 the search can be more useful than behavior data!






Re: Where Search Meets Machine Learning

2015-05-04 Thread Tom Burton-West
Hi Doug and Joaquin,

This is a really interesting discussion.  Joaquin, I'm looking forward to
taking your code for a test drive.  Thank you for making it publicly
available.

Doug,  I'm interested in your pyramid observation.  I work with academic
search which has some of the problems unique queries/information needs and
of data sparsity you mention in your blog post.

This article makes a similar argument that massive amounts of user data are
so important for modern search engines that it is essentially a barrier to
entry for new web search engines.
Usage Data in Web Search: Benefits and Limitations. Ricardo Baeza-Yates and
Yoelle Maarek.  In Proceedings of SSDBM'2012, Chania, Crete, June 2012.
http://www.springerlink.com/index/58255K40151U036N.pdf

 Tom


 I noticed that information retrieval problems fall into a sort-of layered
 pyramid. At the topmopst point is someone like Google where the sheer
 amount of high quality user behavior data that search truly is a machine
 learning problem, much as you propose. As you move down the pyramid the
 quality of user data diminishes.

 Eventually you get to a very thick layer of middle-class search
 applications that value relevance, but have very modest amounts or no user
 data. For most of them, even if they tracked their searches over a year,
 they *might* get good data over their top 50 searches. (I know cause they
 send me the spreadsheet and say fix it!). The best they can use analytics
 data is after-action troubleshooting. Actual user emails complaining about
 the search can be more useful than behavior data!





Re: Where Search Meets Machine Learning

2015-05-02 Thread J. Delgado
Doug,

Thanks for your insights. We actually started with trying to build off of
features and boosting weights combined with built-in relevance scoring
http://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html.
We also played around with replacing and/or combining the default score
with other computations using function_score
http://www.elastic.co/guide/en/elasticsearch/guide/current/function-score-query.html
query, with
but as you mentioned in your article, the crux of the problem is *how to
figure out the weights that control each features influence*:

*Once important features are placed in the search engine the final problem
becomes balancing and regulating their influence. Should text-based factors
matter more than sales based factors? Should exact text matches matter more
than synonym-based matches? What about metadata we glean from machine
learning – how much weight should this play*?

Furthermore, this only covers cases where the scoring can be represented as
a function of such weights! We felt that this approach was short sighted as
some of the problems we are dealing with (e.g. product recommendations,
response prediction, real-time bidding for advertising, etc) have a very
large feature space, sometimes requiring *dimensionality reduction* (e.g.
Matrix Factorization techniques) or learning from past actions/feedback
(e.g. clickthrough data, bidding win rates, remaining budget, etc.). All
this seemed well suited for for Machine (supervised) Learning tasks such as
prediction based on past training data (classification or regression).
These algorithms usually have an offline model building phase and an online
evaluator phase that uses the created model to perform the
prediction/scoring during query evaluation.  Additionally, some of the best
algorithms in machine learning (Random Forest, Support Vector Machines,
Deep Learning/Neural Networks, etc.) are not linear combinations of
feature-weights requiring additional data structure (e.g. trees, support
vectors) to support the computation.

Since there is no one-size-fits all predictive algorithm we architected the
solution so any algorithm that implements our interface can be used. We
tried this out with algorithms available in Weka
http://www.cs.waikato.ac.nz/ml/weka/ and Spark MLib
https://spark.apache.org/docs/1.2.1/mllib-guide.html (only linear models
for now) and it worked! In any case, nothing prevents us from leverage the
text based analysis of features and the default scoring available within
the plugin, which can be combined with the results of the prediction.

To demonstrate its general utility we tested this with datasets available
at the the UCI Machine Learning Repository http://archive.ics.uci.edu/ml/ but
I have been using this approach for real-life response prediction/bidding
problems in advertising and its very powerful. Of course, this is not the
panacea, as there are still some issues with the approach, specially on the
operational side.  Let's keep the conversation going as I think we are on
to something useful.

-- Joaquin


On Thu, Apr 30, 2015 at 6:26 AM, Doug Turnbull 
dturnb...@opensourceconnections.com wrote:

 Hi Joaquin

 Very neat, thanks for sharing,

 Viewing search relevance as something akin to a classification problem is
 actually a driving narrative in Taming Search
 http://manning.com/turnbull. We generalize the relevance problem as one
 of measuring the similarity between features of content (locations of
 restaurants, price of a product, the words in the body of articles,
 expanded synonyms in articles, etc) and features of a query (the search
 terms, user usage history, any location, etc). What makes search
 interesting is that unlike other classification systems, search has built
 in similarity systems (largely TF*IDF).

 So we actually cut the other direction from your talk. It appears that you
 amend the search engine to change the underlying scoring to be based on
 machine learning constructs. In our book, we work the opposite way. We
 largely enable feature similarity classifications between document and
 query by massaging features into terms and use the built in TF*IDF or other
 relevant similarity approach.

 We feel this plays to the advantages of a search engine. Search engines
 already have some basic text analysis built in. They've also been heavily
 optimized for most forms of text-based similarity. If you can massage text
 such that your TF*IDF similarity reflects a rough proportion of text-based
 features important to your users, this tends to reflect their intuitive
 notions of relevance. A lot of this work involves feature section, or what
 we term in the book feature modeling. What features should you introduce to
 your documents that can be used to generate good signals at ranking time.

 You can read more about our thoughts here
 http://java.dzone.com/articles/solr-and-elasticsearch.

 That all being said, what makes your stuff interesting is when you have
 enough 

Re: Where Search Meets Machine Learning

2015-04-30 Thread Doug Turnbull
Hi Joaquin

Very neat, thanks for sharing,

Viewing search relevance as something akin to a classification problem is
actually a driving narrative in Taming Search http://manning.com/turnbull.
We generalize the relevance problem as one of measuring the similarity
between features of content (locations of restaurants, price of a product,
the words in the body of articles, expanded synonyms in articles, etc) and
features of a query (the search terms, user usage history, any location,
etc). What makes search interesting is that unlike other classification
systems, search has built in similarity systems (largely TF*IDF).

So we actually cut the other direction from your talk. It appears that you
amend the search engine to change the underlying scoring to be based on
machine learning constructs. In our book, we work the opposite way. We
largely enable feature similarity classifications between document and
query by massaging features into terms and use the built in TF*IDF or other
relevant similarity approach.

We feel this plays to the advantages of a search engine. Search engines
already have some basic text analysis built in. They've also been heavily
optimized for most forms of text-based similarity. If you can massage text
such that your TF*IDF similarity reflects a rough proportion of text-based
features important to your users, this tends to reflect their intuitive
notions of relevance. A lot of this work involves feature section, or what
we term in the book feature modeling. What features should you introduce to
your documents that can be used to generate good signals at ranking time.

You can read more about our thoughts here
http://java.dzone.com/articles/solr-and-elasticsearch.

That all being said, what makes your stuff interesting is when you have
enough supervised training data over good-enough features. This can be hard
to do for a broad swatch of middle tier search applications, but
increasingly useful as scale goes up. I'd be interested to hear your
thoughts on this article
http://opensourceconnections.com/blog/2014/10/08/when-click-scoring-can-hurt-search-relevance-a-roadmap-to-better-signals-processing-in-search/
I wrote about collecting click tracking and other relevance feedback data:

Good stuff! Again, thanks for sharing,
-Doug



On Wed, Apr 29, 2015 at 6:58 PM, J. Delgado joaquin.delg...@gmail.com
wrote:

 Here is a presentation on the topic:

 http://www.slideshare.net/joaquindelgado1/where-search-meets-machine-learning04252015final

 Search can be viewed as a combination of a) A problem of constraint
 satisfaction, which is the process of finding a solution to a set of
 constraints (query) that impose conditions that the variables (fields) must
 satisfy with a resulting object (document) being a solution in the feasible
 region (result set), plus b) A scoring/ranking problem of assigning values
 to different alternatives, according to some convenient scale. This
 ultimately provides a mechanism to sort various alternatives in the result
 set in order of importance, value or preference. In particular scoring in
 search has evolved from being a document centric calculation (e.g. TF-IDF)
 proper from its information retrieval roots, to a function that is more
 context sensitive (e.g. include geo-distance ranking) or user centric (e.g.
 takes user parameters for personalization) as well as other factors that
 depend on the domain and task at hand. However, most system that
 incorporate machine learning techniques to perform classification or
 generate scores for these specialized tasks do so as a post retrieval
 re-ranking function, outside of search! In this talk I show ways of
 incorporating advanced scoring functions, based on supervised learning and
 bid scaling models, into popular search engines such as Elastic Search and
 potentially SOLR. I'll provide practical examples of how to construct such
 ML Scoring plugins in search to generalize the application of a search
 engine as a model evaluator for supervised learning tasks. This will
 facilitate the building of systems that can do computational advertising,
 recommendations and specialized search systems, applicable to many domains.

 Code to support it (only elastic search for now):
 https://github.com/sdhu/elasticsearch-prediction

 -- J







-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections,
LLC | 240.476.9983 | http://www.opensourceconnections.com
Author: Taming Search http://manning.com/turnbull from Manning
Publications
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.


Where Search Meets Machine Learning

2015-04-29 Thread J. Delgado
Here is a presentation on the topic:
http://www.slideshare.net/joaquindelgado1/where-search-meets-machine-learning04252015final

Search can be viewed as a combination of a) A problem of constraint
satisfaction, which is the process of finding a solution to a set of
constraints (query) that impose conditions that the variables (fields) must
satisfy with a resulting object (document) being a solution in the feasible
region (result set), plus b) A scoring/ranking problem of assigning values
to different alternatives, according to some convenient scale. This
ultimately provides a mechanism to sort various alternatives in the result
set in order of importance, value or preference. In particular scoring in
search has evolved from being a document centric calculation (e.g. TF-IDF)
proper from its information retrieval roots, to a function that is more
context sensitive (e.g. include geo-distance ranking) or user centric (e.g.
takes user parameters for personalization) as well as other factors that
depend on the domain and task at hand. However, most system that
incorporate machine learning techniques to perform classification or
generate scores for these specialized tasks do so as a post retrieval
re-ranking function, outside of search! In this talk I show ways of
incorporating advanced scoring functions, based on supervised learning and
bid scaling models, into popular search engines such as Elastic Search and
potentially SOLR. I'll provide practical examples of how to construct such
ML Scoring plugins in search to generalize the application of a search
engine as a model evaluator for supervised learning tasks. This will
facilitate the building of systems that can do computational advertising,
recommendations and specialized search systems, applicable to many domains.

Code to support it (only elastic search for now):
https://github.com/sdhu/elasticsearch-prediction

-- J