Re: Log-likelihood based correlation test?

Pat Ferrel Thu, 23 Nov 2017 09:39:36 -0800

Use the default. Tuning with a threshold is only for atypical data and unless 
you have a harness for cross-validation you would not know if you were making 
things worse or better. We have our own tools for this but have never had the 
need for threshold tuning.


Yes, llrDownsampled(PtP) is the “model”, each doc put into Elasticsearch is a 
sparse representation of a row from it, along with those from PtV, PtC,… Each 
gets a “field” in the doc.


On Nov 22, 2017, at 6:16 AM, Noelia Osés Fernández <[email protected]> wrote:

Thanks Pat!

How can I tune the threshold?

And when you say "compare to each item in the model", do you mean each row in 
PtP?

On 21 November 2017 at 19:56, Pat Ferrel <[email protected] 
<mailto:[email protected]>> wrote:
No PtP non-zero elements have LLR calculated. The highest scores in the row are 
kept, or ones above some threshold hte resst are removeda as “noise". These are 
put into the Elasticsearch model without scores. 

Elasticsearch compares the similarity of the user history to each item in the 
model to find the KNN similar ones. This uses OKAPI BM25 from Lucene, which has 
several benefits over pure cosines (it actually consists of adjustments to 
cosine) and we also use norms. With ES 5 we should see quality improvements due 
to this. 
https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html
 
<https://www.elastic.co/guide/en/elasticsearch/guide/master/pluggable-similarites.html>



On Nov 21, 2017, at 1:28 AM, Noelia Osés Fernández <[email protected] 
<mailto:[email protected]>> wrote:

Pat,

If I understood your explanation correctly, you say that some elements of PtP 
are removed by the LLR (set to zero, to be precise). But the elements that 
survive are calculated by matrix multiplication. The final PtP is put into 
EleasticSearc and when we query for user recommendations ES uses KNN to find 
the items (the rows in PtP) that are most similar to the user's history.

If the non-zero elements of PtP have been calculated by straight matrix 
multiplication, and I'm assuming that the P matrix only has 0s and 1s to 
indicate which items have been purchased by which user, then the elements of 
PtP are either 0 or greater to or equal than 1. However, the scores I get are 
below 1.

So is the KNN using cosine similarity as a metric to calculate the closest 
neighbours? And is the results of this cosine similarity metric what is 
returned as a 'score'?

If it is, when it is greater than 1, is this because the different cosine 
similarities are added together i.e. PtP, PtL... ?

Thank you for all your valuable help!

On 17 November 2017 at 19:52, Pat Ferrel <[email protected] 
<mailto:[email protected]>> wrote:
Mahout builds the model by doing matrix multiplication (PtP) then calculating 
the LLR score for every non-zero value. We then keep the top K or use a 
threshold to decide whether to keep of not (both are supported in the UR). LLR 
is a metric for seeing how likely 2 events in a large group are correlated. 
Therefore LLR is only used to remove weak data from the model.

So Mahout builds the model then it is put into Elasticsearch which is used as a 
KNN (K-nearest Neighbors) engine. The LLR score is not put into the model only 
an indicator that the item survived the LLR test.

The KNN is applied using the user’s history as the query and finding items the 
most closely match it. Since PtP will have items in rows and the row will have 
correlating items, this “search” methods work quite well to find items that had 
very similar items purchased with it as are in the user’s history.

=============================== that is the simple explanation 
========================================

Item-based recs take the model items (correlated items by the LLR test) as the 
query and the results are the most similar items—the items with most similar 
correlating items.

The model is items in rows and items in columns if you are only using one 
event. PtP. If you think it through, it is all purchased items in as the row 
key and other items purchased along with the row key. LLR filters out the 
weakly correlating non-zero values (0 mean no evidence of correlation anyway). 
If we didn’t do this it would be purely a “Cooccurrence” recommender, one of 
the first useful ones. But filtering based on cooccurrence strength (PtP values 
without LLR applied to them) produces much worse results than using LLR to 
filter for most highly correlated cooccurrences. You get a similar effect with 
Matrix Factorization but you can only use one type of event for various reasons.

Since LLR is a probabilistic metric that only looks at counts, it can be 
applied equally well to PtV (purchase, view), PtS (purchase, search terms), PtC 
(purchase, category-preferences). We did an experiment using Mean Average 
Precision for the UR using video “Likes” vs “Likes” and “Dislikes” so LtL vs. 
LtL and LtD scraped from rottentomatoes.com <http://rottentomatoes.com/> 
reviews and got a 20% lift in the MAP@k score by including data for “Dislikes”. 
https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ 
<https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/>

So the benefit and use of LLR is to filter weak data from the model and allow 
us to see if dislikes, and other events, correlate with likes. Adding this type 
of data, that is usually thrown away is one the the most powerful reasons to 
use the algorithm—BTW the algorithm is called Correlated Cross-Occurrence (CCO).

The benefit of using Lucene (at the heart of Elasticsearch) to do the KNN query 
is that is it fast, taking the user’s realtime events into the query but also 
because it is is trivial to add all sorts or business rules. like give me recs 
based on user events but only ones from a certain category, of give me recs but 
only ones tagged as “in-stock” in fact the business rules can have inclusion 
rules, exclusion rules, and be mixed with ANDs and ORs.

BTW there is a version ready for testing with PIO 0.12.0 and ES5 here: 
https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT 
<https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT> 
Instructions in the readme and notice it is in the 0.7.0-SNAPSHOT branch.


On Nov 17, 2017, at 7:59 AM, Andrew Troemner <[email protected] 
<mailto:[email protected]>> wrote:

I'll echo Dan here. He and I went through the raw Mahout libraries called by 
the Universal Recommender, and while Noelia's description is accurate for an 
intermediate step, the indexing via ElasticSearch generates some separate 
relevancy scores based on their Lucene indexing scheme. The raw LLR scores are 
used in building this process, but the final scores served up by the API's 
should be post-processed, and cannot be used to reconstruct the raw LLR's (to 
my understanding).

There are also some additional steps including down-sampling, which scrubs out 
very rare combinations (which otherwise would have very high LLR's for a single 
observation), which partially corrects for the statistical problem of multiple 
detection. But the underlying logic is per Ted Dunning's research and 
summarized by Noelia, and is a solid way to approach interaction effects for 
tens of thousands of items and including secondary indicators (like 
demographics, or implicit preferences).

ANDREW TROEMNER
Associate Principal Data Scientist | salesforce.com <http://salesforce.com/>
Office: 317.832.4404
Mobile: 317.531.0216




 <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>
On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <[email protected] 
<mailto:[email protected]>> wrote:
Maybe someone can correct me if I am wrong but in the code I believe 
Elasticsearch is used instead of "resulting LLR is what goes into the AB 
element in matrix PtP or PtL."

By default the strongest 50 LLR scores get set as searchable values in 
Elasticsearch per item-event pair.

You can configure the thresholds for significance using the configuration 
parameters: maxCorrelatorsPerItem or minLLR.  And this configuration is 
important because at default of 50 you may end up treating all "indicator 
values" as significant.  More info here: http://actionml.com/docs/ur_config 
<http://actionml.com/docs/ur_config>



On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <[email protected] 
<mailto:[email protected]>> wrote:

Let's see if I've understood how LLR is used in UR. Let P be the matrix for the 
primary conversion indicator (say purchases) and Pt its transposed. 

Then, with a second matrix, which can be P again to make PtP or a matrix for a 
secondary indicator (say L for likes) to make PtL, we take a row from Pt (item 
A) and a column from the second matrix (either P or L, in this example) (item 
B) and we calculate the table that Ted Dunning explains on his webpage: the 
number of coocurrences that item A AND B have been purchased (or purchased AND 
liked), the number of times that item A OR B have been purchased (or purchased 
OR liked), and the number of times that neither item A nor B have been 
purchased (or purchased or liked). With this counts we calculate LLR following 
the formulas that Ted Dunning provides and the resulting LLR is what goes into 
the AB element in matrix PtP or PtL. Correct?   

Thank you!

On 16 November 2017 at 17:03, Noelia Osés Fernández <[email protected] 
<mailto:[email protected]>> wrote:
Wonderful! Thanks Daniel!

Suneel, I'm still new to the Apache ecosystem and so I know that Mahout is used 
but only vaguely... I still don't know the different parts well enough to have 
a good understanding of what each of them do (Spark, MLLib, PIO, Mahout,...)

Thank you both!

On 16 November 2017 at 16:59, Suneel Marthi <[email protected] 
<mailto:[email protected]>> wrote:
Indeed so. Ted Dunning is an Apache Mahout PMC and committer and the whole idea 
of Search-based Recommenders stems from his work and insights.  If u didn't 
know, the PIO UR uses Apache Mahout under the hood and hence u see the LLR.

On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <[email protected] 
<mailto:[email protected]>> wrote:
I am pretty sure the LLR stuff in UR is based off of this blog post and 
associated paper:

http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html 
<http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html>

Accurate Methods for the Statistics of Surprise and Coincidence
by Ted Dunning

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962 
<http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962>


On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <[email protected] 
<mailto:[email protected]>> wrote:
Hi,

I've been trying to understand how the UR algorithm works and I think I have a 
general idea. But I would like to have a mathematical description of the step 
in which the LLR comes into play. In the CCO presentations I have found it says:

(PtP) compares column to column using log-likelihood based correlation test


However, I have searched for "log-likelihood based correlation test" in google 
but no joy. All I get are explanations of the likelihood-ratio test to compare 
two models. 

I would very much appreciate a math explanation of log-likelihood based 
correlation test. Any pointers to papers or any other literature that explains 
this specifically are much appreciated.

Best regards,
Noelia












-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected] 
<mailto:[email protected]>.
To post to this group, send email to [email protected] 
<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com
 
<https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout 
<https://groups.google.com/d/optout>.




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

[email protected] <mailto:[email protected]>
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected] 
<mailto:[email protected]>.
To post to this group, send email to [email protected] 
<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/CAMyseftsnWTn3UqrS5k3SgBJFgftqss6DbjLjo07FUR92HCKoA%40mail.gmail.com
 
<https://groups.google.com/d/msgid/actionml-user/CAMyseftsnWTn3UqrS5k3SgBJFgftqss6DbjLjo07FUR92HCKoA%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout 
<https://groups.google.com/d/optout>.




-- 
 <http://www.vicomtech.org/>

Noelia Osés Fernández, PhD
Senior Researcher |
Investigadora Senior

[email protected] <mailto:[email protected]>
+[34] 943 30 92 30
Data Intelligence for Energy and
Industrial Processes | Inteligencia
de Datos para Energía y Procesos
Industriales

 <https://www.linkedin.com/company/vicomtech>  
<https://www.youtube.com/user/VICOMTech>  <https://twitter.com/@Vicomtech_IK4>

member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es/>

Legal Notice - Privacy policy <http://www.vicomtech.org/en/proteccion-datos>

-- 
You received this message because you are subscribed to the Google Groups 
"actionml-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected] 
<mailto:[email protected]>.
To post to this group, send email to [email protected] 
<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/actionml-user/CAMysefseu_oy5%2BRH9gADL1Z0tGPRUfMf8CCnwLWyb168sdADQQ%40mail.gmail.com
 
<https://groups.google.com/d/msgid/actionml-user/CAMysefseu_oy5%2BRH9gADL1Z0tGPRUfMf8CCnwLWyb168sdADQQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout 
<https://groups.google.com/d/optout>.

Re: Log-likelihood based correlation test?

Reply via email to