Re: Log-likelihood based correlation test?

Daniel Gabrieli Mon, 20 Nov 2017 07:08:10 -0800

There is a REST client for Elasticsearch and bindings in many popular
languages but to get started quickly I found this commands helpful:


List Indices:

curl -XGET 'localhost:9200/_cat/indices?v&pretty'

Get some documents from an index:

curl -XGET 'localhost:9200/<INDEX>/_search?q=*&pretty'

Then look at the "_source" in the document to see what values are
associated with the document.

More info here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#_source

this might also be helpful to work through a single specific query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html





On Mon, Nov 20, 2017 at 9:49 AM Noelia Osés Fernández <no...@vicomtech.org>
wrote:

> Thanks Daniel!
>
> And excuse my ignorance but... how do you inspect the ES index?
>
> On 20 November 2017 at 15:29, Daniel Gabrieli <dgabri...@salesforce.com>
> wrote:
>
>> There is this cli tool and article with more information that does
>> produce scores:
>>
>> https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
>>
>> But I don't know of any commands that return diagnostics about LLR from
>> the PIO framework / UR engine.  That would be a nice feature if it doesn't
>> exist.  The way I've gotten some insight into what the model is doing is by
>> when using PIO / UR is by inspecting the the ElasticSearch index that gets
>> created because it has the "significant" values populated in the documents
>> (though not the actual LLR scores).
>>
>> On Mon, Nov 20, 2017 at 7:22 AM Noelia Osés Fernández <
>> no...@vicomtech.org> wrote:
>>
>>> This thread is very enlightening, thank you very much!
>>>
>>> Is there a way I can see what the P, PtP, and PtL matrices of an app
>>> are? In the handmade case, for example?
>>>
>>> Are there any pio calls I can use to get these?
>>>
>>> On 17 November 2017 at 19:52, Pat Ferrel <p...@occamsmachete.com> wrote:
>>>
>>>> Mahout builds the model by doing matrix multiplication (PtP) then
>>>> calculating the LLR score for every non-zero value. We then keep the top K
>>>> or use a threshold to decide whether to keep of not (both are supported in
>>>> the UR). LLR is a metric for seeing how likely 2 events in a large group
>>>> are correlated. Therefore LLR is only used to remove weak data from the
>>>> model.
>>>>
>>>> So Mahout builds the model then it is put into Elasticsearch which is
>>>> used as a KNN (K-nearest Neighbors) engine. The LLR score is not put into
>>>> the model only an indicator that the item survived the LLR test.
>>>>
>>>> The KNN is applied using the user’s history as the query and finding
>>>> items the most closely match it. Since PtP will have items in rows and the
>>>> row will have correlating items, this “search” methods work quite well to
>>>> find items that had very similar items purchased with it as are in the
>>>> user’s history.
>>>>
>>>> =============================== that is the simple explanation
>>>> ========================================
>>>>
>>>> Item-based recs take the model items (correlated items by the LLR test)
>>>> as the query and the results are the most similar items—the items with most
>>>> similar correlating items.
>>>>
>>>> The model is items in rows and items in columns if you are only using
>>>> one event. PtP. If you think it through, it is all purchased items in as
>>>> the row key and other items purchased along with the row key. LLR filters
>>>> out the weakly correlating non-zero values (0 mean no evidence of
>>>> correlation anyway). If we didn’t do this it would be purely a
>>>> “Cooccurrence” recommender, one of the first useful ones. But filtering
>>>> based on cooccurrence strength (PtP values without LLR applied to them)
>>>> produces much worse results than using LLR to filter for most highly
>>>> correlated cooccurrences. You get a similar effect with Matrix
>>>> Factorization but you can only use one type of event for various reasons.
>>>>
>>>> Since LLR is a probabilistic metric that only looks at counts, it can
>>>> be applied equally well to PtV (purchase, view), PtS (purchase, search
>>>> terms), PtC (purchase, category-preferences). We did an experiment using
>>>> Mean Average Precision for the UR using video “Likes” vs “Likes” and
>>>> “Dislikes” so LtL vs. LtL and LtD scraped from rottentomatoes.com
>>>> reviews and got a 20% lift in the MAP@k score by including data for
>>>> “Dislikes”.
>>>> https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/
>>>>
>>>> So the benefit and use of LLR is to filter weak data from the model and
>>>> allow us to see if dislikes, and other events, correlate with likes. Adding
>>>> this type of data, that is usually thrown away is one the the most powerful
>>>> reasons to use the algorithm—BTW the algorithm is called Correlated
>>>> Cross-Occurrence (CCO).
>>>>
>>>> The benefit of using Lucene (at the heart of Elasticsearch) to do the
>>>> KNN query is that is it fast, taking the user’s realtime events into the
>>>> query but also because it is is trivial to add all sorts or business rules.
>>>> like give me recs based on user events but only ones from a certain
>>>> category, of give me recs but only ones tagged as “in-stock” in fact the
>>>> business rules can have inclusion rules, exclusion rules, and be mixed with
>>>> ANDs and ORs.
>>>>
>>>> BTW there is a version ready for testing with PIO 0.12.0 and ES5 here:
>>>> https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT 
>>>> Instructions
>>>> in the readme and notice it is in the 0.7.0-SNAPSHOT branch.
>>>>
>>>>
>>>> On Nov 17, 2017, at 7:59 AM, Andrew Troemner <atroem...@salesforce.com>
>>>> wrote:
>>>>
>>>> I'll echo Dan here. He and I went through the raw Mahout libraries
>>>> called by the Universal Recommender, and while Noelia's description is
>>>> accurate for an intermediate step, the indexing via ElasticSearch generates
>>>> some separate relevancy scores based on their Lucene indexing scheme. The
>>>> raw LLR scores are used in building this process, but the final scores
>>>> served up by the API's should be post-processed, and cannot be used to
>>>> reconstruct the raw LLR's (to my understanding).
>>>>
>>>> There are also some additional steps including down-sampling, which
>>>> scrubs out very rare combinations (which otherwise would have very high
>>>> LLR's for a single observation), which partially corrects for the
>>>> statistical problem of multiple detection. But the underlying logic is per
>>>> Ted Dunning's research and summarized by Noelia, and is a solid way to
>>>> approach interaction effects for tens of thousands of items and including
>>>> secondary indicators (like demographics, or implicit preferences).
>>>>
>>>>
>>>> *ANDREW TROEMNER*Associate Principal Data Scientist | salesforce.com
>>>> Office: 317.832.4404 <(317)%20832-4404>
>>>> Mobile: 317.531.0216 <(317)%20531-0216>
>>>>
>>>>
>>>>
>>>> <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html>
>>>>
>>>> On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli <
>>>> dgabri...@salesforce.com> wrote:
>>>>
>>>>> Maybe someone can correct me if I am wrong but in the code I believe
>>>>> Elasticsearch is used instead of "resulting LLR is what goes into the
>>>>> AB element in matrix PtP or PtL."
>>>>>
>>>>> By default the strongest 50 LLR scores get set as searchable
>>>>> values in Elasticsearch per item-event pair.
>>>>>
>>>>> You can configure the thresholds for significance using the
>>>>> configuration parameters: maxCorrelatorsPerItem or minLLR.  And this
>>>>> configuration is important because at default of 50 you may end up 
>>>>> treating
>>>>> all "indicator values" as significant.  More info here:
>>>>> http://actionml.com/docs/ur_config
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández <
>>>>> no...@vicomtech.org> wrote:
>>>>>
>>>>>>
>>>>>> Let's see if I've understood how LLR is used in UR. Let P be the
>>>>>> matrix for the primary conversion indicator (say purchases) and Pt its
>>>>>> transposed.
>>>>>>
>>>>>> Then, with a second matrix, which can be P again to make PtP or a
>>>>>> matrix for a secondary indicator (say L for likes) to make PtL, we take a
>>>>>> row from Pt (item A) and a column from the second matrix (either P or L, 
>>>>>> in
>>>>>> this example) (item B) and we calculate the table that Ted Dunning 
>>>>>> explains
>>>>>> on his webpage: the number of coocurrences that item A *AND* B have
>>>>>> been purchased (or purchased AND liked), the number of times that item A
>>>>>>  *OR* B have been purchased (or purchased OR liked), and the number
>>>>>> of times that *neither* item A nor B have been purchased (or
>>>>>> purchased or liked). With this counts we calculate LLR following the
>>>>>> formulas that Ted Dunning provides and the resulting LLR is what goes 
>>>>>> into
>>>>>> the AB element in matrix PtP or PtL. Correct?
>>>>>>
>>>>>> Thank you!
>>>>>>
>>>>>> On 16 November 2017 at 17:03, Noelia Osés Fernández <
>>>>>> no...@vicomtech.org> wrote:
>>>>>>
>>>>>>> Wonderful! Thanks Daniel!
>>>>>>>
>>>>>>> Suneel, I'm still new to the Apache ecosystem and so I know that
>>>>>>> Mahout is used but only vaguely... I still don't know the different 
>>>>>>> parts
>>>>>>> well enough to have a good understanding of what each of them do (Spark,
>>>>>>> MLLib, PIO, Mahout,...)
>>>>>>>
>>>>>>> Thank you both!
>>>>>>>
>>>>>>> On 16 November 2017 at 16:59, Suneel Marthi <smar...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Indeed so. Ted Dunning is an Apache Mahout PMC and committer and
>>>>>>>> the whole idea of Search-based Recommenders stems from his work and
>>>>>>>> insights.  If u didn't know, the PIO UR uses Apache Mahout under the 
>>>>>>>> hood
>>>>>>>> and hence u see the LLR.
>>>>>>>>
>>>>>>>> On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli <
>>>>>>>> dgabri...@salesforce.com> wrote:
>>>>>>>>
>>>>>>>>> I am pretty sure the LLR stuff in UR is based off of this blog
>>>>>>>>> post and associated paper:
>>>>>>>>>
>>>>>>>>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html
>>>>>>>>>
>>>>>>>>> Accurate Methods for the Statistics of Surprise and Coincidence
>>>>>>>>> by Ted Dunning
>>>>>>>>>
>>>>>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández <
>>>>>>>>> no...@vicomtech.org> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I've been trying to understand how the UR algorithm works and I
>>>>>>>>>> think I have a general idea. But I would like to have a *mathematical
>>>>>>>>>> description* of the step in which the LLR comes into play. In
>>>>>>>>>> the CCO presentations I have found it says:
>>>>>>>>>>
>>>>>>>>>> (PtP) compares column to column using
>>>>>>>>>> *log-likelihood based correlation test*
>>>>>>>>>>
>>>>>>>>>> However, I have searched for "log-likelihood based correlation
>>>>>>>>>> test" in google but no joy. All I get are explanations of the
>>>>>>>>>> likelihood-ratio test to compare two models.
>>>>>>>>>>
>>>>>>>>>> I would very much appreciate a math explanation of log-likelihood
>>>>>>>>>> based correlation test. Any pointers to papers or any other 
>>>>>>>>>> literature that
>>>>>>>>>> explains this specifically are much appreciated.
>>>>>>>>>>
>>>>>>>>>> Best regards,
>>>>>>>>>> Noelia
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "actionml-user" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to actionml-user+unsubscr...@googlegroups.com.
>>>> To post to this group, send email to actionml-u...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com
>>>> <https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>
>>>
>>>
>>> --
>>> <http://www.vicomtech.org>
>>>
>>> Noelia Osés Fernández, PhD
>>> Senior Researcher |
>>> Investigadora Senior
>>>
>>> no...@vicomtech.org
>>> +[34] 943 30 92 30 <+34%20943%2030%2092%2030>
>>> Data Intelligence for Energy and
>>> Industrial Processes | Inteligencia
>>> de Datos para Energía y Procesos
>>> Industriales
>>>
>>> <https://www.linkedin.com/company/vicomtech>
>>> <https://www.youtube.com/user/VICOMTech>
>>> <https://twitter.com/@Vicomtech_IK4>
>>>
>>> member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es>
>>>
>>> Legal Notice - Privacy policy
>>> <http://www.vicomtech.org/en/proteccion-datos>
>>>
>>
>
>
> --
> <http://www.vicomtech.org>
>
> Noelia Osés Fernández, PhD
> Senior Researcher |
> Investigadora Senior
>
> no...@vicomtech.org
> +[34] 943 30 92 30 <+34%20943%2030%2092%2030>
> Data Intelligence for Energy and
> Industrial Processes | Inteligencia
> de Datos para Energía y Procesos
> Industriales
>
> <https://www.linkedin.com/company/vicomtech>
> <https://www.youtube.com/user/VICOMTech>
> <https://twitter.com/@Vicomtech_IK4>
>
> member of:  <http://www.graphicsmedia.net/>     <http://www.ik4.es>
>
> Legal Notice - Privacy policy
> <http://www.vicomtech.org/en/proteccion-datos>
>

Re: Log-likelihood based correlation test?

Reply via email to