There is a REST client for Elasticsearch and bindings in many popular languages but to get started quickly I found this commands helpful:
List Indices: curl -XGET 'localhost:9200/_cat/indices?v&pretty' Get some documents from an index: curl -XGET 'localhost:9200/<INDEX>/_search?q=*&pretty' Then look at the "_source" in the document to see what values are associated with the document. More info here: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#_source this might also be helpful to work through a single specific query: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html On Mon, Nov 20, 2017 at 9:49 AM Noelia Osés Fernández <no...@vicomtech.org> wrote: > Thanks Daniel! > > And excuse my ignorance but... how do you inspect the ES index? > > On 20 November 2017 at 15:29, Daniel Gabrieli <dgabri...@salesforce.com> > wrote: > >> There is this cli tool and article with more information that does >> produce scores: >> >> https://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html >> >> But I don't know of any commands that return diagnostics about LLR from >> the PIO framework / UR engine. That would be a nice feature if it doesn't >> exist. The way I've gotten some insight into what the model is doing is by >> when using PIO / UR is by inspecting the the ElasticSearch index that gets >> created because it has the "significant" values populated in the documents >> (though not the actual LLR scores). >> >> On Mon, Nov 20, 2017 at 7:22 AM Noelia Osés Fernández < >> no...@vicomtech.org> wrote: >> >>> This thread is very enlightening, thank you very much! >>> >>> Is there a way I can see what the P, PtP, and PtL matrices of an app >>> are? In the handmade case, for example? >>> >>> Are there any pio calls I can use to get these? >>> >>> On 17 November 2017 at 19:52, Pat Ferrel <p...@occamsmachete.com> wrote: >>> >>>> Mahout builds the model by doing matrix multiplication (PtP) then >>>> calculating the LLR score for every non-zero value. We then keep the top K >>>> or use a threshold to decide whether to keep of not (both are supported in >>>> the UR). LLR is a metric for seeing how likely 2 events in a large group >>>> are correlated. Therefore LLR is only used to remove weak data from the >>>> model. >>>> >>>> So Mahout builds the model then it is put into Elasticsearch which is >>>> used as a KNN (K-nearest Neighbors) engine. The LLR score is not put into >>>> the model only an indicator that the item survived the LLR test. >>>> >>>> The KNN is applied using the user’s history as the query and finding >>>> items the most closely match it. Since PtP will have items in rows and the >>>> row will have correlating items, this “search” methods work quite well to >>>> find items that had very similar items purchased with it as are in the >>>> user’s history. >>>> >>>> =============================== that is the simple explanation >>>> ======================================== >>>> >>>> Item-based recs take the model items (correlated items by the LLR test) >>>> as the query and the results are the most similar items—the items with most >>>> similar correlating items. >>>> >>>> The model is items in rows and items in columns if you are only using >>>> one event. PtP. If you think it through, it is all purchased items in as >>>> the row key and other items purchased along with the row key. LLR filters >>>> out the weakly correlating non-zero values (0 mean no evidence of >>>> correlation anyway). If we didn’t do this it would be purely a >>>> “Cooccurrence” recommender, one of the first useful ones. But filtering >>>> based on cooccurrence strength (PtP values without LLR applied to them) >>>> produces much worse results than using LLR to filter for most highly >>>> correlated cooccurrences. You get a similar effect with Matrix >>>> Factorization but you can only use one type of event for various reasons. >>>> >>>> Since LLR is a probabilistic metric that only looks at counts, it can >>>> be applied equally well to PtV (purchase, view), PtS (purchase, search >>>> terms), PtC (purchase, category-preferences). We did an experiment using >>>> Mean Average Precision for the UR using video “Likes” vs “Likes” and >>>> “Dislikes” so LtL vs. LtL and LtD scraped from rottentomatoes.com >>>> reviews and got a 20% lift in the MAP@k score by including data for >>>> “Dislikes”. >>>> https://developer.ibm.com/dwblog/2017/mahout-spark-correlated-cross-occurences/ >>>> >>>> So the benefit and use of LLR is to filter weak data from the model and >>>> allow us to see if dislikes, and other events, correlate with likes. Adding >>>> this type of data, that is usually thrown away is one the the most powerful >>>> reasons to use the algorithm—BTW the algorithm is called Correlated >>>> Cross-Occurrence (CCO). >>>> >>>> The benefit of using Lucene (at the heart of Elasticsearch) to do the >>>> KNN query is that is it fast, taking the user’s realtime events into the >>>> query but also because it is is trivial to add all sorts or business rules. >>>> like give me recs based on user events but only ones from a certain >>>> category, of give me recs but only ones tagged as “in-stock” in fact the >>>> business rules can have inclusion rules, exclusion rules, and be mixed with >>>> ANDs and ORs. >>>> >>>> BTW there is a version ready for testing with PIO 0.12.0 and ES5 here: >>>> https://github.com/actionml/universal-recommender/tree/0.7.0-SNAPSHOT >>>> Instructions >>>> in the readme and notice it is in the 0.7.0-SNAPSHOT branch. >>>> >>>> >>>> On Nov 17, 2017, at 7:59 AM, Andrew Troemner <atroem...@salesforce.com> >>>> wrote: >>>> >>>> I'll echo Dan here. He and I went through the raw Mahout libraries >>>> called by the Universal Recommender, and while Noelia's description is >>>> accurate for an intermediate step, the indexing via ElasticSearch generates >>>> some separate relevancy scores based on their Lucene indexing scheme. The >>>> raw LLR scores are used in building this process, but the final scores >>>> served up by the API's should be post-processed, and cannot be used to >>>> reconstruct the raw LLR's (to my understanding). >>>> >>>> There are also some additional steps including down-sampling, which >>>> scrubs out very rare combinations (which otherwise would have very high >>>> LLR's for a single observation), which partially corrects for the >>>> statistical problem of multiple detection. But the underlying logic is per >>>> Ted Dunning's research and summarized by Noelia, and is a solid way to >>>> approach interaction effects for tens of thousands of items and including >>>> secondary indicators (like demographics, or implicit preferences). >>>> >>>> >>>> *ANDREW TROEMNER*Associate Principal Data Scientist | salesforce.com >>>> Office: 317.832.4404 <(317)%20832-4404> >>>> Mobile: 317.531.0216 <(317)%20531-0216> >>>> >>>> >>>> >>>> <http://smart.salesforce.com/sig/atroemner//us_mb_kb/default/link.html> >>>> >>>> On Fri, Nov 17, 2017 at 9:55 AM, Daniel Gabrieli < >>>> dgabri...@salesforce.com> wrote: >>>> >>>>> Maybe someone can correct me if I am wrong but in the code I believe >>>>> Elasticsearch is used instead of "resulting LLR is what goes into the >>>>> AB element in matrix PtP or PtL." >>>>> >>>>> By default the strongest 50 LLR scores get set as searchable >>>>> values in Elasticsearch per item-event pair. >>>>> >>>>> You can configure the thresholds for significance using the >>>>> configuration parameters: maxCorrelatorsPerItem or minLLR. And this >>>>> configuration is important because at default of 50 you may end up >>>>> treating >>>>> all "indicator values" as significant. More info here: >>>>> http://actionml.com/docs/ur_config >>>>> >>>>> >>>>> >>>>> On Fri, Nov 17, 2017 at 4:50 AM Noelia Osés Fernández < >>>>> no...@vicomtech.org> wrote: >>>>> >>>>>> >>>>>> Let's see if I've understood how LLR is used in UR. Let P be the >>>>>> matrix for the primary conversion indicator (say purchases) and Pt its >>>>>> transposed. >>>>>> >>>>>> Then, with a second matrix, which can be P again to make PtP or a >>>>>> matrix for a secondary indicator (say L for likes) to make PtL, we take a >>>>>> row from Pt (item A) and a column from the second matrix (either P or L, >>>>>> in >>>>>> this example) (item B) and we calculate the table that Ted Dunning >>>>>> explains >>>>>> on his webpage: the number of coocurrences that item A *AND* B have >>>>>> been purchased (or purchased AND liked), the number of times that item A >>>>>> *OR* B have been purchased (or purchased OR liked), and the number >>>>>> of times that *neither* item A nor B have been purchased (or >>>>>> purchased or liked). With this counts we calculate LLR following the >>>>>> formulas that Ted Dunning provides and the resulting LLR is what goes >>>>>> into >>>>>> the AB element in matrix PtP or PtL. Correct? >>>>>> >>>>>> Thank you! >>>>>> >>>>>> On 16 November 2017 at 17:03, Noelia Osés Fernández < >>>>>> no...@vicomtech.org> wrote: >>>>>> >>>>>>> Wonderful! Thanks Daniel! >>>>>>> >>>>>>> Suneel, I'm still new to the Apache ecosystem and so I know that >>>>>>> Mahout is used but only vaguely... I still don't know the different >>>>>>> parts >>>>>>> well enough to have a good understanding of what each of them do (Spark, >>>>>>> MLLib, PIO, Mahout,...) >>>>>>> >>>>>>> Thank you both! >>>>>>> >>>>>>> On 16 November 2017 at 16:59, Suneel Marthi <smar...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Indeed so. Ted Dunning is an Apache Mahout PMC and committer and >>>>>>>> the whole idea of Search-based Recommenders stems from his work and >>>>>>>> insights. If u didn't know, the PIO UR uses Apache Mahout under the >>>>>>>> hood >>>>>>>> and hence u see the LLR. >>>>>>>> >>>>>>>> On Thu, Nov 16, 2017 at 3:49 PM, Daniel Gabrieli < >>>>>>>> dgabri...@salesforce.com> wrote: >>>>>>>> >>>>>>>>> I am pretty sure the LLR stuff in UR is based off of this blog >>>>>>>>> post and associated paper: >>>>>>>>> >>>>>>>>> http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html >>>>>>>>> >>>>>>>>> Accurate Methods for the Statistics of Surprise and Coincidence >>>>>>>>> by Ted Dunning >>>>>>>>> >>>>>>>>> http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.14.5962 >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Nov 16, 2017 at 10:26 AM Noelia Osés Fernández < >>>>>>>>> no...@vicomtech.org> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I've been trying to understand how the UR algorithm works and I >>>>>>>>>> think I have a general idea. But I would like to have a *mathematical >>>>>>>>>> description* of the step in which the LLR comes into play. In >>>>>>>>>> the CCO presentations I have found it says: >>>>>>>>>> >>>>>>>>>> (PtP) compares column to column using >>>>>>>>>> *log-likelihood based correlation test* >>>>>>>>>> >>>>>>>>>> However, I have searched for "log-likelihood based correlation >>>>>>>>>> test" in google but no joy. All I get are explanations of the >>>>>>>>>> likelihood-ratio test to compare two models. >>>>>>>>>> >>>>>>>>>> I would very much appreciate a math explanation of log-likelihood >>>>>>>>>> based correlation test. Any pointers to papers or any other >>>>>>>>>> literature that >>>>>>>>>> explains this specifically are much appreciated. >>>>>>>>>> >>>>>>>>>> Best regards, >>>>>>>>>> Noelia >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "actionml-user" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to actionml-user+unsubscr...@googlegroups.com. >>>> To post to this group, send email to actionml-u...@googlegroups.com. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com >>>> <https://groups.google.com/d/msgid/actionml-user/CAA2BRS%2Boj%2BNYDmsNNd2mYM1ZC5CgWwC71W3%3DEhrO9qeOiKyWXA%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>>> >>> >>> >>> -- >>> <http://www.vicomtech.org> >>> >>> Noelia Osés Fernández, PhD >>> Senior Researcher | >>> Investigadora Senior >>> >>> no...@vicomtech.org >>> +[34] 943 30 92 30 <+34%20943%2030%2092%2030> >>> Data Intelligence for Energy and >>> Industrial Processes | Inteligencia >>> de Datos para Energía y Procesos >>> Industriales >>> >>> <https://www.linkedin.com/company/vicomtech> >>> <https://www.youtube.com/user/VICOMTech> >>> <https://twitter.com/@Vicomtech_IK4> >>> >>> member of: <http://www.graphicsmedia.net/> <http://www.ik4.es> >>> >>> Legal Notice - Privacy policy >>> <http://www.vicomtech.org/en/proteccion-datos> >>> >> > > > -- > <http://www.vicomtech.org> > > Noelia Osés Fernández, PhD > Senior Researcher | > Investigadora Senior > > no...@vicomtech.org > +[34] 943 30 92 30 <+34%20943%2030%2092%2030> > Data Intelligence for Energy and > Industrial Processes | Inteligencia > de Datos para Energía y Procesos > Industriales > > <https://www.linkedin.com/company/vicomtech> > <https://www.youtube.com/user/VICOMTech> > <https://twitter.com/@Vicomtech_IK4> > > member of: <http://www.graphicsmedia.net/> <http://www.ik4.es> > > Legal Notice - Privacy policy > <http://www.vicomtech.org/en/proteccion-datos> >