Re: MoreLikeThisHandler with mltipli input documents

Alessandro Benedetti Tue, 29 Sep 2015 08:22:51 -0700

Hi Roland,
you said "The main goal is that when a customer is on the pruduct page ".
But if you are in a  product page, I guess you have the product Id.
If you have the product id , you can simply execute the MLT request with
the single Doc Id in input.


Why do you need to calculate beforehand?

Cheers

2015-09-29 15:44 GMT+01:00 Szűcs Roland <szucs.rol...@bookandwalk.hu>:

> Hello Upayavira,
>
> The main goal is that when a customer is on the pruduct page on an e-book
> and he does not like it somehow I want to immediately offer her/him
> alternative e-books in the same topic. If I expect from the customer to
> click on a button like "similar e-books" I lose half of them as they are
> lazy to click anywhere. So I would like to present on the product pages the
> alternatives of the e-books  without clicking.
>
> I assumed the best idea to claculate the similar e-books for all the other
> (n*(n-1) similarity calculation) and present only the top 5. I planned to
> do it when our server is not busy. In this point I found the description of
> mlt as a search component which seemed to be a good candidate as it
> calculates the similar documents to all the result set of the query. So if
> I say q=*:* and mlt component is enabled I get similar document for my
> entire document set. The only problem was with this approach that mlt
> search component does not give back the interesting terms for my tag cloud
> calculation.
>
> That's why I tried to mix the flexibility of mlt compoonent (multiple docs
> as an input accepted) with the robustness of MoreLikeThisHandler (having
> interesting terms).
>
> If there is no solution, I will use the mlt component and solve the tag
> cloud calculation other way. By the way if I am not mistaken, the 5.3.1
> version takes the union of the feature set of the mlt component, and
> handler
>
> Best Regards,
> Roland
>
>
>
> 2015-09-29 14:38 GMT+02:00 Upayavira <u...@odoko.co.uk>:
>
> > Let's take a step back. So, you have 3000 or so docs, and you want to
> > know which documents are similar to these.
> >
> > Why do you want to know this? What feature do you need to build that
> > will use that information? Knowing this may help us to arrive at the
> > right technology for you.
> >
> > For example, you might want to investigate offline clustering algorithms
> > (e.g. [1], which might be a bit dense to follow). A good book on machine
> > learning if you are okay with Python is "Programming Collective
> > Intelligence" as it explains the usual algorithms with simple for loops
> > making it very clear.
> >
> > Or, you could do searches, and then cluster the results at search time
> > (so if you search for 100 docs, it will identify clusters within those
> > 100 matching documents). That might get you there. See [2]
> >
> > So, if you let us know what the end-goal is, perhaps we can suggest an
> > alternative approach, rather than burying ourselves neck-deep in MLT
> > problems.
> >
> > Upayavira
> >
> > [1]
> >
> >
> http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html
> > [2] https://cwiki.apache.org/confluence/display/solr/Result+Clustering
> >
> > On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote:
> > > Hello Upayavira,
> > >
> > > Thanks dealing with my issue. I have applied already the
> termVectors=true
> > > to all fileds involved in the more like this calculation. I have just 3
> > > 000
> > > documents each of them is represented by a relativly big term vector
> with
> > > more than 20 000 unique terms. If I run the more like this handler for
> a
> > > solr doc it takes close to 1 sec to get back the first 10 similar
> > > documents. Aftwr this I have to pass the docid-s to my other
> application
> > > which find the cover of the e-book and other metadata and put it on the
> > > web. The end-to-end process takes too much time from customer
> perspective
> > > that is why I tried to find solution for offline more like this
> > > calculation. But if my app has to call the morelikethishandler for each
> > > doc
> > > it puts overhead for the offline calculation.
> > >
> > > Best Regards,
> > > Roland
> > >
> > > 2015-09-29 13:01 GMT+02:00 Upayavira <u...@odoko.co.uk>:
> > >
> > > > If MoreLikeThis is slow for large documents that are indexed, have
> you
> > > > enabled term vectors on the similarity fields?
> > > >
> > > > Basically, what more like this does is this:
> > > >
> > > > * decide on what terms in the source doc are "interesting", and pick
> > the
> > > > 25 most interesting ones
> > > > * build and execute a boolean query using these interesting terms.
> > > >
> > > > Looking at the first phase of this in more detail:
> > > >
> > > > If you pass in a document using stream.body, it will analyse this
> > > > document into terms, and then calculate the most interesting terms
> from
> > > > that.
> > > >
> > > > If you reference document in your index with a field that is stored,
> it
> > > > will take the stored version, and analyse it and identify the
> > > > interesting terms from there.
> > > >
> > > > If, however, you have stored term vectors against that field, this
> work
> > > > is not needed. You have already done much of the work, and the
> > > > identification of your "interesting terms" will be much faster.
> > > >
> > > > Thus, on the content field of your documents, add termVectors="true"
> in
> > > > your schema, and re-index. Then you could well find MLT becoming a
> lot
> > > > more efficient.
> > > >
> > > > Upayavira
> > > >
> > > > On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote:
> > > > > Hi Alessandro,
> > > > >
> > > > > My original goal was to get offline suggestsion on content based
> > > > > similarity
> > > > > for every e-book we have . We wanted to run a bulk more like this
> > > > > calculation in the evening when the usage of our site is low and we
> > > > > submit
> > > > > a new e-book. Real time more like this can take a while as we have
> > > > > typically long documents (2-5MB text) with all the content indexed.
> > > > >
> > > > > When we upload a new document we wanted to recalculate the more
> like
> > this
> > > > > suggestions and a tf-idf based tag cloouds. Both of them are
> > delivered by
> > > > > the More LikeThisHandler but only for one document as you wrote.
> > > > >
> > > > > The text input is not good for us because we need the similar doc
> > list
> > > > > for
> > > > > each of the matched document. If I put together text of 10 document
> > I can
> > > > > not separate which suggestion relates to which matched document and
> > also
> > > > > the tag cloud will belong to the mixed text.
> > > > >
> > > > > Most likley we will use the MoreLikeThisHandler for each of the
> > documents
> > > > > and parse the json repsonse and store the result in a DQL database
> > > > >
> > > > > Thanks your help.
> > > > >
> > > > > 2015-09-29 11:18 GMT+02:00 Alessandro Benedetti
> > > > > <benedetti.ale...@gmail.com>
> > > > > :
> > > > >
> > > > > > Hi Roland,
> > > > > > what is your exact requirement ?
> > > > > > Do you want to basically build a "description" for a set of
> > documents
> > > > and
> > > > > > then find documents in the index, similar to this description ?
> > > > > >
> > > > > > By default , based on my experience ( and on the code) this is
> the
> > > > entry
> > > > > > point for the Lucene More Like This :
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > *org.apache.lucene.queries.mlt.MoreLikeThis/*** Return a query
> > that
> > > > will
> > > > > > > return docs like the passed lucene document ID.** @param docNum
> > the
> > > > > > > documentID of the lucene doc to generate the 'More Like This"
> > query
> > > > for.*
> > > > > > > @return a query that will return docs like the passed lucene
> > document
> > > > > > > ID.*/public Query like(int docNum) throws IOException {if
> > > > (fieldNames ==
> > > > > > > null) {// gather list of valid fields from
> > luceneCollection<String>
> > > > > > fields
> > > > > > > = MultiFields.getIndexedFields(ir);fieldNames =
> > fields.toArray(new
> > > > > > > String[fields.size()]);}return
> > createQuery(retrieveTerms(docNum));}*
> > > > > >
> > > > > > It means that talking about "documents" you can feed only one
> Solr
> > doc.
> > > > > >
> > > > > > But you can also feed the MLT with simple text.
> > > > > >
> > > > > > So you should study better your use case and understand which
> > option
> > > > > > fits better :
> > > > > >
> > > > > > 1) customising the MLT component starting from Lucene
> > > > > >
> > > > > > 2) doing some processing client side and use the "text"
> similarity
> > > > feature.
> > > > > >
> > > > > >
> > > > > > Cheers
> > > > > >
> > > > > >
> > > > > > 2015-09-29 10:05 GMT+01:00 Roland Szűcs <
> > roland.sz...@bookandwalk.com
> > > > >:
> > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > Is it possible to feed multiple solr id for a
> > MoreLikeThisHandler?
> > > > > > >
> > > > > > > <requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
> > > > > > > <lst name="defaults">
> > > > > > > <str name="mlt.match.include">false</str>
> > > > > > > <str name="mlt.interestingTerms">details</str>
> > > > > > > <str name="mlt.fl">title,content</str>
> > > > > > > <str name="mlt.minwl">4</str>
> > > > > > > <str name="mlt.qf">title^12 content^1</str>
> > > > > > > <str name="mlt.mintf">2</str>
> > > > > > > <int name="mlt.count">10</int>
> > > > > > > <str name="mlt.boost">true</str>
> > > > > > > <str name="wt">json</str>
> > > > > > > <str name="indent">true</str>
> > > > > > > </lst>
> > > > > > >   </requestHandler>
> > > > > > >
> > > > > > > when I call this:
> > > > http://localhost:8983/solr/bandwhu/mlt?q=id:8&fl=id
> > > > > > >  it works fine. Is there any way to have a kind of "bulk" call
> of
> > > > more
> > > > > > like
> > > > > > > this handler . I need the intresting terms as well and as far
> as
> > I
> > > > know
> > > > > > if
> > > > > > > i use more like this as a search component it does not return
> > with
> > > > it so
> > > > > > it
> > > > > > > is not an alternative.
> > > > > > >
> > > > > > > Thanks in advance,
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > >Roland
> > > > > > Szűcs
> > > > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > >Connect
> > > > > > with
> > > > > > > me on Linkedin <
> > > > > > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > > > <https://bookandwalk.hu/>CEOPhone: +36 1 210 81
> 13Bookandwalk.hu
> > > > > > > <https://bokandwalk.hu/>
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > --------------------------
> > > > > >
> > > > > > Benedetti Alessandro
> > > > > > Visiting card - http://about.me/alessandro_benedetti
> > > > > > Blog - http://alexbenedetti.blogspot.co.uk
> > > > > >
> > > > > > "Tyger, tyger burning bright
> > > > > > In the forests of the night,
> > > > > > What immortal hand or eye
> > > > > > Could frame thy fearful symmetry?"
> > > > > >
> > > > > > William Blake - Songs of Experience -1794 England
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> > > > Roland
> > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > > > >Ismerkedjünk
> > > > > meg a Linkedin
> > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > > > 13Bookandwalk.hu
> > > > > <https://bokandwalk.hu/>
> > > >
> > >
> > >
> > >
> > > --
> > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> > Roland
> > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > >Ismerkedjünk
> > > meg a Linkedin
> > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > 13Bookandwalk.hu
> > > <https://bokandwalk.hu/>
> >
>
>
>
> --
> <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs Roland
> <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Ismerkedjünk
> meg a Linkedin <
> https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> 13Bookandwalk.hu
> <https://bokandwalk.hu/>
>



-- 
--------------------------

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England

Re: MoreLikeThisHandler with mltipli input documents

Reply via email to