Re: MoreLikeThisHandler with mltipli input documents

Upayavira Tue, 29 Sep 2015 05:39:21 -0700

Let's take a step back. So, you have 3000 or so docs, and you want to
know which documents are similar to these.


Why do you want to know this? What feature do you need to build that
will use that information? Knowing this may help us to arrive at the
right technology for you.

For example, you might want to investigate offline clustering algorithms
(e.g. [1], which might be a bit dense to follow). A good book on machine
learning if you are okay with Python is "Programming Collective
Intelligence" as it explains the usual algorithms with simple for loops
making it very clear.

Or, you could do searches, and then cluster the results at search time
(so if you search for 100 docs, it will identify clusters within those
100 matching documents). That might get you there. See [2]

So, if you let us know what the end-goal is, perhaps we can suggest an
alternative approach, rather than burying ourselves neck-deep in MLT
problems.

Upayavira

[1]
http://mylazycoding.blogspot.co.uk/2012/03/cluster-apache-solr-data-using-apache_13.html
[2] https://cwiki.apache.org/confluence/display/solr/Result+Clustering

On Tue, Sep 29, 2015, at 12:42 PM, Szűcs Roland wrote:
> Hello Upayavira,
> 
> Thanks dealing with my issue. I have applied already the termVectors=true
> to all fileds involved in the more like this calculation. I have just 3
> 000
> documents each of them is represented by a relativly big term vector with
> more than 20 000 unique terms. If I run the more like this handler for a
> solr doc it takes close to 1 sec to get back the first 10 similar
> documents. Aftwr this I have to pass the docid-s to my other application
> which find the cover of the e-book and other metadata and put it on the
> web. The end-to-end process takes too much time from customer perspective
> that is why I tried to find solution for offline more like this
> calculation. But if my app has to call the morelikethishandler for each
> doc
> it puts overhead for the offline calculation.
> 
> Best Regards,
> Roland
> 
> 2015-09-29 13:01 GMT+02:00 Upayavira <u...@odoko.co.uk>:
> 
> > If MoreLikeThis is slow for large documents that are indexed, have you
> > enabled term vectors on the similarity fields?
> >
> > Basically, what more like this does is this:
> >
> > * decide on what terms in the source doc are "interesting", and pick the
> > 25 most interesting ones
> > * build and execute a boolean query using these interesting terms.
> >
> > Looking at the first phase of this in more detail:
> >
> > If you pass in a document using stream.body, it will analyse this
> > document into terms, and then calculate the most interesting terms from
> > that.
> >
> > If you reference document in your index with a field that is stored, it
> > will take the stored version, and analyse it and identify the
> > interesting terms from there.
> >
> > If, however, you have stored term vectors against that field, this work
> > is not needed. You have already done much of the work, and the
> > identification of your "interesting terms" will be much faster.
> >
> > Thus, on the content field of your documents, add termVectors="true" in
> > your schema, and re-index. Then you could well find MLT becoming a lot
> > more efficient.
> >
> > Upayavira
> >
> > On Tue, Sep 29, 2015, at 10:39 AM, Szűcs Roland wrote:
> > > Hi Alessandro,
> > >
> > > My original goal was to get offline suggestsion on content based
> > > similarity
> > > for every e-book we have . We wanted to run a bulk more like this
> > > calculation in the evening when the usage of our site is low and we
> > > submit
> > > a new e-book. Real time more like this can take a while as we have
> > > typically long documents (2-5MB text) with all the content indexed.
> > >
> > > When we upload a new document we wanted to recalculate the more like this
> > > suggestions and a tf-idf based tag cloouds. Both of them are delivered by
> > > the More LikeThisHandler but only for one document as you wrote.
> > >
> > > The text input is not good for us because we need the similar doc list
> > > for
> > > each of the matched document. If I put together text of 10 document I can
> > > not separate which suggestion relates to which matched document and also
> > > the tag cloud will belong to the mixed text.
> > >
> > > Most likley we will use the MoreLikeThisHandler for each of the documents
> > > and parse the json repsonse and store the result in a DQL database
> > >
> > > Thanks your help.
> > >
> > > 2015-09-29 11:18 GMT+02:00 Alessandro Benedetti
> > > <benedetti.ale...@gmail.com>
> > > :
> > >
> > > > Hi Roland,
> > > > what is your exact requirement ?
> > > > Do you want to basically build a "description" for a set of documents
> > and
> > > > then find documents in the index, similar to this description ?
> > > >
> > > > By default , based on my experience ( and on the code) this is the
> > entry
> > > > point for the Lucene More Like This :
> > > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > *org.apache.lucene.queries.mlt.MoreLikeThis/*** Return a query that
> > will
> > > > > return docs like the passed lucene document ID.** @param docNum the
> > > > > documentID of the lucene doc to generate the 'More Like This" query
> > for.*
> > > > > @return a query that will return docs like the passed lucene document
> > > > > ID.*/public Query like(int docNum) throws IOException {if
> > (fieldNames ==
> > > > > null) {// gather list of valid fields from luceneCollection<String>
> > > > fields
> > > > > = MultiFields.getIndexedFields(ir);fieldNames = fields.toArray(new
> > > > > String[fields.size()]);}return createQuery(retrieveTerms(docNum));}*
> > > >
> > > > It means that talking about "documents" you can feed only one Solr doc.
> > > >
> > > > But you can also feed the MLT with simple text.
> > > >
> > > > So you should study better your use case and understand which option
> > > > fits better :
> > > >
> > > > 1) customising the MLT component starting from Lucene
> > > >
> > > > 2) doing some processing client side and use the "text" similarity
> > feature.
> > > >
> > > >
> > > > Cheers
> > > >
> > > >
> > > > 2015-09-29 10:05 GMT+01:00 Roland Szűcs <roland.sz...@bookandwalk.com
> > >:
> > > >
> > > > > Hi all,
> > > > >
> > > > > Is it possible to feed multiple solr id for a MoreLikeThisHandler?
> > > > >
> > > > > <requestHandler name="/mlt" class="solr.MoreLikeThisHandler">
> > > > > <lst name="defaults">
> > > > > <str name="mlt.match.include">false</str>
> > > > > <str name="mlt.interestingTerms">details</str>
> > > > > <str name="mlt.fl">title,content</str>
> > > > > <str name="mlt.minwl">4</str>
> > > > > <str name="mlt.qf">title^12 content^1</str>
> > > > > <str name="mlt.mintf">2</str>
> > > > > <int name="mlt.count">10</int>
> > > > > <str name="mlt.boost">true</str>
> > > > > <str name="wt">json</str>
> > > > > <str name="indent">true</str>
> > > > > </lst>
> > > > >   </requestHandler>
> > > > >
> > > > > when I call this:
> > http://localhost:8983/solr/bandwhu/mlt?q=id:8&fl=id
> > > > >  it works fine. Is there any way to have a kind of "bulk" call of
> > more
> > > > like
> > > > > this handler . I need the intresting terms as well and as far as I
> > know
> > > > if
> > > > > i use more like this as a search component it does not return with
> > it so
> > > > it
> > > > > is not an alternative.
> > > > >
> > > > > Thanks in advance,
> > > > >
> > > > >
> > > > > --
> > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Roland
> > > > Szűcs
> > > > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Connect
> > > > with
> > > > > me on Linkedin <
> > > > > https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > > > <https://bookandwalk.hu/>CEOPhone: +36 1 210 81 13Bookandwalk.hu
> > > > > <https://bokandwalk.hu/>
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > --------------------------
> > > >
> > > > Benedetti Alessandro
> > > > Visiting card - http://about.me/alessandro_benedetti
> > > > Blog - http://alexbenedetti.blogspot.co.uk
> > > >
> > > > "Tyger, tyger burning bright
> > > > In the forests of the night,
> > > > What immortal hand or eye
> > > > Could frame thy fearful symmetry?"
> > > >
> > > > William Blake - Songs of Experience -1794 England
> > > >
> > >
> > >
> > >
> > > --
> > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs
> > Roland
> > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu
> > >Ismerkedjünk
> > > meg a Linkedin
> > > <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> > > -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> > > 13Bookandwalk.hu
> > > <https://bokandwalk.hu/>
> >
> 
> 
> 
> -- 
> <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Szűcs Roland
> <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>Ismerkedjünk
> meg a Linkedin
> <https://www.linkedin.com/pub/roland-sz%C5%B1cs/28/226/24/hu>
> -en <https://bookandwalk.hu/>ÜgyvezetőTelefon: +36 1 210 81
> 13Bookandwalk.hu
> <https://bokandwalk.hu/>

Re: MoreLikeThisHandler with mltipli input documents

Reply via email to