Hi Upaya, thanks for the explanation, I actually already did some investigations about it ( my first foundation was : http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ ) and then I took a look to the code.
Was just wondering what the community was thinking about including/providing numerical similarity ( approaches, ideas, possible existent solutions). Customisation should be the last step, if anything already available. Thanks for the support anyway ! Cheers 2015-09-25 12:47 GMT+01:00 Upayavira <u...@odoko.co.uk>: > Alessandro, > > I'd suggest you review the code of the MoreLikeThisHandler. It is a > little knotty, but it would be worth your while understanding what is > going on there. > > Basically, there are three phases: > > phase #1: parse the source document into a list of terms (avoided if > term vectors enabled and source doc is in index) > phase #2: calculate a score for each of these terms and select the n > highest scoring ones (default 25) > phase #3: build and execute a boolean query using these 25 terms > > Phase #2 uses a TF/IDF like approach to calculate the scores for those > "interesting terms". > > Once you understand what MLT is doing, you will probably not find it so > hard to create your own version which is better suited to your own > use-case. > > Of course, this would probably be better constructed as a QueryParser > rather than a request handler, but that's a detail. > > Upayavira > > On Fri, Sep 25, 2015, at 11:08 AM, Alessandro Benedetti wrote: > > Hi guys, > > was just investigating a little bit in how to include numeric fields in > > the > > MLT calculations. > > > > As we know, we are currently building a smart lucene query based on the > > document in input ( the one to search for similar ones) and run this > > query > > to obtain the similar docs. > > Because the MLT is currently built on TF/IDF , it is mainly thought for > > textual fields. > > What about we want to include a numeric factor in the similarity > > calculus ? > > > > e.g. > > Solr Document ( Hotel) > > mlt.fl=description,stars,trip_advisor_rating > > > > To find the similarity based not only on the description, but also on the > > numeric fields ( stars and rating) . > > > > The first thought I had , is to add a support for boosting functions. > > In this way we are more flexible and we can add how many functions we > > want. > > > > For example adding : > > bf=div(1,dist(2,seedDocumentRatingA,seedDocumentRatingB,ratingA,ratingB)) > > > > Also other kind of functions can be applied. > > What do you think ? Do you have any alternative ideas ? > > > > Cheers > > -- > > -------------------------- > > > > Benedetti Alessandro > > Visiting card : http://about.me/alessandro_benedetti > > > > "Tyger, tyger burning bright > > In the forests of the night, > > What immortal hand or eye > > Could frame thy fearful symmetry?" > > > > William Blake - Songs of Experience -1794 England > -- -------------------------- Benedetti Alessandro Visiting card - http://about.me/alessandro_benedetti Blog - http://alexbenedetti.blogspot.co.uk "Tyger, tyger burning bright In the forests of the night, What immortal hand or eye Could frame thy fearful symmetry?" William Blake - Songs of Experience -1794 England