Re: More Like This on numeric fields - BF accepted by MLT handler

2015-09-28 Thread Upayavira
You could use the MLT query parser, and combine that with other queries,
whether as filters or boosts.

You can't yet use stream.body yet, so would need to use the handler if
you need that.

Upayavira

On Mon, Sep 28, 2015, at 09:53 AM, Alessandro Benedetti wrote:
> Hi Upaya,
> thanks for the explanation, I actually already did some investigations
> about it ( my first foundation was :
> http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ ) and
> then I took a look to the code.
> 
> Was just wondering what the community was thinking about
> including/providing numerical similarity ( approaches, ideas, possible
> existent solutions).
> Customisation should be the last step, if anything already available.
> 
> Thanks for the support anyway !
> 
> Cheers
> 
> 2015-09-25 12:47 GMT+01:00 Upayavira :
> 
> > Alessandro,
> >
> > I'd suggest you review the code of the MoreLikeThisHandler. It is a
> > little knotty, but it would be worth your while understanding what is
> > going on there.
> >
> > Basically, there are three phases:
> >
> > phase #1: parse the source document into a list of terms (avoided if
> > term vectors enabled and source doc is in index)
> > phase #2: calculate a score for each of these terms and select the n
> > highest scoring ones (default 25)
> > phase #3: build and execute a boolean query using these 25 terms
> >
> > Phase #2 uses a TF/IDF like approach to calculate the scores for those
> > "interesting terms".
> >
> > Once you understand what MLT is doing, you will probably not find it so
> > hard to create your own version which is better suited to your own
> > use-case.
> >
> > Of course, this would probably be better constructed as a QueryParser
> > rather than a request handler, but that's a detail.
> >
> > Upayavira
> >
> > On Fri, Sep 25, 2015, at 11:08 AM, Alessandro Benedetti wrote:
> > > Hi guys,
> > > was just investigating a little bit in how to include numeric fields in
> > > the
> > > MLT calculations.
> > >
> > > As we know, we are currently building a smart lucene query based on the
> > > document in input ( the one to search for similar ones) and run this
> > > query
> > > to obtain the similar docs.
> > > Because the MLT is currently built on TF/IDF , it is mainly thought for
> > > textual fields.
> > > What about we want to include a numeric factor  in the similarity
> > > calculus ?
> > >
> > > e.g.
> > > Solr Document ( Hotel)
> > > mlt.fl=description,stars,trip_advisor_rating
> > >
> > > To find the similarity based not only on the description, but also on the
> > > numeric fields ( stars and rating) .
> > >
> > > The first thought I had , is to add a support for boosting functions.
> > > In this way we are more flexible and we can add how many functions we
> > > want.
> > >
> > > For example adding :
> > > bf=div(1,dist(2,seedDocumentRatingA,seedDocumentRatingB,ratingA,ratingB))
> > >
> > > Also other kind of functions can be applied.
> > > What do you think ? Do you have any alternative ideas ?
> > >
> > > Cheers
> > > --
> > > --
> > >
> > > Benedetti Alessandro
> > > Visiting card : http://about.me/alessandro_benedetti
> > >
> > > "Tyger, tyger burning bright
> > > In the forests of the night,
> > > What immortal hand or eye
> > > Could frame thy fearful symmetry?"
> > >
> > > William Blake - Songs of Experience -1794 England
> >
> 
> 
> 
> -- 
> --
> 
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England


Re: More Like This on numeric fields - BF accepted by MLT handler

2015-09-28 Thread Alessandro Benedetti
Hi Upaya,
thanks for the explanation, I actually already did some investigations
about it ( my first foundation was :
http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ ) and
then I took a look to the code.

Was just wondering what the community was thinking about
including/providing numerical similarity ( approaches, ideas, possible
existent solutions).
Customisation should be the last step, if anything already available.

Thanks for the support anyway !

Cheers

2015-09-25 12:47 GMT+01:00 Upayavira :

> Alessandro,
>
> I'd suggest you review the code of the MoreLikeThisHandler. It is a
> little knotty, but it would be worth your while understanding what is
> going on there.
>
> Basically, there are three phases:
>
> phase #1: parse the source document into a list of terms (avoided if
> term vectors enabled and source doc is in index)
> phase #2: calculate a score for each of these terms and select the n
> highest scoring ones (default 25)
> phase #3: build and execute a boolean query using these 25 terms
>
> Phase #2 uses a TF/IDF like approach to calculate the scores for those
> "interesting terms".
>
> Once you understand what MLT is doing, you will probably not find it so
> hard to create your own version which is better suited to your own
> use-case.
>
> Of course, this would probably be better constructed as a QueryParser
> rather than a request handler, but that's a detail.
>
> Upayavira
>
> On Fri, Sep 25, 2015, at 11:08 AM, Alessandro Benedetti wrote:
> > Hi guys,
> > was just investigating a little bit in how to include numeric fields in
> > the
> > MLT calculations.
> >
> > As we know, we are currently building a smart lucene query based on the
> > document in input ( the one to search for similar ones) and run this
> > query
> > to obtain the similar docs.
> > Because the MLT is currently built on TF/IDF , it is mainly thought for
> > textual fields.
> > What about we want to include a numeric factor  in the similarity
> > calculus ?
> >
> > e.g.
> > Solr Document ( Hotel)
> > mlt.fl=description,stars,trip_advisor_rating
> >
> > To find the similarity based not only on the description, but also on the
> > numeric fields ( stars and rating) .
> >
> > The first thought I had , is to add a support for boosting functions.
> > In this way we are more flexible and we can add how many functions we
> > want.
> >
> > For example adding :
> > bf=div(1,dist(2,seedDocumentRatingA,seedDocumentRatingB,ratingA,ratingB))
> >
> > Also other kind of functions can be applied.
> > What do you think ? Do you have any alternative ideas ?
> >
> > Cheers
> > --
> > --
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
>



-- 
--

Benedetti Alessandro
Visiting card - http://about.me/alessandro_benedetti
Blog - http://alexbenedetti.blogspot.co.uk

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England


Re: More Like This on numeric fields - BF accepted by MLT handler

2015-09-25 Thread Upayavira
Alessandro,

I'd suggest you review the code of the MoreLikeThisHandler. It is a
little knotty, but it would be worth your while understanding what is
going on there.

Basically, there are three phases:

phase #1: parse the source document into a list of terms (avoided if
term vectors enabled and source doc is in index)
phase #2: calculate a score for each of these terms and select the n
highest scoring ones (default 25)
phase #3: build and execute a boolean query using these 25 terms

Phase #2 uses a TF/IDF like approach to calculate the scores for those
"interesting terms".

Once you understand what MLT is doing, you will probably not find it so
hard to create your own version which is better suited to your own
use-case.

Of course, this would probably be better constructed as a QueryParser
rather than a request handler, but that's a detail.

Upayavira

On Fri, Sep 25, 2015, at 11:08 AM, Alessandro Benedetti wrote:
> Hi guys,
> was just investigating a little bit in how to include numeric fields in
> the
> MLT calculations.
> 
> As we know, we are currently building a smart lucene query based on the
> document in input ( the one to search for similar ones) and run this
> query
> to obtain the similar docs.
> Because the MLT is currently built on TF/IDF , it is mainly thought for
> textual fields.
> What about we want to include a numeric factor  in the similarity
> calculus ?
> 
> e.g.
> Solr Document ( Hotel)
> mlt.fl=description,stars,trip_advisor_rating
> 
> To find the similarity based not only on the description, but also on the
> numeric fields ( stars and rating) .
> 
> The first thought I had , is to add a support for boosting functions.
> In this way we are more flexible and we can add how many functions we
> want.
> 
> For example adding :
> bf=div(1,dist(2,seedDocumentRatingA,seedDocumentRatingB,ratingA,ratingB))
> 
> Also other kind of functions can be applied.
> What do you think ? Do you have any alternative ideas ?
> 
> Cheers
> -- 
> --
> 
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England