Re: LTR original score feature

2018-01-29 Thread Michael Alcorn
>It seems to me that the original score feature is not useful because it is
not normalized across all queries and therefore cannot be used to compare
relevance in different queries.

I don't agree with this statement and it's not what Alessandro was
suggesting ("When you put the original score together with the rest of
features, it may
be of potential usage."). The magnitude of the score could very well
contain useful information in certain contexts. The simplest way to
determine whether or not the score is useful is to just train and test the
model with and without the feature included and see which one performs
better.

On Thu, Jan 25, 2018 at 3:41 PM, Brian Yee  wrote:

> Thanks for the reply Alessandro. I'm starting to agree with you but I
> wanted to see if others agree. It seems to me that the original score
> feature is not useful because it is not normalized across all queries and
> therefore cannot be used to compare relevance in different queries.
>
> -Original Message-
> From: alessandro.benedetti [mailto:a.benede...@sease.io]
> Sent: Wednesday, January 24, 2018 10:22 AM
> To: solr-user@lucene.apache.org
> Subject: Re: LTR original score feature
>
> This is actually an interesting point.
> The original Solr score alone will mean nothing, the ranking position of
> the document would be a more relevant feature at that stage.
>
> When you put the original score together with the rest of features, it may
> be of potential usage ( number of query terms, tf for a specific field, idf
> for another field ...).
> Also because some training algorithms will group the training samples by
> query.
>
> personally I start to believe it would be better to decompose the original
> score into finer grain features and then rely on LTR to weight them ( as
> the original score is effectively already mixing up finer grain features
> following a standard formula).
>
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director Sease Ltd. -
> www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: LTR original score feature

2018-01-12 Thread Michael Alcorn
What you're suggesting is that there's a "nonlinear relationship
"
between the original score (the input variable) and some measure of
"relevance" (the output variable). Nonlinear models like decision trees
(which include LambdaMART) and neural networks (which include RankNet) can
handle these types of situations, assuming there's enough data. The
nonlinear phenomena you brought up are also probably part of the
reason why pairwise
models tend to perform better than pointwise models

in
learning to rank tasks.

On Fri, Jan 12, 2018 at 1:52 PM, Brian Yee  wrote:

> I wanted to get some opinions on using the original score feature. The
> original score produced by Solr is intuitively a very important feature. In
> my data set I'm seeing that the original score varies wildly between
> different queries. This makes sense since the score generated by Solr is
> not normalized across all queries. However, won't this mess with our
> training data? If this feature is 3269.4 for the top result for one query,
> and then 32.7 for the top result for another query, it does not mean that
> the first document was 10x more relevant to its query than the second
> document. I am using a normalize param within Ranklib, but that only
> normalizes features between each other, not within one feature, right? How
> are people handling this? Am I missing something?
>


Re: Solr LTR plugin - Training

2017-11-16 Thread Michael Alcorn
Hi,

Not sure if this is your issue or not, but the FieldQParser automatically
converts multi-term arguments to phrases, so you might have to switch to
the DisMaxQParser. I talk a little bit more about it here
.

-Michael

On Thu, Nov 16, 2017 at 5:51 AM, ilay  wrote:

> HI,
>
>  I am using clickstream data for training LTR. Training data is like:
>
> 50888522bath towel  4.1212012426088345
> 51779533bath towel  3.9428197899308484
> 16851137bath towel  3.488605518893958
> .
>
> When I start training, it fires query to solr - that looks exact phrase
> match in the title, brand, category etc. which are configured as features.
> Why is it not looking for term match here?
> q=id:"50888522"=id,score,[features+store%3DmySearchFeatureStore+efi.
> user_query%3Dbath
> towel]
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


From Zero to Learning to Rank in Apache Solr

2017-11-02 Thread Michael Alcorn
Here's a tutorial I wrote that some of you all might find useful:
https://github.com/airalcorn2/Solr-LTR. Feedback is welcome.

Thanks,
Michael A. Alcorn


How to Efficiently Extract Learning to Rank Similarity Features From Solr?

2017-10-23 Thread Michael Alcorn
Hi,

I'm trying to extract several similarity measures from Solr for use in a
learning to rank model. Doing this mathematically involves taking the dot
product of several different matrices, which is extremely fast for non-huge
data sets (e.g., millions of documents and queries). However, to extract
these similarity features from Solr, I have to perform a Solr query for
each query, which introduces several bottlenecks. Are there more efficient
means of computing these similarity measures for large numbers of queries
(other than increased parallelism)?

Thanks,
Michael A. Alcorn


Re: Strange Behavior When Extracting Features

2017-10-16 Thread Michael Alcorn
If anyone else is following this thread, I replied on the Jira.

On Mon, Oct 16, 2017 at 4:07 AM, alessandro.benedetti 
wrote:

> This is interesting, the EFI parameter resolution should work using the
> quotes independently of the query parser.
> At that point, the query parsers (both) receive a multi term text.
> Both of them should work the same.
> At the time I saw the mail I tried to reproduce it through the LTR module
> tests and I didn't succeed .
> It would be quite useful if you can contribute a test that is failing with
> the field query parser.
> Have you tried just with the same query, but in a request handler ?
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Strange Behavior When Extracting Features

2017-10-13 Thread Michael Alcorn
I believe I've discovered a workaround. If you use:

{
"store": "redhat_efi_feature_store",
"name": "case_description_issue_tfidf",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": {
"q":"{!dismax qf=text_tfidf}${text}"
}
}

instead of:

{
"store": "redhat_efi_feature_store",
"name": "case_description_issue_tfidf",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": {
"q": "{!field f=issue_tfidf}${case_description}"
}
}

you can then use single quotes to incorporate multi-term arguments as
Alessandro suggested. I've added this information to the Jira.

On Fri, Sep 22, 2017 at 8:30 AM, alessandro.benedetti 
wrote:

> I think this has nothing to do with the LTR plugin.
> The problem here should be just the way you use the local params,
> to properly pass multi term local params in Solr you need to use *'* :
>
> efi.case_description='added couple of fiber channel'
>
> This should work.
> If not only the first term will be passed as a local param and then passed
> in the efi map to LTR.
>
> I will update the Jira issue as well.
>
> Cheers
>
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Parsing of rq queries in LTR

2017-10-13 Thread Michael Alcorn
I believe I've discovered a workaround. If you use:

{
"store": "redhat_efi_feature_store",
"name": "case_description_issue_tfidf",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": {
"q":"{!dismax qf=text_tfidf}${text}"
}
}

instead of:

{
"store": "redhat_efi_feature_store",
"name": "case_description_issue_tfidf",
"class": "org.apache.solr.ltr.feature.SolrFeature",
"params": {
"q": "{!field f=issue_tfidf}${case_description}"
}
    }

you can then use single quotes to incorporate multi-term arguments as
Alessandro suggested. I've added this information to the Jira.

On Thu, Oct 12, 2017 at 9:10 AM, Michael Alcorn <malc...@redhat.com> wrote:

> It turns out my last comment on that Jira was mistaken. Multi-term EFI
> arguments still exhibit unexpected behavior. Binoy is trying to help me
> figure out what the issue is. I plan on updating the Jira once we've
> figured out the problem.
>
> On Thu, Oct 12, 2017 at 3:41 AM, alessandro.benedetti <
> a.benede...@sease.io> wrote:
>
>> I don't think this is actually that much related to LTR Solr Feature.
>> In the Solr feature I see you specify a query with a specific query parser
>> (field).
>> Unless there is a bug in the SolrFeature for LTR, I expect the query
>> parser
>> you defined to be used[1].
>>
>> This means :
>>
>> "rawquerystring":"{!field f=full_name}alessandro benedetti",
>> "querystring":"{!field f=full_name}alessandro benedetti",
>> "parsedquery":"PhraseQuery(full_name:\"alessandro benedetti\")",
>> "parsedquery_toString":"full_name:\"alessandro benedetti\"",
>>
>> In relation to multi term EFI, you need to pass
>> efi.example='term1 term2' .
>> If not just one term will be passed as EFI.[2]
>> This is more likely to be your problem.
>> I don't think the dash should be relevant at all
>>
>> [1]
>> https://lucene.apache.org/solr/guide/6_6/other-parsers.html#
>> OtherParsers-FieldQueryParser
>> [2] https://issues.apache.org/jira/browse/SOLR-11386
>>
>>
>>
>>
>> -
>> ---
>> Alessandro Benedetti
>> Search Consultant, R Software Engineer, Director
>> Sease Ltd. - www.sease.io
>> --
>> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>
>
>


Re: Parsing of rq queries in LTR

2017-10-12 Thread Michael Alcorn
It turns out my last comment on that Jira was mistaken. Multi-term EFI
arguments still exhibit unexpected behavior. Binoy is trying to help me
figure out what the issue is. I plan on updating the Jira once we've
figured out the problem.

On Thu, Oct 12, 2017 at 3:41 AM, alessandro.benedetti 
wrote:

> I don't think this is actually that much related to LTR Solr Feature.
> In the Solr feature I see you specify a query with a specific query parser
> (field).
> Unless there is a bug in the SolrFeature for LTR, I expect the query parser
> you defined to be used[1].
>
> This means :
>
> "rawquerystring":"{!field f=full_name}alessandro benedetti",
> "querystring":"{!field f=full_name}alessandro benedetti",
> "parsedquery":"PhraseQuery(full_name:\"alessandro benedetti\")",
> "parsedquery_toString":"full_name:\"alessandro benedetti\"",
>
> In relation to multi term EFI, you need to pass
> efi.example='term1 term2' .
> If not just one term will be passed as EFI.[2]
> This is more likely to be your problem.
> I don't think the dash should be relevant at all
>
> [1]
> https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-
> FieldQueryParser
> [2] https://issues.apache.org/jira/browse/SOLR-11386
>
>
>
>
> -
> ---
> Alessandro Benedetti
> Search Consultant, R Software Engineer, Director
> Sease Ltd. - www.sease.io
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Strange Behavior When Extracting Features

2017-09-20 Thread Michael Alcorn
Hi all,

I'm getting some extremely strange behavior when trying to extract features
for a learning to rank model. The following query incorrectly says all
features have zero values:

http://gss-test-fusion.usersys.redhat.com:8983/solr/access/query?q=added
couple of fiber channel={!ltr model=redhat_efi_model reRankDocs=1
efi.case_summary=the efi.case_description=added couple of fiber channel
efi.case_issue=the efi.case_environment=the}=id,score,[features]=10

But this query, which simply moves the word "added" from the front of the
provided text to the back, properly fills in the feature values:

http://gss-test-fusion.usersys.redhat.com:8983/solr/access/query?q=couple
of fiber channel added={!ltr model=redhat_efi_model reRankDocs=1
efi.case_summary=the efi.case_description=couple of fiber channel added
efi.case_issue=the efi.case_environment=the}=id,score,[features]=10

The explain output for the failing query can be found here:

https://gist.github.com/manisnesan/18a8f1804f29b1b62ebfae1211f38cc4

and the explain output for the properly functioning query can be found here:

https://gist.github.com/manisnesan/47685a561605e2229434b38aed11cc65

Have any of you run into this issue? Seems like it could be a bug.

Thanks,
Michael A. Alcorn


Per Text Field Similarity Measures for Learning to Rank

2017-08-04 Thread Michael Alcorn
Hi all,

I recently prototyped a learning to rank system in Python that produced
promising results, so I'm now looking into how to replicate that process in
our Solr setup. For my Python implementation, I was using a number of
features that were per field text comparisons, e.g.:

   1. tfidf_case_title_solution_title
   2. tfidf_case_description_solution_title
   3. ...
   4. bm25_case_title_solution_description
   5. bm25_case_description_solution_description

where each solution field had its own independent index. I was wondering if
any of you all had recommendations on how to do that type of thing in Solr.
It looks like the SolrFeature class might be the way to go, but my
colleagues who are more familiar with Solr than I am weren't sure it was
possible.

Thanks,
Michael A. Alcorn