Re: Text search NGram

Emir Arnautovic Wed, 16 Mar 2016 03:12:08 -0700

Hi Rajesh,

It seems that title length is not different enough to have differentfieldNorm - in all titles it is 0.5 so all documents for exact matchquery result in same score.

Query with "Ofice" results in wrong document being first because of itsfieldNorm=1.0 - seems to me that this document was not reindexed afteromitNorms=false.

Also noticed that ngram field is bit different in schema than in mail -has maxGramSize="800". Does not change explanation, but is easier tounderstand results when max=min.


HTH,
Emir

On 16.03.2016 10:31, G, Rajesh wrote:

Hi Emir,

Yes I have re-indexed after setting omitNorms to false. Attached is the result 
of the query in debug mode.

I am using LuceneQParser

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-----Original Message-----
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Wednesday, March 16, 2016 2:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
Did you reindex afters setting omitNorms to false? Can you send results with 
debug for first query?

What query parser are you using for these queries? You should run your queries 
with debug=true and see how they are rewritten - that should explain why some 
cases do not return expected documents. If you have trouble understanding why 
it is not returned, you can post response to this thread.

Thanks,
Emir

On 16.03.2016 09:30, G, Rajesh wrote:

Hi Emir,

The solution we wanted to implement is to show top 100 best match technology 
names from the list of technology names we have. Whatever technology names user 
has typed will first reach SQL Server and exact match will be done if 
possible[name==name] , only those do not exactly match[spelling mistakes, 
jumbled words] will be searched in SOLR.

With the below setup if I query title:(Microsoft Ofice 365) I get the
below result [note:scores are same?] {
          "title":"Lync - Microsoft Office 365",
          "score":7.7472024
},
{
          "title":"Microsoft Office 365",
          "score":7.7472024
},
When I query title:(Microsoft Ofice 365) OR title_ws:(Microsoft Ofice 365) {
          "title":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 365 
1.0",
          "title_ws":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 
365 1.0",
          "score":3.9297152
},
        {
          "title":"Microsoft Office 365",
          "title_ws":"Microsoft Office 365",
          "score":3.1437721
}

When I query title:(Microsoft Ofice 365) OR title_ws:(Microsoft Ofice
365) qf=title_ws^1 I don’t get any results

The expected result is
{
          "title":"Microsoft Office 365",
          "title_ws":"Microsoft Office 365", },
        {
          "title":"Microsoft Office 365 1.0",
          "title_ws":"Microsoft Office 365 1.0", },
        {
          "title":"Microsoft Office 365 14.0",
          "title_ws":"Microsoft Office 365 14.0", },
        {
          "title":"Microsoft Office 365 14.3",
          "title_ws":"Microsoft Office 365 14.3", },
        {
          "title":"Microsoft Office 365 14.4",
          "title_ws":"Microsoft Office 365 14.4", },

<fieldType name="txt_token_ng" class="solr.TextField" positionIncrementGap="0" 
omitNorms="false">
          <analyzer type="index">
                  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                  <filter class="solr.LowerCaseFilterFactory"/>
                  <filter class="solr.NGramFilterFactory" minGramSize="2" 
maxGramSize="2"/>
          </analyzer>
          <analyzer type="query">
                  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                  <filter class="solr.LowerCaseFilterFactory"/>
                  <filter class="solr.NGramFilterFactory" minGramSize="2" 
maxGramSize="2"/>
          </analyzer>
    </fieldType>

    <field name="id" type="int" indexed="true" stored="true" required="true" 
multiValued="false" />
    <field name="title" type="txt_token_ng" indexed="true" stored="true" 
multiValued="false"/>
    <field name="manufacturername" type="txt_token_ng" indexed="true" stored="true" 
multiValued="false"/>
    <field name="productname" type="txt_token_ng" indexed="true" stored="true" 
multiValued="false"/>
    <field name="version" type="txt_token_ng" indexed="true"
stored="true" multiValued="false"/>

    <fieldType name="txt_token_ws" class="solr.TextField" positionIncrementGap="0" 
omitNorms="false">
          <analyzer type="index">
                  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                  <filter class="solr.LowerCaseFilterFactory"/>
          </analyzer>
            <analyzer type="query">
                  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                  <filter class="solr.LowerCaseFilterFactory"/>
          </analyzer>
    </fieldType>

    <field name="title_ws" type="txt_token_ws" indexed="true" stored="true" 
multiValued="false"/>
    <field name="manufacturername_ws" type="txt_token_ws" indexed="true" stored="true" 
multiValued="false"/>
    <field name="productname_ws" type="txt_token_ws" indexed="true" stored="true" 
multiValued="false"/>
    <field name="version_ws" type="txt_token_ws" indexed="true"
stored="true" multiValued="false"/>

    <copyField source="title" dest="title_ws"/>
    <copyField source="manufacturername" dest="manufacturername_ws"/>
    <copyField source="productname" dest="productname_ws"/>
    <copyField source="version" dest="version_ws"/>

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-----Original Message-----
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 8:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Not sure I understood question. What I meant is you to try setting 
omitNorms="false" to your txt_token field type if you want to stick with ngram 
only solution:

<fieldType name="txt_token" class="solr.TextField" positionIncrementGap="0" 
omitNorms="false">
                   <analyzer type="index">
                                   <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="\s+" replacement=" "/>
                                   <tokenizer 
class="solr.WhitespaceTokenizerFactory"/>
                                   <filter class="solr.LowerCaseFilterFactory"/>
                                   <filter class="solr.NGramFilterFactory" minGramSize="2" 
maxGramSize="800"/>
                   </analyzer>
                    <analyzer type="query">
                                   <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="\s+" replacement=" "/>
                                   <tokenizer 
class="solr.WhitespaceTokenizerFactory"/>
                                   <filter class="solr.LowerCaseFilterFactory"/>
                                   <filter class="solr.NGramFilterFactory" minGramSize="2" 
maxGramSize="800"/>
                   </analyzer>
     </fieldType>


and to add new field type and field to keep nonngram version of field.
Something like:

<fieldType name="txt_token_simple" class="solr.TextField" positionIncrementGap="0" 
>
                   <analyzer type="index">
                                   <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="\s+" replacement=" "/>
                                   <tokenizer 
class="solr.WhitespaceTokenizerFactory"/>
                                   <filter class="solr.LowerCaseFilterFactory"/>
                   </analyzer>
                    <analyzer type="query">
                                   <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="\s+" replacement=" "/>
                                   <tokenizer 
class="solr.WhitespaceTokenizerFactory"/>
                                   <filter class="solr.LowerCaseFilterFactory"/>
                   </analyzer>
     </fieldType>


and use copyField to copy to both fields and query title:test OR 
title_simple:test.

Emir


On 07.03.2016 15:31, G, Rajesh wrote:

Hi Emir,

I have already applied

<tokenizer class="solr.WhitespaceTokenizerFactory"/> and then I have applied <filter 
class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="800"/>. Is this what you wanted me to 
have in my config?

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-----Original Message-----
From: G, Rajesh [mailto:r...@cebglobal.com]
Sent: Monday, March 7, 2016 7:50 PM
To: solr-user@lucene.apache.org
Subject: RE: Text search NGram

Hi Emir,

Thanks for you email. Can you please help me to understand what do you mean by "e.g. 
boost if matching tokenized fileds to make sure exact matches are ordered first"



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-----Original Message-----
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
It is most likely related to norms - you can try setting omitNorms="true" and 
reindexing content. Anyway, it is not common to use just ngrams for matching content - in 
such case you can expect more unexpected ordering/results. You should combine ngrams 
fields with normally tokenized fields (e.g. boost if matching tokenized fileds to make 
sure exact matches are ordered first).

Regards,
Emir

On 07.03.2016 11:44, G, Rajesh wrote:

Hi Team,

We have the blow type and we have indexed the value  "title": "Microsoft Visual Studio 2006" and 
"title": "Microsoft Visual Studio 8.0.61205.56 (2005)"

When I search for title:(Microsoft Visual AND Studio AND 2005)  I get Microsoft 
Visual Studio 8.0.61205.56 (2005) as the second record and  Microsoft Visual 
Studio 2006 as first record. I wanted to have Microsoft Visual Studio 
8.0.61205.56 (2005) listed first since the user has searched for Microsoft 
Visual Studio 2005. Can you please help?.

We are using NGram so it takes care of misspelled or jumbled
words[it works as expected] e.g.
searching Micrs Visual Studio will gets Microsoft Visual Studio
searching Visual Microsoft Studio will gets Microsoft Visual Studio

      <fieldType name="txt_token" class="solr.TextField" positionIncrementGap="0" 
>
                    <analyzer type="index">
                                    <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="\s+" replacement=" "/>
                                    <tokenizer 
class="solr.WhitespaceTokenizerFactory"/>
                                    <filter 
class="solr.LowerCaseFilterFactory"/>
                                    <filter class="solr.NGramFilterFactory" minGramSize="2" 
maxGramSize="800"/>
                    </analyzer>
                     <analyzer type="query">
                                    <charFilter class="solr.PatternReplaceCharFilterFactory" 
pattern="\s+" replacement=" "/>
                                    <tokenizer 
class="solr.WhitespaceTokenizerFactory"/>
                                    <filter 
class="solr.LowerCaseFilterFactory"/>
                                    <filter class="solr.NGramFilterFactory" minGramSize="2" 
maxGramSize="800"/>
                    </analyzer>
      </fieldType>



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

--
Monitoring * Alerting * Anomaly Detection * Centralized Log
Management Solr & Elasticsearch Support * http://sematext.com/

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & 
Elasticsearch Support * http://sematext.com/


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Re: Text search NGram

Reply via email to