subject:"RE\: Text search NGram"

Re: Text search NGram

2016-03-16 Thread Emir Arnautovic


Hi Rajesh,
Here is one bit older presentation https://vimeo.com/32701503 but all 
should be still applicable. You can google for more with "understanding 
solr debug".


Regrads,
Emir

On 16.03.2016 11:30, G, Rajesh wrote:

Hi Emir,
Yes I have changed it to 800 to see if it produces different result. Sorry I 
have not inform that before. I have deleted all folder and files in data folder 
and I have re-indexed. Attached is the result with debug on

Can you please let me know whether there are any utility or a blog that will 
help in understanding the result of debug[parsedquery ,explain...]

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Wednesday, March 16, 2016 3:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
It seems that title length is not different enough to have different fieldNorm 
- in all titles it is 0.5 so all documents for exact match query result in same 
score.

Query with "Ofice" results in wrong document being first because of its
fieldNorm=1.0 - seems to me that this document was not reindexed after 
omitNorms=false.

Also noticed that ngram field is bit different in schema than in mail - has 
maxGramSize="800". Does not change explanation, but is easier to understand 
results when max=min.

HTH,
Emir

On 16.03.2016 10:31, G, Rajesh wrote:

Hi Emir,

Yes I have re-indexed after setting omitNorms to false. Attached is the result 
of the query in debug mode.

I am using LuceneQParser

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Wednesday, March 16, 2016 2:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
Did you reindex afters setting omitNorms to false? Can you send results with 
debug for first query?

What query parser are you using for these queries? You should run your queries 
with debug=true and see how they are rewritten - that should explain why some 
cases do not return expected documents. If you have trouble understanding why 
it is not returned, you can post response to this thread.

Thanks,
Emir

On 16.03.2016 09:30, G, Rajesh wrote:

Hi Emir,

The solution we wanted to implement is to show top 100 best match technology 
names from the list of technology names we have. Whatever technology names user 
has typed will first reach SQL Server and exact match will be done if 
possible[name==name] , only those do not exactly match[spelling mistakes, 
jumbled words] will be searched in SOLR.

With the below setup if I query title:(Microsoft Ofice 365) I get the
below result [note:scores are same?] {
   "title":"Lync - Microsoft Office 365",
   "score":7.7472024
},
{
   "title":"Microsoft Office 365",
   "score":7.7472024
},
When I query title:(Microsoft Ofice 365) OR title_ws:(Microsoft Ofice 365) {
   "title":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 365 
1.0",
   "title_ws":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 
365 1.0",
   "score":3.9297152
},
 {
   "title":"Microsoft Office 365",
   "title_ws":"Microsoft Office 365",

Re: Text search NGram

2016-03-16 Thread Emir Arnautovic


Hi Rajesh,
It seems that title length is not different enough to have different 
fieldNorm - in all titles it is 0.5 so all documents for exact match 
query result in same score.


Query with "Ofice" results in wrong document being first because of its 
fieldNorm=1.0 - seems to me that this document was not reindexed after 
omitNorms=false.


Also noticed that ngram field is bit different in schema than in mail - 
has maxGramSize="800". Does not change explanation, but is easier to 
understand results when max=min.


HTH,
Emir

On 16.03.2016 10:31, G, Rajesh wrote:

Hi Emir,

Yes I have re-indexed after setting omitNorms to false. Attached is the result 
of the query in debug mode.

I am using LuceneQParser

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Wednesday, March 16, 2016 2:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
Did you reindex afters setting omitNorms to false? Can you send results with 
debug for first query?

What query parser are you using for these queries? You should run your queries 
with debug=true and see how they are rewritten - that should explain why some 
cases do not return expected documents. If you have trouble understanding why 
it is not returned, you can post response to this thread.

Thanks,
Emir

On 16.03.2016 09:30, G, Rajesh wrote:

Hi Emir,

The solution we wanted to implement is to show top 100 best match technology 
names from the list of technology names we have. Whatever technology names user 
has typed will first reach SQL Server and exact match will be done if 
possible[name==name] , only those do not exactly match[spelling mistakes, 
jumbled words] will be searched in SOLR.

With the below setup if I query title:(Microsoft Ofice 365) I get the
below result [note:scores are same?] {
  "title":"Lync - Microsoft Office 365",
  "score":7.7472024
},
{
  "title":"Microsoft Office 365",
  "score":7.7472024
},
When I query title:(Microsoft Ofice 365) OR title_ws:(Microsoft Ofice 365) {
  "title":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 365 
1.0",
  "title_ws":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 
365 1.0",
  "score":3.9297152
},
{
  "title":"Microsoft Office 365",
  "title_ws":"Microsoft Office 365",
  "score":3.1437721
}

When I query title:(Microsoft Ofice 365) OR title_ws:(Microsoft Ofice
365) qf=title_ws^1 I don’t get any results

The expected result is
{
  "title":"Microsoft Office 365",
  "title_ws":"Microsoft Office 365", },
{
  "title":"Microsoft Office 365 1.0",
  "title_ws":"Microsoft Office 365 1.0", },
{
  "title":"Microsoft Office 365 14.0",
  "title_ws":"Microsoft Office 365 14.0", },
{
  "title":"Microsoft Office 365 14.3",
  "title_ws":"Microsoft Office 365 14.3", },
{
  "title":"Microsoft Office 365 14.4",
  "title_ws":"Microsoft Office 365 14.4", },


  
  
  
  
  
  
  
  
  
  









  
  
  
  

  
  
  












Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged

Re: Text search NGram

2016-03-16 Thread Emir Arnautovic


Hi Rajesh,
Did you reindex afters setting omitNorms to false? Can you send results 
with debug for first query?


What query parser are you using for these queries? You should run your 
queries with debug=true and see how they are rewritten - that should 
explain why some cases do not return expected documents. If you have 
trouble understanding why it is not returned, you can post response to 
this thread.


Thanks,
Emir

On 16.03.2016 09:30, G, Rajesh wrote:

Hi Emir,

The solution we wanted to implement is to show top 100 best match technology 
names from the list of technology names we have. Whatever technology names user 
has typed will first reach SQL Server and exact match will be done if 
possible[name==name] , only those do not exactly match[spelling mistakes, 
jumbled words] will be searched in SOLR.

With the below setup if I query title:(Microsoft Ofice 365) I get the below 
result [note:scores are same?]
{
 "title":"Lync - Microsoft Office 365",
 "score":7.7472024
},
{
 "title":"Microsoft Office 365",
 "score":7.7472024
},
When I query title:(Microsoft Ofice 365) OR title_ws:(Microsoft Ofice 365) {
 "title":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 365 
1.0",
 "title_ws":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 365 
1.0",
 "score":3.9297152
},
   {
 "title":"Microsoft Office 365",
 "title_ws":"Microsoft Office 365",
 "score":3.1437721
}

When I query title:(Microsoft Ofice 365) OR title_ws:(Microsoft Ofice 365) 
qf=title_ws^1
I don’t get any results

The expected result is
{
 "title":"Microsoft Office 365",
 "title_ws":"Microsoft Office 365",
},
   {
 "title":"Microsoft Office 365 1.0",
 "title_ws":"Microsoft Office 365 1.0",
},
   {
 "title":"Microsoft Office 365 14.0",
 "title_ws":"Microsoft Office 365 14.0",
},
   {
 "title":"Microsoft Office 365 14.3",
 "title_ws":"Microsoft Office 365 14.3",
},
   {
 "title":"Microsoft Office 365 14.4",
 "title_ws":"Microsoft Office 365 14.4",
},


 
 
 
 
 
 
 
 
 
 
   

   
   
   
   
   

   
 
 
 
 
   
 
 
 
   

   
   
   
   

   
   
   
   

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 8:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Not sure I understood question. What I meant is you to try setting 
omitNorms="false" to your txt_token field type if you want to stick with ngram 
only solution:


  
  
  
  
  
  
   
  
  
  
  
  



and to add new field type and field to keep nonngram version of field.
Something like:


  
  
  
  
  
   
  
  
  
  



and use copyField to copy to both fields and query title:test OR 
title_simple:test.

Emir


On 07.03.2016 15:31, G, Rajesh wrote:

Hi Emir,

I have already applied

 and then I have applied . Is this what you wanted me to 
hav

RE: Text search NGram

2016-03-16 Thread G, Rajesh

Hi Emir,

The solution we wanted to implement is to show top 100 best match technology 
names from the list of technology names we have. Whatever technology names user 
has typed will first reach SQL Server and exact match will be done if 
possible[name==name] , only those do not exactly match[spelling mistakes, 
jumbled words] will be searched in SOLR.

With the below setup if I query title:(Microsoft Ofice 365) I get the below 
result [note:scores are same?]
{
"title":"Lync - Microsoft Office 365",
"score":7.7472024
},
{
"title":"Microsoft Office 365",
"score":7.7472024
},
When I query title:(Microsoft Ofice 365) OR title_ws:(Microsoft Ofice 365) {
"title":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 
365 1.0",
"title_ws":"RIM BlackBerry Enterprise Server (BES) for Microsoft Office 
365 1.0",
"score":3.9297152
},
  {
"title":"Microsoft Office 365",
"title_ws":"Microsoft Office 365",
"score":3.1437721
}

When I query title:(Microsoft Ofice 365) OR title_ws:(Microsoft Ofice 365) 
qf=title_ws^1
I don’t get any results

The expected result is
{
"title":"Microsoft Office 365",
"title_ws":"Microsoft Office 365",
},
  {
"title":"Microsoft Office 365 1.0",
"title_ws":"Microsoft Office 365 1.0",
},
  {
"title":"Microsoft Office 365 14.0",
"title_ws":"Microsoft Office 365 14.0",
},
  {
"title":"Microsoft Office 365 14.3",
"title_ws":"Microsoft Office 365 14.3",
},
  {
"title":"Microsoft Office 365 14.4",
"title_ws":"Microsoft Office 365 14.4",
},

Thanks
Rajesh

Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 8:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Not sure I understood question. What I meant is you to try setting 
omitNorms="false" to your txt_token field type if you want to stick with ngram 
only solution:

and to add new field type and field to keep nonngram version of field.
Something like:

and use copyField to copy to both fields and query title:test OR 
title_simple:test.

Emir

On 07.03.2016 15:31, G, Rajesh wrote:
> Hi Emir,
>
> I have already applied
>
>  and then I have applied 
> . 
> Is this what you wanted me to have in my config?
>
> Thanks
> Rajesh
>
>
>
> Corporate Executive Board India Private Limited. Registration No: 
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.
>
> This e-mail and/or its attachments are intended only for the use of the 
> addressee(s) and may contain confidential and legally privileged information 
> belonging to CEB and/or its subsidiaries, including CEB subsidiaries

Re: Text search NGram

2016-03-07 Thread Jack Krupansky

Absolutely, but so what? Nothing in any Solr query is going to be based on
character position.

Also, adding and removing characters in a char filter is a really bad idea
if you might want to do highlighting since the token character position
would not line up with the original source text.

-- Jack Krupansky

On Mon, Mar 7, 2016 at 10:33 AM, G, Rajesh <r...@cebglobal.com> wrote:

> Hi Jack,
>
>
>
> Please correct me if iam wrong I added Char filter because
>
>
>
> In Analyzer[solr ui]  I have provided "Microsoft office" in Field Value
> (Index) now WhitespaceTokenizerFactory produces the below result Office
> starts at 10. if I leave additional space say 2 more spaces Office starts
> at 12 should it not start at 10?
>
>
>
> text
>
>
> raw_bytes
>
>
> start
>
>
> end
>
>
> positionLength
>
>
> type
>
>
> position
>
>
>
>
> microsoft
>
>
> [6d 69 63 72 6f 73 6f 66 74]
>
>
> 0
>
>
> 9
>
>
> 1
>
>
> word
>
>
> 1
>
>
>
>
> office
>
>
> [6f 66 66 69 63 65]
>
>
> 10
>
>
> 16
>
>
> 1
>
>
> word
>
>
> 2
>
>
>
>
>
>
> text
>
>
> raw_bytes
>
>
> start
>
>
> end
>
>
> positionLength
>
>
> type
>
>
> position
>
>
>
>
> microsoft
>
>
> [6d 69 63 72 6f 73 6f 66 74]
>
>
> 0
>
>
> 9
>
>
> 1
>
>
> word
>
>
> 1
>
>
>
>
> office
>
>
> [6f 66 66 69 63 65]
>
>
> 12
>
>
> 18
>
>
> 1
>
>
> word
>
>
> 2
>
>
>
>
>
>
> Thanks
>
> Rajesh
>
>
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services. If
> you have received this e-mail in error, please notify the sender and
> immediately, destroy all copies of this email and its attachments. The
> publication, copying, in whole or in part, or use or dissemination in any
> other way of this e-mail and attachments by anyone other than the intended
> person(s) is prohibited.
>
>
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Monday, March 7, 2016 8:24 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Text search NGram
>
>
>
> The charFilter isn't doing anything useful - the white space tokenzier
> will ignore extra white space anyway.
>
>
>
> -- Jack Krupansky
>
>
>
> On Mon, Mar 7, 2016 at 5:44 AM, G, Rajesh <r...@cebglobal.com r...@cebglobal.com>> wrote:
>
>
>
> > Hi Team,
>
> >
>
> > We have the blow type and we have indexed the value  "title":
>
> > "Microsoft Visual Studio 2006" and "title": "Microsoft Visual Studio
>
> > 8.0.61205.56 (2005)"
>
> >
>
> > When I search for title:(Microsoft Visual AND Studio AND 2005)  I get
>
> > Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and
>
> > Microsoft Visual Studio 2006 as first record. I wanted to have
>
> > Microsoft Visual Studio 8.0.61205.56 (2005) listed first since the
>
> > user has searched for Microsoft Visual Studio 2005. Can you please help?.
>
> >
>
> > We are using NGram so it takes care of misspelled or jumbled words[it
>
> > works as expected] e.g.
>
> > searching Micrs Visual Studio will gets Microsoft Visual Studio
>
> > searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
> >
>
> >   
> > positionIncrementGap="0" >
>
> > 
>
> > 
> > class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement="
> "/>
>
> > 
> > class="solr.WhitespaceTokenizerFactory"/>
>
> > 
> > class="solr.LowerCaseFilterFactory"/>
>
> > 
> > minGramSize="2" maxGramSize="800"/>
>
> > 
>
> >  
>
> > 
&

RE: Text search NGram

2016-03-07 Thread G, Rajesh

Hi Jack,



Please correct me if iam wrong I added Char filter because



In Analyzer[solr ui]  I have provided "Microsoft office" in Field Value (Index) 
now WhitespaceTokenizerFactory produces the below result Office starts at 10. 
if I leave additional space say 2 more spaces Office starts at 12 should it not 
start at 10?



text


raw_bytes


start


end


positionLength


type


position




microsoft


[6d 69 63 72 6f 73 6f 66 74]


0


9


1


word


1




office


[6f 66 66 69 63 65]


10


16


1


word


2






text


raw_bytes


start


end


positionLength


type


position




microsoft


[6d 69 63 72 6f 73 6f 66 74]


0


9


1


word


1




office


[6f 66 66 69 63 65]


12


18


1


word


2






Thanks

Rajesh





Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.



-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
Sent: Monday, March 7, 2016 8:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram



The charFilter isn't doing anything useful - the white space tokenzier will 
ignore extra white space anyway.



-- Jack Krupansky



On Mon, Mar 7, 2016 at 5:44 AM, G, Rajesh 
<r...@cebglobal.com<mailto:r...@cebglobal.com>> wrote:



> Hi Team,

>

> We have the blow type and we have indexed the value  "title":

> "Microsoft Visual Studio 2006" and "title": "Microsoft Visual Studio

> 8.0.61205.56 (2005)"

>

> When I search for title:(Microsoft Visual AND Studio AND 2005)  I get

> Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and

> Microsoft Visual Studio 2006 as first record. I wanted to have

> Microsoft Visual Studio 8.0.61205.56 (2005) listed first since the

> user has searched for Microsoft Visual Studio 2005. Can you please help?.

>

> We are using NGram so it takes care of misspelled or jumbled words[it

> works as expected] e.g.

> searching Micrs Visual Studio will gets Microsoft Visual Studio

> searching Visual Microsoft Studio will gets Microsoft Visual Studio

>

>positionIncrementGap="0" >

> 

>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>

>  class="solr.WhitespaceTokenizerFactory"/>

>  class="solr.LowerCaseFilterFactory"/>

>  minGramSize="2" maxGramSize="800"/>

> 

>  

>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>

>  class="solr.WhitespaceTokenizerFactory"/>

>  class="solr.LowerCaseFilterFactory"/>

>  minGramSize="2" maxGramSize="800"/>

> 

>   

>

>

>

> Corporate Executive Board India Private Limited. Registration No:

> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF

> Building

> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..

>

>

>

> This e-mail and/or its attachments are intended only for the use of

> the

> addressee(s) and may contain confidential and legally privileged

> information belonging to CEB and/or its subsidiaries, including CEB

> subsidiaries that offer SHL Talent Measurement products and services.

> If you have received this e-mail in error, please notify the sender

> and immediately, destroy all copies of this email and its attachments.

> The publication, copying, in whole or in part, or use or dissemination

> in any other way of this e-mail and attachments by anyone other than

> the intended

> person(s) is prohibited.

>

>

>

Re: Text search NGram

2016-03-07 Thread Jack Krupansky

The charFilter isn't doing anything useful - the white space tokenzier will
ignore extra white space anyway.

-- Jack Krupansky

On Mon, Mar 7, 2016 at 5:44 AM, G, Rajesh  wrote:

> Hi Team,
>
> We have the blow type and we have indexed the value  "title": "Microsoft
> Visual Studio 2006" and "title": "Microsoft Visual Studio 8.0.61205.56
> (2005)"
>
> When I search for title:(Microsoft Visual AND Studio AND 2005)  I get
> Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and
> Microsoft Visual Studio 2006 as first record. I wanted to have Microsoft
> Visual Studio 8.0.61205.56 (2005) listed first since the user has searched
> for Microsoft Visual Studio 2005. Can you please help?.
>
> We are using NGram so it takes care of misspelled or jumbled words[it
> works as expected]
> e.g.
> searching Micrs Visual Studio will gets Microsoft Visual Studio
> searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
>positionIncrementGap="0" >
> 
>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>  class="solr.WhitespaceTokenizerFactory"/>
>  class="solr.LowerCaseFilterFactory"/>
>  minGramSize="2" maxGramSize="800"/>
> 
>  
>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>  class="solr.WhitespaceTokenizerFactory"/>
>  class="solr.LowerCaseFilterFactory"/>
>  minGramSize="2" maxGramSize="800"/>
> 
>   
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services. If
> you have received this e-mail in error, please notify the sender and
> immediately, destroy all copies of this email and its attachments. The
> publication, copying, in whole or in part, or use or dissemination in any
> other way of this e-mail and attachments by anyone other than the intended
> person(s) is prohibited.
>
>
>

RE: Text search NGram

2016-03-07 Thread G, Rajesh

Hi Emir,

I got it. Thanks Emir it was helpful

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 8:16 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Not sure I understood question. What I meant is you to try setting 
omitNorms="false" to your txt_token field type if you want to stick with ngram 
only solution:


 
 
 
 
 
 
  
 
 
 
 
 
   


and to add new field type and field to keep nonngram version of field.
Something like:


 
 
 
 
 
  
 
 
 
 
   


and use copyField to copy to both fields and query title:test OR 
title_simple:test.

Emir


On 07.03.2016 15:31, G, Rajesh wrote:
> Hi Emir,
>
> I have already applied
>
>  and then I have applied 
> . 
> Is this what you wanted me to have in my config?
>
> Thanks
> Rajesh
>
>
>
> Corporate Executive Board India Private Limited. Registration No: 
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.
>
> This e-mail and/or its attachments are intended only for the use of the 
> addressee(s) and may contain confidential and legally privileged information 
> belonging to CEB and/or its subsidiaries, including CEB subsidiaries that 
> offer SHL Talent Measurement products and services. If you have received this 
> e-mail in error, please notify the sender and immediately, destroy all copies 
> of this email and its attachments. The publication, copying, in whole or in 
> part, or use or dissemination in any other way of this e-mail and attachments 
> by anyone other than the intended person(s) is prohibited.
>
> -Original Message-
> From: G, Rajesh [mailto:r...@cebglobal.com]
> Sent: Monday, March 7, 2016 7:50 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Text search NGram
>
> Hi Emir,
>
> Thanks for you email. Can you please help me to understand what do you mean 
> by "e.g. boost if matching tokenized fileds to make sure exact matches are 
> ordered first"
>
>
>
> Corporate Executive Board India Private Limited. Registration No: 
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.
>
> This e-mail and/or its attachments are intended only for the use of the 
> addressee(s) and may contain confidential and legally privileged information 
> belonging to CEB and/or its subsidiaries, including CEB subsidiaries that 
> offer SHL Talent Measurement products and services. If you have received this 
> e-mail in error, please notify the sender and immediately, destroy all copies 
> of this email and its attachments. The publication, copying, in whole or in 
> part, or use or dissemination in any other way of this e-mail and attachments 
> by anyone other than the intended person(s) is prohibited.
>
> -Original Message-
> From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
> Sent: Monday, March 7, 2016 7:36 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Text search NGram
>
> Hi Rajesh,
> It is most likely related to norms - you can try setting omitNorms="true" and 
> reindexing content. Anyway, it is not common to use just ngrams for matching 
> content - in such case you can expect more unexpected ordering/results. You 
> should combine ngrams fields with normally tokenized fields (e.g. boost if 
>

Re: Text search NGram

2016-03-07 Thread Emir Arnautovic

Not sure I understood question. What I meant is you to try setting 
omitNorms="false" to your txt_token field type if you want to stick with 
ngram only solution:









 





  


and to add new field type and field to keep nonngram version of field. 
Something like:








 




  


and use copyField to copy to both fields and query title:test OR 
title_simple:test.


Emir


On 07.03.2016 15:31, G, Rajesh wrote:

Hi Emir,

I have already applied

 and then I have applied . Is this what you wanted me to 
have in my config?

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: G, Rajesh [mailto:r...@cebglobal.com]
Sent: Monday, March 7, 2016 7:50 PM
To: solr-user@lucene.apache.org
Subject: RE: Text search NGram

Hi Emir,

Thanks for you email. Can you please help me to understand what do you mean by "e.g. 
boost if matching tokenized fileds to make sure exact matches are ordered first"



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
It is most likely related to norms - you can try setting omitNorms="true" and 
reindexing content. Anyway, it is not common to use just ngrams for matching content - in 
such case you can expect more unexpected ordering/results. You should combine ngrams 
fields with normally tokenized fields (e.g. boost if matching tokenized fileds to make 
sure exact matches are ordered first).

Regards,
Emir

On 07.03.2016 11:44, G, Rajesh wrote:

Hi Team,

We have the blow type and we have indexed the value  "title": "Microsoft Visual Studio 2006" and 
"title": "Microsoft Visual Studio 8.0.61205.56 (2005)"

When I search for title:(Microsoft Visual AND Studio AND 2005)  I get Microsoft 
Visual Studio 8.0.61205.56 (2005) as the second record and  Microsoft Visual 
Studio 2006 as first record. I wanted to have Microsoft Visual Studio 
8.0.61205.56 (2005) listed first since the user has searched for Microsoft 
Visual Studio 2005. Can you please help?.

We are using NGram so it takes care of misspelled or jumbled words[it
works as expected] e.g.
searching Micrs Visual Studio will gets Microsoft Visual Studio
searching Visual Microsoft Studio will gets Microsoft Visual Studio


  
  
  
  
  
  
   
  
  
  
  
  




Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered o

Re: Text search NGram

2016-03-07 Thread Emir Arnautovic


Hi Rajesh,
Solution includes 2 fields - one "ngram" field (like your txt_token) and 
other "nonngram" field - just tokenized (like your txt_token without 
ngram token filter). If you have two documents:

1. ABCDEF
2. ABCD
And you are searching for ABCD, if you use only ngram field, both are 
matches and doc 1 can be first, but if you search from ngram:ABCD OR 
nonngram:ABCD, doc 2 will have higher score.


Regards,
Emir

On 07.03.2016 15:20, G, Rajesh wrote:

Hi Emir,

Thanks for you email. Can you please help me to understand what do you mean by "e.g. 
boost if matching tokenized fileds to make sure exact matches are ordered first"



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
It is most likely related to norms - you can try setting omitNorms="true" and 
reindexing content. Anyway, it is not common to use just ngrams for matching content - in 
such case you can expect more unexpected ordering/results. You should combine ngrams 
fields with normally tokenized fields (e.g. boost if matching tokenized fileds to make 
sure exact matches are ordered first).

Regards,
Emir

On 07.03.2016 11:44, G, Rajesh wrote:

Hi Team,

We have the blow type and we have indexed the value  "title": "Microsoft Visual Studio 2006" and 
"title": "Microsoft Visual Studio 8.0.61205.56 (2005)"

When I search for title:(Microsoft Visual AND Studio AND 2005)  I get Microsoft 
Visual Studio 8.0.61205.56 (2005) as the second record and  Microsoft Visual 
Studio 2006 as first record. I wanted to have Microsoft Visual Studio 
8.0.61205.56 (2005) listed first since the user has searched for Microsoft 
Visual Studio 2005. Can you please help?.

We are using NGram so it takes care of misspelled or jumbled words[it
works as expected] e.g.
searching Micrs Visual Studio will gets Microsoft Visual Studio
searching Visual Microsoft Studio will gets Microsoft Visual Studio


  
  
  
  
  
  
   
  
  
  
  
  




Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..



This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & 
Elasticsearch Support * http://sematext.com/



--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

RE: Text search NGram

2016-03-07 Thread G, Rajesh

Hi Emir,

I have already applied

 and then I have applied 
. Is 
this what you wanted me to have in my config?

Thanks
Rajesh

Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: G, Rajesh [mailto:r...@cebglobal.com]
Sent: Monday, March 7, 2016 7:50 PM
To: solr-user@lucene.apache.org
Subject: RE: Text search NGram

Hi Emir,

Thanks for you email. Can you please help me to understand what do you mean by 
"e.g. boost if matching tokenized fileds to make sure exact matches are ordered 
first"

Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
It is most likely related to norms - you can try setting omitNorms="true" and 
reindexing content. Anyway, it is not common to use just ngrams for matching 
content - in such case you can expect more unexpected ordering/results. You 
should combine ngrams fields with normally tokenized fields (e.g. boost if 
matching tokenized fileds to make sure exact matches are ordered first).

Regards,
Emir

On 07.03.2016 11:44, G, Rajesh wrote:
> Hi Team,
>
> We have the blow type and we have indexed the value  "title": "Microsoft 
> Visual Studio 2006" and "title": "Microsoft Visual Studio 8.0.61205.56 (2005)"
>
> When I search for title:(Microsoft Visual AND Studio AND 2005)  I get 
> Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and  
> Microsoft Visual Studio 2006 as first record. I wanted to have Microsoft 
> Visual Studio 8.0.61205.56 (2005) listed first since the user has searched 
> for Microsoft Visual Studio 2005. Can you please help?.
>
> We are using NGram so it takes care of misspelled or jumbled words[it
> works as expected] e.g.
> searching Micrs Visual Studio will gets Microsoft Visual Studio
> searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
> positionIncrementGap="0" >
>  
>   class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>   class="solr.WhitespaceTokenizerFactory"/>
>  
>   minGramSize="2" maxGramSize="800"/>
>  
>   
>   class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>   class="solr.WhitespaceTokenizerFactory"/>
>  
>   minGramSize="2" maxGramSize="800"/>
>  
>
>
>
>
> Corporate Executive Board India Private Limited. Registration No: 
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the 
> addressee(s) and may contain confidential and legally privileged information 
> belonging to CEB and/or its subsidiaries, including CEB subsidiaries that 
> of

RE: Text search NGram

2016-03-07 Thread G, Rajesh

Hi Emir,

Thanks for you email. Can you please help me to understand what do you mean by 
"e.g. boost if matching tokenized fileds to make sure exact matches are ordered 
first"

Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Emir Arnautovic [mailto:emir.arnauto...@sematext.com]
Sent: Monday, March 7, 2016 7:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

Hi Rajesh,
It is most likely related to norms - you can try setting omitNorms="true" and 
reindexing content. Anyway, it is not common to use just ngrams for matching 
content - in such case you can expect more unexpected ordering/results. You 
should combine ngrams fields with normally tokenized fields (e.g. boost if 
matching tokenized fileds to make sure exact matches are ordered first).

Regards,
Emir

On 07.03.2016 11:44, G, Rajesh wrote:
> Hi Team,
>
> We have the blow type and we have indexed the value  "title": "Microsoft 
> Visual Studio 2006" and "title": "Microsoft Visual Studio 8.0.61205.56 (2005)"
>
> When I search for title:(Microsoft Visual AND Studio AND 2005)  I get 
> Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and  
> Microsoft Visual Studio 2006 as first record. I wanted to have Microsoft 
> Visual Studio 8.0.61205.56 (2005) listed first since the user has searched 
> for Microsoft Visual Studio 2005. Can you please help?.
>
> We are using NGram so it takes care of misspelled or jumbled words[it
> works as expected] e.g.
> searching Micrs Visual Studio will gets Microsoft Visual Studio
> searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
> positionIncrementGap="0" >
>  
>   class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>   class="solr.WhitespaceTokenizerFactory"/>
>  
>   minGramSize="2" maxGramSize="800"/>
>  
>   
>   class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>   class="solr.WhitespaceTokenizerFactory"/>
>  
>   minGramSize="2" maxGramSize="800"/>
>  
>
>
>
>
> Corporate Executive Board India Private Limited. Registration No: 
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the 
> addressee(s) and may contain confidential and legally privileged information 
> belonging to CEB and/or its subsidiaries, including CEB subsidiaries that 
> offer SHL Talent Measurement products and services. If you have received this 
> e-mail in error, please notify the sender and immediately, destroy all copies 
> of this email and its attachments. The publication, copying, in whole or in 
> part, or use or dissemination in any other way of this e-mail and attachments 
> by anyone other than the intended person(s) is prohibited.
>
>

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & 
Elasticsearch Support * http://sematext.com/

Hi Rajesh,
It is most likely related to norms - you can try setting
omitNorms="true" and reindexing content. Anyway, it is not common to use
just ngrams for matching content - in such case you can expect more
unexpected ordering/results. You should combine ngrams fields with
normally tokenized fields (e.g. boost if matching tokenized fileds to
make sure exact matches are ordered first).

Regards,
Emir

On 07.03.2016 11:44, G, Rajesh wrote:

Hi Team,

We have the blow type and we have indexed the value "title": "Microsoft Visual Studio 2006" and
"title": "Microsoft Visual Studio 8.0.61205.56 (2005)"

When I search for title:(Microsoft Visual AND Studio AND 2005) I get Microsoft
Visual Studio 8.0.61205.56 (2005) as the second record and Microsoft Visual
Studio 2006 as first record. I wanted to have Microsoft Visual Studio
8.0.61205.56 (2005) listed first since the user has searched for Microsoft
Visual Studio 2005. Can you please help?.

We are using NGram so it takes care of misspelled or jumbled words[it works as
expected]
e.g.
searching Micrs Visual Studio will gets Microsoft Visual Studio
searching Visual Microsoft Studio will gets Microsoft Visual Studio

Corporate Executive Board India Private Limited. Registration No:
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..

This e-mail and/or its attachments are intended only for the use of the
addressee(s) and may contain confidential and legally privileged information
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer
SHL Talent Measurement products and services. If you have received this e-mail
in error, please notify the sender and immediately, destroy all copies of this
email and its attachments. The publication, copying, in whole or in part, or
use or dissemination in any other way of this e-mail and attachments by anyone
other than the intended person(s) is prohibited.

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

RE: Text search NGram

2016-03-07 Thread G, Rajesh

Hi Binoy,

It is Standard Query Parser

Thanks
Rajesh



Corporate Executive Board India Private Limited. Registration No: 
U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building 
No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-Original Message-
From: Binoy Dalal [mailto:binoydala...@gmail.com]
Sent: Monday, March 7, 2016 5:12 PM
To: solr-user@lucene.apache.org
Subject: Re: Text search NGram

What query parser are you using?

Additionally, run the same query with =true and see how your results 
are being scored to find out why the ms vs 2006 shows up before 2005.

On Mon, 7 Mar 2016, 16:14 G, Rajesh, <r...@cebglobal.com> wrote:

> Hi Team,
>
> We have the blow type and we have indexed the value  "title":
> "Microsoft Visual Studio 2006" and "title": "Microsoft Visual Studio
> 8.0.61205.56 (2005)"
>
> When I search for title:(Microsoft Visual AND Studio AND 2005)  I get
> Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and
> Microsoft Visual Studio 2006 as first record. I wanted to have
> Microsoft Visual Studio 8.0.61205.56 (2005) listed first since the
> user has searched for Microsoft Visual Studio 2005. Can you please help?.
>
> We are using NGram so it takes care of misspelled or jumbled words[it
> works as expected] e.g.
> searching Micrs Visual Studio will gets Microsoft Visual Studio
> searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
>positionIncrementGap="0" >
> 
>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>  class="solr.WhitespaceTokenizerFactory"/>
>  class="solr.LowerCaseFilterFactory"/>
>  minGramSize="2" maxGramSize="800"/>
> 
>  
>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>  class="solr.WhitespaceTokenizerFactory"/>
>  class="solr.LowerCaseFilterFactory"/>
>  minGramSize="2" maxGramSize="800"/>
> 
>   
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF
> Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of
> the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services.
> If you have received this e-mail in error, please notify the sender
> and immediately, destroy all copies of this email and its attachments.
> The publication, copying, in whole or in part, or use or dissemination
> in any other way of this e-mail and attachments by anyone other than
> the intended
> person(s) is prohibited.
>
>
> --
Regards,
Binoy Dalal

Re: Text search NGram

2016-03-07 Thread Binoy Dalal

What query parser are you using?

Additionally, run the same query with =true and see how your
results are being scored to find out why the ms vs 2006 shows up before
2005.

On Mon, 7 Mar 2016, 16:14 G, Rajesh,  wrote:

> Hi Team,
>
> We have the blow type and we have indexed the value  "title": "Microsoft
> Visual Studio 2006" and "title": "Microsoft Visual Studio 8.0.61205.56
> (2005)"
>
> When I search for title:(Microsoft Visual AND Studio AND 2005)  I get
> Microsoft Visual Studio 8.0.61205.56 (2005) as the second record and
> Microsoft Visual Studio 2006 as first record. I wanted to have Microsoft
> Visual Studio 8.0.61205.56 (2005) listed first since the user has searched
> for Microsoft Visual Studio 2005. Can you please help?.
>
> We are using NGram so it takes care of misspelled or jumbled words[it
> works as expected]
> e.g.
> searching Micrs Visual Studio will gets Microsoft Visual Studio
> searching Visual Microsoft Studio will gets Microsoft Visual Studio
>
>positionIncrementGap="0" >
> 
>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>  class="solr.WhitespaceTokenizerFactory"/>
>  class="solr.LowerCaseFilterFactory"/>
>  minGramSize="2" maxGramSize="800"/>
> 
>  
>  class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
>  class="solr.WhitespaceTokenizerFactory"/>
>  class="solr.LowerCaseFilterFactory"/>
>  minGramSize="2" maxGramSize="800"/>
> 
>   
>
>
>
> Corporate Executive Board India Private Limited. Registration No:
> U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building
> No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the
> addressee(s) and may contain confidential and legally privileged
> information belonging to CEB and/or its subsidiaries, including CEB
> subsidiaries that offer SHL Talent Measurement products and services. If
> you have received this e-mail in error, please notify the sender and
> immediately, destroy all copies of this email and its attachments. The
> publication, copying, in whole or in part, or use or dissemination in any
> other way of this e-mail and attachments by anyone other than the intended
> person(s) is prohibited.
>
>
> --
Regards,
Binoy Dalal

Re: Text search NGram

Re: Text search NGram

Re: Text search NGram

RE: Text search NGram

Re: Text search NGram

RE: Text search NGram

Re: Text search NGram

RE: Text search NGram

Re: Text search NGram

Re: Text search NGram

RE: Text search NGram

RE: Text search NGram

Re: Text search NGram

RE: Text search NGram

Re: Text search NGram

15 matches

Site Navigation

Mail list logo

Footer information