subject:"Edgengram"

Solr 5: hit highlight with NGram/EdgeNgram-fields

2015-04-20 Thread Bjørn Hjelle

with Solr 4.10.3 I was advised to set luceneMatchVersion to 4.3 to make
hit highlight work with NGram/EdgeNgram- fields, like this:

 filter class=solr.EdgeNGramFilterFactory maxGramSize=20
minGramSize=1 luceneMatchVersion=4.3/

In Solr 5 and 5.1 this seems to not work any more.
The complete word is  highlighted, not just the part that matches the
search term.

In Solr admin analysis page it again does not show the proper end-offset
positions. What is shows is this:

LENGTF
textt   te  tes test
raw_bytes   [74][74 65] [74 65 73]  [74 65 73 74]
start   0   0   0   0
end 4   4   4   4
positionLength  1   1   1   1
typewordwordwordword
position1   1   1   1

In Solr 4.10.3 with LuceneMatchVersion set to 4.3 end offset would be: 1,
2, 3, 4 and hit higlight would work.

Any advise on making hit highlight with (Edge)NGram -fields would be highly
appreciated!

Thanks,
Bjørn

Combination of edgengram and ngram

2011-12-13 Thread Shawn Heisey

I am interested in a new filter type, one that would combine edgengram 
and ngram.  The idea is that it would create all ngrams specified by the 
min/max size, but the ngrams that happen to be edgengrams (specifically 
the left side) would get an index-time boost.  Optionally the boost 
would be higher if it came from the first token.


The use case:  An automatic autosuggest dropdown that populates as a 
user types into a search box.  The index would have one field and it 
would be built from a manually produced list of suggested search 
phrases.  The boosts mentioned would make it so that matches from the 
beginning of a word, and especially from the beginning of the entire 
suggested phrase, would be returned first.


I could get a similar effect by using a copyfield, analyzing one field 
with ngrams and the other with edgengrams, then using edismax to put a 
boost on the edge version.  I will start with this method, but using 
copyfield makes the index bigger, and using dismax makes the ultimate 
parsed queries more complicated.


If I can avoid the copyfield, the index will be smaller and the queries 
very simple, which should make for very high speed.


I will take a look at the source code, but I'm a bit of a Java novice.  
Does anyone have the knowledge, desire, and time to crank this one out 
quickly?  Is it possible someone has already written such a filter?


Thanks,
Shawn

Problem using EdgeNGram

2011-09-21 Thread Kissue Kissue

Hi,

I am using solr 3.3 with SolrJ. I am trying to use EdgeNgram to power auto
suggest feature in my application. My understanding is that using EdgeNgram
would mean that results will only be returned for records starting with the
search criteria but this is not happening for me.

For example if i search for tr, i get results as following:

Greenham Trading 6
IT Training Publications
AA Training

Below are details of my configuration:

fieldType name=edgytext class=solr.TextField
positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=15 /
  /analyzer
  analyzer type=query
tokenizer class=solr.StandardTokenizerFactory/
filter class=solr.LowerCaseFilterFactory/
  /analyzer
/fieldType

field name=businessName type=edgytext indexed=true stored=true
required=true omitNorms=true omitTermFreqAndPositions=true /

Any ideas why this is happening will be much appreciated.

Thanks.

Re: Problem using EdgeNGram

2011-09-21 Thread O. Klein

Try using KeywordTokenizerFactory instead of StandardTokenizerFactory to get
the results you want.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problem-using-EdgeNGram-tp3355132p3355211.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Edgengram

2011-06-01 Thread Brian Lamb

Hi Tomás,

Thank you very much for your suggestion. I took another crack at it using
your recommendation and it worked ideally. The only thing I had to change
was

analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory /
/analyzer

analyzer type=query
tokenizer class=solr.LowerCaseTokenizerFactory /
/analyzer

The first did not produce any results but the second worked beautifully.

Thanks!

Brian Lamb

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

...or also use the LowerCaseTokenizerFactory at query time for consistency,
but not the edge ngram filter.

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

Hi Brian, I don't know if I understand what you are trying to achieve.
You
want the term query abcdefg to have an idf of 1 insead of 7? I think
using
the KeywordTokenizerFilterFactory at query time should work. I would be
something like:

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
analyzer type=index

tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=25 side=front /
/analyzer
analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory /
/analyzer
/fieldType

this way, at query time abcdefg won't be turned to a ab abc abcd abcde
abcdef abcdefg. At index time it will.

Regards,
Tomás

On Tue, May 31, 2011 at 1:07 PM, Brian Lamb
brian.l...@journalexperts.com
wrote:

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
analyzer
tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=25 side=front /
/analyzer
/fieldType

I believe I used that link when I initially set up the field and it
worked
great (and I'm still using it in other places). In this particular
example
however it does not appear to be practical for me. I mentioned that I
have
a
similarity class that returns 1 for the idf and in the case of an
edgengram,
it returns 1 * length of the search string.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com
bmdakshinamur...@gmail.com wrote:

Can you specify the analyzer you are using for your queries?

May be you could use a KeywordAnalyzer for your queries so you don't
end
up
matching parts of your query.

http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
This should help you.

On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

In this particular case, I will be doing a solr search based on user
preferences. So I will not be depending on the user to type
abcdefg.
That
will be automatically generated based on user selections.

The contents of the field do not contain spaces and since I am
created
the
search parameters, case isn't important either.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 9:44 AM, Erick Erickson
erickerick...@gmail.com
wrote:

That'll work for your case, although be aware that string types
aren't
analyzed at all,
so case matters, as do spaces etc.

What is the use-case here? If you explain it a bit there might be
better answers

Best
Erick

On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
For this, I ended up just changing it to string and using
abcdefg*
to
match. That seems to work so far.

Thanks,

Brian Lamb

On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

Hi all,

I'm running into some confusion with the way edgengram works. I
have
the
field set up as:

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
analyzer
tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory
minGramSize=1
maxGramSize=100 side=front /
/analyzer
/fieldType

I've also set up my own similarity class that returns 1 as the
idf
score.
What I've found this does is if I match a string abcdefg
against a
field
containing abcdefghijklmnop, then the idf will score that as
a
7:

7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
abcdefg=2)

I get why that's happening, but is there a way to avoid that?
Do
I
need
to
do a new field type to achieve the desired affect?

Thanks,

Brian Lamb

--
Thanks and Regards,
DakshinaMurthy BM

Re: Edgengram

2011-06-01 Thread Erick Erickson

Be a little careful here. LowerCaseTokenizerFactory is different than
KeywordTokenizerFactory.

LowerCaseTokenizerFactory will give you more than one term. e.g.
the string Intelligence can't be MeaSurEd will give you 5 terms,
any of which may match. i.e.
intelligence, can, t, be, measured.
whereas KeywordTokenizerFactory followed, by, say LowerCaseFilter
would give you exactly one token:
intelligence can't be measured.

So searching for measured would get a hit in the first case but
not in the second. Searching for intellig* would hit both.

Neither is better, just make sure they do what you want!

This page will help a lot:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
as will the admin/analysis page.

Best
Erick

On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
Hi Tomás,

Thank you very much for your suggestion. I took another crack at it using
your recommendation and it worked ideally. The only thing I had to change
was

analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory /
/analyzer

analyzer type=query
tokenizer class=solr.LowerCaseTokenizerFactory /
/analyzer

The first did not produce any results but the second worked beautifully.

Thanks!

Brian Lamb

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

...or also use the LowerCaseTokenizerFactory at query time for consistency,
but not the edge ngram filter.

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
analyzer type=index

this way, at query time abcdefg won't be turned to a ab abc abcd abcde
abcdef abcdefg. At index time it will.

Regards,
Tomás

On Tue, May 31, 2011 at 1:07 PM, Brian Lamb
brian.l...@journalexperts.com
wrote:

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com
bmdakshinamur...@gmail.com wrote:

Can you specify the analyzer you are using for your queries?

May be you could use a KeywordAnalyzer for your queries so you don't
end
up
matching parts of your query.

http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
This should help you.

On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

The contents of the field do not contain spaces and since I am
created
the
search parameters, case isn't important either.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 9:44 AM, Erick Erickson
erickerick...@gmail.com
wrote:

That'll work for your case, although be aware that string types
aren't
analyzed at all,
so case matters, as do spaces etc.

What is the use-case here? If you explain it a bit there might be
better answers

Best
Erick

On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
For this, I ended up just changing it to string and using
abcdefg*
to
match. That seems to work so far.

Thanks,

Brian Lamb

On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

Hi all,

I'm running into some confusion with the way edgengram works. I
have
the
field set up as:

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
analyzer
tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory

Re: Edgengram

2011-06-01 Thread Brian Lamb

I think in my case LowerCaseTokenizerFactory will be sufficient because
there will never be spaces in this particular field. But thank you for the
useful link!

Thanks,

Brian Lamb

On Wed, Jun 1, 2011 at 11:44 AM, Erick Erickson erickerick...@gmail.comwrote:

Be a little careful here. LowerCaseTokenizerFactory is different than
KeywordTokenizerFactory.

So searching for measured would get a hit in the first case but
not in the second. Searching for intellig* would hit both.

Neither is better, just make sure they do what you want!

This page will help a lot:

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.LowerCaseTokenizerFactory
as will the admin/analysis page.

Best
Erick

On Wed, Jun 1, 2011 at 10:43 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
Hi Tomás,

Thank you very much for your suggestion. I took another crack at it using
your recommendation and it worked ideally. The only thing I had to change
was

analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory /
/analyzer

analyzer type=query
tokenizer class=solr.LowerCaseTokenizerFactory /
/analyzer

The first did not produce any results but the second worked beautifully.

Thanks!

Brian Lamb

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

...or also use the LowerCaseTokenizerFactory at query time for
consistency,
but not the edge ngram filter.

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
analyzer type=index

this way, at query time abcdefg won't be turned to a ab abc abcd
abcde
abcdef abcdefg. At index time it will.

Regards,
Tomás

On Tue, May 31, 2011 at 1:07 PM, Brian Lamb
brian.l...@journalexperts.com
wrote:

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com
bmdakshinamur...@gmail.com wrote:

Can you specify the analyzer you are using for your queries?

May be you could use a KeywordAnalyzer for your queries so you
don't
end
up
matching parts of your query.

http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
This should help you.

On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

In this particular case, I will be doing a solr search based on
user
preferences. So I will not be depending on the user to type
abcdefg.
That
will be automatically generated based on user selections.

The contents of the field do not contain spaces and since I am
created
the
search parameters, case isn't important either.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 9:44 AM, Erick Erickson
erickerick...@gmail.com
wrote:

That'll work for your case, although be aware that string types
aren't
analyzed at all,
so case matters, as do spaces etc.

What is the use-case here? If you explain it a bit there might
be
better answers

Best
Erick

On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
For this, I ended up just changing it to string and using
abcdefg*
to
match. That seems to work so far.

Thanks,

Brian Lamb

On Wed, May

Re: Edgengram

2011-05-31 Thread Erick Erickson

That'll work for your case, although be aware that string types aren't
analyzed at all,
so case matters, as do spaces etc.

What is the use-case here? If you explain it a bit there might be
better answers

Best
Erick

On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
 For this, I ended up just changing it to string and using abcdefg* to
 match. That seems to work so far.

 Thanks,

 Brian Lamb

 On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
 brian.l...@journalexperts.comwrote:

 Hi all,

 I'm running into some confusion with the way edgengram works. I have the
 field set up as:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
    analyzer
      tokenizer class=solr.LowerCaseTokenizerFactory /
        filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=100 side=front /
    /analyzer
 /fieldType

 I've also set up my own similarity class that returns 1 as the idf score.
 What I've found this does is if I match a string abcdefg against a field
 containing abcdefghijklmnop, then the idf will score that as a 7:

 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)

 I get why that's happening, but is there a way to avoid that? Do I need to
 do a new field type to achieve the desired affect?

 Thanks,

 Brian Lamb

Re: Edgengram

2011-05-31 Thread Brian Lamb

In this particular case, I will be doing a solr search based on user
preferences. So I will not be depending on the user to type abcdefg. That
will be automatically generated based on user selections.

The contents of the field do not contain spaces and since I am created the
search parameters, case isn't important either.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.comwrote:

 That'll work for your case, although be aware that string types aren't
 analyzed at all,
 so case matters, as do spaces etc.

 What is the use-case here? If you explain it a bit there might be
 better answers

 Best
 Erick

 On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
 brian.l...@journalexperts.com wrote:
  For this, I ended up just changing it to string and using abcdefg* to
  match. That seems to work so far.
 
  Thanks,
 
  Brian Lamb
 
  On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
  brian.l...@journalexperts.comwrote:
 
  Hi all,
 
  I'm running into some confusion with the way edgengram works. I have the
  field set up as:
 
  fieldType name=edgengram class=solr.TextField
  positionIncrementGap=1000
 analyzer
   tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
  maxGramSize=100 side=front /
 /analyzer
  /fieldType
 
  I've also set up my own similarity class that returns 1 as the idf
 score.
  What I've found this does is if I match a string abcdefg against a
 field
  containing abcdefghijklmnop, then the idf will score that as a 7:
 
  7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)
 
  I get why that's happening, but is there a way to avoid that? Do I need
 to
  do a new field type to achieve the desired affect?
 
  Thanks,
 
  Brian Lamb

Re: Edgengram

2011-05-31 Thread bmdakshinamur...@gmail.com

Can you specify the analyzer you are using for your queries?

May be you could use a KeywordAnalyzer for your queries so you don't end up
matching parts of your query.
http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
This should help you.

On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

In this particular case, I will be doing a solr search based on user
preferences. So I will not be depending on the user to type abcdefg. That
will be automatically generated based on user selections.

The contents of the field do not contain spaces and since I am created the
search parameters, case isn't important either.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com
wrote:

That'll work for your case, although be aware that string types aren't
analyzed at all,
so case matters, as do spaces etc.

What is the use-case here? If you explain it a bit there might be
better answers

Best
Erick

On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
For this, I ended up just changing it to string and using abcdefg* to
match. That seems to work so far.

Thanks,

Brian Lamb

On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

Hi all,

I'm running into some confusion with the way edgengram works. I have
the
field set up as:

I've also set up my own similarity class that returns 1 as the idf
score.
What I've found this does is if I match a string abcdefg against a
field
containing abcdefghijklmnop, then the idf will score that as a 7:

7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)

I get why that's happening, but is there a way to avoid that? Do I
need
to
do a new field type to achieve the desired affect?

Thanks,

Brian Lamb

--
Thanks and Regards,
DakshinaMurthy BM

Re: Edgengram

2011-05-31 Thread Brian Lamb

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
   analyzer
 tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=25 side=front /
   /analyzer
/fieldType

I believe I used that link when I initially set up the field and it worked
great (and I'm still using it in other places). In this particular example
however it does not appear to be practical for me. I mentioned that I have a
similarity class that returns 1 for the idf and in the case of an edgengram,
it returns 1 * length of the search string.

Thanks,

Brian Lamb

On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com 
bmdakshinamur...@gmail.com wrote:

 Can you specify the analyzer you are using for your queries?

 May be you could use a KeywordAnalyzer for your queries so you don't end up
 matching parts of your query.

 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
 This should help you.

 On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
 brian.l...@journalexperts.comwrote:

  In this particular case, I will be doing a solr search based on user
  preferences. So I will not be depending on the user to type abcdefg.
 That
  will be automatically generated based on user selections.
 
  The contents of the field do not contain spaces and since I am created
 the
  search parameters, case isn't important either.
 
  Thanks,
 
  Brian Lamb
 
  On Tue, May 31, 2011 at 9:44 AM, Erick Erickson erickerick...@gmail.com
  wrote:
 
   That'll work for your case, although be aware that string types aren't
   analyzed at all,
   so case matters, as do spaces etc.
  
   What is the use-case here? If you explain it a bit there might be
   better answers
  
   Best
   Erick
  
   On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
   brian.l...@journalexperts.com wrote:
For this, I ended up just changing it to string and using abcdefg*
 to
match. That seems to work so far.
   
Thanks,
   
Brian Lamb
   
On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
brian.l...@journalexperts.comwrote:
   
Hi all,
   
I'm running into some confusion with the way edgengram works. I have
  the
field set up as:
   
fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
   analyzer
 tokenizer class=solr.LowerCaseTokenizerFactory /
   filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=100 side=front /
   /analyzer
/fieldType
   
I've also set up my own similarity class that returns 1 as the idf
   score.
What I've found this does is if I match a string abcdefg against a
   field
containing abcdefghijklmnop, then the idf will score that as a 7:
   
7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
 abcdefg=2)
   
I get why that's happening, but is there a way to avoid that? Do I
  need
   to
do a new field type to achieve the desired affect?
   
Thanks,
   
Brian Lamb
   
   
  
 



 --
 Thanks and Regards,
 DakshinaMurthy BM

Re: Edgengram

2011-05-31 Thread Tomás Fernández Löbbe

Hi Brian, I don't know if I understand what you are trying to achieve. You
want the term query abcdefg to have an idf of 1 insead of 7? I think using
the KeywordTokenizerFilterFactory at query time should work. I would be
something like:

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
  analyzer type=index
tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=25 side=front /
  /analyzer
  analyzer type=query
  tokenizer class=solr.KeywordTokenizerFactory /
  /analyzer
/fieldType

this way, at query time abcdefg won't be turned to a ab abc abcd abcde
abcdef abcdefg. At index time it will.

Regards,
Tomás


On Tue, May 31, 2011 at 1:07 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
   analyzer
 tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 side=front /
   /analyzer
 /fieldType

 I believe I used that link when I initially set up the field and it worked
 great (and I'm still using it in other places). In this particular example
 however it does not appear to be practical for me. I mentioned that I have
 a
 similarity class that returns 1 for the idf and in the case of an
 edgengram,
 it returns 1 * length of the search string.

 Thanks,

 Brian Lamb

 On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com 
 bmdakshinamur...@gmail.com wrote:

  Can you specify the analyzer you are using for your queries?
 
  May be you could use a KeywordAnalyzer for your queries so you don't end
 up
  matching parts of your query.
 
 
 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
  This should help you.
 
  On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
  brian.l...@journalexperts.comwrote:
 
   In this particular case, I will be doing a solr search based on user
   preferences. So I will not be depending on the user to type abcdefg.
  That
   will be automatically generated based on user selections.
  
   The contents of the field do not contain spaces and since I am created
  the
   search parameters, case isn't important either.
  
   Thanks,
  
   Brian Lamb
  
   On Tue, May 31, 2011 at 9:44 AM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
That'll work for your case, although be aware that string types
 aren't
analyzed at all,
so case matters, as do spaces etc.
   
What is the use-case here? If you explain it a bit there might be
better answers
   
Best
Erick
   
On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
 For this, I ended up just changing it to string and using
 abcdefg*
  to
 match. That seems to work so far.

 Thanks,

 Brian Lamb

 On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
 brian.l...@journalexperts.comwrote:

 Hi all,

 I'm running into some confusion with the way edgengram works. I
 have
   the
 field set up as:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
analyzer
  tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=100 side=front /
/analyzer
 /fieldType

 I've also set up my own similarity class that returns 1 as the idf
score.
 What I've found this does is if I match a string abcdefg against
 a
field
 containing abcdefghijklmnop, then the idf will score that as a
 7:

 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
  abcdefg=2)

 I get why that's happening, but is there a way to avoid that? Do I
   need
to
 do a new field type to achieve the desired affect?

 Thanks,

 Brian Lamb


   
  
 
 
 
  --
  Thanks and Regards,
  DakshinaMurthy BM

Re: Edgengram

2011-05-31 Thread Tomás Fernández Löbbe

...or also use the LowerCaseTokenizerFactory at query time for consistency,
but not the edge ngram filter.

2011/5/31 Tomás Fernández Löbbe tomasflo...@gmail.com

 Hi Brian, I don't know if I understand what you are trying to achieve. You
 want the term query abcdefg to have an idf of 1 insead of 7? I think using
 the KeywordTokenizerFilterFactory at query time should work. I would be
 something like:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
   analyzer type=index

 tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 side=front /
   /analyzer
   analyzer type=query
   tokenizer class=solr.KeywordTokenizerFactory /
   /analyzer
 /fieldType

 this way, at query time abcdefg won't be turned to a ab abc abcd abcde
 abcdef abcdefg. At index time it will.

 Regards,
 Tomás


 On Tue, May 31, 2011 at 1:07 PM, Brian Lamb brian.l...@journalexperts.com
  wrote:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
   analyzer
 tokenizer class=solr.LowerCaseTokenizerFactory /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 side=front /
   /analyzer
 /fieldType

 I believe I used that link when I initially set up the field and it worked
 great (and I'm still using it in other places). In this particular example
 however it does not appear to be practical for me. I mentioned that I have
 a
 similarity class that returns 1 for the idf and in the case of an
 edgengram,
 it returns 1 * length of the search string.

 Thanks,

 Brian Lamb

 On Tue, May 31, 2011 at 11:34 AM, bmdakshinamur...@gmail.com 
 bmdakshinamur...@gmail.com wrote:

  Can you specify the analyzer you are using for your queries?
 
  May be you could use a KeywordAnalyzer for your queries so you don't end
 up
  matching parts of your query.
 
 
 http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
  This should help you.
 
  On Tue, May 31, 2011 at 8:24 PM, Brian Lamb
  brian.l...@journalexperts.comwrote:
 
   In this particular case, I will be doing a solr search based on user
   preferences. So I will not be depending on the user to type abcdefg.
  That
   will be automatically generated based on user selections.
  
   The contents of the field do not contain spaces and since I am created
  the
   search parameters, case isn't important either.
  
   Thanks,
  
   Brian Lamb
  
   On Tue, May 31, 2011 at 9:44 AM, Erick Erickson 
 erickerick...@gmail.com
   wrote:
  
That'll work for your case, although be aware that string types
 aren't
analyzed at all,
so case matters, as do spaces etc.
   
What is the use-case here? If you explain it a bit there might be
better answers
   
Best
Erick
   
On Fri, May 27, 2011 at 9:17 AM, Brian Lamb
brian.l...@journalexperts.com wrote:
 For this, I ended up just changing it to string and using
 abcdefg*
  to
 match. That seems to work so far.

 Thanks,

 Brian Lamb

 On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
 brian.l...@journalexperts.comwrote:

 Hi all,

 I'm running into some confusion with the way edgengram works. I
 have
   the
 field set up as:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
analyzer
  tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory
 minGramSize=1
 maxGramSize=100 side=front /
/analyzer
 /fieldType

 I've also set up my own similarity class that returns 1 as the
 idf
score.
 What I've found this does is if I match a string abcdefg
 against a
field
 containing abcdefghijklmnop, then the idf will score that as a
 7:

 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2
  abcdefg=2)

 I get why that's happening, but is there a way to avoid that? Do
 I
   need
to
 do a new field type to achieve the desired affect?

 Thanks,

 Brian Lamb


   
  
 
 
 
  --
  Thanks and Regards,
  DakshinaMurthy BM

Re: Edgengram

2011-05-27 Thread Brian Lamb

For this, I ended up just changing it to string and using abcdefg* to
match. That seems to work so far.

Thanks,

Brian Lamb

On Wed, May 25, 2011 at 4:53 PM, Brian Lamb
brian.l...@journalexperts.comwrote:

 Hi all,

 I'm running into some confusion with the way edgengram works. I have the
 field set up as:

 fieldType name=edgengram class=solr.TextField
 positionIncrementGap=1000
analyzer
  tokenizer class=solr.LowerCaseTokenizerFactory /
filter class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=100 side=front /
/analyzer
 /fieldType

 I've also set up my own similarity class that returns 1 as the idf score.
 What I've found this does is if I match a string abcdefg against a field
 containing abcdefghijklmnop, then the idf will score that as a 7:

 7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)

 I get why that's happening, but is there a way to avoid that? Do I need to
 do a new field type to achieve the desired affect?

 Thanks,

 Brian Lamb

Edgengram

2011-05-25 Thread Brian Lamb

Hi all,

I'm running into some confusion with the way edgengram works. I have the
field set up as:

fieldType name=edgengram class=solr.TextField
positionIncrementGap=1000
   analyzer
 tokenizer class=solr.LowerCaseTokenizerFactory /
   filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=100 side=front /
   /analyzer
/fieldType

I've also set up my own similarity class that returns 1 as the idf score.
What I've found this does is if I match a string abcdefg against a field
containing abcdefghijklmnop, then the idf will score that as a 7:

7.0 = idf(myfield: a=51 ab=23 abc=2 abcd=2 abcde=2 abcdef=2 abcdefg=2)

I get why that's happening, but is there a way to avoid that? Do I need to
do a new field type to achieve the desired affect?

Thanks,

Brian Lamb

LetterTokenizer + EdgeNGram + apostrophe in query = invalid result

2011-02-25 Thread Matt Weber

I have the following field defined in my schema:

  fieldType name=ngram class=solr.TextField positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.LetterTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
  filter class=solr.EdgeNGramFilterFactory minGramSize=1
maxGramSize=25 /
/analyzer
analyzer type=query
  tokenizer class=solr.LetterTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory/
/analyzer
  /fieldType

  field name=person type=ngram indexed=true stored=true /

I have the default field set to person and have indexed the
following document:

add
   doc
   field name=id![CDATA[1001116609]]/field
   field name=person![CDATA[Vincent M D'Onofrio]]/field
   /doc
/add


The following queries return the result as expected using the standard
request handler:

vincent m d onofrio
d'o
onofrio
d onofrio

The following query fails:

d'onofrio

This is weird because d'o returns a result.  As soon as I type the
n I start to get no results.  I ran this though the field analysis
page and  it shows that this query is being tokenized correctly and
should yield a result.

I am using a build of trunk Solr (r1073990) and the example
solrconfig.xml.  I am also using the example schema with the addition
of my ngram field.

Any ideas?  I have tried this with other word's containing an
apostrophe and they all stop returning results after 4 characters.


Thanks,
Matt Weber

Re: EdgeNgram Auto suggest - doubles ignore

2011-02-08 Thread johnnyisrael


Hi Erick,

If you have time, Can you please take a look and provide your comments (or)
suggestions for this problem?

Please let me know if you need any more information.

Thanks,

Johnny
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2451828.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: EdgeNgram Auto suggest - doubles ignore

2011-02-08 Thread Erick Erickson

I'm afraid I'll have to pass, I'm absolutely swamped at the moment. Perhaps
someone else can pick it up.

I will say that you should be getting terms back when you pre-lower-case
them, so look in your index via the admin page or Luke to see if what's
really in your index is what you think in the name field.

As for sorting, I haven't a clue. Start by backing out your custom sorting,
verifying that things are as you expect for everything *except* sorting and
add
it back in

Best
Erick



On Tue, Feb 8, 2011 at 10:11 AM, johnnyisrael johnnyi.john...@gmail.comwrote:


 Hi Erick,

 If you have time, Can you please take a look and provide your comments (or)
 suggestions for this problem?

 Please let me know if you need any more information.

 Thanks,

 Johnny
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2451828.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: EdgeNgram Auto suggest - doubles ignore

2011-02-01 Thread johnnyisrael


Hi Erick,

I tried to use terms component, I got ended up with the following problems.

Problem: 1

Custom Sort not working in terms component:

http://lucene.472066.n3.nabble.com/Term-component-sort-is-not-working-td1905059.html#a1909386

I want to sort using one of my custom field[value_score], I gave it aleady
in my configuration, but it is not sorting properly.

The following are the configuration in solrconfig.xml

  searchComponent name=termsComponent
class=org.apache.solr.handler.component.TermsComponent/

  requestHandler name=/terms
class=org.apache.solr.handler.component.SearchHandler
 lst name=defaults
bool name=termstrue/bool
str name=wtjson/str
str name=flname/str
str name=sortvalue_score desc/str
str name=indenttrue/str
/lst 
arr name=components
  strtermsComponent/str
/arr
  /requestHandler

The SOLR response tag is not returned based on sorted parameter.

Problem: 2

Cap sensitive problem: [I am searching for Apple]

http://localhost/solr/core1/terms?terms.fl=nameterms.prefix=apple -- not
working

http://localhost/solr/core1/terms?terms.fl=nameterms.prefix=Apple --
working

Tried regex to overcome cap-sensitive problem: 

http://localhost/solr/core1/terms?terms.fl=nameterms.regex=Appleterms.regex.flag=case_insensitive

Is this regex based search will help me for my requirement?

It is returning irrelevant results. I am using the same syntax it is
mentioned in WIKI.

http://wiki.apache.org/solr/TermsComponent

Am I going wrong anywhere?

Please let me know if you need any more info.

Thanks,

Johnny
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2399330.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread johnnyisrael


Hi Eric,

You are right, there is a copy field to EdgeNgram, I tried the configuration
but it not working as expected.

Configuration I tried:



fieldType name=”query” class=”solr.TextField” positionIncrementGap=”100″
termVectors=”true”
analyzer type=”index”
tokenizer class=”solr.StandardTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
/analyzer
analyzer type=”query”
tokenizer class=”solr.StandardTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
/analyzer
/fieldType

fieldType name=”edgytext” class=”solr.TextField”
positionIncrementGap=”100″
analyzer type=”index”
tokenizer class=”solr.WhitespaceTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
filter class=”solr.EdgeNGramFilterFactory” minGramSize=”3″
maxGramSize=”25″/
/analyzer
analyzer type=”query”
tokenizer class=”solr.KeywordTokenizerFactory”/
filter class=”solr.LowerCaseFilterFactory”/
/analyzer
/fieldType

field name=”user_query” type=”query” indexed=”true” stored=”true”
omitNorms=”true” omitTermFreqAndPositions=”true” /
field name=”edgy_user_query” type=”edgytext” indexed=”true” stored=”true”
omitNorms=”true” omitTermFreqAndPositions=”true” /

defaultSearchFieldedgy_user_query/defaultSearchField
copyField source=”user_query” dest=”edgy_user_query”/

==

When I search for the term apple.

It is returning results for pineapple vers apple, milk with apple,
apple milk shake ...

Is there any other way to overcome this problem?

Thanks,

Johnny


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2329370.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Erick Erickson

Let's back up here because now I'm not clear what you actually want.
EdgeNGrams
are a way of matching substrings, which is what's happening here. Of course
searching apple against any of the three examples, just as searching for
apple
without grams would match, that's the expected behavior.

So, we need a clear problem definition of what you're trying to do, along
with
example queries (please post the results of adding debugQuery=on).

Best
Erick

On Tue, Jan 25, 2011 at 8:29 AM, johnnyisrael johnnyi.john...@gmail.comwrote:


 Hi Eric,

 You are right, there is a copy field to EdgeNgram, I tried the
 configuration
 but it not working as expected.

 Configuration I tried:

 

 fieldType name=”query” class=”solr.TextField” positionIncrementGap=”100″
 termVectors=”true”
 analyzer type=”index”
 tokenizer class=”solr.StandardTokenizerFactory”/
 filter class=”solr.LowerCaseFilterFactory”/
 /analyzer
 analyzer type=”query”
 tokenizer class=”solr.StandardTokenizerFactory”/
 filter class=”solr.LowerCaseFilterFactory”/
 /analyzer
 /fieldType

 fieldType name=”edgytext” class=”solr.TextField”
 positionIncrementGap=”100″
 analyzer type=”index”
 tokenizer class=”solr.WhitespaceTokenizerFactory”/
 filter class=”solr.LowerCaseFilterFactory”/
 filter class=”solr.EdgeNGramFilterFactory” minGramSize=”3″
 maxGramSize=”25″/
 /analyzer
 analyzer type=”query”
 tokenizer class=”solr.KeywordTokenizerFactory”/
 filter class=”solr.LowerCaseFilterFactory”/
 /analyzer
 /fieldType

 field name=”user_query” type=”query” indexed=”true” stored=”true”
 omitNorms=”true” omitTermFreqAndPositions=”true” /
 field name=”edgy_user_query” type=”edgytext” indexed=”true” stored=”true”
 omitNorms=”true” omitTermFreqAndPositions=”true” /

 defaultSearchFieldedgy_user_query/defaultSearchField
 copyField source=”user_query” dest=”edgy_user_query”/

 ==

 When I search for the term apple.

 It is returning results for pineapple vers apple, milk with apple,
 apple milk shake ...

 Is there any other way to overcome this problem?

 Thanks,

 Johnny


 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2329370.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread johnnyisrael


Hi Eric,

What I want here is, lets say I have 3 documents like 

[pineapple vers apple, milk with apple, apple milk shake ]

and If i search for apple, it should return only apple milk shake
because that term alone starts with the letter apple which I typed in. It
should not bring others and if I type milk it should return only milk
with apple

I want an output Similar like a Google auto suggest.

Is there a way to achieve  this without encapsulating with double quotes.

Thanks,

Johnny
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2333602.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Jonathan Rochkind

I haven't figured out any way to achieve that AT ALL without making a 
seperate Solr index just to serve autosuggest queries. At least when you 
want to auto-suggest on a multi-value field. Someone posted a crazy 
tricky way to do it with a single-valued field a while ago.  If you 
can/are willing to make a seperate Solr index with a schema set up for 
auto-suggest specifically, it's easy. But from an existing schema, where 
you want to auto-suggest just based on the values in one field, it's a 
multi-valued field, and you want to allow matches in the middle of the 
field -- I don't think there's a way to do it.


On 1/25/2011 3:03 PM, johnnyisrael wrote:

Hi Eric,

What I want here is, lets say I have 3 documents like

[pineapple vers apple, milk with apple, apple milk shake ]

and If i search for apple, it should return only apple milk shake
because that term alone starts with the letter apple which I typed in. It
should not bring others and if I type milk it should return only milk
with apple

I want an output Similar like a Google auto suggest.

Is there a way to achieve  this without encapsulating with double quotes.

Thanks,

Johnny

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Markus Jelsma

Then you don't need NGrams at all. A wildcard will suffice or you can use the 
TermsComponent.

If these strings are indexed as single tokens (KeywordTokenizer with 
LowercaseFilter) you can simply do field:app* to retrieve the apple milk 
shake. You can also use the string field type but then you must make sure the 
values are already lowercased before indexing.

Be careful though, there is no query time analysis for wildcard (and fuzzy) 
queries so make sure 

 Hi Eric,
 
 What I want here is, lets say I have 3 documents like
 
 [pineapple vers apple, milk with apple, apple milk shake ]
 
 and If i search for apple, it should return only apple milk shake
 because that term alone starts with the letter apple which I typed in. It
 should not bring others and if I type milk it should return only milk
 with apple
 
 I want an output Similar like a Google auto suggest.
 
 Is there a way to achieve  this without encapsulating with double quotes.
 
 Thanks,
 
 Johnny

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Markus Jelsma

Oh, i should perhaps mention that EdgeNGrams will yield results a lot quicker 
than using wildcards at the cost of a larger index. You should, of course, use 
EdgeNGrams if you worry about performance and have a huge index and a number 
of queries per second.

 Then you don't need NGrams at all. A wildcard will suffice or you can use
 the TermsComponent.
 
 If these strings are indexed as single tokens (KeywordTokenizer with
 LowercaseFilter) you can simply do field:app* to retrieve the apple milk
 shake. You can also use the string field type but then you must make sure
 the values are already lowercased before indexing.
 
 Be careful though, there is no query time analysis for wildcard (and fuzzy)
 queries so make sure
 
  Hi Eric,
  
  What I want here is, lets say I have 3 documents like
  
  [pineapple vers apple, milk with apple, apple milk shake ]
  
  and If i search for apple, it should return only apple milk shake
  because that term alone starts with the letter apple which I typed in.
  It should not bring others and if I type milk it should return only
  milk with apple
  
  I want an output Similar like a Google auto suggest.
  
  Is there a way to achieve  this without encapsulating with double quotes.
  
  Thanks,
  
  Johnny

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread mesenthil


The index contains around 1.5 million documents. As this is used for
autosuggest feature, performance is an important factor. 

So it looks like, using edgeNgram it is difficult to achieve the the
following 

Result should return only those terms where search letter is matching with
the first word only. For example, when we type M,  it should return
Mumford and Sons and not jackson Michael. 


Jonathan,

Is it possible to achieve this when we have separate index using edgeNgram?
 
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2334538.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Jonathan Rochkind

Ah, sorry, I got confused about your requirements, if you just want to 
match at the beginning of the field, it may be more possible.  Using 
edgegrams or wildcard. If you have a single-valued field. Do you have a 
single-valued or a multi-valued field?  That is, does each document have 
just one value, or multiple?   I still get confused about how to do it 
with edgegrams, even with single-valued field, but I think maybe it's 
possible.


_Definitely_ possible, with or without edgegrams, if you are 
willing/able to make a completely seperate Solr index where each term 
for auto-suggest is a document.  Yes.


The problem lies in what results are. In general, Solr's results are 
the documents you have in the Solr index. Thus it makes everything a lot 
easier to deal with if you have an index where each document in the 
index is a term for auto-suggest.   But that doesnt' always meet 
requirements if you need to auto-suggest within existing fq's and such, 
and of course it takes more resources to run an additional solr index.


On 1/25/2011 5:03 PM, mesenthil wrote:

The index contains around 1.5 million documents. As this is used for
autosuggest feature, performance is an important factor.

So it looks like, using edgeNgram it is difficult to achieve the the
following

Result should return only those terms where search letter is matching with
the first word only. For example, when we type M,  it should return
Mumford and Sons and not jackson Michael.


Jonathan,

Is it possible to achieve this when we have separate index using edgeNgram?

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread mesenthil


Right now our configuration says multivalues=true. But that need not be
true in our case. Will make it false and try and update this thread with
more details..
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2334627.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: EdgeNgram Auto suggest - doubles ignore

2011-01-25 Thread Erick Erickson

OK, try this.

Use some analysis chain for your field like:

analyzer
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
/analyzer

This can be a multiValued field, BTW.

now use the TermsComponent to fetch your data. See:
http://wiki.apache.org/solr/TermsComponent

and specify terms.prefix=apple e.g.
http://localhost:8983/solr/terms?terms.prefix=appterms.fl=blivet

The return list should be what you want. Note that the returned
values will be lower cased, and you can only specify
lower case in your search term (all because of specifying
the lowercase filter in my example).

This should be very fast no matter what your index size, as the
return list size defaults to 10 (though you can specify different
numbers).

Best
Erick

On Tue, Jan 25, 2011 at 3:03 PM, johnnyisrael johnnyi.john...@gmail.comwrote:


 Hi Eric,

 What I want here is, lets say I have 3 documents like

 [pineapple vers apple, milk with apple, apple milk shake ]

 and If i search for apple, it should return only apple milk shake
 because that term alone starts with the letter apple which I typed in. It
 should not bring others and if I type milk it should return only milk
 with apple

 I want an output Similar like a Google auto suggest.

 Is there a way to achieve  this without encapsulating with double quotes.

 Thanks,

 Johnny
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/EdgeNgram-Auto-suggest-doubles-ignore-tp2321919p2333602.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query performance issue while using EdgeNGram

2010-12-22 Thread Shanmugavel SRD


1) Thanks for this update. I have to use 'WhiteSpaceTokenizer'
2) I have to suggest the whole query itself (Say name or title)
3) Could you please let me know if there is a way to find the evicted docs?
4) Yes, we are seeing improvement in the response time if we optimize. But
still for some queries QTime is more than 8 secs. It is a 'Blocker' for us.
Could you please suggest any to reduce the QTime to 1 secs.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-performance-issue-while-using-EdgeNGram-tp2097056p2130751.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query performance issue while using EdgeNGram

2010-12-22 Thread Erick Erickson

Hmmm. find evicted docs? If you mean find out how many docs are deleted,
look
on the admin schema browser page and the difference between MaxDoc and
NumDocs
is the number of deleted documents.

You say for some queries the QTime is more than 8 secs. What happens if
you
re-run that query a bit later? The reason I ask is if you're not warming the
cache that
that particular query uses, you may be seeing cache loading time here.

Look at the admin stats page, especially for evictions. It's also possible
that your
caches are being reclaimed for some queries and you're seeing response
time spikes when the caches are re-loaded.

Best
Erick

On Wed, Dec 22, 2010 at 7:10 AM, Shanmugavel SRD
srdshanmuga...@gmail.comwrote:


 1) Thanks for this update. I have to use 'WhiteSpaceTokenizer'
 2) I have to suggest the whole query itself (Say name or title)
 3) Could you please let me know if there is a way to find the evicted docs?
 4) Yes, we are seeing improvement in the response time if we optimize. But
 still for some queries QTime is more than 8 secs. It is a 'Blocker' for us.
 Could you please suggest any to reduce the QTime to 1 secs.
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Query-performance-issue-while-using-EdgeNGram-tp2097056p2130751.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: Query performance issue while using EdgeNGram

2010-12-16 Thread Erick Erickson

})(.*)? replacement=$1 replace=all /
/analyzer
  /fieldType
 /types
 fields
   field name=name type=string indexed=false stored=true/
   field name=id type=string indexed=true stored=true /
   field name=score type=sfloat indexed=true stored=false /
   field name=autosuggest type=typeahead indexed=true
 stored=false/
 /fields
 uniqueKeyid/uniqueKey
 defaultSearchFieldautosuggest/defaultSearchField
 copyField source=name dest=autosuggest/
 solrQueryParser defaultOperator=AND/
 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/Query-performance-issue-while-using-EdgeNGram-tp2097056p2097056.html
 Sent from the Solr - User mailing list archive at Nabble.com.

Re: EdgeNGram relevancy

2010-11-16 Thread Robert Gründler

thanks for the explanation.

the results for the autocompletion are pretty good now, but we still have a 
small problem. 

When there are hits in the edgytext2 fields, results which only have hits in 
the edgytext field
should not be returned at all.

Example:

Query: Martin Sco

Current Results (in that order):

- Martin Scorsese
- Martin Lawrence
- Joseph Martin

However, in an autocompletion context, only Martin Scorsese makes sense, the 
2 others are logically
not correct.

I'm not sure if this can be solved on the solr side, or if we should implement 
the logic in the
application.


thanks!

-robert







On Nov 12, 2010, at 12:13 AM, Jonathan Rochkind wrote:

 Without the parens, the edgytext: only applied to Mr, the default field 
 still applied to Scorcese.
 
 The double quotes are neccesary in the second case (rather than parens), 
 because on a non-tokenized field because the standard query parser will 
 pre-tokenize on whitespace before sending individual white-space seperated 
 words to match the index. If the index includes multi-word tokens with 
 internal whitespace, they will never match. But the standard query parser 
 doesn't pre-tokenize like this, it passes the whole phrase to the index 
 intact.
 
 Robert Gründler wrote:
 Did you run your query without using () and  operators? If yes can you 
 try this?
 q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0

 
 I didn't use () and  in my query before. Using the query with those 
 operators
 works now, stopwords are thrown out as the should, thanks.
 
 However, i don't understand how the () and  operators affect the 
 StopWordFilter.
 
 Could you give a brief explanation for the above example?
 
 thanks!
 
 
 -robert

Re: EdgeNGram relevancy

2010-11-16 Thread Robert Gründler

it seems adding the '+' (required) operator to each term in a multi-term query 
does the trick:

http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#+

ie: edgytext2:(+Martin +Sco)


-robert



On Nov 16, 2010, at 8:52 PM, Robert Gründler wrote:

 thanks for the explanation.
 
 the results for the autocompletion are pretty good now, but we still have a 
 small problem. 
 
 When there are hits in the edgytext2 fields, results which only have hits 
 in the edgytext field
 should not be returned at all.
 
 Example:
 
 Query: Martin Sco
 
 Current Results (in that order):
 
 - Martin Scorsese
 - Martin Lawrence
 - Joseph Martin
 
 However, in an autocompletion context, only Martin Scorsese makes sense, 
 the 2 others are logically
 not correct.
 
 I'm not sure if this can be solved on the solr side, or if we should 
 implement the logic in the
 application.
 
 
 thanks!
 
 -robert
 
 
 
 
 
 
 
 On Nov 12, 2010, at 12:13 AM, Jonathan Rochkind wrote:
 
 Without the parens, the edgytext: only applied to Mr, the default field 
 still applied to Scorcese.
 
 The double quotes are neccesary in the second case (rather than parens), 
 because on a non-tokenized field because the standard query parser will 
 pre-tokenize on whitespace before sending individual white-space seperated 
 words to match the index. If the index includes multi-word tokens with 
 internal whitespace, they will never match. But the standard query parser 
 doesn't pre-tokenize like this, it passes the whole phrase to the index 
 intact.
 
 Robert Gründler wrote:
 Did you run your query without using () and  operators? If yes can you 
 try this?
 q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0
 
 
 I didn't use () and  in my query before. Using the query with those 
 operators
 works now, stopwords are thrown out as the should, thanks.
 
 However, i don't understand how the () and  operators affect the 
 StopWordFilter.
 
 Could you give a brief explanation for the above example?
 
 thanks!
 
 
 -robert

EdgeNGram relevancy

2010-11-11 Thread Robert Gründler

Hi,

consider the following fieldtype (used for autocompletion):

  fieldType name=edgytext class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true / 
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=25 /
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
   /analyzer
  /fieldType


This works fine as long as the query string is a single word. For multiple 
words, the ranking is weird though.

Example:

Query String: Bill Cl

Result (in that order):

- Clyde Phillips
- Clay Rogers
- Roger Cloud
- Bill Clinton

Bill Clinton should have the highest rank in that case.  

Has anyone an idea how to to configure this fieldtype to make matches in both 
tokens rank higher than those who match in either token?


thanks!


-robert

Re: EdgeNGram relevancy

2010-11-11 Thread Ahmet Arslan

You can add an additional field, with using KeywordTokenizerFactory instead of 
WhitespaceTokenizerFactory. And query both these fields with an OR operator. 

edgytext:(Bill Cl) OR edgytext2:Bill Cl

You can even apply boost so that begins with matches comes first.

--- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote:

 From: Robert Gründler rob...@dubture.com
 Subject: EdgeNGram relevancy
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 5:51 PM
 Hi,
 
 consider the following fieldtype (used for
 autocompletion):
 
   fieldType name=edgytext class=solr.TextField
 positionIncrementGap=100
    analyzer type=index
      tokenizer
 class=solr.WhitespaceTokenizerFactory/
      filter
 class=solr.LowerCaseFilterFactory/
      filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 /     
      filter
 class=solr.PatternReplaceFilterFactory pattern=([^a-z])
 replacement= replace=all /
      filter
 class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 /
    /analyzer
    analyzer type=query
      tokenizer
 class=solr.WhitespaceTokenizerFactory/
      filter
 class=solr.LowerCaseFilterFactory/
      filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
      filter
 class=solr.PatternReplaceFilterFactory pattern=([^a-z])
 replacement= replace=all /
    /analyzer
   /fieldType
 
 
 This works fine as long as the query string is a single
 word. For multiple words, the ranking is weird though.
 
 Example:
 
 Query String: Bill Cl
 
 Result (in that order):
 
 - Clyde Phillips
 - Clay Rogers
 - Roger Cloud
 - Bill Clinton
 
 Bill Clinton should have the highest rank in that
 case.  
 
 Has anyone an idea how to to configure this fieldtype to
 make matches in both tokens rank higher than those who match
 in either token?
 
 
 thanks!
 
 
 -robert

Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler

thanks a lot, that setup works pretty well now.

the only problem now is that the StopWords do not work that good anymore. I'll 
provide an example, but first the 2 fieldtypes:

  !-- autocomplete field which finds matches inside strings (scor matches 
Martin Scorsese) --
  
  fieldType name=edgytext class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true / 
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=25 /
   /analyzer
   analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords.txt enablePositionIncrements=true /
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
   /analyzer
  /fieldType
  
  !-- autocomplete field which finds startsWith matches only (scor matches 
only Scorpio, but not Martin Scorsese) --  

  fieldType name=edgytext2 class=solr.TextField positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
 filter class=solr.EdgeNGramFilterFactory minGramSize=1 
maxGramSize=25 /
   /analyzer
   analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory/
 filter class=solr.PatternReplaceFilterFactory 
pattern=([^a-z]) replacement= replace=all /
   /analyzer
  /fieldType


This setup now makes troubles regarding StopWords, here's an example:

Let's say the index contains 2 Strings: Mr Martin Scorsese and Martin 
Scorsese. Mr is in the stopword list.

Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0

This way, the only result i get is Mr Martin Scorsese, because the strict 
field edgytext2 is boosted by 2.0. 

Any idea why in this case Martin Scorsese is not in the result at all?


thanks again!


-robert






On Nov 11, 2010, at 5:57 PM, Ahmet Arslan wrote:

 You can add an additional field, with using KeywordTokenizerFactory instead 
 of WhitespaceTokenizerFactory. And query both these fields with an OR 
 operator. 
 
 edgytext:(Bill Cl) OR edgytext2:Bill Cl
 
 You can even apply boost so that begins with matches comes first.
 
 --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote:
 
 From: Robert Gründler rob...@dubture.com
 Subject: EdgeNGram relevancy
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 5:51 PM
 Hi,
 
 consider the following fieldtype (used for
 autocompletion):
 
   fieldType name=edgytext class=solr.TextField
 positionIncrementGap=100
analyzer type=index
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 / 
  filter
 class=solr.PatternReplaceFilterFactory pattern=([^a-z])
 replacement= replace=all /
  filter
 class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 /
/analyzer
analyzer type=query
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
  filter
 class=solr.PatternReplaceFilterFactory pattern=([^a-z])
 replacement= replace=all /
/analyzer
   /fieldType
 
 
 This works fine as long as the query string is a single
 word. For multiple words, the ranking is weird though.
 
 Example:
 
 Query String: Bill Cl
 
 Result (in that order):
 
 - Clyde Phillips
 - Clay Rogers
 - Roger Cloud
 - Bill Clinton
 
 Bill Clinton should have the highest rank in that
 case.  
 
 Has anyone an idea how to to configure this fieldtype to
 make matches in both tokens rank higher than those who match
 in either token?
 
 
 thanks!
 
 
 -robert

Re: EdgeNGram relevancy

2010-11-11 Thread Ahmet Arslan

 This setup now makes troubles regarding StopWords, here's
 an example:
 
 Let's say the index contains 2 Strings: Mr Martin
 Scorsese and Martin Scorsese. Mr is in the stopword
 list.
 
 Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0
 
 This way, the only result i get is Mr Martin Scorsese,
 because the strict field edgytext2 is boosted by 2.0. 
 
 Any idea why in this case Martin Scorsese is not in the
 result at all?

Did you run your query without using () and  operators? If yes can you try 
this?
q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0

If no can you paste output of debugQuery=on

Re: EdgeNGram relevancy

2010-11-11 Thread Nick Martin


On 12 Nov 2010, at 01:46, Ahmet Arslan iori...@yahoo.com wrote:

 This setup now makes troubles regarding StopWords, here's
 an example:
 
 Let's say the index contains 2 Strings: Mr Martin
 Scorsese and Martin Scorsese. Mr is in the stopword
 list.
 
 Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0
 
 This way, the only result i get is Mr Martin Scorsese,
 because the strict field edgytext2 is boosted by 2.0. 
 
 Any idea why in this case Martin Scorsese is not in the
 result at all?
 
 Did you run your query without using () and  operators? If yes can you try 
 this?
 q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0
 
 If no can you paste output of debugQuery=on
 
 
 

This would still not deal with the problem of removing stop words from the 
indexing and query analysis stages.

I really need something that will allow that and give a single token as in the 
example below.

Best

Nick

Re: EdgeNGram relevancy

2010-11-11 Thread Andy

Could anyone help me understand what does Clyde Phillips appear in the 
results for Bill Cl??

Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so 
why is it even in the results?

Thanks.

--- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote:

 You can add an additional field, with
 using KeywordTokenizerFactory instead of
 WhitespaceTokenizerFactory. And query both these fields with
 an OR operator. 
 
 edgytext:(Bill Cl) OR edgytext2:Bill Cl
 
 You can even apply boost so that begins with matches comes
 first.
 
 --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com
 wrote:
 
  From: Robert Gründler rob...@dubture.com
  Subject: EdgeNGram relevancy
  To: solr-user@lucene.apache.org
  Date: Thursday, November 11, 2010, 5:51 PM
  Hi,
  
  consider the following fieldtype (used for
  autocompletion):
  
    fieldType name=edgytext
 class=solr.TextField
  positionIncrementGap=100
     analyzer type=index
       tokenizer
  class=solr.WhitespaceTokenizerFactory/
       filter
  class=solr.LowerCaseFilterFactory/
       filter
  class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true
  /     
       filter
  class=solr.PatternReplaceFilterFactory
 pattern=([^a-z])
  replacement= replace=all /
       filter
  class=solr.EdgeNGramFilterFactory minGramSize=1
  maxGramSize=25 /
     /analyzer
     analyzer type=query
       tokenizer
  class=solr.WhitespaceTokenizerFactory/
       filter
  class=solr.LowerCaseFilterFactory/
       filter
  class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true
 /
       filter
  class=solr.PatternReplaceFilterFactory
 pattern=([^a-z])
  replacement= replace=all /
     /analyzer
    /fieldType
  
  
  This works fine as long as the query string is a
 single
  word. For multiple words, the ranking is weird
 though.
  
  Example:
  
  Query String: Bill Cl
  
  Result (in that order):
  
  - Clyde Phillips
  - Clay Rogers
  - Roger Cloud
  - Bill Clinton
  
  Bill Clinton should have the highest rank in that
  case.  
  
  Has anyone an idea how to to configure this fieldtype
 to
  make matches in both tokens rank higher than those who
 match
  in either token?
  
  
  thanks!
  
  
  -robert

Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler

according to the fieldtype i posted previously, i think it's because of:

1. WhiteSpaceTokenizer splits the String Clyde Phillips into 2 tokens: 
Clyde and Phillips
2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token: 
C Cl Cly ...   AND  P Ph Phi ...

The Query String Bill Cl gets split up in 2 Tokens Bill and Cl by the 
WhitespaceTokenizer.

This creates a match for the 2nd token Ci of the query, and one of the 
subtokens the EdgeNGramFilter created: Cl.


-robert




On Nov 11, 2010, at 21:34 , Andy wrote:

 Could anyone help me understand what does Clyde Phillips appear in the 
 results for Bill Cl??
 
 Clyde Phillips doesn't produce any EdgeNGram that would match Bill Cl, so 
 why is it even in the results?
 
 Thanks.
 
 --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com wrote:
 
 You can add an additional field, with
 using KeywordTokenizerFactory instead of
 WhitespaceTokenizerFactory. And query both these fields with
 an OR operator. 
 
 edgytext:(Bill Cl) OR edgytext2:Bill Cl
 
 You can even apply boost so that begins with matches comes
 first.
 
 --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com
 wrote:
 
 From: Robert Gründler rob...@dubture.com
 Subject: EdgeNGram relevancy
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 5:51 PM
 Hi,
 
 consider the following fieldtype (used for
 autocompletion):
 
   fieldType name=edgytext
 class=solr.TextField
 positionIncrementGap=100
analyzer type=index
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 / 
  filter
 class=solr.PatternReplaceFilterFactory
 pattern=([^a-z])
 replacement= replace=all /
  filter
 class=solr.EdgeNGramFilterFactory minGramSize=1
 maxGramSize=25 /
/analyzer
analyzer type=query
  tokenizer
 class=solr.WhitespaceTokenizerFactory/
  filter
 class=solr.LowerCaseFilterFactory/
  filter
 class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true
 /
  filter
 class=solr.PatternReplaceFilterFactory
 pattern=([^a-z])
 replacement= replace=all /
/analyzer
   /fieldType
 
 
 This works fine as long as the query string is a
 single
 word. For multiple words, the ranking is weird
 though.
 
 Example:
 
 Query String: Bill Cl
 
 Result (in that order):
 
 - Clyde Phillips
 - Clay Rogers
 - Roger Cloud
 - Bill Clinton
 
 Bill Clinton should have the highest rank in that
 case.  
 
 Has anyone an idea how to to configure this fieldtype
 to
 make matches in both tokens rank higher than those who
 match
 in either token?
 
 
 thanks!
 
 
 -robert

Re: EdgeNGram relevancy

2010-11-11 Thread Andy

Ah I see. Thanks for the explanation.

Could you set the defaultOperator to AND? That way both Bill and Cl must 
be a match and that would exclude Clyde Phillips.

--- On Thu, 11/11/10, Robert Gründler rob...@dubture.com wrote:

 From: Robert Gründler rob...@dubture.com
 Subject: Re: EdgeNGram relevancy
 To: solr-user@lucene.apache.org
 Date: Thursday, November 11, 2010, 3:51 PM
 according to the fieldtype i posted
 previously, i think it's because of:

 1. WhiteSpaceTokenizer splits the String Clyde Phillips
 into 2 tokens: Clyde and Phillips
 2. EdgeNGramFilter gets the 2 tokens, and creates an
 EdgeNGram for each token: C Cl Cly
 ...   AND  P Ph Phi ...

 The Query String Bill Cl gets split up in 2 Tokens Bill
 and Cl by the WhitespaceTokenizer.

 This creates a match for the 2nd token Ci of the query,
 and one of the subtokens the EdgeNGramFilter created:
 Cl.

 -robert

 On Nov 11, 2010, at 21:34 , Andy wrote:

  Could anyone help me understand what does Clyde
 Phillips appear in the results for Bill Cl??

  Clyde Phillips doesn't produce any EdgeNGram that
 would match Bill Cl, so why is it even in the results?

  Thanks.

  --- On Thu, 11/11/10, Ahmet Arslan iori...@yahoo.com
 wrote:

  You can add an additional field, with
  using KeywordTokenizerFactory instead of
  WhitespaceTokenizerFactory. And query both these
 fields with
  an OR operator. 

  edgytext:(Bill Cl) OR edgytext2:Bill Cl

  You can even apply boost so that begins with
 matches comes
  first.

  --- On Thu, 11/11/10, Robert Gründler rob...@dubture.com
  wrote:

  From: Robert Gründler rob...@dubture.com
  Subject: EdgeNGram relevancy
  To: solr-user@lucene.apache.org
  Date: Thursday, November 11, 2010, 5:51 PM
  Hi,

  consider the following fieldtype (used for
  autocompletion):

    fieldType
 name=edgytext
  class=solr.TextField
  positionIncrementGap=100
     analyzer type=index
       tokenizer
  class=solr.WhitespaceTokenizerFactory/
       filter
  class=solr.LowerCaseFilterFactory/
       filter
  class=solr.StopFilterFactory
 ignoreCase=true
  words=stopwords.txt
 enablePositionIncrements=true
  /     
           filter
  class=solr.PatternReplaceFilterFactory
  pattern=([^a-z])
  replacement= replace=all /
       filter
  class=solr.EdgeNGramFilterFactory
 minGramSize=1
  maxGramSize=25 /
     /analyzer
     analyzer type=query
       tokenizer
  class=solr.WhitespaceTokenizerFactory/
       filter
  class=solr.LowerCaseFilterFactory/
       filter
  class=solr.StopFilterFactory
 ignoreCase=true
  words=stopwords.txt
 enablePositionIncrements=true
  /
           filter
  class=solr.PatternReplaceFilterFactory
  pattern=([^a-z])
  replacement= replace=all /
     /analyzer
    /fieldType

  This works fine as long as the query string is
 a
  single
  word. For multiple words, the ranking is
 weird
  though.

  Example:

  Query String: Bill Cl

  Result (in that order):

  - Clyde Phillips
  - Clay Rogers
  - Roger Cloud
  - Bill Clinton

  Bill Clinton should have the highest rank in
 that
  case.  

  Has anyone an idea how to to configure this
 fieldtype
  to
  make matches in both tokens rank higher than
 those who
  match
  in either token?

  thanks!

  -robert

Re: EdgeNGram relevancy

2010-11-11 Thread Robert Gründler

 
 Did you run your query without using () and  operators? If yes can you try 
 this?
 q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0

I didn't use () and  in my query before. Using the query with those operators
works now, stopwords are thrown out as the should, thanks.

However, i don't understand how the () and  operators affect the 
StopWordFilter.

Could you give a brief explanation for the above example?

thanks!


-robert

Re: EdgeNGram relevancy

2010-11-11 Thread Jonathan Rochkind

Without the parens, the edgytext: only applied to Mr, the default 
field still applied to Scorcese.


The double quotes are neccesary in the second case (rather than parens), 
because on a non-tokenized field because the standard query parser will 
pre-tokenize on whitespace before sending individual white-space 
seperated words to match the index. If the index includes multi-word 
tokens with internal whitespace, they will never match. But the standard 
query parser doesn't pre-tokenize like this, it passes the whole 
phrase to the index intact.


Robert Gründler wrote:

Did you run your query without using () and  operators? If yes can you try 
this?
q=edgytext:(Mr Scorsese) OR edgytext2:Mr Scorsese^2.0



I didn't use () and  in my query before. Using the query with those operators
works now, stopwords are thrown out as the should, thanks.

However, i don't understand how the () and  operators affect the 
StopWordFilter.

Could you give a brief explanation for the above example?

thanks!


-robert

44 matches

Mail list logo